Skip to content

Extract all text if subelements are present#3

Open
karlb wants to merge 2 commits intoitkach:masterfrom
karlb:master
Open

Extract all text if subelements are present#3
karlb wants to merge 2 commits intoitkach:masterfrom
karlb:master

Conversation

@karlb
Copy link
Contributor

@karlb karlb commented Nov 22, 2017

The text function only extracts text via the element's text attribute. This returns only a part of the text when subelements are present:

If the element is created from an XML file, the text attribute holds either the text between the element’s start tag and its first child or end tag, or None

see https://docs.python.org/2/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.text

I've fixed this by iterating over all text elements in that element/subtree.

@itkach
Copy link
Owner

itkach commented Nov 26, 2017

Are you sure this is desirable? Looks like <title> content may be a complex structure, in which case concatenating texts doesn't produce anything meaningful. Third example at http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-title.html shows title elements inside elements, main title and a sub-title. In this case joining two texts sort of works, but it needs some separator. In other cases it wouldn't make sense at all. All we want here is automatically give converted dictionary a reasonable title, but if source dictionary doesn't have a simple title there's no point in guessing.

@karlb
Copy link
Contributor Author

karlb commented Nov 26, 2017

I did the change mostly for the <availability> tag, which can contain <ref> tags. For those the current behavior does not work well.

I could change the code to only include the content of specific sub tags (probably only <ref> for now). Counter-suggestions welcome!

@itkach
Copy link
Owner

itkach commented Nov 28, 2017

I did the change mostly for the tag, which can contain tags.

In that case I think text() is the wrong thing to change. text() and attr() are simple utility functions, let them stay that way.

Perhaps xpaths for license and copyright tags can be improved, or maybe it makes sense to look for license and copyright in several places. But honestly, this is just best effort to populate tags automatically, in no way is it meant to absolve conversion author from verifying resulting dictionary.

@karlb
Copy link
Contributor Author

karlb commented Dec 6, 2017

in no way is it meant to absolve conversion author from verifying resulting dictionary.

I agree, and I created this ticket while doing exactly that. I've changed all FreeDict dictionaries so that the current implementation successfully finds license name and URL. Unfortunately, I didn't find a way to add the required refs without breaking the copyright extraction.
With the change in this PR all of these work fine for all FreeDict dictionaries.

In that case I think text() is the wrong thing to change. text() and attr() are simple utility functions, let them stay that way.

What about adding the new version as a separate text_including_subtags() and using it only for <copyright>? I could also make it more restrictive and include only the text of <ref> subtags. But skipping unknown tags while keeping their content seems to work well for web browsers, so I though it might be a sensible default here, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants