Extract all text if subelements are present by karlb · Pull Request #3 · itkach/tei2slob

karlb · 2017-11-22T10:24:29Z

The text function only extracts text via the element's text attribute. This returns only a part of the text when subelements are present:

If the element is created from an XML file, the text attribute holds either the text between the element’s start tag and its first child or end tag, or None

see https://docs.python.org/2/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.text

I've fixed this by iterating over all text elements in that element/subtree.

itkach · 2017-11-26T19:21:01Z

Are you sure this is desirable? Looks like <title> content may be a complex structure, in which case concatenating texts doesn't produce anything meaningful. Third example at http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-title.html shows title elements inside elements, main title and a sub-title. In this case joining two texts sort of works, but it needs some separator. In other cases it wouldn't make sense at all. All we want here is automatically give converted dictionary a reasonable title, but if source dictionary doesn't have a simple title there's no point in guessing.

karlb · 2017-11-26T20:32:06Z

I did the change mostly for the <availability> tag, which can contain <ref> tags. For those the current behavior does not work well.

I could change the code to only include the content of specific sub tags (probably only <ref> for now). Counter-suggestions welcome!

itkach · 2017-11-28T03:34:08Z

I did the change mostly for the tag, which can contain tags.

In that case I think text() is the wrong thing to change. text() and attr() are simple utility functions, let them stay that way.

Perhaps xpaths for license and copyright tags can be improved, or maybe it makes sense to look for license and copyright in several places. But honestly, this is just best effort to populate tags automatically, in no way is it meant to absolve conversion author from verifying resulting dictionary.

karlb · 2017-12-06T17:04:15Z

in no way is it meant to absolve conversion author from verifying resulting dictionary.

I agree, and I created this ticket while doing exactly that. I've changed all FreeDict dictionaries so that the current implementation successfully finds license name and URL. Unfortunately, I didn't find a way to add the required refs without breaking the copyright extraction.
With the change in this PR all of these work fine for all FreeDict dictionaries.

In that case I think text() is the wrong thing to change. text() and attr() are simple utility functions, let them stay that way.

What about adding the new version as a separate text_including_subtags() and using it only for <copyright>? I could also make it more restrictive and include only the text of <ref> subtags. But skipping unknown tags while keeping their content seems to work well for web browsers, so I though it might be a sensible default here, too.

karlb added 2 commits November 22, 2017 11:19

extract all text if subelements are present

95b921a

return empty text on non-matching XML-path

7537902

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract all text if subelements are present#3

Extract all text if subelements are present#3
karlb wants to merge 2 commits intoitkach:masterfrom
karlb:master

karlb commented Nov 22, 2017

Uh oh!

itkach commented Nov 26, 2017 •

edited

Loading

Uh oh!

karlb commented Nov 26, 2017

Uh oh!

itkach commented Nov 28, 2017

Uh oh!

karlb commented Dec 6, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

karlb commented Nov 22, 2017

Uh oh!

itkach commented Nov 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karlb commented Nov 26, 2017

Uh oh!

itkach commented Nov 28, 2017

Uh oh!

karlb commented Dec 6, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

itkach commented Nov 26, 2017 •

edited

Loading