-
Notifications
You must be signed in to change notification settings - Fork 33
Description
Description
ScienceBeam fails to correctly extract article titles from PDF manuscripts that have front pages (cover pages with journal branding, terms & conditions, etc.). Instead of extracting the actual article title, it extracts header/footer text and journal information from the front page.
What are the Steps to Reproduce to issue?
- Submit a manuscript PDF with a front page containing journal branding and terms/conditions (e.g., Taylor & Francis "Expert Review" journal format)
- Process the document through ScienceBeam for XML conversion
- Examine the resulting XML file's element in the // section
What is the Expected behaviour?
The element should contain the actual manuscript title. For example, based on the content in file 76_pdf-0.1.11.xml(See attached file)
76.pdf, the title should be something like: "The Toxic Effects of Ethylene Glycol Tetraacetate Acid, Ferrum Lek and Methanol on the Glutathione System: correction Options"
But it is coming as(see screenshot) -
![]()
Additional Context
Affected file:
The actual article content (abstract, body) is correctly extracted
This appears to be a pattern recognition issue where ScienceBeam is not properly identifying and skipping front page content when locating the title