Skip to content

Bug Report: Incorrect Title Extraction for Documents with Front Pages #561

@yashp048

Description

@yashp048

Description

ScienceBeam fails to correctly extract article titles from PDF manuscripts that have front pages (cover pages with journal branding, terms & conditions, etc.). Instead of extracting the actual article title, it extracts header/footer text and journal information from the front page.

What are the Steps to Reproduce to issue?

  1. Submit a manuscript PDF with a front page containing journal branding and terms/conditions (e.g., Taylor & Francis "Expert Review" journal format)
  2. Process the document through ScienceBeam for XML conversion
  3. Examine the resulting XML file's element in the // section

What is the Expected behaviour?

The element should contain the actual manuscript title. For example, based on the content in file 76_pdf-0.1.11.xml(See attached file)

76.pdf, the title should be something like: "The Toxic Effects of Ethylene Glycol Tetraacetate Acid, Ferrum Lek and Methanol on the Glutathione System: correction Options"
But it is coming as(see screenshot) -
Image

Additional Context

Affected file:

76_pdf-normal.xml

The actual article content (abstract, body) is correctly extracted
This appears to be a pattern recognition issue where ScienceBeam is not properly identifying and skipping front page content when locating the title

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions