Skip to content

feat(scraper): support hierarchical parsing of XML structures #362

@arabold

Description

@arabold

Context

Following #341, all XML-variant formats (.xml, .xslt, .xsl, .xsd, .dtd, .wsdl) are now correctly routed to SourceCodePipeline and processed as source code. However, they are treated as flat text wrapped in code blocks, without any awareness of the XML document structure.

Proposal

Add hierarchical/structural parsing support for XML content, similar to how JsonPipeline handles JSON documents. This would allow the system to:

  • Parse XML into a document tree and split content along meaningful structural boundaries (elements, namespaces, sections)
  • Produce better chunks that preserve semantic context rather than splitting mid-element
  • Extract metadata from XML declarations, root elements, or schema definitions

References

  • Custom file extensions #341 — Initial fix for XML-variant MIME type detection
  • JsonPipeline / JsonDocumentSplitter — Existing implementation for structured JSON parsing that could serve as a model

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions