feat(scraper): support hierarchical parsing of XML structures

## Context

Following #341, all XML-variant formats (`.xml`, `.xslt`, `.xsl`, `.xsd`, `.dtd`, `.wsdl`) are now correctly routed to `SourceCodePipeline` and processed as source code. However, they are treated as flat text wrapped in code blocks, without any awareness of the XML document structure.

## Proposal

Add hierarchical/structural parsing support for XML content, similar to how `JsonPipeline` handles JSON documents. This would allow the system to:

- Parse XML into a document tree and split content along meaningful structural boundaries (elements, namespaces, sections)
- Produce better chunks that preserve semantic context rather than splitting mid-element
- Extract metadata from XML declarations, root elements, or schema definitions

## References

- #341 — Initial fix for XML-variant MIME type detection
- `JsonPipeline` / `JsonDocumentSplitter` — Existing implementation for structured JSON parsing that could serve as a model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scraper): support hierarchical parsing of XML structures #362

Context

Proposal

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat(scraper): support hierarchical parsing of XML structures #362

Description

Context

Proposal

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions