-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
Expand document ingestion beyond PDFs to support common text-based formats: Word documents, plain text, Markdown, HTML, and other office formats.
Motivation
Journalists work with diverse document types:
- Word documents (.docx) - Reports, memos, drafts
- Plain text (.txt) - Logs, exports, simple notes
- Markdown (.md) - Technical docs, notes
- HTML (.html) - Saved web pages, email exports
- Rich text (.rtf) - Legacy documents
- OpenDocument (.odt) - Open-source office docs
Currently only PDFs are supported. Adding these formats significantly increases the types of evidence that can be indexed and searched.
Proposed Approach
Extraction Architecture
Add format-specific extractors behind a common trait:
pub trait TextExtractor: Send + Sync {
/// File extensions this extractor handles
fn extensions(&self) -> &[&str];
/// Extract text content from bytes
fn extract(&self, bytes: &[u8]) -> Result<String>;
}
// Registry of extractors
pub struct ExtractorRegistry {
extractors: Vec<Box<dyn TextExtractor>>,
}
impl ExtractorRegistry {
pub fn extract(&self, filename: &str, bytes: &[u8]) -> Result<String> {
let ext = get_extension(filename);
let extractor = self.find_for_extension(ext)?;
extractor.extract(bytes)
}
}Format-Specific Libraries
| Format | Library | Notes |
|---|---|---|
| .docx | docx-rs |
Pure Rust, handles modern Word |
| .txt | stdlib | Direct UTF-8/encoding detection |
| .md | Direct use | Already plain text |
| .html | scraper or html2text |
Strip tags, preserve structure |
| .rtf | rtf-parser |
Less common, lower priority |
| .odt | zip + XML parsing |
OpenDocument is zipped XML |
Encoding Detection
For plain text files, detect encoding:
encoding_rsfor charset detection- Default to UTF-8, fall back to common encodings
- Store detected encoding in metadata for reference
Ingestion Flow Changes
User adds file
↓
Detect format by extension
↓
Route to appropriate extractor
↓
Extract text → store as blob
↓
Index in milli (same as PDF flow)
Metadata Updates
Add format info to document metadata:
{
"name": "report.docx",
"format": "docx",
"original_hash": "...",
"text_hash": "...",
"extraction": "docx-rs"
}Tasks
- Define
TextExtractortrait - Implement
.txtextractor with encoding detection - Implement
.mdextractor (passthrough with optional frontmatter stripping) - Implement
.docxextractor using docx-rs - Implement
.htmlextractor - Refactor PDF extraction to use the trait
- Update ingestion pipeline to route by extension
- Update UI file picker to accept new formats
- Add format indicators in document list
Priority Order
- .txt / .md - Trivial, immediate value
- .docx - Very common, good library support
- .html - Common for web research
- .odt / .rtf - Lower priority, less common
Open Questions
- Should we preserve any formatting metadata (headers, bold, etc.)?
- How to handle embedded images in .docx? (Could connect to vision model later)
- Maximum file size limits per format?
Related
- Independent of Add local vision/multimodal model support via mistralrs #11 (Vision models) and OCR pipeline using local vision models #12 (OCR)
- Could later integrate with vision model for embedded images in .docx
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels