Skip to content

Support additional document formats (.docx, .txt, .md, .html) #13

@monneyboi

Description

@monneyboi

Summary

Expand document ingestion beyond PDFs to support common text-based formats: Word documents, plain text, Markdown, HTML, and other office formats.

Motivation

Journalists work with diverse document types:

  • Word documents (.docx) - Reports, memos, drafts
  • Plain text (.txt) - Logs, exports, simple notes
  • Markdown (.md) - Technical docs, notes
  • HTML (.html) - Saved web pages, email exports
  • Rich text (.rtf) - Legacy documents
  • OpenDocument (.odt) - Open-source office docs

Currently only PDFs are supported. Adding these formats significantly increases the types of evidence that can be indexed and searched.

Proposed Approach

Extraction Architecture

Add format-specific extractors behind a common trait:

pub trait TextExtractor: Send + Sync {
    /// File extensions this extractor handles
    fn extensions(&self) -> &[&str];
    
    /// Extract text content from bytes
    fn extract(&self, bytes: &[u8]) -> Result<String>;
}

// Registry of extractors
pub struct ExtractorRegistry {
    extractors: Vec<Box<dyn TextExtractor>>,
}

impl ExtractorRegistry {
    pub fn extract(&self, filename: &str, bytes: &[u8]) -> Result<String> {
        let ext = get_extension(filename);
        let extractor = self.find_for_extension(ext)?;
        extractor.extract(bytes)
    }
}

Format-Specific Libraries

Format Library Notes
.docx docx-rs Pure Rust, handles modern Word
.txt stdlib Direct UTF-8/encoding detection
.md Direct use Already plain text
.html scraper or html2text Strip tags, preserve structure
.rtf rtf-parser Less common, lower priority
.odt zip + XML parsing OpenDocument is zipped XML

Encoding Detection

For plain text files, detect encoding:

  • encoding_rs for charset detection
  • Default to UTF-8, fall back to common encodings
  • Store detected encoding in metadata for reference

Ingestion Flow Changes

User adds file
    ↓
Detect format by extension
    ↓
Route to appropriate extractor
    ↓
Extract text → store as blob
    ↓
Index in milli (same as PDF flow)

Metadata Updates

Add format info to document metadata:

{
  "name": "report.docx",
  "format": "docx",
  "original_hash": "...",
  "text_hash": "...",
  "extraction": "docx-rs"
}

Tasks

  • Define TextExtractor trait
  • Implement .txt extractor with encoding detection
  • Implement .md extractor (passthrough with optional frontmatter stripping)
  • Implement .docx extractor using docx-rs
  • Implement .html extractor
  • Refactor PDF extraction to use the trait
  • Update ingestion pipeline to route by extension
  • Update UI file picker to accept new formats
  • Add format indicators in document list

Priority Order

  1. .txt / .md - Trivial, immediate value
  2. .docx - Very common, good library support
  3. .html - Common for web research
  4. .odt / .rtf - Lower priority, less common

Open Questions

  1. Should we preserve any formatting metadata (headers, bold, etc.)?
  2. How to handle embedded images in .docx? (Could connect to vision model later)
  3. Maximum file size limits per format?

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions