Support additional document formats (.docx, .txt, .md, .html)

## Summary

Expand document ingestion beyond PDFs to support common text-based formats: Word documents, plain text, Markdown, HTML, and other office formats.

## Motivation

Journalists work with diverse document types:
- **Word documents** (.docx) - Reports, memos, drafts
- **Plain text** (.txt) - Logs, exports, simple notes
- **Markdown** (.md) - Technical docs, notes
- **HTML** (.html) - Saved web pages, email exports
- **Rich text** (.rtf) - Legacy documents
- **OpenDocument** (.odt) - Open-source office docs

Currently only PDFs are supported. Adding these formats significantly increases the types of evidence that can be indexed and searched.

## Proposed Approach

### Extraction Architecture

Add format-specific extractors behind a common trait:

```rust
pub trait TextExtractor: Send + Sync {
    /// File extensions this extractor handles
    fn extensions(&self) -> &[&str];
    
    /// Extract text content from bytes
    fn extract(&self, bytes: &[u8]) -> Result<String>;
}

// Registry of extractors
pub struct ExtractorRegistry {
    extractors: Vec<Box<dyn TextExtractor>>,
}

impl ExtractorRegistry {
    pub fn extract(&self, filename: &str, bytes: &[u8]) -> Result<String> {
        let ext = get_extension(filename);
        let extractor = self.find_for_extension(ext)?;
        extractor.extract(bytes)
    }
}
```

### Format-Specific Libraries

| Format | Library | Notes |
|--------|---------|-------|
| .docx | `docx-rs` | Pure Rust, handles modern Word |
| .txt | stdlib | Direct UTF-8/encoding detection |
| .md | Direct use | Already plain text |
| .html | `scraper` or `html2text` | Strip tags, preserve structure |
| .rtf | `rtf-parser` | Less common, lower priority |
| .odt | `zip` + XML parsing | OpenDocument is zipped XML |

### Encoding Detection

For plain text files, detect encoding:
- `encoding_rs` for charset detection
- Default to UTF-8, fall back to common encodings
- Store detected encoding in metadata for reference

### Ingestion Flow Changes

```
User adds file
    ↓
Detect format by extension
    ↓
Route to appropriate extractor
    ↓
Extract text → store as blob
    ↓
Index in milli (same as PDF flow)
```

### Metadata Updates

Add format info to document metadata:

```json
{
  "name": "report.docx",
  "format": "docx",
  "original_hash": "...",
  "text_hash": "...",
  "extraction": "docx-rs"
}
```

## Tasks

- [ ] Define `TextExtractor` trait
- [ ] Implement `.txt` extractor with encoding detection
- [ ] Implement `.md` extractor (passthrough with optional frontmatter stripping)
- [ ] Implement `.docx` extractor using docx-rs
- [ ] Implement `.html` extractor
- [ ] Refactor PDF extraction to use the trait
- [ ] Update ingestion pipeline to route by extension
- [ ] Update UI file picker to accept new formats
- [ ] Add format indicators in document list

## Priority Order

1. **.txt / .md** - Trivial, immediate value
2. **.docx** - Very common, good library support
3. **.html** - Common for web research
4. **.odt / .rtf** - Lower priority, less common

## Open Questions

1. Should we preserve any formatting metadata (headers, bold, etc.)?
2. How to handle embedded images in .docx? (Could connect to vision model later)
3. Maximum file size limits per format?

## Related

- Independent of #11 (Vision models) and #12 (OCR)
- Could later integrate with vision model for embedded images in .docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support additional document formats (.docx, .txt, .md, .html) #13

Summary

Motivation

Proposed Approach

Extraction Architecture

Format-Specific Libraries

Encoding Detection

Ingestion Flow Changes

Metadata Updates

Tasks

Priority Order

Open Questions

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Format	Library	Notes
.docx	`docx-rs`	Pure Rust, handles modern Word
.txt	stdlib	Direct UTF-8/encoding detection
.md	Direct use	Already plain text
.html	`scraper` or `html2text`	Strip tags, preserve structure
.rtf	`rtf-parser`	Less common, lower priority
.odt	`zip` + XML parsing	OpenDocument is zipped XML

Support additional document formats (.docx, .txt, .md, .html) #13

Description

Summary

Motivation

Proposed Approach

Extraction Architecture

Format-Specific Libraries

Encoding Detection

Ingestion Flow Changes

Metadata Updates

Tasks

Priority Order

Open Questions

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions