-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
Extract all available metadata from PDF files during ingestion, including author, title, creation/modification dates, and other standard PDF Info dictionary fields.
Current State
DocumentMetadataonly stores:id,name,pdf_hash,text_hash,page_count,tags,created_atExtractedDocumentonly returns:pdf_bytes,text,page_count
Available PDF Metadata
PDFs have an Info dictionary (accessed via doc.trailer.get(b"Info") in lopdf) containing:
| Field | Type | Description |
|---|---|---|
Title |
String | Document title |
Author |
String | Person who created the content |
Subject |
String | Subject/topic |
Keywords |
String | Keywords (comma-separated) |
Creator |
String | Application that created the original (e.g., "Microsoft Word") |
Producer |
String | Application that converted to PDF |
CreationDate |
Date | When the document was created |
ModDate |
Date | When the document was last modified |
Implementation
-
src/core/pdf/extractor.rs- Extract metadata from the Info dictionary using:if let Ok(info_ref) = doc.trailer.get(b"Info") { if let Ok((_, info)) = doc.dereference(info_ref) { // Extract Title, Author, Subject, Keywords, Creator, Producer // Use Object::as_datetime() for CreationDate/ModDate } }
-
ExtractedDocument- Add optional metadata fields -
DocumentMetadata- Add new optional fields:title: Option<String>author: Option<String>subject: Option<String>keywords: Option<String>creator: Option<String>producer: Option<String>pdf_created_at: Option<String>(from PDF CreationDate)pdf_modified_at: Option<String>(from PDF ModDate)
-
Search index - Consider indexing title/author/keywords as searchable fields
Notes
- All new fields should be
Option<T>since not all PDFs contain metadata - lopdf has
Object::as_datetime()for parsing PDF date format (D:YYYYMMDDHHmmSS+HH'mm') - Metadata syncs automatically since it's part of the metadata blob in iroh-docs
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels