Skip to content

Extract PDF metadata (author, title, dates, etc.) #5

@monneyboi

Description

@monneyboi

Summary

Extract all available metadata from PDF files during ingestion, including author, title, creation/modification dates, and other standard PDF Info dictionary fields.

Current State

  • DocumentMetadata only stores: id, name, pdf_hash, text_hash, page_count, tags, created_at
  • ExtractedDocument only returns: pdf_bytes, text, page_count

Available PDF Metadata

PDFs have an Info dictionary (accessed via doc.trailer.get(b"Info") in lopdf) containing:

Field Type Description
Title String Document title
Author String Person who created the content
Subject String Subject/topic
Keywords String Keywords (comma-separated)
Creator String Application that created the original (e.g., "Microsoft Word")
Producer String Application that converted to PDF
CreationDate Date When the document was created
ModDate Date When the document was last modified

Implementation

  1. src/core/pdf/extractor.rs - Extract metadata from the Info dictionary using:

    if let Ok(info_ref) = doc.trailer.get(b"Info") {
        if let Ok((_, info)) = doc.dereference(info_ref) {
            // Extract Title, Author, Subject, Keywords, Creator, Producer
            // Use Object::as_datetime() for CreationDate/ModDate
        }
    }
  2. ExtractedDocument - Add optional metadata fields

  3. DocumentMetadata - Add new optional fields:

    • title: Option<String>
    • author: Option<String>
    • subject: Option<String>
    • keywords: Option<String>
    • creator: Option<String>
    • producer: Option<String>
    • pdf_created_at: Option<String> (from PDF CreationDate)
    • pdf_modified_at: Option<String> (from PDF ModDate)
  4. Search index - Consider indexing title/author/keywords as searchable fields

Notes

  • All new fields should be Option<T> since not all PDFs contain metadata
  • lopdf has Object::as_datetime() for parsing PDF date format (D:YYYYMMDDHHmmSS+HH'mm')
  • Metadata syncs automatically since it's part of the metadata blob in iroh-docs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions