Skip to content

Auto-summarize and tag documents on import #6

@monneyboi

Description

@monneyboi

Summary

When importing documents, automatically generate summaries and tags to give the agent better context during discovery queries.

Motivation

Currently the agent must read full document text to understand content. Pre-computed summaries and tags would:

  • Speed up agent queries (scan summaries instead of full text)
  • Improve semantic matching (summaries capture key themes)
  • Enable tag-based filtering before search
  • Sync to peers via iroh-docs (shared context)

Proposed Implementation

Metadata Extension

Add summary and tags fields to document metadata:

{
  "name": "paper.pdf",
  "pdf_hash": "...",
  "text_hash": "...",
  "summary": "Analysis of Arctic ice loss 2010-2023...",
  "tags": ["climate", "arctic", "data-analysis"],
  "created_at": "..."
}

Background Processing

  1. Import document immediately (don't block on summarization)
  2. Queue summarization as background task
  3. Show "processing" state in UI
  4. Document is searchable by full text immediately; summary enhances discovery once ready

Tag Generation

  • LLM-suggested topic tags
  • Entity extraction (people, organizations, dates) as structured metadata
  • Useful for investigative journalism workflows

Considerations

  • Batch imports: Process queue with progress indicator
  • Re-summarization: Allow regenerating summaries after model upgrade
  • Sync: Summaries sync via iroh-docs, so peers benefit without re-processing

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions