Skip to content

Add cross-chunk context awareness to prevent information loss during document chunking #230

@Dhano

Description

@Dhano

Description:

⚠️ AI-Generated Issue Disclaimer: This issue was identified and generated using generative AI tools. The problem analysis and proposed solutions have not been manually tested or verified. Please validate the issue description and proposed solutions before implementation.

Problem Statement

The current _annotate_documents_single_pass implementation processes document chunks independently without considering context from previous chunks. This leads to significant information loss and extraction quality degradation, particularly for:

  • Coreference resolution (pronouns like "she", "he", "it")
  • Entity disambiguation (partial names in later chunks)
  • Cross-chunk relationships (entities and relationships spanning multiple chunks)
  • Context-dependent extractions (entities that only make sense with full context)

Current Behavior

# Each chunk is processed in isolation
for text_chunk in batch:
    batch_prompts.append(
        self._prompt_generator.render(
            question=text_chunk.chunk_text,  # Only current chunk
            additional_context=text_chunk.additional_context,  # Only doc-level context
        )
    )

Example Problem

Document: "Dr. Sarah Johnson is a cardiologist at Mayo Clinic. She specializes in heart surgery. Dr. Johnson has 15 years of experience."

Chunk 1: "Dr. Sarah Johnson is a cardiologist at Mayo Clinic."

  • Extracts: {"name": "Dr. Sarah Johnson", "profession": "cardiologist", "hospital": "Mayo Clinic"}

Chunk 2: "She specializes in heart surgery. Dr. Johnson has 15 years of experience."

  • Extracts: {"specialization": "heart surgery", "experience": "15 years"}
  • Lost: Connection between "She"/"Dr. Johnson" and "Dr. Sarah Johnson"

Proposed Solutions

Option 1: Sliding Window Context

Option 2: Entity Tracking

Option 3: Overlapping Chunks

Option 4: Post-Processing Coreference Resolution

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions