-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Description:
⚠️ AI-Generated Issue Disclaimer: This issue was identified and generated using generative AI tools. The problem analysis and proposed solutions have not been manually tested or verified. Please validate the issue description and proposed solutions before implementation.
Problem Statement
The current _annotate_documents_single_pass implementation processes document chunks independently without considering context from previous chunks. This leads to significant information loss and extraction quality degradation, particularly for:
- Coreference resolution (pronouns like "she", "he", "it")
- Entity disambiguation (partial names in later chunks)
- Cross-chunk relationships (entities and relationships spanning multiple chunks)
- Context-dependent extractions (entities that only make sense with full context)
Current Behavior
# Each chunk is processed in isolation
for text_chunk in batch:
batch_prompts.append(
self._prompt_generator.render(
question=text_chunk.chunk_text, # Only current chunk
additional_context=text_chunk.additional_context, # Only doc-level context
)
)Example Problem
Document: "Dr. Sarah Johnson is a cardiologist at Mayo Clinic. She specializes in heart surgery. Dr. Johnson has 15 years of experience."
Chunk 1: "Dr. Sarah Johnson is a cardiologist at Mayo Clinic."
- Extracts:
{"name": "Dr. Sarah Johnson", "profession": "cardiologist", "hospital": "Mayo Clinic"}
Chunk 2: "She specializes in heart surgery. Dr. Johnson has 15 years of experience."
- Extracts:
{"specialization": "heart surgery", "experience": "15 years"} - Lost: Connection between "She"/"Dr. Johnson" and "Dr. Sarah Johnson"