Skip to content

Multi-Document extraction bleed (only last result captured) #260

@vayoa

Description

@vayoa

Hey, I've been trying to use this library with multiple text documents, but I consistently find that only the last item in the document list is analyzed throughout the results, even though they quote the other documents. Here's a demo code and its output.

import langextract as lx
from rich import print
from dotenv import load_dotenv
load_dotenv()


# Define your prompt
prompt = """
Extract information about Person and Company from the document.
For each Person, extract name, age, role.
For each Company, extract name, location, revenue.
Use exact text from the document; do not invent or paraphrase.
If a value is not present, output null or omit.
"""

# Example annotations
examples = [
    lx.data.ExampleData(
        text="Alice, aged 35, works as a Manager at Acme Corp based in London. Their revenue was $5M last year.",
        extractions=[
            lx.data.Extraction(
                extraction_class="Person",
                extraction_text="Alice",
                attributes={"age": "35", "role": "Manager"}
            ),
            lx.data.Extraction(
                extraction_class="Company",
                extraction_text="Acme Corp",
                attributes={"location": "London", "revenue": "$5M"}
            ),
        ],
    ),
    # possibly more examples
]

# Suppose we have multiple documents
documents = [
    lx.data.Document(document_id='1', text="Bob, 28, is a Developer at Beta LLC located in Berlin. Revenue: €2 million."),
    lx.data.Document(document_id='2', text="Charlie, 42, Senior Engineer at Gamma Inc, USA. Gamma Inc revenue last year was $10M."),
    # ...
]

# Run extraction
results = lx.extract(
    text_or_documents=documents,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
    extraction_passes=2,
    max_workers=4,
    # optionally chunking params etc.
)

# Process results
print(list(results))

Output:

[
    AnnotatedDocument(
        extractions=[
            Extraction(extraction_class='Person', extraction_text='Charlie', char_interval=CharInterval(start_pos=0, end_pos=7), alignment_status=<AlignmentStatus.MATCH_EXACT: 'match_exact'>, extraction_index=1, group_index=0, description=None, attributes={'age': '42', 'role': 'Senior Engineer'}),
            Extraction(extraction_class='Company', extraction_text='Gamma Inc', char_interval=CharInterval(start_pos=32, end_pos=41), alignment_status=<AlignmentStatus.MATCH_EXACT: 'match_exact'>, extraction_index=2, group_index=1, description=None, attributes={'location': 'USA', 'revenue': '$10M'})
        ],
        text='Bob, 28, is a Developer at Beta LLC located in Berlin. Revenue: €2 million.'
    ),
    AnnotatedDocument(
        extractions=[
            Extraction(extraction_class='Person', extraction_text='Charlie', char_interval=CharInterval(start_pos=0, end_pos=7), alignment_status=<AlignmentStatus.MATCH_EXACT: 'match_exact'>, extraction_index=1, group_index=0, description=None, attributes={'age': '42', 'role': 'Senior Engineer'}),
            Extraction(extraction_class='Company', extraction_text='Gamma Inc', char_interval=CharInterval(start_pos=32, end_pos=41), alignment_status=<AlignmentStatus.MATCH_EXACT: 'match_exact'>, extraction_index=2, group_index=1, description=None, attributes={'location': 'USA', 'revenue': '$10M'})
        ],
        text='Charlie, 42, Senior Engineer at Gamma Inc, USA. Gamma Inc revenue last year was $10M.'
    )
]

Notice how both results return Charlie as the person name, even though the first one clearly points to doc1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions