-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Description
Hey, I've been trying to use this library with multiple text documents, but I consistently find that only the last item in the document list is analyzed throughout the results, even though they quote the other documents. Here's a demo code and its output.
import langextract as lx
from rich import print
from dotenv import load_dotenv
load_dotenv()
# Define your prompt
prompt = """
Extract information about Person and Company from the document.
For each Person, extract name, age, role.
For each Company, extract name, location, revenue.
Use exact text from the document; do not invent or paraphrase.
If a value is not present, output null or omit.
"""
# Example annotations
examples = [
lx.data.ExampleData(
text="Alice, aged 35, works as a Manager at Acme Corp based in London. Their revenue was $5M last year.",
extractions=[
lx.data.Extraction(
extraction_class="Person",
extraction_text="Alice",
attributes={"age": "35", "role": "Manager"}
),
lx.data.Extraction(
extraction_class="Company",
extraction_text="Acme Corp",
attributes={"location": "London", "revenue": "$5M"}
),
],
),
# possibly more examples
]
# Suppose we have multiple documents
documents = [
lx.data.Document(document_id='1', text="Bob, 28, is a Developer at Beta LLC located in Berlin. Revenue: €2 million."),
lx.data.Document(document_id='2', text="Charlie, 42, Senior Engineer at Gamma Inc, USA. Gamma Inc revenue last year was $10M."),
# ...
]
# Run extraction
results = lx.extract(
text_or_documents=documents,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash",
extraction_passes=2,
max_workers=4,
# optionally chunking params etc.
)
# Process results
print(list(results))
Output:
[
AnnotatedDocument(
extractions=[
Extraction(extraction_class='Person', extraction_text='Charlie', char_interval=CharInterval(start_pos=0, end_pos=7), alignment_status=<AlignmentStatus.MATCH_EXACT: 'match_exact'>, extraction_index=1, group_index=0, description=None, attributes={'age': '42', 'role': 'Senior Engineer'}),
Extraction(extraction_class='Company', extraction_text='Gamma Inc', char_interval=CharInterval(start_pos=32, end_pos=41), alignment_status=<AlignmentStatus.MATCH_EXACT: 'match_exact'>, extraction_index=2, group_index=1, description=None, attributes={'location': 'USA', 'revenue': '$10M'})
],
text='Bob, 28, is a Developer at Beta LLC located in Berlin. Revenue: €2 million.'
),
AnnotatedDocument(
extractions=[
Extraction(extraction_class='Person', extraction_text='Charlie', char_interval=CharInterval(start_pos=0, end_pos=7), alignment_status=<AlignmentStatus.MATCH_EXACT: 'match_exact'>, extraction_index=1, group_index=0, description=None, attributes={'age': '42', 'role': 'Senior Engineer'}),
Extraction(extraction_class='Company', extraction_text='Gamma Inc', char_interval=CharInterval(start_pos=32, end_pos=41), alignment_status=<AlignmentStatus.MATCH_EXACT: 'match_exact'>, extraction_index=2, group_index=1, description=None, attributes={'location': 'USA', 'revenue': '$10M'})
],
text='Charlie, 42, Senior Engineer at Gamma Inc, USA. Gamma Inc revenue last year was $10M.'
)
]
Notice how both results return Charlie as the person name, even though the first one clearly points to doc1.
vayoa
Metadata
Metadata
Assignees
Labels
No labels