Skip to content

Commit 5d01450

Browse files
committed
Fixes and docs
Signed-off-by: Alina Buzachis <[email protected]>
1 parent 3791783 commit 5d01450

File tree

3 files changed

+134
-10
lines changed

3 files changed

+134
-10
lines changed

docling/models/table_confidence_model.py

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -246,7 +246,10 @@ def _calculate_text_quality(text: str) -> float:
246246

247247
# Penalize columns with mixed data types (e.g., numbers and text)
248248
for col in range(1, detected_table.num_cols + 1):
249-
col_texts = [c.text for c in detected_table.table_cells if c.column == col]
249+
col_texts = [
250+
c.text for c in detected_table.table_cells
251+
if c.start_col_offset_idx <= col <= c.end_col_offset_idx
252+
]
250253
if col_texts:
251254
has_number = any(str(t).replace('.', '', 1).isdigit() for t in col_texts)
252255
has_text = any(not str(t).replace('.', '', 1).isdigit() for t in col_texts)
@@ -327,17 +330,19 @@ def _calculate_completeness_score(self, detected_table: Table) -> ScoreValue:
327330
1 for col in range(1, detected_table.num_cols + 1)
328331
if sum(
329332
1 for c in detected_table.table_cells
330-
if c.column == col and c.text and c.text.strip().replace('\u200b','')
333+
if c.start_col_offset_idx <= col <= c.end_col_offset_idx and c.text
334+
and c.text.strip().replace('\u200b','')
331335
) / max(1, detected_table.num_rows) < 0.1
332336
)
333337
total_penalty += (sparse_cols / detected_table.num_cols) * 0.1
334338

335339
if detected_table.num_rows > 1:
336340
sparse_rows = sum(
337-
1 for row in range(1, detected_table.num_rows + 1)
341+
1 for row in range(detected_table.num_rows) # 0-based
338342
if sum(
339343
1 for c in detected_table.table_cells
340-
if c.row == row and c.text and c.text.strip().replace('\u200b','')
344+
if c.start_row_offset_idx <= row <= c.end_row_offset_idx
345+
and c.text and c.text.strip().replace('\u200b','')
341346
) / max(1, detected_table.num_cols) < 0.1
342347
)
343348
total_penalty += (sparse_rows / detected_table.num_rows) * 0.1
@@ -387,10 +392,14 @@ def _calculate_layout_score(self, detected_table: Table) -> ScoreValue:
387392

388393
# Calculate bonus for consistent column alignment
389394
aligned_fraction = 0.0
390-
if detected_table.num_cols > 1 and all(hasattr(c, "column") for c in detected_table.table_cells):
395+
if detected_table.num_cols > 1:
391396
consistent_columns = 0
392-
for col in range(1, detected_table.num_cols + 1):
393-
col_x_coords = [c.bbox.x for c in detected_table.table_cells if c.column == col]
397+
for col in range(detected_table.num_cols): # zero-based index
398+
col_x_coords = [
399+
c.bbox.l # use left edge for alignment
400+
for c in detected_table.table_cells
401+
if c.start_col_offset_idx <= col <= c.end_col_offset_idx
402+
]
394403
if len(col_x_coords) > 1 and np.std(col_x_coords) < 5:
395404
consistent_columns += 1
396405
aligned_fraction = consistent_columns / detected_table.num_cols
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
## Table Confidence Model Documentation
2+
3+
This document explains the ``TableConfidenceModel`` used for scoring tables detected in document pages.
4+
5+
### Overview
6+
7+
The ``TableConfidenceModel`` evaluates detected tables and assigns multiple confidence scores to quantify aspects such as structure, text quality, completeness, and layout. The scoring system helps downstream processes filter low-quality tables and weight them appropriately for further processing.
8+
9+
The model uses heuristics to detect issues such as overlapping cells, mixed content types, sparse rows/columns, and irregular layouts. Each score is then adjusted using configurable methods such as ``sigmoid``, ``sqrt``, or ``linear``.
10+
11+
### Model Configuration
12+
13+
**TableConfidenceOptions**
14+
15+
``TableConfidenceOptions`` is a placeholder for future configuration options, allowing customization of scoring behavior, thresholds, and weighting.
16+
17+
**TableConfidenceModel**
18+
19+
The ``TableConfidenceModel`` is the main class responsible for calculating table confidence scores.
20+
- ``enabled``: Whether the model actively scores tables.
21+
- ``adjustment_method``: Determines how raw scores are adjusted.
22+
23+
When the model is called on a batch of pages, it iterates through each table and calculates the following scores:
24+
- ``Structure Score``: Measures grid integrity and penalizes overlapping cells. Larger tables and multi-row/column tables can receive a bonus.
25+
- ``Cell Text Score``: Evaluates the consistency and quality of cell text, penalizing mixed data types within a column and overlapping cells, while rewarding high-quality or structured text.
26+
- ``Completeness Score``: Determines how fully populated a table is by comparing filled cells to expected cells, penalizing sparse rows or columns and accounting for merged cells.
27+
- ``Layout Score``: Assesses visual and structural integrity, including grid presence, column alignment, row height consistency, OTSL sequences, and bounding box validity, applying both bonuses and penalties.
28+
29+
Scores are clamped to a 0-1 range and adjusted according to the specified method.
30+
31+
### How Scores Are Calculated
32+
33+
The model uses the following heuristics:
34+
- ``Structure``: Starts with a base score from the detection model, subtracts penalties for overlapping cells, and adds bonuses for multi-row or multi-column structures.
35+
- ``Cell Text``: Begins with a base confidence, applies penalties for overlaps and mixed content types, evaluates text quality heuristically based on content and character composition, and blends this with the base score.
36+
- ``Completeness``: Computes the ratio of filled cells to total expected cells, considering merged cells and table shape, with penalties for sparsely populated rows or columns.
37+
- ``Layout``: Combines visual cues, such as grid presence, alignment, row height consistency, and OTSL sequences, while penalizing irregular bounding boxes, overlaps, low fill ratios, and inconsistent row heights.
38+
39+
These four scores together provide a comprehensive assessment of table quality.
40+
41+
### Table and Page Data Structures
42+
43+
**Table**
44+
- Table extends BasePageElement and represents a detected table on a page.
45+
- **Key attributes**:
46+
- ``num_rows`` and ``num_cols``: dimensions of the table grid.
47+
- ``table_cells``: list of TableCell objects representing individual cells.
48+
- ``detailed_scores``: holds a TableConfidenceScores object with all calculated confidence scores.
49+
- ``otsl_seq``: stores the Open Table Structure Language sequence for layout evaluation.
50+
51+
This structure is what the ``TableConfidenceModel`` consumes to calculate scores.
52+
53+
**TableConfidenceScores**
54+
- Stores the four individual confidence scores:
55+
1. structure_score
56+
2. cell_text_score
57+
3. completeness_score
58+
4. layout_score
59+
- Property ``total_table_score`` provides a weighted average using default weights:
60+
1. Structure: 0.3
61+
2. Cell Text: 0.3
62+
3. Completeness: 0.2
63+
4. Layout: 0.2
64+
65+
**PageConfidenceScores**
66+
- Aggregates confidence for an entire page.
67+
- Contains per-table scores (tables: Dict[int, TableConfidenceScores]).
68+
- Provides computed properties:
69+
- ``table_score``: average across all tables on the page.
70+
- ``mean_score``: average across OCR, layout, parsing, and tables.
71+
- ``mean_grade`` and ``low_grade``: map numeric scores to qualitative grades (POOR, FAIR, GOOD, EXCELLENT).
72+
73+
**ConfidenceReport**
74+
- Aggregates confidence across the whole document.
75+
- Holds pages: Dict[int, PageConfidenceScores] and document-level scores (mean_score, table_score, etc.).
76+
77+
### Usage Example
78+
79+
The following example shows how to configure a document conversion pipeline to enable table structure extraction and run the ``TableConfidenceModel``:
80+
81+
```
82+
# 1. Configure table structure to use the ACCURATE mode
83+
table_options = TableStructureOptions(
84+
table_former_mode=TableFormerMode.ACCURATE
85+
)
86+
87+
# 2. Define the pipeline options, ensuring table structure is enabled.
88+
pipeline_options = PdfPipelineOptions(
89+
do_ocr=True,
90+
force_full_page_ocr=True,
91+
do_table_structure=True, # Crucial for the table confidence model to run
92+
table_structure_options=table_options,
93+
)
94+
95+
# 3. Instantiate the DocumentConverter with the pipeline options.
96+
doc_converter = DocumentConverter(
97+
format_options={
98+
InputFormat.PDF: PdfFormatOption(
99+
pipeline_options=pipeline_options
100+
)
101+
}
102+
)
103+
104+
# 4. Run the conversion.
105+
conv_result: ConversionResult = doc_converter.convert(source=pdf_path)
106+
107+
# 5. Access table confidence scores.
108+
for page in conv_result.pages:
109+
if page.predictions.confidence_scores and page.predictions.confidence_scores.tables:
110+
for table_id, score in page.predictions.confidence_scores.tables.items():
111+
print(f"Page {page.page_no}, Table {table_id} scores: {score}")
112+
```

tests/test_table_confidence_score.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,11 +28,14 @@ def height(self):
2828

2929

3030
class Cell:
31-
def __init__(self, text, bbox, row, column):
31+
def __init__(self, text, bbox, row, column, row_span=1, col_span=1):
3232
self.text = text
3333
self.bbox = bbox
34-
self.row = row
35-
self.column = column
34+
self.start_row_offset_idx = row - 1
35+
self.end_row_offset_idx = self.start_row_offset_idx + (row_span - 1)
36+
37+
self.start_col_offset_idx = column - 1
38+
self.end_col_offset_idx = self.start_col_offset_idx + (col_span - 1)
3639

3740

3841
class Cluster:

0 commit comments

Comments
 (0)