|
| 1 | +## Table Confidence Model Documentation |
| 2 | + |
| 3 | +This document explains the ``TableConfidenceModel`` used for scoring tables detected in document pages. |
| 4 | + |
| 5 | +### Overview |
| 6 | + |
| 7 | +The ``TableConfidenceModel`` evaluates detected tables and assigns multiple confidence scores to quantify aspects such as structure, text quality, completeness, and layout. The scoring system helps downstream processes filter low-quality tables and weight them appropriately for further processing. |
| 8 | + |
| 9 | +The model uses heuristics to detect issues such as overlapping cells, mixed content types, sparse rows/columns, and irregular layouts. Each score is then adjusted using configurable methods such as ``sigmoid``, ``sqrt``, or ``linear``. |
| 10 | + |
| 11 | +### Model Configuration |
| 12 | + |
| 13 | +**TableConfidenceOptions** |
| 14 | + |
| 15 | +``TableConfidenceOptions`` is a placeholder for future configuration options, allowing customization of scoring behavior, thresholds, and weighting. |
| 16 | + |
| 17 | +**TableConfidenceModel** |
| 18 | + |
| 19 | +The ``TableConfidenceModel`` is the main class responsible for calculating table confidence scores. |
| 20 | + - ``enabled``: Whether the model actively scores tables. |
| 21 | + - ``adjustment_method``: Determines how raw scores are adjusted. |
| 22 | + |
| 23 | +When the model is called on a batch of pages, it iterates through each table and calculates the following scores: |
| 24 | + - ``Structure Score``: Measures grid integrity and penalizes overlapping cells. Larger tables and multi-row/column tables can receive a bonus. |
| 25 | + - ``Cell Text Score``: Evaluates the consistency and quality of cell text, penalizing mixed data types within a column and overlapping cells, while rewarding high-quality or structured text. |
| 26 | +- ``Completeness Score``: Determines how fully populated a table is by comparing filled cells to expected cells, penalizing sparse rows or columns and accounting for merged cells. |
| 27 | +- ``Layout Score``: Assesses visual and structural integrity, including grid presence, column alignment, row height consistency, OTSL sequences, and bounding box validity, applying both bonuses and penalties. |
| 28 | + |
| 29 | +Scores are clamped to a 0-1 range and adjusted according to the specified method. |
| 30 | + |
| 31 | +### How Scores Are Calculated |
| 32 | + |
| 33 | +The model uses the following heuristics: |
| 34 | + - ``Structure``: Starts with a base score from the detection model, subtracts penalties for overlapping cells, and adds bonuses for multi-row or multi-column structures. |
| 35 | + - ``Cell Text``: Begins with a base confidence, applies penalties for overlaps and mixed content types, evaluates text quality heuristically based on content and character composition, and blends this with the base score. |
| 36 | + - ``Completeness``: Computes the ratio of filled cells to total expected cells, considering merged cells and table shape, with penalties for sparsely populated rows or columns. |
| 37 | + - ``Layout``: Combines visual cues, such as grid presence, alignment, row height consistency, and OTSL sequences, while penalizing irregular bounding boxes, overlaps, low fill ratios, and inconsistent row heights. |
| 38 | + |
| 39 | +These four scores together provide a comprehensive assessment of table quality. |
| 40 | + |
| 41 | +### Table and Page Data Structures |
| 42 | + |
| 43 | +**Table** |
| 44 | +- Table extends BasePageElement and represents a detected table on a page. |
| 45 | +- **Key attributes**: |
| 46 | + - ``num_rows`` and ``num_cols``: dimensions of the table grid. |
| 47 | + - ``table_cells``: list of TableCell objects representing individual cells. |
| 48 | + - ``detailed_scores``: holds a TableConfidenceScores object with all calculated confidence scores. |
| 49 | + - ``otsl_seq``: stores the Open Table Structure Language sequence for layout evaluation. |
| 50 | + |
| 51 | +This structure is what the ``TableConfidenceModel`` consumes to calculate scores. |
| 52 | + |
| 53 | +**TableConfidenceScores** |
| 54 | +- Stores the four individual confidence scores: |
| 55 | + 1. structure_score |
| 56 | + 2. cell_text_score |
| 57 | + 3. completeness_score |
| 58 | + 4. layout_score |
| 59 | +- Property ``total_table_score`` provides a weighted average using default weights: |
| 60 | + 1. Structure: 0.3 |
| 61 | + 2. Cell Text: 0.3 |
| 62 | + 3. Completeness: 0.2 |
| 63 | + 4. Layout: 0.2 |
| 64 | + |
| 65 | +**PageConfidenceScores** |
| 66 | +- Aggregates confidence for an entire page. |
| 67 | +- Contains per-table scores (tables: Dict[int, TableConfidenceScores]). |
| 68 | +- Provides computed properties: |
| 69 | + - ``table_score``: average across all tables on the page. |
| 70 | + - ``mean_score``: average across OCR, layout, parsing, and tables. |
| 71 | + - ``mean_grade`` and ``low_grade``: map numeric scores to qualitative grades (POOR, FAIR, GOOD, EXCELLENT). |
| 72 | + |
| 73 | +**ConfidenceReport** |
| 74 | +- Aggregates confidence across the whole document. |
| 75 | +- Holds pages: Dict[int, PageConfidenceScores] and document-level scores (mean_score, table_score, etc.). |
| 76 | + |
| 77 | +### Usage Example |
| 78 | + |
| 79 | +The following example shows how to configure a document conversion pipeline to enable table structure extraction and run the ``TableConfidenceModel``: |
| 80 | + |
| 81 | +``` |
| 82 | +# 1. Configure table structure to use the ACCURATE mode |
| 83 | +table_options = TableStructureOptions( |
| 84 | + table_former_mode=TableFormerMode.ACCURATE |
| 85 | +) |
| 86 | +
|
| 87 | +# 2. Define the pipeline options, ensuring table structure is enabled. |
| 88 | +pipeline_options = PdfPipelineOptions( |
| 89 | + do_ocr=True, |
| 90 | + force_full_page_ocr=True, |
| 91 | + do_table_structure=True, # Crucial for the table confidence model to run |
| 92 | + table_structure_options=table_options, |
| 93 | +) |
| 94 | +
|
| 95 | +# 3. Instantiate the DocumentConverter with the pipeline options. |
| 96 | +doc_converter = DocumentConverter( |
| 97 | + format_options={ |
| 98 | + InputFormat.PDF: PdfFormatOption( |
| 99 | + pipeline_options=pipeline_options |
| 100 | + ) |
| 101 | + } |
| 102 | +) |
| 103 | +
|
| 104 | +# 4. Run the conversion. |
| 105 | +conv_result: ConversionResult = doc_converter.convert(source=pdf_path) |
| 106 | +
|
| 107 | +# 5. Access table confidence scores. |
| 108 | +for page in conv_result.pages: |
| 109 | + if page.predictions.confidence_scores and page.predictions.confidence_scores.tables: |
| 110 | + for table_id, score in page.predictions.confidence_scores.tables.items(): |
| 111 | + print(f"Page {page.page_no}, Table {table_id} scores: {score}") |
| 112 | +``` |
0 commit comments