Fixes to llm output parsing when using LLM based ranking #16

amrit110 · 2025-11-06T19:44:27Z

This pull request introduces improvements to schema item filtering, LLM output sanitization, and prompt clarity for schema ranking. The most important changes are the addition of robust parsing and validation utilities for schema items, enhanced pre-processing and extraction of LLM-generated JSON, and improved handling of invalid LLM outputs. The schema ranking prompt has also been rewritten for clarity and strict output requirements.

Schema item parsing and filtering improvements:

Added utility functions (_parse_schema_item, _parse_column_ref, _get_foreign_key) to robustly parse and validate schema item references, and to extract foreign key relationships in src/pipe/add_schema.py. This ensures only valid columns and their foreign keys are included during schema filtering.
Refactored the filter_schema function to use these utilities, improving the reliability of schema filtering and handling edge cases.

LLM output sanitization and extraction:

Introduced _preprocess_json_string in src/pipe/llm_util.py to fix common LLM formatting errors in JSON output, such as array termination, missing quotes, and empty items, before parsing.
Updated extract_json to pre-process LLM output before parsing, improving robustness against malformed JSON.
Changed logging in extract_object to debug level for failed extractions, reducing noise in production logs.

Schema ranking and prompt handling:

Added _sanitize_schema_item and improved _process_output in src/pipe/rank_schema_llm.py to validate and sanitize LLM outputs, fallback to all schema items if output is invalid, and log warnings when necessary.
Rewrote the schema ranking prompt in src/pipe/rank_schema_prompts/v1.py to clarify input/output formats, requirements, and examples, ensuring LLMs return strictly valid and relevant schema items.

sepideh-abedini · 2025-11-23T22:04:37Z

src/pipe/llm_util.py

+def _preprocess_json_string(text: str) -> str:
+    """
+    Pre-process JSON string to fix common LLM formatting errors.
+


This function actually modifies the LLM output, so I think a more appropriate name for it would be postprocess_json_string

sepideh-abedini · 2025-11-23T22:07:27Z

src/pipe/llm_util.py

+    # Find patterns like: "something] where ] should be "]
+    text = re.sub(r'([^"])\]', r'\1"]', text)
+    # But undo if we just added "" which would be wrong
+    text = text.replace('"""]', '"]')


To fix the probable errors that might come from the regex defined in line 130, we should use something like:
text = re.sub(r',\s*([^",]*)"', r',\1', text), and the replace method here actually doesn’t fix that issue.

sepideh-abedini · 2025-11-23T22:14:25Z

src/pipe/rank_schema_llm.py


-    def _process_output(self, row: dict[str, Any], output: str) -> Any:
-        return extract_object(output)
+    def _sanitize_schema_item(self, item: str) -> str | None:


Here, _sanitize_schema_item doesn’t seem to be a proper name, as it might cause confusion with the actual "sanitization" we do in the MaskSQL pipeline to mask sensitive information from the input. So I would prefer something like format_schema_item.

sepideh-abedini · 2025-11-23T22:26:10Z

src/pipe/rank_schema_llm.py

+            item_ref = item_ref + ("]" * bracket_count)
+        elif bracket_count < 0:
+            # More closing than opening - invalid
+            return None


Here we don’t just need to have the same number of opening and closing brackets, it should exactly match our formats of [table] or [table].[column]. So it’s better to use a regex to check for an exact match with these patterns.

sepideh-abedini · 2025-11-23T22:35:09Z

src/pipe/rank_schema_llm.py

+                f"All LLM schema items were invalid for question_id={row.get('question_id')}, "
+                f"falling back to all schema items"
+            )
+            return self.extract_schema_items(row)


It would be more helpful if we print this warning for every single item that was invalid so we can catch them more easily. In this case, it would be better to add this warning somewhere after line 90, inside the if sanitized loop, using the else condition.

sepideh-abedini · 2025-11-23T22:37:15Z

src/pipe/rank_schema_llm.py

+            )
+            return self.extract_schema_items(row)
+
+        return sanitized_items


If we are going to change _sanitize_schema_item, then we should also take into account these variables that use the "sanitized" keyword.

sepideh-abedini · 2025-11-23T22:44:21Z

src/pipe/add_schema.py

I recommend also including some tests to check whether these new classes or functions work properly after refactoring to better handle edge cases or any issues they might cause.

Fixes to llm output parsing when using LLM based ranking

7fd3ba6

amrit110 self-assigned this Nov 6, 2025

amrit110 added bug Something isn't working enhancement New feature or request labels Nov 6, 2025

amrit110 marked this pull request as draft November 13, 2025 19:09

amrit110 added 2 commits November 17, 2025 09:02

Revert change to main.py

a464817

Merge branch 'main' into fix_llm_ranking

4eb2fdd

amrit110 requested a review from sepideh-abedini November 17, 2025 14:03

amrit110 marked this pull request as ready for review November 17, 2025 14:03

sepideh-abedini reviewed Nov 23, 2025

View reviewed changes

Merge branch 'main' into fix_llm_ranking

164a05b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes to llm output parsing when using LLM based ranking #16

Fixes to llm output parsing when using LLM based ranking #16

Uh oh!

amrit110 commented Nov 6, 2025 •

edited

Loading

Uh oh!

sepideh-abedini Nov 23, 2025

Uh oh!

sepideh-abedini Nov 23, 2025

Uh oh!

sepideh-abedini Nov 23, 2025

Uh oh!

sepideh-abedini Nov 23, 2025

Uh oh!

sepideh-abedini Nov 23, 2025

Uh oh!

sepideh-abedini Nov 23, 2025

Uh oh!

sepideh-abedini Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fixes to llm output parsing when using LLM based ranking #16

Are you sure you want to change the base?

Fixes to llm output parsing when using LLM based ranking #16

Uh oh!

Conversation

amrit110 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sepideh-abedini Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

sepideh-abedini Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

sepideh-abedini Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

sepideh-abedini Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

sepideh-abedini Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

sepideh-abedini Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

sepideh-abedini Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amrit110 commented Nov 6, 2025 •

edited

Loading