-
Notifications
You must be signed in to change notification settings - Fork 0
Fixes to llm output parsing when using LLM based ranking #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| def _preprocess_json_string(text: str) -> str: | ||
| """ | ||
| Pre-process JSON string to fix common LLM formatting errors. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function actually modifies the LLM output, so I think a more appropriate name for it would be postprocess_json_string
| # Find patterns like: "something] where ] should be "] | ||
| text = re.sub(r'([^"])\]', r'\1"]', text) | ||
| # But undo if we just added "" which would be wrong | ||
| text = text.replace('"""]', '"]') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix the probable errors that might come from the regex defined in line 130, we should use something like:
text = re.sub(r',\s*([^",]*)"', r',\1', text), and the replace method here actually doesn’t fix that issue.
|
|
||
| def _process_output(self, row: dict[str, Any], output: str) -> Any: | ||
| return extract_object(output) | ||
| def _sanitize_schema_item(self, item: str) -> str | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, _sanitize_schema_item doesn’t seem to be a proper name, as it might cause confusion with the actual "sanitization" we do in the MaskSQL pipeline to mask sensitive information from the input. So I would prefer something like format_schema_item.
| item_ref = item_ref + ("]" * bracket_count) | ||
| elif bracket_count < 0: | ||
| # More closing than opening - invalid | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we don’t just need to have the same number of opening and closing brackets, it should exactly match our formats of [table] or [table].[column]. So it’s better to use a regex to check for an exact match with these patterns.
| f"All LLM schema items were invalid for question_id={row.get('question_id')}, " | ||
| f"falling back to all schema items" | ||
| ) | ||
| return self.extract_schema_items(row) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more helpful if we print this warning for every single item that was invalid so we can catch them more easily. In this case, it would be better to add this warning somewhere after line 90, inside the if sanitized loop, using the else condition.
| ) | ||
| return self.extract_schema_items(row) | ||
|
|
||
| return sanitized_items |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to change _sanitize_schema_item, then we should also take into account these variables that use the "sanitized" keyword.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend also including some tests to check whether these new classes or functions work properly after refactoring to better handle edge cases or any issues they might cause.
This pull request introduces improvements to schema item filtering, LLM output sanitization, and prompt clarity for schema ranking. The most important changes are the addition of robust parsing and validation utilities for schema items, enhanced pre-processing and extraction of LLM-generated JSON, and improved handling of invalid LLM outputs. The schema ranking prompt has also been rewritten for clarity and strict output requirements.
Schema item parsing and filtering improvements:
_parse_schema_item,_parse_column_ref,_get_foreign_key) to robustly parse and validate schema item references, and to extract foreign key relationships insrc/pipe/add_schema.py. This ensures only valid columns and their foreign keys are included during schema filtering.filter_schemafunction to use these utilities, improving the reliability of schema filtering and handling edge cases.LLM output sanitization and extraction:
_preprocess_json_stringinsrc/pipe/llm_util.pyto fix common LLM formatting errors in JSON output, such as array termination, missing quotes, and empty items, before parsing.extract_jsonto pre-process LLM output before parsing, improving robustness against malformed JSON.extract_objectto debug level for failed extractions, reducing noise in production logs.Schema ranking and prompt handling:
_sanitize_schema_itemand improved_process_outputinsrc/pipe/rank_schema_llm.pyto validate and sanitize LLM outputs, fallback to all schema items if output is invalid, and log warnings when necessary.src/pipe/rank_schema_prompts/v1.pyto clarify input/output formats, requirements, and examples, ensuring LLMs return strictly valid and relevant schema items.