Skip to content

Conversation

@amrit110
Copy link
Member

@amrit110 amrit110 commented Nov 6, 2025

This pull request introduces improvements to schema item filtering, LLM output sanitization, and prompt clarity for schema ranking. The most important changes are the addition of robust parsing and validation utilities for schema items, enhanced pre-processing and extraction of LLM-generated JSON, and improved handling of invalid LLM outputs. The schema ranking prompt has also been rewritten for clarity and strict output requirements.

Schema item parsing and filtering improvements:

  • Added utility functions (_parse_schema_item, _parse_column_ref, _get_foreign_key) to robustly parse and validate schema item references, and to extract foreign key relationships in src/pipe/add_schema.py. This ensures only valid columns and their foreign keys are included during schema filtering.
  • Refactored the filter_schema function to use these utilities, improving the reliability of schema filtering and handling edge cases.

LLM output sanitization and extraction:

  • Introduced _preprocess_json_string in src/pipe/llm_util.py to fix common LLM formatting errors in JSON output, such as array termination, missing quotes, and empty items, before parsing.
  • Updated extract_json to pre-process LLM output before parsing, improving robustness against malformed JSON.
  • Changed logging in extract_object to debug level for failed extractions, reducing noise in production logs.

Schema ranking and prompt handling:

  • Added _sanitize_schema_item and improved _process_output in src/pipe/rank_schema_llm.py to validate and sanitize LLM outputs, fallback to all schema items if output is invalid, and log warnings when necessary.
  • Rewrote the schema ranking prompt in src/pipe/rank_schema_prompts/v1.py to clarify input/output formats, requirements, and examples, ensuring LLMs return strictly valid and relevant schema items.

@amrit110 amrit110 self-assigned this Nov 6, 2025
@amrit110 amrit110 added bug Something isn't working enhancement New feature or request labels Nov 6, 2025
@amrit110 amrit110 marked this pull request as draft November 13, 2025 19:09
@amrit110 amrit110 marked this pull request as ready for review November 17, 2025 14:03
def _preprocess_json_string(text: str) -> str:
"""
Pre-process JSON string to fix common LLM formatting errors.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function actually modifies the LLM output, so I think a more appropriate name for it would be postprocess_json_string

# Find patterns like: "something] where ] should be "]
text = re.sub(r'([^"])\]', r'\1"]', text)
# But undo if we just added "" which would be wrong
text = text.replace('"""]', '"]')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix the probable errors that might come from the regex defined in line 130, we should use something like:
text = re.sub(r',\s*([^",]*)"', r',\1', text), and the replace method here actually doesn’t fix that issue.


def _process_output(self, row: dict[str, Any], output: str) -> Any:
return extract_object(output)
def _sanitize_schema_item(self, item: str) -> str | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, _sanitize_schema_item doesn’t seem to be a proper name, as it might cause confusion with the actual "sanitization" we do in the MaskSQL pipeline to mask sensitive information from the input. So I would prefer something like format_schema_item.

item_ref = item_ref + ("]" * bracket_count)
elif bracket_count < 0:
# More closing than opening - invalid
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we don’t just need to have the same number of opening and closing brackets, it should exactly match our formats of [table] or [table].[column]. So it’s better to use a regex to check for an exact match with these patterns.

f"All LLM schema items were invalid for question_id={row.get('question_id')}, "
f"falling back to all schema items"
)
return self.extract_schema_items(row)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more helpful if we print this warning for every single item that was invalid so we can catch them more easily. In this case, it would be better to add this warning somewhere after line 90, inside the if sanitized loop, using the else condition.

)
return self.extract_schema_items(row)

return sanitized_items
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to change _sanitize_schema_item, then we should also take into account these variables that use the "sanitized" keyword.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend also including some tests to check whether these new classes or functions work properly after refactoring to better handle edge cases or any issues they might cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants