Feature/fix schema validation#1042
Conversation
There was a problem hiding this comment.
IMO, the right fix is to apply column selection only to the target schema when in strict mode.
We say in our documentation that strict mode enforces an exact match between the source and target schemas:
strict: Whether to perform strict schema validation (default: False).
- False: Validates that all expected columns exist with compatible types (allows extra columns)
- True: Validates exact schema match (same columns, same order, same types)
Column filtering uses the _get_schema method. We can modify this method to receive the strict parameter and conditionally filter the expected columns.
There was a problem hiding this comment.
I think the missing-column bug happens because the engine auto-fills columns for has_valid_schema, and that ends up filtering the expected schema down to only columns that already exist in the DF. So, missing expected cols never get checked. Because of that, I agree with removing the auto-preselection here (or changing it to use exclude_columns instead).
There was a problem hiding this comment.
IMO, the right fix is to apply column selection only to the target schema when in strict mode.
We say in our documentation that strict mode enforces an exact match between the source and target schemas:
strict: Whether to perform strict schema validation (default: False). - False: Validates that all expected columns exist with compatible types (allows extra columns) - True: Validates exact schema match (same columns, same order, same types)Column filtering uses the
_get_schemamethod. We can modify this method to receive thestrictparameter and conditionally filter the expected columns.
@ghanse While I understand the intent to conditionally filter based on strict mode, modifying _get_schema doesn't resolve the bug because of how the engine auto-injects arguments behind the scenes.
As @akshan-main correctly pointed out, the root cause is a semantic conflict with the columns argument. The @register_for_original_columns_preselection decorator forces the DQEngine to auto-inject the columns kwarg into the check with a list of all existing columns in the DataFrame so DQX internal columns are ignored.
However, has_valid_schema already uses the columns keyword argument to define a specific subset of columns to be validated. Because of the decorator, this user configuration is overridden. When _get_schema() is called with this auto-injected list of existing columns, it actively filters the expected_schema down to only the columns that currently exist in the DataFrame.
Because of this, today, even in strict=False (permissive mode), missing columns are silently passing. The auto-injected list causes _get_schema to silently drop any missing columns from the expected_schema before _get_permissive_schema_comparison even runs.
If we only selectively applied column selection to the target schema in strict mode as you suggested:
- In
strict=True:_get_schemawould still filter out the missing columns from theexpected_schemabecause of the decorator's injection, resulting in a false PASS for missing columns. - In
strict=False:_get_schemawouldn't filter them out, so the missing columns would finally be correctly flagged.
This would create a paradoxical situation where strict mode is actually less strict than permissive mode regarding missing columns.
By removing the decorator, we prevent the engine from spoofing the columns kwarg, preserving the user's intent and allowing missing columns to be accurately checked against the full expected_schema in both modes. To still ignore DQX internal/result columns safely, I've moved the actual column preselection logic directly inside the apply() closure where it filters actual_schema without mutating the expected_schema definition.
Let me know what you think!
There was a problem hiding this comment.
If we only selectively applied column selection to the target schema in strict mode as you suggested:
- In strict=True: _get_schema would still filter out the missing columns from the expected_schema because of the decorator's injection, resulting in a false PASS for missing columns.
- In strict=False: _get_schema wouldn't filter them out, so the missing columns would finally be correctly flagged.
We should invert this logic and apply column selection to the expected schema only when strict=True. The docs mention that non-strict mode will only validate the expected columns and not any extra columns.
There was a problem hiding this comment.
Sorry @ghanse, I need some clarification on the exact requirements to ensure we align with the intended behavior:
expected_schemafiltering: Should thecolumnsparameter filter theexpected_schemaonly whenstrict=True? There is an existing test (has_valid_schema_with_specified_columns) that explicitly expectsexpected_schemafiltering to happen whenstrict=False. If we change this, the test will need to be updated.- Non-strict definition: In
strict=False, does "expected columns" refer to the entireexpected_schema, completely ignoring thecolumnsparameter? - Superset behavior: If the
columnslist contains columns not present in the DataFrame (e.g., DF hasa, b, cbutcolumns=["a", "b", "c", "d"]), should validation fail in bothstrict=Trueandstrict=False? The documentation isn't fully clear on this edge case.
There was a problem hiding this comment.
Sorry, I meant we should apply column selection on the expected schema only when strict=False.
- Schema filtering: Filter the
expected_schemaonly whenstrict=False. Whenstrict=True, we should filter the DataFrame schema and compare against the unfilteredexpected_schema. - Non-strict definition: When
strict=False, the "expected columns" are the subset of expected columns after filtering using thecolumnsparameter. Whenstrict=True, the "expected columns" are the full expected schema, either from theexpected_schema,ref_df_name, orref_table. We can add this explanation to the docstring. - Superset behavior: I don't think we currently cover this edge case. IMO the validation should fail in both cases if the DataFrame schema does not contain all
columns.
I believe this is in line with the docstring:
strict: Whether to perform strict schema validation (default: False).
- False: Validates that all expected columns exist with compatible types (allows extra columns)
- True: Validates exact schema match (same columns, same order, same types)
There was a problem hiding this comment.
I have squashed the commits and updated the PR with a signed final commit.
Key updates include:
- Docstring Updates: Refined the documentation to reflect the recent changes.
- Test Case Adjustment: Updated
test_has_valid_schema_with_specified_columns.- Reason: Previously, the test filtered the expected schema even when
strict=True, which failed to correctly identify the missing columnc. I've updated the expected result to ensure it correctly validates the schema under strict mode.
- Reason: Previously, the test filtered the expected schema even when
Please let me know if this addresses the requirements!
09a5ca0 to
d5b5f08
Compare
Changes
has_valid_schemasilently skipped validation for columns missing from the checked DataFrame.@register_for_original_columns_preselectiondecorator to preventDQEnginefrom auto-injecting columns.apply()logic to filter out DQX infrastructure columns, ensuring validation properly fails for missing expected columns.Linked issues
Resolves #1039
Tests
Added
test_apply_checks_with_has_valid_schema_missing_columnsandtest_apply_checks_with_has_valid_schema_permissive_missing_columnstotests/integration/test_apply_checks.pyto explicitly guarantee that executing the check viaDQEngine.apply_checks(...)correctly flags missing columns instrict=Trueandstrict=Falsemode.