Skip to content

Feature/fix schema validation#1042

Open
fedeflowers wants to merge 1 commit intodatabrickslabs:mainfrom
fedeflowers:feature/fix-schema-validation
Open

Feature/fix schema validation#1042
fedeflowers wants to merge 1 commit intodatabrickslabs:mainfrom
fedeflowers:feature/fix-schema-validation

Conversation

@fedeflowers
Copy link

Changes

  • Fixed a critical bug where has_valid_schema silently skipped validation for columns missing from the checked DataFrame.
  • Removed the @register_for_original_columns_preselection decorator to prevent DQEngine from auto-injecting columns.
  • Implemented manual column preselection in the apply() logic to filter out DQX infrastructure columns, ensuring validation properly fails for missing expected columns.

Linked issues

Resolves #1039

Tests

Added test_apply_checks_with_has_valid_schema_missing_columns and test_apply_checks_with_has_valid_schema_permissive_missing_columns to tests/integration/test_apply_checks.py to explicitly guarantee that executing the check via DQEngine.apply_checks(...) correctly flags missing columns in strict=True and strict=False mode.

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests
  • added performance tests

@fedeflowers fedeflowers requested a review from a team as a code owner February 21, 2026 15:27
@fedeflowers fedeflowers requested review from grusin-db and removed request for a team February 21, 2026 15:27
Copy link
Collaborator

@ghanse ghanse Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the right fix is to apply column selection only to the target schema when in strict mode.

We say in our documentation that strict mode enforces an exact match between the source and target schemas:

strict: Whether to perform strict schema validation (default: False).
  - False: Validates that all expected columns exist with compatible types (allows extra columns)
  - True: Validates exact schema match (same columns, same order, same types)

Column filtering uses the _get_schema method. We can modify this method to receive the strict parameter and conditionally filter the expected columns.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the missing-column bug happens because the engine auto-fills columns for has_valid_schema, and that ends up filtering the expected schema down to only columns that already exist in the DF. So, missing expected cols never get checked. Because of that, I agree with removing the auto-preselection here (or changing it to use exclude_columns instead).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the right fix is to apply column selection only to the target schema when in strict mode.

We say in our documentation that strict mode enforces an exact match between the source and target schemas:

strict: Whether to perform strict schema validation (default: False).
  - False: Validates that all expected columns exist with compatible types (allows extra columns)
  - True: Validates exact schema match (same columns, same order, same types)

Column filtering uses the _get_schema method. We can modify this method to receive the strict parameter and conditionally filter the expected columns.

@ghanse While I understand the intent to conditionally filter based on strict mode, modifying _get_schema doesn't resolve the bug because of how the engine auto-injects arguments behind the scenes.

As @akshan-main correctly pointed out, the root cause is a semantic conflict with the columns argument. The @register_for_original_columns_preselection decorator forces the DQEngine to auto-inject the columns kwarg into the check with a list of all existing columns in the DataFrame so DQX internal columns are ignored.

However, has_valid_schema already uses the columns keyword argument to define a specific subset of columns to be validated. Because of the decorator, this user configuration is overridden. When _get_schema() is called with this auto-injected list of existing columns, it actively filters the expected_schema down to only the columns that currently exist in the DataFrame.

Because of this, today, even in strict=False (permissive mode), missing columns are silently passing. The auto-injected list causes _get_schema to silently drop any missing columns from the expected_schema before _get_permissive_schema_comparison even runs.

If we only selectively applied column selection to the target schema in strict mode as you suggested:

  • In strict=True: _get_schema would still filter out the missing columns from the expected_schema because of the decorator's injection, resulting in a false PASS for missing columns.
  • In strict=False: _get_schema wouldn't filter them out, so the missing columns would finally be correctly flagged.

This would create a paradoxical situation where strict mode is actually less strict than permissive mode regarding missing columns.

By removing the decorator, we prevent the engine from spoofing the columns kwarg, preserving the user's intent and allowing missing columns to be accurately checked against the full expected_schema in both modes. To still ignore DQX internal/result columns safely, I've moved the actual column preselection logic directly inside the apply() closure where it filters actual_schema without mutating the expected_schema definition.

Let me know what you think!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only selectively applied column selection to the target schema in strict mode as you suggested:

  • In strict=True: _get_schema would still filter out the missing columns from the expected_schema because of the decorator's injection, resulting in a false PASS for missing columns.
  • In strict=False: _get_schema wouldn't filter them out, so the missing columns would finally be correctly flagged.

We should invert this logic and apply column selection to the expected schema only when strict=True. The docs mention that non-strict mode will only validate the expected columns and not any extra columns.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @ghanse, I need some clarification on the exact requirements to ensure we align with the intended behavior:

  1. expected_schema filtering: Should the columns parameter filter the expected_schema only when strict=True? There is an existing test (has_valid_schema_with_specified_columns) that explicitly expects expected_schema filtering to happen when strict=False. If we change this, the test will need to be updated.
  2. Non-strict definition: In strict=False, does "expected columns" refer to the entire expected_schema, completely ignoring the columns parameter?
  3. Superset behavior: If the columns list contains columns not present in the DataFrame (e.g., DF has a, b, c but columns=["a", "b", "c", "d"]), should validation fail in both strict=True and strict=False? The documentation isn't fully clear on this edge case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant we should apply column selection on the expected schema only when strict=False.

  1. Schema filtering: Filter the expected_schema only when strict=False. When strict=True, we should filter the DataFrame schema and compare against the unfiltered expected_schema.
  2. Non-strict definition: When strict=False, the "expected columns" are the subset of expected columns after filtering using the columns parameter. When strict=True, the "expected columns" are the full expected schema, either from the expected_schema, ref_df_name, or ref_table. We can add this explanation to the docstring.
  3. Superset behavior: I don't think we currently cover this edge case. IMO the validation should fail in both cases if the DataFrame schema does not contain all columns.

I believe this is in line with the docstring:

strict: Whether to perform strict schema validation (default: False).
  - False: Validates that all expected columns exist with compatible types (allows extra columns)
  - True: Validates exact schema match (same columns, same order, same types)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have squashed the commits and updated the PR with a signed final commit.

Key updates include:

  • Docstring Updates: Refined the documentation to reflect the recent changes.
  • Test Case Adjustment: Updated test_has_valid_schema_with_specified_columns.
    • Reason: Previously, the test filtered the expected schema even when strict=True, which failed to correctly identify the missing column c. I've updated the expected result to ensure it correctly validates the schema under strict mode.

Please let me know if this addresses the requirements!

@fedeflowers fedeflowers requested a review from ghanse February 23, 2026 08:51
@fedeflowers fedeflowers force-pushed the feature/fix-schema-validation branch from 09a5ca0 to d5b5f08 Compare February 24, 2026 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: has_valid_schema doesn't detect missing columns

3 participants