Skip to content

Enhance data contract quality rules generation with schema validation support#1043

Merged
mwojtyczka merged 18 commits intomainfrom
feat-odcs-schema-check
Mar 13, 2026
Merged

Enhance data contract quality rules generation with schema validation support#1043
mwojtyczka merged 18 commits intomainfrom
feat-odcs-schema-check

Conversation

@vb-dbrks
Copy link
Contributor

Changes

  1. Updated DQGenerator to support schema validation rules generation from ODCS contracts, ensuring dataset schemas match contract definitions.
  2. Added generate_schema_validation parameter to generate_rules_from_contract method, defaulting to True.
  3. Enhanced documentation to reflect the new schema validation feature and its requirements.
  4. Introduced InvalidPhysicalTypeError for better error handling when physical types are missing or invalid in schema properties.
  5. Updated integration tests to cover schema validation scenarios, ensuring rules are generated correctly.

Linked issues

Resolves #1016

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests
  • added performance tests

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the data contract quality rules generation feature to support schema validation based on ODCS (Open Data Contract Standard) v3.x contracts. The changes introduce automatic generation of has_valid_schema rules that ensure dataset schemas match contract definitions, resolving issue #1016.

Changes:

  • Added schema validation rules generation via a new generate_schema_validation parameter (defaults to True)
  • Introduced InvalidPhysicalTypeError for better error handling when physicalType is missing or invalid
  • Updated sample contracts, unit tests, and integration tests to cover schema validation scenarios
  • Enhanced documentation to describe the new feature

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/databricks/labs/dqx/errors.py Added InvalidPhysicalTypeError exception class for schema validation errors
src/databricks/labs/dqx/profiler/generator.py Added generate_schema_validation parameter to generate_rules_from_contract method
src/databricks/labs/dqx/datacontract/contract_rules_generator.py Implemented schema validation logic with Unity Catalog type validation and DDL generation
src/databricks/labs/dqx/datacontract/init.py Updated module docstring to reflect ODCS v3.x support
tests/unit/test_datacontract_generator.py Added comprehensive unit tests for schema validation including error cases and edge cases
tests/integration/test_datacontract_integration.py Added integration tests covering all code paths for schema validation
tests/datacontract_helpers.py Added helper function to filter schema validation rules
tests/init.py Added package initialization file
tests/resources/sample_datacontract.yaml Updated all properties with physicalType (Unity Catalog types) and added comprehensive data types schema
docs/dqx/src/pages/index.tsx Added Data Contract capability to homepage
docs/dqx/docs/reference/profiler.mdx Updated API reference to document generate_schema_validation parameter
docs/dqx/docs/guide/data_contract_quality_rules_generation.mdx Added schema validation section to user guide

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@vb-dbrks vb-dbrks requested review from ghanse and removed request for grusin-db February 23, 2026 14:48
@github-actions
Copy link

github-actions bot commented Feb 23, 2026

✅ 629/629 passed, 1 flaky, 41 skipped, 3h41m52s total

Flaky tests:

  • 🤪 test_profiler_workflow_serverless (10.009s)

Running from acceptance #4100

@codecov
Copy link

codecov bot commented Feb 23, 2026

Codecov Report

❌ Patch coverage is 97.35099% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.87%. Comparing base (4ebeef2) to head (b9730e3).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
.../labs/dqx/datacontract/contract_rules_generator.py 97.33% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1043      +/-   ##
==========================================
+ Coverage   91.54%   91.87%   +0.32%     
==========================================
  Files          98       98              
  Lines        8945     9093     +148     
==========================================
+ Hits         8189     8354     +165     
+ Misses        756      739      -17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ghanse
ghanse previously requested changes Feb 23, 2026
Copy link
Collaborator

@ghanse ghanse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments.

Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review - PR #1043: Schema Validation Rules from Data Contracts

Overview

Well-implemented feature that generates dataset-level has_valid_schema rules from ODCS contracts. The core validation logic for Unity Catalog types (recursive validation of ARRAY/MAP/STRUCT, DECIMAL bounds checking, column name escaping) is solid and well-structured. Test coverage is thorough with both unit and integration tests covering happy paths, error paths, and edge cases.

Key Strengths

  • Clean separation: one validator method per complex type
  • Good error messages with context (schema name, property name, documentation links)
  • Recursion depth guard prevents stack overflow on malformed input
  • DECIMAL validation matches Spark actual limits (precision 1-38, scale 0-precision)
  • DDL normalization ensures consistent schema matching at runtime

Issues and suggestions posted as inline comments below.

Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments

vb-dbrks and others added 2 commits March 11, 2026 21:16
…ator

Changes:
- Added comments to clarify the exclusion of VOID and OBJECT types from Unity Catalog types.
- Updated the `get_schema_validation_rules` function signature for better type hinting.
- Refined the `_generate_rules_from_temp_contract` function to specify types for parameters and return values.

These updates improve code clarity and maintainability, ensuring better adherence to type safety in schema validation processes.
Co-authored-by: Marcin Wojtyczka <marcin.wojtyczka@databricks.com>
Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Data Contract Implicit Checks to Include Schema Validation

4 participants