Enhance data contract quality rules generation with schema validation support#1043
Enhance data contract quality rules generation with schema validation support#1043mwojtyczka merged 18 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR enhances the data contract quality rules generation feature to support schema validation based on ODCS (Open Data Contract Standard) v3.x contracts. The changes introduce automatic generation of has_valid_schema rules that ensure dataset schemas match contract definitions, resolving issue #1016.
Changes:
- Added schema validation rules generation via a new
generate_schema_validationparameter (defaults to True) - Introduced
InvalidPhysicalTypeErrorfor better error handling when physicalType is missing or invalid - Updated sample contracts, unit tests, and integration tests to cover schema validation scenarios
- Enhanced documentation to describe the new feature
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/databricks/labs/dqx/errors.py | Added InvalidPhysicalTypeError exception class for schema validation errors |
| src/databricks/labs/dqx/profiler/generator.py | Added generate_schema_validation parameter to generate_rules_from_contract method |
| src/databricks/labs/dqx/datacontract/contract_rules_generator.py | Implemented schema validation logic with Unity Catalog type validation and DDL generation |
| src/databricks/labs/dqx/datacontract/init.py | Updated module docstring to reflect ODCS v3.x support |
| tests/unit/test_datacontract_generator.py | Added comprehensive unit tests for schema validation including error cases and edge cases |
| tests/integration/test_datacontract_integration.py | Added integration tests covering all code paths for schema validation |
| tests/datacontract_helpers.py | Added helper function to filter schema validation rules |
| tests/init.py | Added package initialization file |
| tests/resources/sample_datacontract.yaml | Updated all properties with physicalType (Unity Catalog types) and added comprehensive data types schema |
| docs/dqx/src/pages/index.tsx | Added Data Contract capability to homepage |
| docs/dqx/docs/reference/profiler.mdx | Updated API reference to document generate_schema_validation parameter |
| docs/dqx/docs/guide/data_contract_quality_rules_generation.mdx | Added schema validation section to user guide |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/databricks/labs/dqx/datacontract/contract_rules_generator.py
Outdated
Show resolved
Hide resolved
src/databricks/labs/dqx/datacontract/contract_rules_generator.py
Outdated
Show resolved
Hide resolved
|
✅ 629/629 passed, 1 flaky, 41 skipped, 3h41m52s total Flaky tests:
Running from acceptance #4100 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1043 +/- ##
==========================================
+ Coverage 91.54% 91.87% +0.32%
==========================================
Files 98 98
Lines 8945 9093 +148
==========================================
+ Hits 8189 8354 +165
+ Misses 756 739 -17 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
src/databricks/labs/dqx/datacontract/contract_rules_generator.py
Outdated
Show resolved
Hide resolved
mwojtyczka
left a comment
There was a problem hiding this comment.
Code Review - PR #1043: Schema Validation Rules from Data Contracts
Overview
Well-implemented feature that generates dataset-level has_valid_schema rules from ODCS contracts. The core validation logic for Unity Catalog types (recursive validation of ARRAY/MAP/STRUCT, DECIMAL bounds checking, column name escaping) is solid and well-structured. Test coverage is thorough with both unit and integration tests covering happy paths, error paths, and edge cases.
Key Strengths
- Clean separation: one validator method per complex type
- Good error messages with context (schema name, property name, documentation links)
- Recursion depth guard prevents stack overflow on malformed input
- DECIMAL validation matches Spark actual limits (precision 1-38, scale 0-precision)
- DDL normalization ensures consistent schema matching at runtime
Issues and suggestions posted as inline comments below.
…ator Changes: - Added comments to clarify the exclusion of VOID and OBJECT types from Unity Catalog types. - Updated the `get_schema_validation_rules` function signature for better type hinting. - Refined the `_generate_rules_from_temp_contract` function to specify types for parameters and return values. These updates improve code clarity and maintainability, ensuring better adherence to type safety in schema validation processes.
src/databricks/labs/dqx/datacontract/contract_rules_generator.py
Outdated
Show resolved
Hide resolved
src/databricks/labs/dqx/datacontract/contract_rules_generator.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Marcin Wojtyczka <marcin.wojtyczka@databricks.com>
Changes
Linked issues
Resolves #1016
Tests