-
Couldn't load subscription status.
- Fork 1.5k
Simplified API for validating DataFrames #24009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
🛡️ TRIVY SCAN RESULT 🛡️ Target:
|
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
com.fasterxml.jackson.core:jackson-core |
CVE-2025-52999 | 🚨 HIGH | 2.12.7 | 2.15.0 |
com.fasterxml.jackson.core:jackson-core |
CVE-2025-52999 | 🚨 HIGH | 2.13.4 | 2.15.0 |
com.fasterxml.jackson.core:jackson-databind |
CVE-2022-42003 | 🚨 HIGH | 2.12.7 | 2.12.7.1, 2.13.4.2 |
com.fasterxml.jackson.core:jackson-databind |
CVE-2022-42004 | 🚨 HIGH | 2.12.7 | 2.12.7.1, 2.13.4 |
com.google.code.gson:gson |
CVE-2022-25647 | 🚨 HIGH | 2.2.4 | 2.8.9 |
com.google.protobuf:protobuf-java |
CVE-2021-22569 | 🚨 HIGH | 3.3.0 | 3.16.1, 3.18.2, 3.19.2 |
com.google.protobuf:protobuf-java |
CVE-2022-3509 | 🚨 HIGH | 3.3.0 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2022-3510 | 🚨 HIGH | 3.3.0 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2024-7254 | 🚨 HIGH | 3.3.0 | 3.25.5, 4.27.5, 4.28.2 |
com.google.protobuf:protobuf-java |
CVE-2021-22569 | 🚨 HIGH | 3.7.1 | 3.16.1, 3.18.2, 3.19.2 |
com.google.protobuf:protobuf-java |
CVE-2022-3509 | 🚨 HIGH | 3.7.1 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2022-3510 | 🚨 HIGH | 3.7.1 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2024-7254 | 🚨 HIGH | 3.7.1 | 3.25.5, 4.27.5, 4.28.2 |
com.nimbusds:nimbus-jose-jwt |
CVE-2023-52428 | 🚨 HIGH | 9.8.1 | 9.37.2 |
commons-beanutils:commons-beanutils |
CVE-2025-48734 | 🚨 HIGH | 1.9.4 | 1.11.0 |
commons-io:commons-io |
CVE-2024-47554 | 🚨 HIGH | 2.8.0 | 2.14.0 |
dnsjava:dnsjava |
CVE-2024-25638 | 🚨 HIGH | 2.1.7 | 3.6.0 |
io.netty:netty-codec-http2 |
CVE-2025-55163 | 🚨 HIGH | 4.1.96.Final | 4.2.4.Final, 4.1.124.Final |
io.netty:netty-codec-http2 |
GHSA-xpw8-rcwv-8f8p | 🚨 HIGH | 4.1.96.Final | 4.1.100.Final |
io.netty:netty-handler |
CVE-2025-24970 | 🚨 HIGH | 4.1.96.Final | 4.1.118.Final |
net.minidev:json-smart |
CVE-2021-31684 | 🚨 HIGH | 1.3.2 | 1.3.3, 2.4.4 |
net.minidev:json-smart |
CVE-2023-1370 | 🚨 HIGH | 1.3.2 | 2.4.9 |
org.apache.avro:avro |
CVE-2024-47561 | 🔥 CRITICAL | 1.7.7 | 1.11.4 |
org.apache.avro:avro |
CVE-2023-39410 | 🚨 HIGH | 1.7.7 | 1.11.3 |
org.apache.derby:derby |
CVE-2022-46337 | 🔥 CRITICAL | 10.14.2.0 | 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0 |
org.apache.ivy:ivy |
CVE-2022-46751 | 🚨 HIGH | 2.5.1 | 2.5.2 |
org.apache.mesos:mesos |
CVE-2018-1330 | 🚨 HIGH | 1.4.3 | 1.6.0 |
org.apache.thrift:libthrift |
CVE-2019-0205 | 🚨 HIGH | 0.12.0 | 0.13.0 |
org.apache.thrift:libthrift |
CVE-2020-13949 | 🚨 HIGH | 0.12.0 | 0.14.0 |
org.apache.zookeeper:zookeeper |
CVE-2023-44981 | 🔥 CRITICAL | 3.6.3 | 3.7.2, 3.8.3, 3.9.1 |
org.eclipse.jetty:jetty-server |
CVE-2024-13009 | 🚨 HIGH | 9.4.56.v20240826 | 9.4.57.v20241219 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: Node.js
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: Python
Vulnerabilities (2)
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
Werkzeug |
CVE-2024-34069 | 🚨 HIGH | 2.2.3 | 3.0.3 |
setuptools |
CVE-2025-47273 | 🚨 HIGH | 70.3.0 | 78.1.1 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: /etc/ssl/private/ssl-cert-snakeoil.key
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/extended_sample_data.yaml
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/lineage.yaml
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/sample_data.json
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/sample_data.yaml
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/sample_usage.json
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/sample_usage.yaml
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️ Target:
|
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
libexpat1 |
CVE-2023-52425 | 🚨 HIGH | 2.5.0-1+deb12u1 | 2.5.0-1+deb12u2 |
libexpat1 |
CVE-2024-8176 | 🚨 HIGH | 2.5.0-1+deb12u1 | 2.5.0-1+deb12u2 |
libgnutls30 |
CVE-2025-32988 | 🚨 HIGH | 3.7.9-2+deb12u3 | 3.7.9-2+deb12u5 |
libgnutls30 |
CVE-2025-32990 | 🚨 HIGH | 3.7.9-2+deb12u3 | 3.7.9-2+deb12u5 |
libicu72 |
CVE-2025-5222 | 🚨 HIGH | 72.1-3 | 72.1-3+deb12u1 |
libperl5.36 |
CVE-2023-31484 | 🚨 HIGH | 5.36.0-7+deb12u1 | 5.36.0-7+deb12u3 |
libperl5.36 |
CVE-2024-56406 | 🚨 HIGH | 5.36.0-7+deb12u1 | 5.36.0-7+deb12u2 |
libsqlite3-0 |
CVE-2025-6965 | 🔥 CRITICAL | 3.40.1-2+deb12u1 | 3.40.1-2+deb12u2 |
libxslt1.1 |
CVE-2024-55549 | 🚨 HIGH | 1.1.35-1 | 1.1.35-1+deb12u1 |
libxslt1.1 |
CVE-2025-24855 | 🚨 HIGH | 1.1.35-1 | 1.1.35-1+deb12u1 |
libxslt1.1 |
CVE-2025-7424 | 🚨 HIGH | 1.1.35-1 | 1.1.35-1+deb12u2 |
perl |
CVE-2023-31484 | 🚨 HIGH | 5.36.0-7+deb12u1 | 5.36.0-7+deb12u3 |
perl |
CVE-2024-56406 | 🚨 HIGH | 5.36.0-7+deb12u1 | 5.36.0-7+deb12u2 |
perl-base |
CVE-2023-31484 | 🚨 HIGH | 5.36.0-7+deb12u1 | 5.36.0-7+deb12u3 |
perl-base |
CVE-2024-56406 | 🚨 HIGH | 5.36.0-7+deb12u1 | 5.36.0-7+deb12u2 |
perl-modules-5.36 |
CVE-2023-31484 | 🚨 HIGH | 5.36.0-7+deb12u1 | 5.36.0-7+deb12u3 |
perl-modules-5.36 |
CVE-2024-56406 | 🚨 HIGH | 5.36.0-7+deb12u1 | 5.36.0-7+deb12u2 |
sqlite3 |
CVE-2025-6965 | 🔥 CRITICAL | 3.40.1-2+deb12u1 | 3.40.1-2+deb12u2 |
sudo |
CVE-2025-32462 | 🚨 HIGH | 1.9.13p3-1+deb12u1 | 1.9.13p3-1+deb12u2 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: Java
Vulnerabilities (31)
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
com.fasterxml.jackson.core:jackson-core |
CVE-2025-52999 | 🚨 HIGH | 2.12.7 | 2.15.0 |
com.fasterxml.jackson.core:jackson-core |
CVE-2025-52999 | 🚨 HIGH | 2.13.4 | 2.15.0 |
com.fasterxml.jackson.core:jackson-databind |
CVE-2022-42003 | 🚨 HIGH | 2.12.7 | 2.12.7.1, 2.13.4.2 |
com.fasterxml.jackson.core:jackson-databind |
CVE-2022-42004 | 🚨 HIGH | 2.12.7 | 2.12.7.1, 2.13.4 |
com.google.code.gson:gson |
CVE-2022-25647 | 🚨 HIGH | 2.2.4 | 2.8.9 |
com.google.protobuf:protobuf-java |
CVE-2021-22569 | 🚨 HIGH | 3.3.0 | 3.16.1, 3.18.2, 3.19.2 |
com.google.protobuf:protobuf-java |
CVE-2022-3509 | 🚨 HIGH | 3.3.0 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2022-3510 | 🚨 HIGH | 3.3.0 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2024-7254 | 🚨 HIGH | 3.3.0 | 3.25.5, 4.27.5, 4.28.2 |
com.google.protobuf:protobuf-java |
CVE-2021-22569 | 🚨 HIGH | 3.7.1 | 3.16.1, 3.18.2, 3.19.2 |
com.google.protobuf:protobuf-java |
CVE-2022-3509 | 🚨 HIGH | 3.7.1 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2022-3510 | 🚨 HIGH | 3.7.1 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2024-7254 | 🚨 HIGH | 3.7.1 | 3.25.5, 4.27.5, 4.28.2 |
com.nimbusds:nimbus-jose-jwt |
CVE-2023-52428 | 🚨 HIGH | 9.8.1 | 9.37.2 |
commons-beanutils:commons-beanutils |
CVE-2025-48734 | 🚨 HIGH | 1.9.4 | 1.11.0 |
commons-io:commons-io |
CVE-2024-47554 | 🚨 HIGH | 2.8.0 | 2.14.0 |
dnsjava:dnsjava |
CVE-2024-25638 | 🚨 HIGH | 2.1.7 | 3.6.0 |
io.netty:netty-codec-http2 |
CVE-2025-55163 | 🚨 HIGH | 4.1.96.Final | 4.2.4.Final, 4.1.124.Final |
io.netty:netty-codec-http2 |
GHSA-xpw8-rcwv-8f8p | 🚨 HIGH | 4.1.96.Final | 4.1.100.Final |
io.netty:netty-handler |
CVE-2025-24970 | 🚨 HIGH | 4.1.96.Final | 4.1.118.Final |
net.minidev:json-smart |
CVE-2021-31684 | 🚨 HIGH | 1.3.2 | 1.3.3, 2.4.4 |
net.minidev:json-smart |
CVE-2023-1370 | 🚨 HIGH | 1.3.2 | 2.4.9 |
org.apache.avro:avro |
CVE-2024-47561 | 🔥 CRITICAL | 1.7.7 | 1.11.4 |
org.apache.avro:avro |
CVE-2023-39410 | 🚨 HIGH | 1.7.7 | 1.11.3 |
org.apache.derby:derby |
CVE-2022-46337 | 🔥 CRITICAL | 10.14.2.0 | 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0 |
org.apache.ivy:ivy |
CVE-2022-46751 | 🚨 HIGH | 2.5.1 | 2.5.2 |
org.apache.mesos:mesos |
CVE-2018-1330 | 🚨 HIGH | 1.4.3 | 1.6.0 |
org.apache.thrift:libthrift |
CVE-2019-0205 | 🚨 HIGH | 0.12.0 | 0.13.0 |
org.apache.thrift:libthrift |
CVE-2020-13949 | 🚨 HIGH | 0.12.0 | 0.14.0 |
org.apache.zookeeper:zookeeper |
CVE-2023-44981 | 🔥 CRITICAL | 3.6.3 | 3.7.2, 3.8.3, 3.9.1 |
org.eclipse.jetty:jetty-server |
CVE-2024-13009 | 🚨 HIGH | 9.4.56.v20240826 | 9.4.57.v20241219 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: Node.js
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: Python
Vulnerabilities (11)
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
Authlib |
CVE-2025-59420 | 🚨 HIGH | 1.3.1 | 1.6.4 |
Authlib |
CVE-2025-61920 | 🚨 HIGH | 1.3.1 | 1.6.5 |
Werkzeug |
CVE-2024-34069 | 🚨 HIGH | 2.2.3 | 3.0.3 |
aiomysql |
CVE-2025-62611 | 🚨 HIGH | 0.2.0 | 0.3.0 |
apache-airflow-providers-common-sql |
CVE-2025-30473 | 🚨 HIGH | 1.21.0 | 1.24.1 |
deepdiff |
CVE-2025-58367 | 🔥 CRITICAL | 7.0.1 | 8.6.1 |
redshift-connector |
CVE-2025-5279 | 🚨 HIGH | 2.1.5 | 2.1.7 |
setuptools |
CVE-2024-6345 | 🚨 HIGH | 65.5.1 | 70.0.0 |
setuptools |
CVE-2025-47273 | 🚨 HIGH | 65.5.1 | 78.1.1 |
setuptools |
CVE-2025-47273 | 🚨 HIGH | 70.3.0 | 78.1.1 |
tornado |
CVE-2025-47287 | 🚨 HIGH | 6.4.2 | 6.5 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: /etc/ssl/private/ssl-cert-snakeoil.key
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /home/airflow/openmetadata-airflow-apis/openmetadata_managed_apis.egg-info/PKG-INFO
No Vulnerabilities Found
ingestion/src/metadata/data_quality/validations/column/pandas/__init__.py
Outdated
Show resolved
Hide resolved
ingestion/src/metadata/data_quality/validations/table/pandas/__init__.py
Outdated
Show resolved
Hide resolved
ingestion/src/metadata/sdk/data_quality/dataframes/dataframe_validation_engine.py
Outdated
Show resolved
Hide resolved
ingestion/src/metadata/sdk/data_quality/dataframes/dataframe_validation_engine.py
Show resolved
Hide resolved
ingestion/src/metadata/sdk/data_quality/dataframes/dataframe_validator.py
Show resolved
Hide resolved
ingestion/src/metadata/sdk/data_quality/dataframes/dataframe_validator.py
Outdated
Show resolved
Hide resolved
ingestion/src/metadata/sdk/data_quality/dataframes/validation_results.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a simplified API for validating pandas DataFrames using OpenMetadata's existing test definitions. It provides a facade pattern that wraps the existing validation infrastructure with a more user-friendly interface and adds support for short-circuit execution mode that stops validation on the first failure.
Key changes:
- New
DataFrameValidatorclass providing a simple API for configuring and executing data quality tests on DataFrames - Support for short-circuit validation mode that stops execution after the first test failure
- Comprehensive test suite covering success/failure scenarios, edge cases, and the new short-circuit functionality
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
ingestion/src/metadata/sdk/data_quality/dataframes/dataframe_validator.py |
Core validator facade providing the simplified API for DataFrame validation |
ingestion/src/metadata/sdk/data_quality/dataframes/dataframe_validation_engine.py |
Orchestration engine implementing the validation execution logic with short-circuit support |
ingestion/src/metadata/sdk/data_quality/dataframes/dataframe_validator_adapter.py |
Adapter translating DataFrame validation requests to the existing validator interface |
ingestion/src/metadata/sdk/data_quality/dataframes/validation_results.py |
Data models for validation results with helper properties |
ingestion/src/metadata/sdk/data_quality/dataframes/__init__.py |
Package exports for the new DataFrame validation module |
ingestion/src/metadata/sdk/data_quality/__init__.py |
Updated SDK exports to include new DataFrame validation classes |
ingestion/src/metadata/data_quality/validations/column/pandas/__init__.py |
New __init__.py exposing column validator classes |
ingestion/src/metadata/data_quality/validations/table/pandas/__init__.py |
New __init__.py exposing table validator classes |
ingestion/tests/unit/sdk/data_quality/test_dataframe_validator.py |
Comprehensive unit tests covering all validator functionality |
ingestion/src/metadata/sdk/examples/dataframe_validation_example.py |
Example code demonstrating various usage patterns |
ingestion/src/metadata/sdk/data_quality/dataframes/dataframe_validator.py
Outdated
Show resolved
Hide resolved
ingestion/src/metadata/sdk/data_quality/dataframes/dataframe_validation_engine.py
Outdated
Show resolved
Hide resolved
ingestion/src/metadata/sdk/data_quality/dataframes/dataframe_validator.py
Outdated
Show resolved
Hide resolved
ingestion/src/metadata/sdk/data_quality/dataframes/dataframe_validator_adapter.py
Outdated
Show resolved
Hide resolved
| Returns: | ||
| None | ||
| """ | ||
| self._check_full_table_tests_included() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using this here because, like the warning above explains, some tests need the whole DataFrame to return a valid response. This however means we need to have the whole dataframe in memory, for which the on_success/on_failure interface may be less clear.
I decided to let application code be explicit when validation is to be run in chunks (e.g: in cases when data source is too big for memory), for which the callbacks make a nice UX.
See tests for the expected behavior of this warning
Describe your changes:
Implements #24006
This PR includes a facade to validate pandas DataFrames with a similar API as #23850, using a short-circuit execution mode
Type of change:
Checklist:
or decision-making process is reflected in the issue.