Skip to content

Conversation

@edg956
Copy link
Contributor

@edg956 edg956 commented Oct 24, 2025

Describe your changes:

Implements #24006

This PR includes a facade to validate pandas DataFrames with a similar API as #23850, using a short-circuit execution mode

Type of change:

  • New feature

Checklist:

  • I have read the CONTRIBUTING document.
  • The issue properly describes why the new feature is needed, what's the goal, and how we are building it. Any discussion
    or decision-making process is reflected in the issue.
  • I have added tests around the new logic.

@github-actions
Copy link
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 24, 2025

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion-base-slim:trivy (debian 12.12)

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (31)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (2)

Package Vulnerability ID Severity Installed Version Fixed Version
Werkzeug CVE-2024-34069 🚨 HIGH 2.2.3 3.0.3
setuptools CVE-2025-47273 🚨 HIGH 70.3.0 78.1.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/extended_sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/lineage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.yaml

No Vulnerabilities Found

@github-actions
Copy link
Contributor

github-actions bot commented Oct 24, 2025

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion:trivy (debian 12.9)

Vulnerabilities (19)

Package Vulnerability ID Severity Installed Version Fixed Version
libexpat1 CVE-2023-52425 🚨 HIGH 2.5.0-1+deb12u1 2.5.0-1+deb12u2
libexpat1 CVE-2024-8176 🚨 HIGH 2.5.0-1+deb12u1 2.5.0-1+deb12u2
libgnutls30 CVE-2025-32988 🚨 HIGH 3.7.9-2+deb12u3 3.7.9-2+deb12u5
libgnutls30 CVE-2025-32990 🚨 HIGH 3.7.9-2+deb12u3 3.7.9-2+deb12u5
libicu72 CVE-2025-5222 🚨 HIGH 72.1-3 72.1-3+deb12u1
libperl5.36 CVE-2023-31484 🚨 HIGH 5.36.0-7+deb12u1 5.36.0-7+deb12u3
libperl5.36 CVE-2024-56406 🚨 HIGH 5.36.0-7+deb12u1 5.36.0-7+deb12u2
libsqlite3-0 CVE-2025-6965 🔥 CRITICAL 3.40.1-2+deb12u1 3.40.1-2+deb12u2
libxslt1.1 CVE-2024-55549 🚨 HIGH 1.1.35-1 1.1.35-1+deb12u1
libxslt1.1 CVE-2025-24855 🚨 HIGH 1.1.35-1 1.1.35-1+deb12u1
libxslt1.1 CVE-2025-7424 🚨 HIGH 1.1.35-1 1.1.35-1+deb12u2
perl CVE-2023-31484 🚨 HIGH 5.36.0-7+deb12u1 5.36.0-7+deb12u3
perl CVE-2024-56406 🚨 HIGH 5.36.0-7+deb12u1 5.36.0-7+deb12u2
perl-base CVE-2023-31484 🚨 HIGH 5.36.0-7+deb12u1 5.36.0-7+deb12u3
perl-base CVE-2024-56406 🚨 HIGH 5.36.0-7+deb12u1 5.36.0-7+deb12u2
perl-modules-5.36 CVE-2023-31484 🚨 HIGH 5.36.0-7+deb12u1 5.36.0-7+deb12u3
perl-modules-5.36 CVE-2024-56406 🚨 HIGH 5.36.0-7+deb12u1 5.36.0-7+deb12u2
sqlite3 CVE-2025-6965 🔥 CRITICAL 3.40.1-2+deb12u1 3.40.1-2+deb12u2
sudo CVE-2025-32462 🚨 HIGH 1.9.13p3-1+deb12u1 1.9.13p3-1+deb12u2

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (31)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (11)

Package Vulnerability ID Severity Installed Version Fixed Version
Authlib CVE-2025-59420 🚨 HIGH 1.3.1 1.6.4
Authlib CVE-2025-61920 🚨 HIGH 1.3.1 1.6.5
Werkzeug CVE-2024-34069 🚨 HIGH 2.2.3 3.0.3
aiomysql CVE-2025-62611 🚨 HIGH 0.2.0 0.3.0
apache-airflow-providers-common-sql CVE-2025-30473 🚨 HIGH 1.21.0 1.24.1
deepdiff CVE-2025-58367 🔥 CRITICAL 7.0.1 8.6.1
redshift-connector CVE-2025-5279 🚨 HIGH 2.1.5 2.1.7
setuptools CVE-2024-6345 🚨 HIGH 65.5.1 70.0.0
setuptools CVE-2025-47273 🚨 HIGH 65.5.1 78.1.1
setuptools CVE-2025-47273 🚨 HIGH 70.3.0 78.1.1
tornado CVE-2025-47287 🚨 HIGH 6.4.2 6.5

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /home/airflow/openmetadata-airflow-apis/openmetadata_managed_apis.egg-info/PKG-INFO

No Vulnerabilities Found

@TeddyCr TeddyCr requested a review from Copilot October 27, 2025 14:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a simplified API for validating pandas DataFrames using OpenMetadata's existing test definitions. It provides a facade pattern that wraps the existing validation infrastructure with a more user-friendly interface and adds support for short-circuit execution mode that stops validation on the first failure.

Key changes:

  • New DataFrameValidator class providing a simple API for configuring and executing data quality tests on DataFrames
  • Support for short-circuit validation mode that stops execution after the first test failure
  • Comprehensive test suite covering success/failure scenarios, edge cases, and the new short-circuit functionality

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
ingestion/src/metadata/sdk/data_quality/dataframes/dataframe_validator.py Core validator facade providing the simplified API for DataFrame validation
ingestion/src/metadata/sdk/data_quality/dataframes/dataframe_validation_engine.py Orchestration engine implementing the validation execution logic with short-circuit support
ingestion/src/metadata/sdk/data_quality/dataframes/dataframe_validator_adapter.py Adapter translating DataFrame validation requests to the existing validator interface
ingestion/src/metadata/sdk/data_quality/dataframes/validation_results.py Data models for validation results with helper properties
ingestion/src/metadata/sdk/data_quality/dataframes/__init__.py Package exports for the new DataFrame validation module
ingestion/src/metadata/sdk/data_quality/__init__.py Updated SDK exports to include new DataFrame validation classes
ingestion/src/metadata/data_quality/validations/column/pandas/__init__.py New __init__.py exposing column validator classes
ingestion/src/metadata/data_quality/validations/table/pandas/__init__.py New __init__.py exposing table validator classes
ingestion/tests/unit/sdk/data_quality/test_dataframe_validator.py Comprehensive unit tests covering all validator functionality
ingestion/src/metadata/sdk/examples/dataframe_validation_example.py Example code demonstrating various usage patterns

Returns:
None
"""
self._check_full_table_tests_included()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this here because, like the warning above explains, some tests need the whole DataFrame to return a valid response. This however means we need to have the whole dataframe in memory, for which the on_success/on_failure interface may be less clear.

I decided to let application code be explicit when validation is to be run in chunks (e.g: in cases when data source is too big for memory), for which the callbacks make a nice UX.

See tests for the expected behavior of this warning

@edg956 edg956 requested a review from TeddyCr October 28, 2025 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement a simple API for validating DataFrames

3 participants