-
-
Notifications
You must be signed in to change notification settings - Fork 364
feat: add max_reported_failures config to limit SchemaError output
#2095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add max_reported_failures config to limit SchemaError output
#2095
Conversation
Add a new configuration option to limit the number of failure cases shown in SchemaError messages. This prevents excessive log output when validating large dataframes with many failures. - Add max_failure_cases field to PanderaConfig (default: 100) - Support environment variable PANDERA_MAX_FAILURE_CASES - Add max_failure_cases parameter to config_context - Use -1 as sentinel value for showing all failure cases Signed-off-by: Shuntaro Takahashi <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>
Update pandas error formatters to respect the max_failure_cases configuration when constructing error messages. - Modify format_vectorized_error_message to limit failure cases - Update consolidate_failure_cases to apply limits - Add truncation message showing omitted count when limit exceeded - Handle -1 as "show all" sentinel value Signed-off-by: Shuntaro Takahashi <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>
Update polars backend to respect the max_failure_cases configuration when constructing error messages in run_check method. - Get max_failure_cases from config context - Apply limit when formatting failure case messages - Add truncation message with omitted count when limit exceeded - Handle -1 as "show all" sentinel value Signed-off-by: Shuntaro Takahashi <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>
Add tests to verify the new max_failure_cases configuration works correctly across different scenarios. - Add config tests for environment variable parsing - Add config_context tests for temporary overrides - Add pandas backend tests for error message formatting - Add polars backend tests for error message formatting - Test default (100), custom limits, -1 (unlimited), and 0 (none) Signed-off-by: Shuntaro Takahashi <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>
Signed-off-by: Shuntaro Takahashi <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>
e5440a9 to
9c51515
Compare
max_failure_cases config to limit SchemaError output
|
@cosmicBboy Hello! I've opened this pull request and was hoping you'd have a moment to review it when you're free. Please let me know if there's anything I can do to help move this forward. Thank you for your time! |
|
this is great! taking a look |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2095 +/- ##
==========================================
+ Coverage 81.51% 81.63% +0.12%
==========================================
Files 137 140 +3
Lines 10879 10962 +83
==========================================
+ Hits 8868 8949 +81
- Misses 2011 2013 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
looks like there are some test failures. you can run specific tests locally like so: https://pandera.readthedocs.io/en/stable/CONTRIBUTING.html#run-a-specific-test-suite-locally |
|
overall the changes look good, though there looks like there are CI test failures. You can run specific tests locally: https://pandera.readthedocs.io/en/stable/CONTRIBUTING.html#run-a-specific-test-suite-locally |
ybressler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few Q's for you:
- Why set the default to
100and not-1? (Which is the most reasonable?) - Your solution is implemented for pandas and polars, but not spark or ibis. Will you get these in another follow up PR?
- Your logic between df frameworks is repeated and could benefit from refactoring into common code.
- Left some inline comments about form/nesting
|
Fixes: #577 |
|
@cosmicBboy @ybressler Thank you for your feedbacks! I will deal with the issues in this weekend. |
ee89005 to
d64177b
Compare
max_failure_cases config to limit SchemaError outputmax_reported_failures config to limit SchemaError output
01cee03 to
7da5066
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code has significant improvements - well done!
Follow up work:
- Your decomposed functions (ex:
format_failure_cases_with_truncation()) should have unit tests to fully established their separated functionality. Do in a follow up PR to limit the scope. - I think it would be helpful to abstract the error reporting functionality into an abstract base class which becomes subclassed for each backend. This would allow neater testing, reusable design patterns, and more consistency throughout the codebase. This expands the scope significantly, so let's keep out of this current scope.
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Rush Kirubi <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Shuntaro Takahashi <[email protected]>
Signed-off-by: Adam Merberg <[email protected]>
…ILURES Based on PR review feedback, renamed the constant for better clarity: - Environment variable: PANDERA_MAX_FAILURE_CASES → PANDERA_MAX_REPORTED_FAILURES - Config field: max_failure_cases → max_reported_failures - Updated all references in code, tests, and documentation - Renamed test files to match the new naming convention This change makes it clearer that the configuration controls the maximum number of failures that will be reported in error messages, not the maximum number of failure cases that can occur. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>
Based on PR review feedback, simplified the nested conditionals: - Handle unlimited case (-1) first with early return - Handle zero case next (summary only) - Use Python's forgiving slice behavior (no need to check bounds) - Only add "omitted" message if we actually truncated This reduces nesting from 3 levels to 2 and eliminates duplicate code. The logic is now clearer and more maintainable. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>
…urns - Extract helper function to enable early returns in pandas formatter - Remove obvious comments that don't add value - Consolidate error message building to reduce duplication - Further reduce nesting and improve readability The logic is now much cleaner and easier to follow. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>
Signed-off-by: Shuntaro Takahashi <[email protected]>
Each test function now has a single, clear purpose: - Split compound tests into focused individual tests - Each test validates one specific behavior - Improved test names to clearly describe what they test - Fixed overly strict assertions that were causing false failures This makes tests more maintainable and easier to understand. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>
Environment variable handling is an integration concern, not a unit test responsibility. Removed monkeypatch-based tests as they test system configuration rather than the core logic. The config_context tests adequately cover the functionality of limiting reported failures. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>
- Create common format_failure_cases_with_truncation function in pandera/backends/error_formatters.py - Extract polars error formatting to pandera/backends/polars/error_formatters.py - Update both pandas and polars backends to use the shared truncation logic - Maintain backend-specific formatting while sharing core truncation behavior This reduces code duplication and ensures consistent truncation behavior across all backends. Signed-off-by: Shuntaro Takahashi <[email protected]>
- Add format_failure_cases_message to pyspark backend for future extensibility - Create ibis/error_formatters.py using common truncation logic - Update ibis backend to use local error formatter with unified behavior - Ensure all backends have consistent max_reported_failures support Note: PySpark currently only supports scalar failures, but the infrastructure is now in place for future vectorized failure case support. Signed-off-by: Shuntaro Takahashi <[email protected]>
PySpark only supports scalar failure cases, not vectorized failures. Removing the unused function to keep the codebase clean and avoid confusion. Signed-off-by: Shuntaro Takahashi <[email protected]>
- Applied black code formatting - Fixed import sorting with isort - Added missing newline at end of file - All tests pass with formatted code Signed-off-by: Shuntaro Takahashi <[email protected]>
- Reduced pandas tests from 10 to 5 unique test functions (8 test cases with parametrize) - Reduced polars tests from 8 to 5 unique test functions (7 test cases with parametrize) - Used pytest.mark.parametrize to test multiple scenarios efficiently - Removed redundant tests where results could be inferred from other tests - All test coverage is maintained while reducing code duplication Benefits: - Faster test execution - Easier maintenance - Clearer test intent - Less redundant code Signed-off-by: Shuntaro Takahashi <[email protected]>
- Renamed test_max_reported_failures.py to test_polars_max_reported_failures.py - Follows the convention of other test files in tests/polars/ directory - Added missing newline at end of file Signed-off-by: Shuntaro Takahashi <[email protected]>
- Created comprehensive test suite for ibis backend max_reported_failures - Tests are currently skipped due to environment-specific backend requirements - Ibis uses the same error formatting code as pandas (via our unified implementation) - Functionality is effectively tested through pandas tests - PySpark tests not needed as it only supports scalar failures, not detailed failure cases Note: Ibis tests can be enabled when proper backend (DuckDB, etc.) is configured in the test environment. Signed-off-by: Shuntaro Takahashi <[email protected]>
Signed-off-by: Shuntaro Takahashi <[email protected]>
…rted_failures - Updated parametrized test cases in test_config.py - All config tests now pass successfully Signed-off-by: Shuntaro Takahashi <[email protected]>
- Added max_reported_failures field to expected dictionaries in test_pandas_config.py - Added max_reported_failures field to expected dictionaries in test_pyspark_config.py - Tests now properly account for the new config field introduced in the feature Signed-off-by: Shuntaro Takahashi <[email protected]>
c3da5b3 to
4a8504d
Compare
This commit properly merges upstream's polars testing infrastructure changes
while preserving our macOS-specific polars-lts-cpu improvement.
Key changes:
- Add POLARS_VERSIONS = ["0.20.0", "1.32.2"] constant
- Update _testing_requirements to accept polars parameter with default
- Change polars handling from version pinning to parametrized: polars=={polars}
- Add DATAFRAME_EXTRAS set for special handling of dataframe libraries
- Update EXTRA_PYTHON_PYDANTIC to use 4-tuple (extra, pandas, pydantic, polars)
- Polars tests use latest pandas/pydantic versions with parametrized polars versions
- Other dataframe libraries (pyspark, dask, ibis, modin) use None for pandas/pydantic
- Preserve macOS polars-lts-cpu=={polars} handling for improved compatibility
- Update tests function signature to include polars parameter
The resulting configuration generates 2 polars test sessions per Python version:
- tests-X.Y(extra='polars', pandas='2.2.3', pydantic='2.10.6', polars='0.20.0')
- tests-X.Y(extra='polars', pandas='2.2.3', pydantic='2.10.6', polars='1.32.2')
This resolves the merge conflicts with upstream while maintaining the existing
codebase functionality and improving polars version testing coverage.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <[email protected]>
Signed-off-by: Shuntaro Takahashi <[email protected]>
d484ded to
4c111a5
Compare
0036afc to
f1eb2e5
Compare
|
Feel free to resolve all addressed threads @stakahashy |
@ybressler Thank you for taking care of this PR! I resolved the issues, and will address the remained tasks in another PR after this PR merged. |
| :param failure_cases: The failure cases to format (backend-specific type) | ||
| :param total_failures: Total number of failures | ||
| :param max_reported_failures: Maximum failures to report | ||
| (-1 for unlimited, 0 for summary only) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Just as I'm passing by, I feel like None for no limit is cleaner, and summary (if necessary) could be a different argument? Having magic numbers like -1 and 0 is confusing to users (even if documented, it's not intuitive).
Summary
This PR deals with #2094, adds a configuration option to limit the number of failure cases shown in SchemaError messages, preventing excessive log output when validating large dataframes with many validation failures.
Changes
max_reported_failuresconfiguration option toPanderaConfig(default: 100)PANDERA_MAX_REPORTED_FAILURESConfiguration Options
-1: Current behavior. Show all failure cases0: Show only count, no examplesTemporarily show only 4 failure cases
Example output with limit: