feat: add `max_reported_failures` config to limit `SchemaError` output #2095

stakahashy · 2025-07-28T02:42:22Z

Summary

This PR deals with #2094, adds a configuration option to limit the number of failure cases shown in SchemaError messages, preventing excessive log output when validating large dataframes with many validation failures.

Changes

Add max_reported_failures configuration option to PanderaConfig (default: 100)
Support environment variable PANDERA_MAX_REPORTED_FAILURES
Update pandas and polars backends to respect the limit
Add comprehensive tests for the new feature

Configuration Options

Default: 100 failure cases shown
-1: Current behavior. Show all failure cases
0: Show only count, no examples

Temporarily show only 4 failure cases

  with config_context(max_reported_failures=4):
      schema.validate(df)

Example output with limit:

  SchemaError: Column 'amount' failed validator: greater_than(0) failure cases: -1.5, -2.3, -0.1, -5.2, ... and 496 more failures (500 total)

Add a new configuration option to limit the number of failure cases shown in SchemaError messages. This prevents excessive log output when validating large dataframes with many failures. - Add max_failure_cases field to PanderaConfig (default: 100) - Support environment variable PANDERA_MAX_FAILURE_CASES - Add max_failure_cases parameter to config_context - Use -1 as sentinel value for showing all failure cases Signed-off-by: Shuntaro Takahashi <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>

Update pandas error formatters to respect the max_failure_cases configuration when constructing error messages. - Modify format_vectorized_error_message to limit failure cases - Update consolidate_failure_cases to apply limits - Add truncation message showing omitted count when limit exceeded - Handle -1 as "show all" sentinel value Signed-off-by: Shuntaro Takahashi <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>

Update polars backend to respect the max_failure_cases configuration when constructing error messages in run_check method. - Get max_failure_cases from config context - Apply limit when formatting failure case messages - Add truncation message with omitted count when limit exceeded - Handle -1 as "show all" sentinel value Signed-off-by: Shuntaro Takahashi <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>

Add tests to verify the new max_failure_cases configuration works correctly across different scenarios. - Add config tests for environment variable parsing - Add config_context tests for temporary overrides - Add pandas backend tests for error message formatting - Add polars backend tests for error message formatting - Test default (100), custom limits, -1 (unlimited), and 0 (none) Signed-off-by: Shuntaro Takahashi <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>

Signed-off-by: Shuntaro Takahashi <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>

stakahashy · 2025-08-09T12:07:23Z

@cosmicBboy Hello! I've opened this pull request and was hoping you'd have a moment to review it when you're free. Please let me know if there's anything I can do to help move this forward. Thank you for your time!

cosmicBboy · 2025-08-12T17:44:16Z

this is great! taking a look

codecov · 2025-08-12T18:38:06Z

Codecov Report

❌ Patch coverage is 96.62921% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.63%. Comparing base (ff8674a) to head (f3a272b).

Files with missing lines	Patch %	Lines
pandera/backends/pandas/error_formatters.py	92.00%	2 Missing ⚠️
pandera/backends/ibis/error_formatters.py	96.29%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2095      +/-   ##
==========================================
+ Coverage   81.51%   81.63%   +0.12%     
==========================================
  Files         137      140       +3     
  Lines       10879    10962      +83     
==========================================
+ Hits         8868     8949      +81     
- Misses       2011     2013       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cosmicBboy · 2025-08-12T23:43:10Z

looks like there are some test failures. you can run specific tests locally like so: https://pandera.readthedocs.io/en/stable/CONTRIBUTING.html#run-a-specific-test-suite-locally

cosmicBboy · 2025-08-17T00:10:48Z

overall the changes look good, though there looks like there are CI test failures. You can run specific tests locally: https://pandera.readthedocs.io/en/stable/CONTRIBUTING.html#run-a-specific-test-suite-locally

ybressler

A few Q's for you:

Why set the default to 100 and not -1? (Which is the most reasonable?)
Your solution is implemented for pandas and polars, but not spark or ibis. Will you get these in another follow up PR?
Your logic between df frameworks is repeated and could benefit from refactoring into common code.
Left some inline comments about form/nesting

pandera/backends/pandas/error_formatters.py

pandera/backends/polars/base.py

pandera/config.py

ybressler · 2025-08-20T15:47:52Z

Fixes: #577

stakahashy · 2025-08-22T05:59:01Z

@cosmicBboy @ybressler Thank you for your feedbacks! I will deal with the issues in this weekend.

ybressler

Code has significant improvements - well done!

Follow up work:

Your decomposed functions (ex: format_failure_cases_with_truncation()) should have unit tests to fully established their separated functionality. Do in a follow up PR to limit the scope.
I think it would be helpful to abstract the error reporting functionality into an abstract base class which becomes subclassed for each backend. This would allow neater testing, reusable design patterns, and more consistency throughout the codebase. This expands the scope significantly, so let's keep out of this current scope.

pandera/backends/error_formatters.py

pandera/backends/ibis/error_formatters.py

Signed-off-by: Deepyaman Datta <[email protected]>

Signed-off-by: Rush Kirubi <[email protected]>

Signed-off-by: Deepyaman Datta <[email protected]>

Signed-off-by: Shuntaro Takahashi <[email protected]>

Signed-off-by: Adam Merberg <[email protected]>

…ILURES Based on PR review feedback, renamed the constant for better clarity: - Environment variable: PANDERA_MAX_FAILURE_CASES → PANDERA_MAX_REPORTED_FAILURES - Config field: max_failure_cases → max_reported_failures - Updated all references in code, tests, and documentation - Renamed test files to match the new naming convention This change makes it clearer that the configuration controls the maximum number of failures that will be reported in error messages, not the maximum number of failure cases that can occur. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>

Based on PR review feedback, simplified the nested conditionals: - Handle unlimited case (-1) first with early return - Handle zero case next (summary only) - Use Python's forgiving slice behavior (no need to check bounds) - Only add "omitted" message if we actually truncated This reduces nesting from 3 levels to 2 and eliminates duplicate code. The logic is now clearer and more maintainable. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>

…urns - Extract helper function to enable early returns in pandas formatter - Remove obvious comments that don't add value - Consolidate error message building to reduce duplication - Further reduce nesting and improve readability The logic is now much cleaner and easier to follow. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>

Signed-off-by: Shuntaro Takahashi <[email protected]>

Each test function now has a single, clear purpose: - Split compound tests into focused individual tests - Each test validates one specific behavior - Improved test names to clearly describe what they test - Fixed overly strict assertions that were causing false failures This makes tests more maintainable and easier to understand. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>

Environment variable handling is an integration concern, not a unit test responsibility. Removed monkeypatch-based tests as they test system configuration rather than the core logic. The config_context tests adequately cover the functionality of limiting reported failures. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>

- Create common format_failure_cases_with_truncation function in pandera/backends/error_formatters.py - Extract polars error formatting to pandera/backends/polars/error_formatters.py - Update both pandas and polars backends to use the shared truncation logic - Maintain backend-specific formatting while sharing core truncation behavior This reduces code duplication and ensures consistent truncation behavior across all backends. Signed-off-by: Shuntaro Takahashi <[email protected]>

- Add format_failure_cases_message to pyspark backend for future extensibility - Create ibis/error_formatters.py using common truncation logic - Update ibis backend to use local error formatter with unified behavior - Ensure all backends have consistent max_reported_failures support Note: PySpark currently only supports scalar failures, but the infrastructure is now in place for future vectorized failure case support. Signed-off-by: Shuntaro Takahashi <[email protected]>

PySpark only supports scalar failure cases, not vectorized failures. Removing the unused function to keep the codebase clean and avoid confusion. Signed-off-by: Shuntaro Takahashi <[email protected]>

- Applied black code formatting - Fixed import sorting with isort - Added missing newline at end of file - All tests pass with formatted code Signed-off-by: Shuntaro Takahashi <[email protected]>

- Reduced pandas tests from 10 to 5 unique test functions (8 test cases with parametrize) - Reduced polars tests from 8 to 5 unique test functions (7 test cases with parametrize) - Used pytest.mark.parametrize to test multiple scenarios efficiently - Removed redundant tests where results could be inferred from other tests - All test coverage is maintained while reducing code duplication Benefits: - Faster test execution - Easier maintenance - Clearer test intent - Less redundant code Signed-off-by: Shuntaro Takahashi <[email protected]>

- Renamed test_max_reported_failures.py to test_polars_max_reported_failures.py - Follows the convention of other test files in tests/polars/ directory - Added missing newline at end of file Signed-off-by: Shuntaro Takahashi <[email protected]>

- Created comprehensive test suite for ibis backend max_reported_failures - Tests are currently skipped due to environment-specific backend requirements - Ibis uses the same error formatting code as pandas (via our unified implementation) - Functionality is effectively tested through pandas tests - PySpark tests not needed as it only supports scalar failures, not detailed failure cases Note: Ibis tests can be enabled when proper backend (DuckDB, etc.) is configured in the test environment. Signed-off-by: Shuntaro Takahashi <[email protected]>

Signed-off-by: Shuntaro Takahashi <[email protected]>

…rted_failures - Updated parametrized test cases in test_config.py - All config tests now pass successfully Signed-off-by: Shuntaro Takahashi <[email protected]>

- Added max_reported_failures field to expected dictionaries in test_pandas_config.py - Added max_reported_failures field to expected dictionaries in test_pyspark_config.py - Tests now properly account for the new config field introduced in the feature Signed-off-by: Shuntaro Takahashi <[email protected]>

This commit properly merges upstream's polars testing infrastructure changes while preserving our macOS-specific polars-lts-cpu improvement. Key changes: - Add POLARS_VERSIONS = ["0.20.0", "1.32.2"] constant - Update _testing_requirements to accept polars parameter with default - Change polars handling from version pinning to parametrized: polars=={polars} - Add DATAFRAME_EXTRAS set for special handling of dataframe libraries - Update EXTRA_PYTHON_PYDANTIC to use 4-tuple (extra, pandas, pydantic, polars) - Polars tests use latest pandas/pydantic versions with parametrized polars versions - Other dataframe libraries (pyspark, dask, ibis, modin) use None for pandas/pydantic - Preserve macOS polars-lts-cpu=={polars} handling for improved compatibility - Update tests function signature to include polars parameter The resulting configuration generates 2 polars test sessions per Python version: - tests-X.Y(extra='polars', pandas='2.2.3', pydantic='2.10.6', polars='0.20.0') - tests-X.Y(extra='polars', pandas='2.2.3', pydantic='2.10.6', polars='1.32.2') This resolves the merge conflicts with upstream while maintaining the existing codebase functionality and improving polars version testing coverage. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>

pandera/backends/pandas/error_formatters.py

ybressler · 2025-10-09T17:00:45Z

Feel free to resolve all addressed threads @stakahashy

stakahashy · 2025-10-11T12:19:18Z

Feel free to resolve all addressed threads @stakahashy

@ybressler Thank you for taking care of this PR! I resolved the issues, and will address the remained tasks in another PR after this PR merged.

deepyaman · 2025-11-20T17:44:40Z

pandera/backends/error_formatters.py

+    :param failure_cases: The failure cases to format (backend-specific type)
+    :param total_failures: Total number of failures
+    :param max_reported_failures: Maximum failures to report
+        (-1 for unlimited, 0 for summary only)


Nit: Just as I'm passing by, I feel like None for no limit is cleaner, and summary (if necessary) could be a different argument? Having magic numbers like -1 and 0 is confusing to users (even if documented, it's not intuitive).

stakahashy added 5 commits July 28, 2025 11:47

fix: add missing newlines at end of test files

9c51515

Signed-off-by: Shuntaro Takahashi <[email protected]> Signed-off-by: Shuntaro Takahashi <[email protected]>

stakahashy force-pushed the limit_log_length_on_schema_error branch from e5440a9 to 9c51515 Compare July 28, 2025 02:47

stakahashy changed the title ~~feat: add max_failure_cases config to limit SchemaError output~~ feat: add max_failure_cases config to limit SchemaError output Jul 28, 2025

stakahashy mentioned this pull request Jul 28, 2025

Limit log length for SchemaError #2094

Open

ybressler reviewed Aug 17, 2025

View reviewed changes

pandera/backends/pandas/error_formatters.py Outdated Show resolved Hide resolved

pandera/backends/pandas/error_formatters.py Outdated Show resolved Hide resolved

pandera/backends/polars/base.py Outdated Show resolved Hide resolved

pandera/config.py Outdated Show resolved Hide resolved

ybressler assigned stakahashy Aug 17, 2025

stakahashy force-pushed the limit_log_length_on_schema_error branch from ee89005 to d64177b Compare August 23, 2025 15:25

stakahashy changed the title ~~feat: add max_failure_cases config to limit SchemaError output~~ feat: add max_reported_failures config to limit SchemaError output Aug 23, 2025

stakahashy force-pushed the limit_log_length_on_schema_error branch from 01cee03 to 7da5066 Compare August 23, 2025 16:56

ybressler approved these changes Aug 26, 2025

View reviewed changes

pandera/backends/error_formatters.py Show resolved Hide resolved

pandera/backends/ibis/error_formatters.py Show resolved Hide resolved

deepyaman and others added 9 commits September 7, 2025 23:51

Upgrade pyupgrade hook and target Python version

5bdb62c

Signed-off-by: Deepyaman Datta <[email protected]>

Use builtins.type to not clash with type field

45f8951

Signed-off-by: Deepyaman Datta <[email protected]>

Remove unused imports, using ruff with manual help

9f244ea

Signed-off-by: Deepyaman Datta <[email protected]>

Fix passing an empty column list to check duplicates (unionai-oss#2092)

2ae8911

Signed-off-by: Rush Kirubi <[email protected]>

Replace Literal imports from typing_extensions (unionai-oss#2100)

c5cf6f6

Signed-off-by: Deepyaman Datta <[email protected]>

Add .git-blame-ignore-revs to avoid bulk changes (unionai-oss#2101)

dc78cd1

Signed-off-by: Shuntaro Takahashi <[email protected]>

limit polars version even on mac (unionai-oss#2105)

1ce613a

Signed-off-by: Adam Merberg <[email protected]>

stakahashy and others added 14 commits September 7, 2025 23:51

revert docstring change on config_context

38cbc6d

Signed-off-by: Shuntaro Takahashi <[email protected]>

refactor: remove unnecessary format_failure_cases_message from pyspark

1b42f18

PySpark only supports scalar failure cases, not vectorized failures. Removing the unused function to keep the codebase clean and avoid confusion. Signed-off-by: Shuntaro Takahashi <[email protected]>

style: apply black and isort formatting to error formatters

58664e3

- Applied black code formatting - Fixed import sorting with isort - Added missing newline at end of file - All tests pass with formatted code Signed-off-by: Shuntaro Takahashi <[email protected]>

fix: declare import at top

be32ee3

Signed-off-by: Shuntaro Takahashi <[email protected]>

Fix remaining test parameter names from max_failure_cases to max_repo…

f6d46e4

…rted_failures - Updated parametrized test cases in test_config.py - All config tests now pass successfully Signed-off-by: Shuntaro Takahashi <[email protected]>

stakahashy force-pushed the limit_log_length_on_schema_error branch 2 times, most recently from c3da5b3 to 4a8504d Compare September 7, 2025 16:17

stakahashy force-pushed the limit_log_length_on_schema_error branch from d484ded to 4c111a5 Compare September 7, 2025 17:06

Merge branch 'main' into limit_log_length_on_schema_error

f1eb2e5

stakahashy force-pushed the limit_log_length_on_schema_error branch from 0036afc to f1eb2e5 Compare September 8, 2025 01:13

ybressler approved these changes Oct 9, 2025

View reviewed changes

pandera/backends/pandas/error_formatters.py Show resolved Hide resolved

cosmicBboy closed this Nov 18, 2025

cosmicBboy reopened this Nov 18, 2025

deepyaman reviewed Nov 20, 2025

View reviewed changes

Merge branch 'main' into limit_log_length_on_schema_error

f3a272b

Uh oh!

feat: add max_reported_failures config to limit SchemaError output #2095

Are you sure you want to change the base?

feat: add max_reported_failures config to limit SchemaError output #2095

Uh oh!

Conversation

stakahashy commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Configuration Options

Temporarily show only 4 failure cases

Uh oh!

stakahashy commented Aug 9, 2025

Uh oh!

cosmicBboy commented Aug 12, 2025

Uh oh!

codecov bot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cosmicBboy commented Aug 12, 2025

Uh oh!

cosmicBboy commented Aug 17, 2025

Uh oh!

ybressler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ybressler commented Aug 20, 2025

Uh oh!

stakahashy commented Aug 22, 2025

Uh oh!

ybressler left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Follow up work:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ybressler commented Oct 9, 2025

Uh oh!

stakahashy commented Oct 11, 2025

Uh oh!

deepyaman Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

feat: add `max_reported_failures` config to limit `SchemaError` output #2095

feat: add `max_reported_failures` config to limit `SchemaError` output #2095

stakahashy commented Jul 28, 2025 •

edited

Loading

codecov bot commented Aug 12, 2025 •

edited

Loading

ybressler left a comment •

edited

Loading