-
-
Couldn't load subscription status.
- Fork 362
optimize pandas MultiIndex validation by avoiding materializing level values when possible #2118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
optimize pandas MultiIndex validation by avoiding materializing level values when possible #2118
Conversation
Signed-off-by: Adam Merberg <[email protected]>
Signed-off-by: Adam Merberg <[email protected]>
Signed-off-by: Adam Merberg <[email protected]>
Signed-off-by: Adam Merberg <[email protected]>
Signed-off-by: Adam Merberg <[email protected]>
Signed-off-by: Adam Merberg <[email protected]>
Signed-off-by: Adam Merberg <[email protected]>
Signed-off-by: Adam Merberg <[email protected]>
Signed-off-by: Adam Merberg <[email protected]>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2118 +/- ##
==========================================
- Coverage 94.28% 93.49% -0.79%
==========================================
Files 91 135 +44
Lines 7013 10796 +3783
==========================================
+ Hits 6612 10094 +3482
- Misses 401 702 +301 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Adam Merberg <[email protected]>
This reverts commit e9eee9f. Signed-off-by: Adam Merberg <[email protected]>
|
@cosmicBboy Would an optimization along the lines proposed here would be viable in your view? There are probably a few design details to be worked out here, but something like this would be a big performance improvement for us. |
|
Hi @amerberg this approach seems reasonable. Some clarifying questions:
|
As currently written, this implementation won't change failure reporting at all. The approach taken here is to abandon the optimization and switch to validation on the full values as soon as any failure is encountered. That does mean failed validations will be slightly slower with this change. That includes things like
That's right. Series and Index objects don't have the same codes/levels representation as MultiIndex so there isn't as much room for improvement there as far as I know.
This PR implements it at the check level, and then the optimization is applied for a level at validation time if all of the checks for the schema component have |
This PR introduces a significant performance optimization for pandas MultiIndex validation that reduces both execution time and memory usage.
When validating a pandas MultiIndex, the current implementation calls
get_level_valuesfor every level. This is slow and memory intensive because pandas internally doesn't represent a MultiIndex level as a single array of values. Instead, it has one array with the "levels" which stores the unique values, and a separate array of "codes" which are integer references to positions in the levels array. Callingget_level_valueson an index with many rows requirespandasto allocate and populate a large array using the levels and codes.The key idea behind the change proposed here is that many common checks can be run just as well on the array of unique values as on the full array of values. For instance, we can check that a level has the right dtype, that integer values in a level are all positive, that strings in a level conform to a maximum length, or that a column of any type does not contain any nulls just by looking at the unique values.
The approach taken here is to define a new attribute
determined_by_uniqueon theCheckclass, which can be set toTrueto indicate that the outcome of a check depends only on the unique values in an array being checked. This attribute is also set on built-in checks as appropriate. TheMultiIndexBackendis also updated to validate on the unique levels when all checks on a level havedetermined_by_unique=True. (In the event that validating on unique values fails, we re-run validation on the fully materialized level to ensure that failure information will be returned correctly.)This optimization can significantly improve running time and memory usage. For instance, in my local testing, the following benchmark script reported average running time of 4.5s and peak memory usage of 453.9MB on main, which on this branch was reduced to 0.16s and 107.7MB: