fix: gc string view arrays in RepartitionExec#20500
fix: gc string view arrays in RepartitionExec#20500Samyak2 wants to merge 4 commits intoapache:mainfrom
Conversation
6a1547d to
471de13
Compare
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn hash_repartition_string_view_compaction() -> Result<()> { |
There was a problem hiding this comment.
This test does not actually exercise the regression it is meant to cover.
It only checks that repartition returns all rows. That would also pass before the gc() change.
As a result, we still do not have a test that would catch the over-counting bug if this logic regresses.
Please add an assertion that observes the compaction or accounting behavior directly. For example:
- Check that the total
get_array_memory_size()across the repartitioned outputs stays close to the original batch, instead of scaling with the number of output partitions. - Test spill behavior under a tight memory limit (e.g., spilled bytes).
- Verify
StringViewArraybuffer ownership after repartition, so outputs no longer all retain the original shared payload buffer.
There was a problem hiding this comment.
Yeah, I did try to add a test that checks for memory size specifically, but it seemed a bit fragile to assert on those numbers. Let me try the other approaches, thanks!
| if let Some(sv) = | ||
| col.as_any().downcast_ref::<StringViewArray>() | ||
| { | ||
| Arc::new(sv.gc()) |
There was a problem hiding this comment.
The new StringViewArray::gc() pass in repartition is very similar to the existing organize_stringview_arrays logic in datafusion/physical-plan/src/sorts/sort.rs.
A small shared helper would keep the workaround in one place and reduce the chance that sort/repartition diverge when Arrow view handling changes again.
There was a problem hiding this comment.
Agreed. I'll change this
There was a problem hiding this comment.
I have made this change.
- I wasn't sure of where to put this util. I have made it
pub(crate)and kept it inphysical-plan/src/common.rs - In the case of repartition, we would be doing the
RecordBatchconstructor validation twice when string view arrays are present (when there's no string view array, we return the same batch back). But this should be okay since the actual gc time will dominate over the time for schema validation, etc.
(will fix the other comment tomorrow)
Fixes apache#20491 - Took the fix from `ExternalSorter` introduced in apache#14823 - If any `StringViewArray` columns are present in the repartitioned input, we gc them to reduce duplicate tracking of the same string view buffer. - Fixes over-counting when there's a `RepartitionExec` above a partial agg on a `StringViewArray` column.
This is not needed for round robin repartition
4385fb4 to
0733758
Compare
Which issue does this PR close?
Utf8View/StringViewArray#20491.Rationale for this change
RepartitionExecabove a partial agg on aStringViewArraycolumn.ExternalSorterintroduced in Fix: External sort failing onStringViewdue to shared buffers #14823What changes are included in this PR?
StringViewArraycolumns are present in the repartitioned input, we gc them to reduce duplicate tracking of the same string view buffer.Are these changes tested?
Utf8View/StringViewArray#20491Are there any user-facing changes?
No