Skip to content

fix: gc string view arrays in RepartitionExec#20500

Open
Samyak2 wants to merge 4 commits intoapache:mainfrom
Samyak2:fix-repartition-string-view-counting
Open

fix: gc string view arrays in RepartitionExec#20500
Samyak2 wants to merge 4 commits intoapache:mainfrom
Samyak2:fix-repartition-string-view-counting

Conversation

@Samyak2
Copy link
Contributor

@Samyak2 Samyak2 commented Feb 23, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

  • If any StringViewArray columns are present in the repartitioned input, we gc them to reduce duplicate tracking of the same string view buffer.

Are these changes tested?

Are there any user-facing changes?

No

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Feb 23, 2026
@Samyak2 Samyak2 force-pushed the fix-repartition-string-view-counting branch from 6a1547d to 471de13 Compare February 25, 2026 07:44
@Samyak2 Samyak2 marked this pull request as ready for review February 25, 2026 10:03
Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Samyak2

Thanks for working on this.

}

#[tokio::test]
async fn hash_repartition_string_view_compaction() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test does not actually exercise the regression it is meant to cover.
It only checks that repartition returns all rows. That would also pass before the gc() change.

As a result, we still do not have a test that would catch the over-counting bug if this logic regresses.

Please add an assertion that observes the compaction or accounting behavior directly. For example:

  • Check that the total get_array_memory_size() across the repartitioned outputs stays close to the original batch, instead of scaling with the number of output partitions.
  • Test spill behavior under a tight memory limit (e.g., spilled bytes).
  • Verify StringViewArray buffer ownership after repartition, so outputs no longer all retain the original shared payload buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I did try to add a test that checks for memory size specifically, but it seemed a bit fragile to assert on those numbers. Let me try the other approaches, thanks!

if let Some(sv) =
col.as_any().downcast_ref::<StringViewArray>()
{
Arc::new(sv.gc())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new StringViewArray::gc() pass in repartition is very similar to the existing organize_stringview_arrays logic in datafusion/physical-plan/src/sorts/sort.rs.

A small shared helper would keep the workaround in one place and reduce the chance that sort/repartition diverge when Arrow view handling changes again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'll change this

Copy link
Contributor Author

@Samyak2 Samyak2 Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made this change.

  • I wasn't sure of where to put this util. I have made it pub(crate) and kept it in physical-plan/src/common.rs
  • In the case of repartition, we would be doing the RecordBatch constructor validation twice when string view arrays are present (when there's no string view array, we return the same batch back). But this should be okay since the actual gc time will dominate over the time for schema validation, etc.

(will fix the other comment tomorrow)

Samyak2 added 3 commits March 17, 2026 23:05
Fixes apache#20491

- Took the fix from `ExternalSorter` introduced in apache#14823
- If any `StringViewArray` columns are present in the repartitioned
  input, we gc them to reduce duplicate tracking of the same string view
  buffer.
- Fixes over-counting when there's a `RepartitionExec` above a partial
  agg on a `StringViewArray` column.
This is not needed for round robin repartition
@Samyak2 Samyak2 force-pushed the fix-repartition-string-view-counting branch from 4385fb4 to 0733758 Compare March 17, 2026 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Over-counting of memory in aggregation + repartition over Utf8View/StringViewArray

2 participants