Skip to content

perf: Optimize array_sort()#21083

Merged
comphead merged 6 commits intoapache:mainfrom
neilconway:neilc/array-sort-custom-kernel
Mar 22, 2026
Merged

perf: Optimize array_sort()#21083
comphead merged 6 commits intoapache:mainfrom
neilconway:neilc/array-sort-custom-kernel

Conversation

@neilconway
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

The previous array_sort implementation called the Arrow sort kernel for every row, and then used concat to produce the final results. This was quite inefficient. Instead, we employee three different techniques depending on the input:

(1) For arrays of primitives types without null elements, we copy all values into a single Vec, sort each row's slice of the Vec in-place, and then wrap the Vec in a GenericListArray.

(2) For arrays of primitives types with null elements, we use a similar approach but we need to incur some more bookkeeping to place null elements in the right place and construct the null buffer.

(3) For arrays of non-primitive types, we use RowConverter to convert the entire input into the row format in one call, sort row indices by comparing the encoded row values, and then use a single take() to construct the result of the sort.

Benchmarks (8192 rows, vs main):

int32/5 elements:          886 µs →  57 µs  (-94%)
int32/20 elements:        1.64 ms → 846 µs  (-48%)
int32/100 elements:       4.03 ms → 3.22 ms (-20%)
int32_null_elements/5:    1.17 ms → 168 µs  (-86%)
int32_null_elements/1000: 47.2 ms → 44.1 ms  (-7%)
string/5 elements:        2.12 ms → 727 µs  (-66%)
string/1000 elements:      405 ms → 293 ms  (-28%)

What changes are included in this PR?

  • New array_sort benchmark
  • Extended unit test coverage
  • Improve docs
  • Implement optimizations as described above

Are these changes tested?

No.

Are there any user-facing changes?

No.

@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Mar 20, 2026
@neilconway
Copy link
Contributor Author

FYI @Dandandan -- thanks for the suggestion about avoiding the per-row sort kernel, seems quite effective.

let values_start = offsets[0].as_usize();
let total_values = offsets[row_count].as_usize() - values_start;

let converter = RowConverter::new(vec![SortField::new_with_options(
Copy link
Contributor

@Dandandan Dandandan Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using a RowConverter for this one? I think the path in arrow-rs is something like:

  • partition on nulls
  • create the indices (begin...end)
  • sort by the strings (using the indices)
  • Add the nulls if it starts with nulls
  • Add the values (using take)
  • Add the nulls if it ends with nulls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I briefly considered something like that, but I figured that all the pointer chasing would be pretty expensive. You're right that it's worth comparing though.

Here's a quick Claude-generated version -- lmk if you had something else in mind.

Benchmarking it against the RowComparator approach, RowComparator wins for medium-sized arrays (20 elements) and larger, and loses to the index-based comparison approach for small arrays:

  ┌─────────────┬──────────┬─────────────────┬─────────────────┐
  │  Benchmark  │   main   │  RowConverter   │ make_comparator │
  ├─────────────┼──────────┼─────────────────┼─────────────────┤
  │ string/5    │ 2.12 ms  │ 727 µs (-66%)   │ 608 µs (-71%)   │
  ├─────────────┼──────────┼─────────────────┼─────────────────┤
  │ string/20   │ 5.94 ms  │ 4.42 ms (-26%)  │ 4.76 ms (-20%)  │
  ├─────────────┼──────────┼─────────────────┼─────────────────┤
  │ string/100  │ 26.8 ms  │ 22.6 ms (-16%)  │ 25.1 ms (-6%)   │
  ├─────────────┼──────────┼─────────────────┼─────────────────┤
  │ string/1000 │ 404.9 ms │ 293.1 ms (-28%) │ 403.9 ms (~0%)  │
  └─────────────┴──────────┴─────────────────┴─────────────────┘

Not sure offhand which typical real-world workloads look like; lmk if you have a view.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Claude code looks close to what I had in mind except for the make_comparator and some non-null specializilation / Vec::push which can not be inlined and generate slow / branchy code.

I guess for strings you could also do the same tricks as used in arrow kernels, create: create some inlined key / small string as well for fast comparisons.

RowFilter is fine of course, but there is some higher fixed overhead upfront and has some higher space usage as well, so for single columns I think type specialization always wins.

@Dandandan
Copy link
Contributor

Woah, great!

@alamb
Copy link
Contributor

alamb commented Mar 21, 2026

Merged up from main to get the fix for cargo audit

@comphead comphead added this pull request to the merge queue Mar 22, 2026
Merged via the queue into apache:main with commit e5c69a4 Mar 22, 2026
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize array_sort with nested-list-aware sort kernel Optimize array_sort

4 participants