Skip to content

Conversation

@connortsui20
Copy link
Contributor

@connortsui20 connortsui20 commented Nov 19, 2025

Reorganizes some of the filter implementations and adds AVX512 filter implementation and benchmarks

@connortsui20 connortsui20 requested a review from gatesn November 19, 2025 14:57
@codecov
Copy link

codecov bot commented Nov 19, 2025

Codecov Report

❌ Patch coverage is 75.83643% with 130 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.45%. Comparing base (c75c8a3) to head (eae0511).

Files with missing lines Patch % Lines
vortex-compute/src/filter/slice/simd_compress.rs 9.57% 85 Missing ⚠️
vortex-compute/src/filter/slice/out/by_mask.rs 55.55% 28 Missing ⚠️
vortex-compute/src/filter/slice/out/avx512.rs 91.07% 5 Missing ⚠️
vortex-compute/src/filter/slice/out/by_bitview.rs 87.50% 5 Missing ⚠️
vortex-compute/src/filter/slice/in_place/avx512.rs 92.30% 4 Missing ⚠️
vortex-compute/src/filter/slice/out/mod.rs 98.38% 2 Missing ⚠️
vortex-compute/src/filter/slice/in_place/mod.rs 99.08% 1 Missing ⚠️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@a10y
Copy link
Contributor

a10y commented Nov 19, 2025

Does GitHub give us CI runners with AVX512 support? My recollection from when I did the AVX2 take kernel is that they don't

@connortsui20 connortsui20 added the feature Release label indicating a new feature or request label Nov 19, 2025
@connortsui20
Copy link
Contributor Author

not really sure how to deal with this, I think sometimes they have it, sometimes not?

@robert3005
Copy link
Contributor

CI runs in AWS, you can get whatever machine you want, right now by default we use m7i/m7a which are sapphire rapids/zen4 which have AVX512. On benchmarks we run c6i which is ice-lake sp which also has avx512

@connortsui20 connortsui20 force-pushed the ct/avx512-filter branch 10 times, most recently from 02c00d3 to 4f8baa2 Compare November 19, 2025 19:16
@0ax1
Copy link
Contributor

0ax1 commented Nov 20, 2025

Heads up that we can't micro-benchmark avx512 with codspeed, as valgrind/callgrind does not support avx512. Saying you'll get numbers for avx2 if it is implemented:

      - name: Build benchmarks (shard 1)
        env:
          RUSTFLAGS: "-C target-feature=+avx2"
      ...

@connortsui20
Copy link
Contributor Author

Given that were pivoting to batch I'm going to put a pin in this.

@gatesn
Copy link
Contributor

gatesn commented Nov 20, 2025

Nooo I like this!

@gatesn gatesn reopened this Nov 20, 2025
@connortsui20 connortsui20 force-pushed the ct/avx512-filter branch 2 times, most recently from 4ae987c to be41513 Compare November 24, 2025 15:02
@connortsui20 connortsui20 marked this pull request as ready for review November 24, 2025 15:04
@connortsui20 connortsui20 marked this pull request as draft November 24, 2025 15:05
@connortsui20 connortsui20 force-pushed the ct/avx512-filter branch 2 times, most recently from e2529e9 to 36a28c6 Compare November 24, 2025 15:10
@connortsui20 connortsui20 marked this pull request as ready for review November 24, 2025 15:10
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The items here are moved to the out module in slice

@connortsui20 connortsui20 requested a review from 0ax1 November 24, 2025 15:12
@codspeed-hq
Copy link

codspeed-hq bot commented Nov 24, 2025

CodSpeed Performance Report

Merging #5399 will not alter performance

Comparing ct/avx512-filter (eae0511) with develop (c75c8a3)

Summary

✅ 1478 untouched
🆕 56 new
⏩ 214 skipped1

Benchmarks breakdown

Benchmark BASE HEAD Change
🆕 in_place_scalar[(1024, 0.0)] N/A 4.6 µs N/A
🆕 in_place_scalar[(1024, 0.1)] N/A 6.2 µs N/A
🆕 in_place_scalar[(1024, 0.25)] N/A 7 µs N/A
🆕 in_place_scalar[(1024, 0.5)] N/A 7.6 µs N/A
🆕 in_place_scalar[(1024, 0.75)] N/A 8.2 µs N/A
🆕 in_place_scalar[(1024, 0.9)] N/A 8.5 µs N/A
🆕 in_place_scalar[(1024, 1.0)] N/A 8.8 µs N/A
🆕 in_place_scalar[(131072, 0.0)] N/A 553.9 µs N/A
🆕 in_place_scalar[(131072, 0.1)] N/A 784.8 µs N/A
🆕 in_place_scalar[(131072, 0.25)] N/A 866 µs N/A
🆕 in_place_scalar[(131072, 0.5)] N/A 943.8 µs N/A
🆕 in_place_scalar[(131072, 0.75)] N/A 1 ms N/A
🆕 in_place_scalar[(131072, 0.9)] N/A 1.1 ms N/A
🆕 in_place_scalar[(131072, 1.0)] N/A 1.1 ms N/A
🆕 in_place_scalar[(16384, 0.0)] N/A 69.5 µs N/A
🆕 in_place_scalar[(16384, 0.1)] N/A 97.8 µs N/A
🆕 in_place_scalar[(16384, 0.25)] N/A 108.1 µs N/A
🆕 in_place_scalar[(16384, 0.5)] N/A 117.7 µs N/A
🆕 in_place_scalar[(16384, 0.75)] N/A 126.8 µs N/A
🆕 in_place_scalar[(16384, 0.9)] N/A 132.2 µs N/A
... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Footnotes

  1. 214 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@connortsui20
Copy link
Contributor Author

@0ax1 is there a way for me to disable the benchmark just for codspeed?

Comment on lines +37 to +40
if (mask[byte_idx] >> bit_idx) & 1 == 1 {
data[write_pos] = data[read_pos];
write_pos += 1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: i think you can do this faster by always doing speculative write to data[write_pos], and then just incrementing write_pos by the mask bit on each loop iteration. FSST uses a similar trick in its compression loop.

maybe save for FLUP though so we can benchmark the difference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this entire function is super inefficient but I'm going to wait on optimizing all of these further until we actually start using it in operators and can get better / more realistic benchmarks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably shouldn't merge SIMD code unless it is fully optimized. It's a bit misleading

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could just remove this then? This isn't SIMD code

@0ax1
Copy link
Contributor

0ax1 commented Nov 24, 2025

@0ax1 is there a way for me to disable the benchmark just for codspeed?

You shouldn't need to. We compile for AVX2, so the AVX512 version shouldn't be picked up.

Ah I see that we're running into failed to execute the benchmark process, exit code: 132 which I assume is picked based on runtime capabilities of the CPU?

Discussed to exclude the benchmark via: #[cfg(not(codspeed))]

@connortsui20
Copy link
Contributor Author

hmm it seems I've done something wrong here then: https://github.com/vortex-data/vortex/actions/runs/19644046057/job/56254751182?pr=5399

@connortsui20 connortsui20 force-pushed the ct/avx512-filter branch 3 times, most recently from 4553fd8 to 723122a Compare November 25, 2025 16:37
Signed-off-by: Connor Tsui <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Release label indicating a new feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants