-
Notifications
You must be signed in to change notification settings - Fork 98
add AVX512 support for filtering in place #5399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Does GitHub give us CI runners with AVX512 support? My recollection from when I did the AVX2 take kernel is that they don't |
|
not really sure how to deal with this, I think sometimes they have it, sometimes not? |
|
CI runs in AWS, you can get whatever machine you want, right now by default we use m7i/m7a which are sapphire rapids/zen4 which have AVX512. On benchmarks we run c6i which is ice-lake sp which also has avx512 |
02c00d3 to
4f8baa2
Compare
|
Heads up that we can't micro-benchmark avx512 with codspeed, as valgrind/callgrind does not support avx512. Saying you'll get numbers for avx2 if it is implemented: |
|
Given that were pivoting to batch I'm going to put a pin in this. |
|
Nooo I like this! |
4ae987c to
be41513
Compare
e2529e9 to
36a28c6
Compare
36a28c6 to
923ddb4
Compare
vortex-compute/src/filter/slice.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The items here are moved to the out module in slice
df01d9c to
6468b12
Compare
CodSpeed Performance ReportMerging #5399 will not alter performanceComparing Summary
Benchmarks breakdown
Footnotes
|
|
@0ax1 is there a way for me to disable the benchmark just for codspeed? |
| if (mask[byte_idx] >> bit_idx) & 1 == 1 { | ||
| data[write_pos] = data[read_pos]; | ||
| write_pos += 1; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: i think you can do this faster by always doing speculative write to data[write_pos], and then just incrementing write_pos by the mask bit on each loop iteration. FSST uses a similar trick in its compression loop.
maybe save for FLUP though so we can benchmark the difference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this entire function is super inefficient but I'm going to wait on optimizing all of these further until we actually start using it in operators and can get better / more realistic benchmarks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably shouldn't merge SIMD code unless it is fully optimized. It's a bit misleading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could just remove this then? This isn't SIMD code
Discussed to exclude the benchmark via: |
|
hmm it seems I've done something wrong here then: https://github.com/vortex-data/vortex/actions/runs/19644046057/job/56254751182?pr=5399 |
4553fd8 to
723122a
Compare
Signed-off-by: Connor Tsui <[email protected]>
723122a to
eae0511
Compare
Reorganizes some of the filter implementations and adds AVX512 filter implementation and benchmarks