SIMD Acceleration of Truth-Table Operations with AVX2#142
Closed
costamag wants to merge 11 commits intolsils:masterfrom
Closed
SIMD Acceleration of Truth-Table Operations with AVX2#142costamag wants to merge 11 commits intolsils:masterfrom
costamag wants to merge 11 commits intolsils:masterfrom
Conversation
Collaborator
Author
|
Hi @msoeken. I modified the functions to better align with the code structure in |
…n of scalar vs. vector operation
Collaborator
Author
|
I will open a clean PR after refactoring. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces vectorized bitwise operations using 256-bit AVX2 registers. Each register processes four 64-bit words in parallel, enabling efficient computation across truth tables. These operations should be preferred over traditional scalar implementations for sufficiently large truth tables.
The optimal cutoff (in number of bits) at which AVX2 becomes advantageous may vary. Compiler-dependent benchmarking is required to identify this threshold. Tests show consistent speedups for 10-input static_truth_tables and 12-input dynamic_truth_tables.
Acknowledgment: Thanks to @Michal-Atlas for suggesting the use of Single Instruction, Multiple Data (SIMD) instructions to accelerate truth-table operations.
Remark: I also experimented with a pop-count implementation based on the AVX2 Harley–Seal algorithm described in “Faster Population Counts Using AVX2 Instructions” (arXiv:1611.07612). However, on machines with sufficiently large L1 caches, the scalar version still outperforms it. A faster pop-count remains an open optimization target.