Optimize quaternion multiplication for SSE#1951
Conversation
c1c87ba to
6eb3bb3
Compare
|
I have restored the previous implementation when If you're OK with changing behavior between versions of Jolt, another option would be to re-organize the factors for the scalar version to match the new SSE implementation. For example, the following gives the same floating-point results with float x = (lx * rw + ly * rz) + (lw * rx - lz * ry);
float y = (ly * rw + lz * rx) + (lw * ry - lx * rz);
float z = (lz * rw + lx * ry) + (lw * rz - ly * rx);
float w = -(lx * rx + ly * ry) + (lw * rw - lz * rz);I have a version stashed locally for this if that is your preferred approach, I would just need to update the hashes used for the determinism tests. A third option would be to have commented out blocks for the scalar versions that match the optimized SSE version that can be swapped out at a later time if you don't want to change behavior today, but want to bundle it with another future breaking change. |
b99fcbf to
8c4dfce
Compare
|
Here are some results from with a release build (GCC 15.2.1 on a Ryzen 9900X; 12 cores, 24 threads) with Original implementation: New version (with FMADD): New version (without FMADD): Forcing scalar version for Quaternion multiply functions only: The updated SSE version (without FMADD) version appears to be universally faster than the original implementation on my CPU across all test cases, with a larger speedup compared to the original implementation vs. scalar. One interesting detail is that the FMADD version is slower for the discrete cast than the original and scalar version, while the linear cast seems to be slower with one thread but faster with more threads. I suspect this may be due to the instruction dependency with the previous multiply, which may be slower if it can run fewer instructions in parallel, and may be CPU dependent based on specific instruction timings. I'm not sure if you wanted to test across some other CPUs, or just remove the FMADD instructions to have more consistent results altogether. |
|
After some further experimentation, I pushed a new version that removes all but one of the FMADD operations. This gives by far the best discrete results. The linear results are a little behind the standard SSE (without FMADD) version, though still improved over the original version. I'm open to removing the FMADD entirely if you'd prefer the linear savings. Alternatively, as I expect this comes down to function inlining and how the compiler and CPU can order instructions to avoid waits, it might also be worth having a special function without FMADD so you can have the maximum for both discrete and linear tests. This could get tricky if you start looking every single case that may benefit from one or the other, however, and may or may not be worth the effort. Here are the results with the latest revision: |
|
First of all, thanks for your contribution! I will do some measuring myself to decide which version I want. The problem with these measurements is that if the hash of the simulation changes, the simulation itself changes. So you may not be measuring the effect of your quaternion optimization, but you may be measuring a pile of ragdolls falling over and spreading out over the floor (less contacts, much cheaper) vs a pile staying a pile (more contacts, more expensive). This is probably why the measurements show such big and strange fluctuations. |
Adjusted quaternion multiplication for SSE to be computed based on performing only vertical operations. These operations were derived from taking the formula for each component, re-organizing them to group which portions are subtracted,and shuffling the components as necessary to fit it together. This uses fewer total instructions and is supported across all SSE versions. FMA operations are taken advantage of when available to further reduce instructions. The original implementation is kept when CROSS_PLATFORM_DETERMINISTIC is provided. While it is slower than the newer implementation, it does provide consistent floating-point results with the non-SSE version to ensure cross-platform consistent results.
|
I see, those results definitely don't have much meaning then. I have been running some performance tests on the raw multiplication within my own codebase, where I have a quaternion multiplication similar to what I submitted with this change. Here are some interesting tidbits from my experiments:
With these experiments, I did find one additional FMADD operation for the full multiply to further compact the instructions. I have uploaded my (hopefully last) revision that restores all FMADD instructions where appropriate in addition to the new one I didn't have with my previous revisions. |
|
I think the new version is superior to what I had written, so I've updated the hashes of the tests and removed the old implementation. Tomorrow I'll do a final check to see if all the simulations still look ok and then I'll merge it. |
|
Thanks! |
Adjusted quaternion multiplication for SSE to be computed based on performing only vertical operations. These operations were derived from taking the formula for each component, re-organizing them to group which portions are subtracted,and shuffling the components as necessary to fit it together. This uses fewer total instructions and is supported across all SSE versions. FMA operations are taken advantage of when available to further reduce instructions.