Skip to content

Optimize quaternion multiplication for SSE#1951

Merged
jrouwe merged 2 commits intojrouwe:masterfrom
akb825:quaternion-opt
Mar 20, 2026
Merged

Optimize quaternion multiplication for SSE#1951
jrouwe merged 2 commits intojrouwe:masterfrom
akb825:quaternion-opt

Conversation

@akb825
Copy link
Copy Markdown
Contributor

@akb825 akb825 commented Mar 16, 2026

Adjusted quaternion multiplication for SSE to be computed based on performing only vertical operations. These operations were derived from taking the formula for each component, re-organizing them to group which portions are subtracted,and shuffling the components as necessary to fit it together. This uses fewer total instructions and is supported across all SSE versions. FMA operations are taken advantage of when available to further reduce instructions.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 16, 2026

CLA assistant check
All committers have signed the CLA.

@akb825 akb825 force-pushed the quaternion-opt branch 3 times, most recently from c1c87ba to 6eb3bb3 Compare March 17, 2026 01:03
@akb825
Copy link
Copy Markdown
Contributor Author

akb825 commented Mar 17, 2026

I have restored the previous implementation when CROSS_PLATFORM_DETERMINISTIC is enabled to pass the determinism tests.

If you're OK with changing behavior between versions of Jolt, another option would be to re-organize the factors for the scalar version to match the new SSE implementation. For example, the following gives the same floating-point results with CROSS_PLATFORM_DETERMINISTIC between both for the full multiplication:

	float x =  (lx * rw + ly * rz) + (lw * rx - lz * ry);
	float y =  (ly * rw + lz * rx) + (lw * ry - lx * rz);
	float z =  (lz * rw + lx * ry) + (lw * rz - ly * rx);
	float w = -(lx * rx + ly * ry) + (lw * rw - lz * rz);

I have a version stashed locally for this if that is your preferred approach, I would just need to update the hashes used for the determinism tests.

A third option would be to have commented out blocks for the scalar versions that match the optimized SSE version that can be swapped out at a later time if you don't want to change behavior today, but want to bundle it with another future breaking change.

@akb825 akb825 force-pushed the quaternion-opt branch 3 times, most recently from b99fcbf to 8c4dfce Compare March 17, 2026 02:04
@akb825
Copy link
Copy Markdown
Contributor Author

akb825 commented Mar 17, 2026

Here are some results from with a release build (GCC 15.2.1 on a Ryzen 9900X; 12 cores, 24 threads) with PerformanceTest using the default parameters:

Original implementation:

Discrete, 1, 89.420504, 0x69890050ad4327fa
Discrete, 2, 152.577754, 0x69890050ad4327fa
Discrete, 3, 220.815830, 0x69890050ad4327fa
Discrete, 4, 282.233927, 0x69890050ad4327fa
Discrete, 5, 332.847740, 0x69890050ad4327fa
Discrete, 6, 389.765835, 0x69890050ad4327fa
Discrete, 7, 396.118177, 0x69890050ad4327fa
Discrete, 8, 415.337240, 0x69890050ad4327fa
Discrete, 9, 420.840093, 0x69890050ad4327fa
Discrete, 10, 452.534975, 0x69890050ad4327fa
Discrete, 11, 508.241349, 0x69890050ad4327fa
Discrete, 12, 481.631455, 0x69890050ad4327fa
Discrete, 13, 480.726535, 0x69890050ad4327fa
Discrete, 14, 487.539152, 0x69890050ad4327fa
Discrete, 15, 497.805240, 0x69890050ad4327fa
Discrete, 16, 511.072168, 0x69890050ad4327fa
Discrete, 17, 521.473252, 0x69890050ad4327fa
Discrete, 18, 531.069977, 0x69890050ad4327fa
Discrete, 19, 543.295002, 0x69890050ad4327fa
Discrete, 20, 557.431558, 0x69890050ad4327fa
Discrete, 21, 568.009020, 0x69890050ad4327fa
Discrete, 22, 573.674402, 0x69890050ad4327fa
Discrete, 23, 578.013527, 0x69890050ad4327fa
Discrete, 24, 585.432075, 0x69890050ad4327fa
LinearCast, 1, 82.171574, 0x119fed3f1b34fe3b
LinearCast, 2, 140.445241, 0x119fed3f1b34fe3b
LinearCast, 3, 194.358514, 0x119fed3f1b34fe3b
LinearCast, 4, 248.066862, 0x119fed3f1b34fe3b
LinearCast, 5, 296.383733, 0x119fed3f1b34fe3b
LinearCast, 6, 343.388799, 0x119fed3f1b34fe3b
LinearCast, 7, 338.311680, 0x119fed3f1b34fe3b
LinearCast, 8, 340.977768, 0x119fed3f1b34fe3b
LinearCast, 9, 356.527980, 0x119fed3f1b34fe3b
LinearCast, 10, 377.391947, 0x119fed3f1b34fe3b
LinearCast, 11, 393.156100, 0x119fed3f1b34fe3b
LinearCast, 12, 413.506985, 0x119fed3f1b34fe3b
LinearCast, 13, 425.142888, 0x119fed3f1b34fe3b
LinearCast, 14, 437.003258, 0x119fed3f1b34fe3b
LinearCast, 15, 448.801031, 0x119fed3f1b34fe3b
LinearCast, 16, 460.942047, 0x119fed3f1b34fe3b
LinearCast, 17, 469.844633, 0x119fed3f1b34fe3b
LinearCast, 18, 483.478421, 0x119fed3f1b34fe3b
LinearCast, 19, 494.376506, 0x119fed3f1b34fe3b
LinearCast, 20, 505.239974, 0x119fed3f1b34fe3b
LinearCast, 21, 514.867039, 0x119fed3f1b34fe3b
LinearCast, 22, 521.955794, 0x119fed3f1b34fe3b
LinearCast, 23, 516.855865, 0x119fed3f1b34fe3b
LinearCast, 24, 516.976555, 0x119fed3f1b34fe3b

New version (with FMADD):

Discrete, 1, 82.818946, 0xa1e008acca2086bd
Discrete, 2, 141.476999, 0xa1e008acca2086bd
Discrete, 3, 202.216324, 0xa1e008acca2086bd
Discrete, 4, 262.442209, 0xa1e008acca2086bd
Discrete, 5, 311.198906, 0xa1e008acca2086bd
Discrete, 6, 361.799642, 0xa1e008acca2086bd
Discrete, 7, 357.549405, 0xa1e008acca2086bd
Discrete, 8, 374.795839, 0xa1e008acca2086bd
Discrete, 9, 396.291963, 0xa1e008acca2086bd
Discrete, 10, 417.960779, 0xa1e008acca2086bd
Discrete, 11, 414.997268, 0xa1e008acca2086bd
Discrete, 12, 457.054459, 0xa1e008acca2086bd
Discrete, 13, 439.371623, 0xa1e008acca2086bd
Discrete, 14, 450.904211, 0xa1e008acca2086bd
Discrete, 15, 464.623039, 0xa1e008acca2086bd
Discrete, 16, 476.186199, 0xa1e008acca2086bd
Discrete, 17, 486.335003, 0xa1e008acca2086bd
Discrete, 18, 495.902976, 0xa1e008acca2086bd
Discrete, 19, 507.791624, 0xa1e008acca2086bd
Discrete, 20, 518.069155, 0xa1e008acca2086bd
Discrete, 21, 526.678758, 0xa1e008acca2086bd
Discrete, 22, 528.729839, 0xa1e008acca2086bd
Discrete, 23, 530.595032, 0xa1e008acca2086bd
Discrete, 24, 536.089312, 0xa1e008acca2086bd
LinearCast, 1, 84.386772, 0x2215d48255f4bf13
LinearCast, 2, 143.158635, 0x2215d48255f4bf13
LinearCast, 3, 198.294181, 0x2215d48255f4bf13
LinearCast, 4, 254.643693, 0x2215d48255f4bf13
LinearCast, 5, 305.592483, 0x2215d48255f4bf13
LinearCast, 6, 349.911801, 0x2215d48255f4bf13
LinearCast, 7, 360.543161, 0x2215d48255f4bf13
LinearCast, 8, 388.597113, 0x2215d48255f4bf13
LinearCast, 9, 385.337244, 0x2215d48255f4bf13
LinearCast, 10, 406.034126, 0x2215d48255f4bf13
LinearCast, 11, 419.499122, 0x2215d48255f4bf13
LinearCast, 12, 434.407009, 0x2215d48255f4bf13
LinearCast, 13, 448.172928, 0x2215d48255f4bf13
LinearCast, 14, 457.967111, 0x2215d48255f4bf13
LinearCast, 15, 470.411544, 0x2215d48255f4bf13
LinearCast, 16, 484.168922, 0x2215d48255f4bf13
LinearCast, 17, 497.480853, 0x2215d48255f4bf13
LinearCast, 18, 511.220869, 0x2215d48255f4bf13
LinearCast, 19, 520.911553, 0x2215d48255f4bf13
LinearCast, 20, 533.697837, 0x2215d48255f4bf13
LinearCast, 21, 543.400816, 0x2215d48255f4bf13
LinearCast, 22, 549.462545, 0x2215d48255f4bf13
LinearCast, 23, 558.854979, 0x2215d48255f4bf13
LinearCast, 24, 546.814588, 0x2215d48255f4bf13

New version (without FMADD):

Discrete, 1, 91.427587, 0x1dc7c6b5a6c6cbad
Discrete, 2, 155.359789, 0x1dc7c6b5a6c6cbad
Discrete, 3, 224.865426, 0x1dc7c6b5a6c6cbad
Discrete, 4, 291.133671, 0x1dc7c6b5a6c6cbad
Discrete, 5, 349.163930, 0x1dc7c6b5a6c6cbad
Discrete, 6, 395.146402, 0x1dc7c6b5a6c6cbad
Discrete, 7, 412.204350, 0x1dc7c6b5a6c6cbad
Discrete, 8, 400.459048, 0x1dc7c6b5a6c6cbad
Discrete, 9, 430.011496, 0x1dc7c6b5a6c6cbad
Discrete, 10, 444.553664, 0x1dc7c6b5a6c6cbad
Discrete, 11, 464.429650, 0x1dc7c6b5a6c6cbad
Discrete, 12, 484.742703, 0x1dc7c6b5a6c6cbad
Discrete, 13, 499.664620, 0x1dc7c6b5a6c6cbad
Discrete, 14, 508.502668, 0x1dc7c6b5a6c6cbad
Discrete, 15, 525.789459, 0x1dc7c6b5a6c6cbad
Discrete, 16, 531.037099, 0x1dc7c6b5a6c6cbad
Discrete, 17, 542.024585, 0x1dc7c6b5a6c6cbad
Discrete, 18, 558.006324, 0x1dc7c6b5a6c6cbad
Discrete, 19, 571.356273, 0x1dc7c6b5a6c6cbad
Discrete, 20, 582.126210, 0x1dc7c6b5a6c6cbad
Discrete, 21, 586.148002, 0x1dc7c6b5a6c6cbad
Discrete, 22, 587.500739, 0x1dc7c6b5a6c6cbad
Discrete, 23, 594.150946, 0x1dc7c6b5a6c6cbad
Discrete, 24, 600.054704, 0x1dc7c6b5a6c6cbad
LinearCast, 1, 86.154322, 0x3eb3b23b61b1d232
LinearCast, 2, 146.429203, 0x3eb3b23b61b1d232
LinearCast, 3, 203.373634, 0x3eb3b23b61b1d232
LinearCast, 4, 257.642197, 0x3eb3b23b61b1d232
LinearCast, 5, 308.305477, 0x3eb3b23b61b1d232
LinearCast, 6, 364.075335, 0x3eb3b23b61b1d232
LinearCast, 7, 350.601025, 0x3eb3b23b61b1d232
LinearCast, 8, 367.717942, 0x3eb3b23b61b1d232
LinearCast, 9, 374.979824, 0x3eb3b23b61b1d232
LinearCast, 10, 393.164743, 0x3eb3b23b61b1d232
LinearCast, 11, 413.626002, 0x3eb3b23b61b1d232
LinearCast, 12, 431.024551, 0x3eb3b23b61b1d232
LinearCast, 13, 442.200789, 0x3eb3b23b61b1d232
LinearCast, 14, 455.217964, 0x3eb3b23b61b1d232
LinearCast, 15, 465.369758, 0x3eb3b23b61b1d232
LinearCast, 16, 482.210247, 0x3eb3b23b61b1d232
LinearCast, 17, 488.871184, 0x3eb3b23b61b1d232
LinearCast, 18, 498.114639, 0x3eb3b23b61b1d232
LinearCast, 19, 509.777349, 0x3eb3b23b61b1d232
LinearCast, 20, 518.619012, 0x3eb3b23b61b1d232
LinearCast, 21, 528.199441, 0x3eb3b23b61b1d232
LinearCast, 22, 534.980398, 0x3eb3b23b61b1d232
LinearCast, 23, 535.269609, 0x3eb3b23b61b1d232
LinearCast, 24, 539.237886, 0x3eb3b23b61b1d232

Forcing scalar version for Quaternion multiply functions only:

Discrete, 1, 89.109721, 0x69890050ad4327fa
Discrete, 2, 149.792058, 0x69890050ad4327fa
Discrete, 3, 217.344187, 0x69890050ad4327fa
Discrete, 4, 280.306443, 0x69890050ad4327fa
Discrete, 5, 333.202864, 0x69890050ad4327fa
Discrete, 6, 394.652192, 0x69890050ad4327fa
Discrete, 7, 399.188795, 0x69890050ad4327fa
Discrete, 8, 413.244113, 0x69890050ad4327fa
Discrete, 9, 445.083890, 0x69890050ad4327fa
Discrete, 10, 428.925573, 0x69890050ad4327fa
Discrete, 11, 471.985550, 0x69890050ad4327fa
Discrete, 12, 462.244346, 0x69890050ad4327fa
Discrete, 13, 483.428690, 0x69890050ad4327fa
Discrete, 14, 489.302400, 0x69890050ad4327fa
Discrete, 15, 494.431168, 0x69890050ad4327fa
Discrete, 16, 509.214589, 0x69890050ad4327fa
Discrete, 17, 516.411078, 0x69890050ad4327fa
Discrete, 18, 531.348185, 0x69890050ad4327fa
Discrete, 19, 539.473273, 0x69890050ad4327fa
Discrete, 20, 553.537002, 0x69890050ad4327fa
Discrete, 21, 563.049228, 0x69890050ad4327fa
Discrete, 22, 567.124582, 0x69890050ad4327fa
Discrete, 23, 575.054701, 0x69890050ad4327fa
Discrete, 24, 577.134031, 0x69890050ad4327fa
LinearCast, 1, 81.259291, 0x119fed3f1b34fe3b
LinearCast, 2, 139.062300, 0x119fed3f1b34fe3b
LinearCast, 3, 192.790377, 0x119fed3f1b34fe3b
LinearCast, 4, 244.909839, 0x119fed3f1b34fe3b
LinearCast, 5, 291.534320, 0x119fed3f1b34fe3b
LinearCast, 6, 339.598798, 0x119fed3f1b34fe3b
LinearCast, 7, 335.485280, 0x119fed3f1b34fe3b
LinearCast, 8, 342.827226, 0x119fed3f1b34fe3b
LinearCast, 9, 356.537601, 0x119fed3f1b34fe3b
LinearCast, 10, 376.640920, 0x119fed3f1b34fe3b
LinearCast, 11, 394.114402, 0x119fed3f1b34fe3b
LinearCast, 12, 406.001636, 0x119fed3f1b34fe3b
LinearCast, 13, 420.454002, 0x119fed3f1b34fe3b
LinearCast, 14, 434.806929, 0x119fed3f1b34fe3b
LinearCast, 15, 447.269276, 0x119fed3f1b34fe3b
LinearCast, 16, 461.326986, 0x119fed3f1b34fe3b
LinearCast, 17, 471.203121, 0x119fed3f1b34fe3b
LinearCast, 18, 481.739126, 0x119fed3f1b34fe3b
LinearCast, 19, 492.802182, 0x119fed3f1b34fe3b
LinearCast, 20, 500.653689, 0x119fed3f1b34fe3b
LinearCast, 21, 513.146859, 0x119fed3f1b34fe3b
LinearCast, 22, 517.535846, 0x119fed3f1b34fe3b
LinearCast, 23, 521.653395, 0x119fed3f1b34fe3b
LinearCast, 24, 525.536158, 0x119fed3f1b34fe3b

The updated SSE version (without FMADD) version appears to be universally faster than the original implementation on my CPU across all test cases, with a larger speedup compared to the original implementation vs. scalar.

One interesting detail is that the FMADD version is slower for the discrete cast than the original and scalar version, while the linear cast seems to be slower with one thread but faster with more threads. I suspect this may be due to the instruction dependency with the previous multiply, which may be slower if it can run fewer instructions in parallel, and may be CPU dependent based on specific instruction timings. I'm not sure if you wanted to test across some other CPUs, or just remove the FMADD instructions to have more consistent results altogether.

@akb825
Copy link
Copy Markdown
Contributor Author

akb825 commented Mar 17, 2026

After some further experimentation, I pushed a new version that removes all but one of the FMADD operations. This gives by far the best discrete results. The linear results are a little behind the standard SSE (without FMADD) version, though still improved over the original version. I'm open to removing the FMADD entirely if you'd prefer the linear savings.

Alternatively, as I expect this comes down to function inlining and how the compiler and CPU can order instructions to avoid waits, it might also be worth having a special function without FMADD so you can have the maximum for both discrete and linear tests. This could get tricky if you start looking every single case that may benefit from one or the other, however, and may or may not be worth the effort.

Here are the results with the latest revision:

Discrete, 1, 97.129831, 0xb62f83232b7a7779
Discrete, 2, 163.008412, 0xb62f83232b7a7779
Discrete, 3, 234.275376, 0xb62f83232b7a7779
Discrete, 4, 301.483261, 0xb62f83232b7a7779
Discrete, 5, 358.640717, 0xb62f83232b7a7779
Discrete, 6, 416.315781, 0xb62f83232b7a7779
Discrete, 7, 421.610221, 0xb62f83232b7a7779
Discrete, 8, 436.600114, 0xb62f83232b7a7779
Discrete, 9, 436.920641, 0xb62f83232b7a7779
Discrete, 10, 452.768044, 0xb62f83232b7a7779
Discrete, 11, 489.038318, 0xb62f83232b7a7779
Discrete, 12, 494.996932, 0xb62f83232b7a7779
Discrete, 13, 507.867674, 0xb62f83232b7a7779
Discrete, 14, 521.844299, 0xb62f83232b7a7779
Discrete, 15, 541.291443, 0xb62f83232b7a7779
Discrete, 16, 553.567839, 0xb62f83232b7a7779
Discrete, 17, 557.199359, 0xb62f83232b7a7779
Discrete, 18, 571.632297, 0xb62f83232b7a7779
Discrete, 19, 593.190831, 0xb62f83232b7a7779
Discrete, 20, 592.016995, 0xb62f83232b7a7779
Discrete, 21, 598.891325, 0xb62f83232b7a7779
Discrete, 22, 611.379437, 0xb62f83232b7a7779
Discrete, 23, 616.705418, 0xb62f83232b7a7779
Discrete, 24, 624.859727, 0xb62f83232b7a7779
LinearCast, 1, 84.438377, 0x2e33f525552d46a1
LinearCast, 2, 143.974699, 0x2e33f525552d46a1
LinearCast, 3, 195.230415, 0x2e33f525552d46a1
LinearCast, 4, 255.336736, 0x2e33f525552d46a1
LinearCast, 5, 303.979836, 0x2e33f525552d46a1
LinearCast, 6, 351.966681, 0x2e33f525552d46a1
LinearCast, 7, 352.454064, 0x2e33f525552d46a1
LinearCast, 8, 356.763275, 0x2e33f525552d46a1
LinearCast, 9, 367.048720, 0x2e33f525552d46a1
LinearCast, 10, 384.356980, 0x2e33f525552d46a1
LinearCast, 11, 404.278587, 0x2e33f525552d46a1
LinearCast, 12, 419.173513, 0x2e33f525552d46a1
LinearCast, 13, 430.803782, 0x2e33f525552d46a1
LinearCast, 14, 441.251283, 0x2e33f525552d46a1
LinearCast, 15, 453.934621, 0x2e33f525552d46a1
LinearCast, 16, 467.644574, 0x2e33f525552d46a1
LinearCast, 17, 477.435904, 0x2e33f525552d46a1
LinearCast, 18, 487.702646, 0x2e33f525552d46a1
LinearCast, 19, 499.069974, 0x2e33f525552d46a1
LinearCast, 20, 508.652380, 0x2e33f525552d46a1
LinearCast, 21, 518.977658, 0x2e33f525552d46a1
LinearCast, 22, 521.385809, 0x2e33f525552d46a1
LinearCast, 23, 518.669395, 0x2e33f525552d46a1
LinearCast, 24, 526.334996, 0x2e33f525552d46a1

@jrouwe
Copy link
Copy Markdown
Owner

jrouwe commented Mar 17, 2026

First of all, thanks for your contribution! I will do some measuring myself to decide which version I want.

The problem with these measurements is that if the hash of the simulation changes, the simulation itself changes. So you may not be measuring the effect of your quaternion optimization, but you may be measuring a pile of ragdolls falling over and spreading out over the floor (less contacts, much cheaper) vs a pile staying a pile (more contacts, more expensive). This is probably why the measurements show such big and strange fluctuations.

Adjusted quaternion multiplication for SSE to be computed based on
performing only vertical operations. These operations were derived from
taking the formula for each component, re-organizing them to group which
portions are subtracted,and shuffling the components as necessary to fit it
together. This uses fewer total instructions and is supported across all
SSE versions. FMA operations are taken advantage of when available to
further reduce instructions.

The original implementation is kept when CROSS_PLATFORM_DETERMINISTIC is
provided. While it is slower than the newer implementation, it does provide
consistent floating-point results with the non-SSE version to ensure
cross-platform consistent results.
@akb825
Copy link
Copy Markdown
Contributor Author

akb825 commented Mar 18, 2026

I see, those results definitely don't have much meaning then.

I have been running some performance tests on the raw multiplication within my own codebase, where I have a quaternion multiplication similar to what I submitted with this change. Here are some interesting tidbits from my experiments:

  • The current version of GCC auto-vectorizes the scalar quaternion multiplication.
  • If MFADD instructions are enabled, where I don't have fp-contract disabled, it will optimize both scalar and SSE versions to use FMADD instructions. At least with my own version, I was originally getting the same results when just adjusting some defines to swap between the SSE and FMADD versions as it was optimizing them to the same thing.
  • FMADD is indeed slightly faster than SSE (~1%), and the SSE version is also faster than the auto-vectorized scalar version (~5%). (I don't have an equivalent for the original implementation from Jolt that I can compare with)
  • Interestingly, the fastest was actually the scalar version auto-vectorized by GCC with FMADD, which ~1.5% faster than the hand-written version. I tried to reproduce the disassembly, where it ended up having the same instructions but in a bit different order (the shuffles were interleaved in different places), and it ended up slower. I tried manually placing the shuffles in the same place, but the optimizer moved them around and I still wasn't able to get the same performance as the auto-vectorized version. A weird quirk, but certainly not something I'd want to rely on for the "best" implementation.

With these experiments, I did find one additional FMADD operation for the full multiply to further compact the instructions. I have uploaded my (hopefully last) revision that restores all FMADD instructions where appropriate in addition to the new one I didn't have with my previous revisions.

@jrouwe
Copy link
Copy Markdown
Owner

jrouwe commented Mar 19, 2026

I think the new version is superior to what I had written, so I've updated the hashes of the tests and removed the old implementation. Tomorrow I'll do a final check to see if all the simulations still look ok and then I'll merge it.

@jrouwe jrouwe merged commit 244f890 into jrouwe:master Mar 20, 2026
73 checks passed
@jrouwe
Copy link
Copy Markdown
Owner

jrouwe commented Mar 20, 2026

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants