Skip to content

Conversation

@MarekKnapek
Copy link
Contributor

Checklist

  • documentation is added or updated
  • tests are added or updated

Copy link
Member

@sjaeckel sjaeckel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks interesting. Thanks for the next PR :)

When looking at it it seems like we'd be trading computation in space to computation in time, meaning that the execution should be slower after the patch applied.

So I modified the timing demo a bit to show something relevant, and the before looks as follows:

sha512              : Process at    39
sha512-256          : Process at    39
sha384              : Process at    39
sha512-224          : Process at    39
sha1                : Process at    61
sha256              : Process at   122
sha224              : Process at   122

vs. after this patch applied:

sha512              : Process at    39
sha384              : Process at    40
sha512-256          : Process at    40
sha512-224          : Process at    40
sha1                : Process at    68
sha224              : Process at   106
sha256              : Process at   106

sha1 really got worse, sha512-based stayed more or less the same (maybe a little bit slower), but sha256-based got significantly better performance!?

Not sure what to do with sha1, maybe enable this patch via a new LTC_SMALL_STACK option?
The other two I'd simply take unconditionally.

What do you think?

@MarekKnapek
Copy link
Contributor Author

I think my next PR will be about x86 (and amd64) specific intrinsics. Making the SHA-1, SHA-256 and SHA-512 much, much faster.

  • How do I run these benchmarks myself? I could play with the code a bit more, maybe adding an if(i >= 16){ Wi(i) } condition into the loops.
  • Also what configuration option did you use? I mean with LTC_SMALL_CODE or without?
  • And how do I enable/disable this option at compile time? So far I manually edited some global header file to enable/disable this option. But I believe there might be some more kosher way of toggling this option.

* Add the option to only run for a subset of algos.
* Improve `hash` to show something meaningful.

Signed-off-by: Steffen Jaeckel <[email protected]>
@sjaeckel
Copy link
Member

sjaeckel commented Dec 2, 2025

  • How do I run these benchmarks myself? I could play with the code a bit more, maybe adding an if(i >= 16){ Wi(i) } condition into the loops.

That's the timinig demo in demos/timing.c. I've just pushed an update to it and ran it as ./timing hash sha, then removed the sha3 parts manually when pasting its output here.

  • And how do I enable/disable this option at compile time?

That depends on how you build the library.

I usually simply run make, so it's a matter of make -j$(($(nproc)*2+1)) timing CFLAGS="-DLTC_SMALL_CODE".

If you use CMake (and build in a folder inside the ltc folder) it'd be cmake -DLTC_CFLAGS="-DLTC_SMALL_CODE" -DCMAKE_BUILD_TYPE=Release -DBUILD_USABLE_DEMOS=On .. && make -j$(($(nproc)*2+1)), then run ./demos/ltc-timing hash sha.

Also what configuration option did you use? I mean with LTC_SMALL_CODE or without?

Those previous tests were done with the standard config. With LTC_SMALL_CODE enabled they look like this:

Before the patch:

sha512              : Process at    39
sha384              : Process at    39
sha512-256          : Process at    40
sha512-224          : Process at    40
sha1                : Process at    84
sha256              : Process at   108
sha224              : Process at   108

After the patch:

sha512-224          : Process at    45
sha512              : Process at    45
sha512-256          : Process at    45
sha384              : Process at    45
sha1                : Process at    77
sha256              : Process at   132
sha224              : Process at   134

So it seems like your patch improves the performance in the default case (LTC_SMALL_CODE undefined) for "sha256 based", but deteriorates for "sha1".

In the case LTC_SMALL_CODE is defined it improves "sha1", but deteriorates the two others.

FYI:

$ head /proc/cpuinfo | grep 'model name'
model name      : AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics

I think my next PR will be about x86 (and amd64) specific intrinsics

OK, that sounds nice. You're also thinking about adding SHA-NI support? If yes, you could have a look at #557 how we did it for AES-NI.

@MarekKnapek
Copy link
Contributor Author

My performance measurements are different. Maybe it depends on processor cache size, branch prediction buffer size and many other things.

Before:

sha1                : Process at    98
sha224              : Process at   188
sha256              : Process at   188
sha384              : Process at    58
sha512              : Process at    58
sha512-224          : Process at    58
sha512-256          : Process at    58

After:

sha1                : Process at   100
sha224              : Process at   185
sha256              : Process at   185
sha384              : Process at    67
sha512              : Process at    67
sha512-224          : Process at    67
sha512-256          : Process at    67

My command line was:
make && make test && make docs && make timing && ./test && ./helper.pl -a && ./timing hash sha

Another measurement, this time with LTC_SMALL_CODE.

Before:

sha1                : Process at   135
sha224              : Process at   190
sha256              : Process at   190
sha384              : Process at    62
sha512              : Process at    62
sha512-224          : Process at    62
sha512-256          : Process at    62

After:

sha1                : Process at   110
sha224              : Process at   229
sha256              : Process at   229
sha384              : Process at    73
sha512              : Process at    73
sha512-224          : Process at    73
sha512-256          : Process at    73

Here I was able to improve SHA-1 after speed from 110 to 121 by changing the first loop to:

    for (i = 0; i < 20; ) {
       if(i >= 16){ Wi(i); } FF0(a,b,c,d,e,i++); t = e; e = d; d = c; c = b; b = a; a = t;
    }

But it is still slower than before.

@sjaeckel
Copy link
Member

sjaeckel commented Dec 3, 2025

My performance measurements are different.

For sure, since you most likely have a different CPU. But the differences of the algorithm classes themselves are comparable and my statement from above:

[...] your patch improves the performance in the default case (LTC_SMALL_CODE undefined) for "sha256 based", but deteriorates for "sha1".

In the case LTC_SMALL_CODE is defined it improves "sha1", but deteriorates the two others.

is thereby validated.

Maybe it depends on processor cache size, branch prediction buffer size and many other things.

Absolutely.

Here I was able to improve SHA-1 after speed from 110 to 121 by changing the first loop to:

FYI: lower value = faster, the number shown is "the number of CPU cycles per iteration" -> i.e. by having it changed from 110 to 121 you made it 10% slower :-D

My command line was: make && make test && make docs && make timing && ./test && ./helper.pl -a && ./timing hash sha

No need to run all these, especially not make test since the timing demo already runs the self-tests of the hash algorithms and would exit with an error if it failed.

To speed your local development cycle up I'd suggest you to run make -j$(($(nproc)*2+1)) timing && ./timing hash sha.
You most likely also have a multi-core CPU, so you can use that fact and run parallel builds via the -j option.

And I run ./helper.pl -a never manually, but you can also execute make install_hooks once, which will install a Git pre-commit hook which checks that this succeeds before committing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants