-
Notifications
You must be signed in to change notification settings - Fork 21
Intrinsics experimental POC #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Intrinsics experimental POC #52
Conversation
BlakeBeforehand - non Itrinsics version (with reduced memory usage)Intrinsics refactor d90f3a6Using code from @saucecontrol Blake2FastAdapted Blake2bSimd to use intrinsics in d90f3a6, it's arguably more readable than the Blake2Fast version but is slightly slower, both are at least 4x time faster than the current version. Changed Blake2bNormal to use Blake2bNormal before stackalloc |
I'm just getting some time to look at this repo for the first time in like years. I appreciate this PR so much. This is exactly what I wanted to do with this library from the start. But I was a C++ developer working on .NET core 1.0 on Linux. I'm going to re-open this with your commits squashed while trying to add ARM64/SVE support and tidying up the CI / versioning stuff before publishing. Thank you so much for this! |
Also adding a note here that .NET8 is beginning to introduce support for AVX512 which will require an update to use but should improve things even more assuming newer gen chips don't suffer from the power draw issues that they historically have with AVX512 extensions. |
Glad you liked it! 😄
|
Added the PHC compress function. It passes the tests but the old hacky version appears to be faster? Hacky version
PHC
|
I think that if I'm reading the numbers right, they're pretty much the same? Which is what I would sort of expect. The implementations actually seem roughly the same overall. The PHC version is just easier to read and link back to the reference implementation. |
Ratio seems to be lower for the older version,. In my runs PHC seemed to be 0.60-0.55 with the older one at 0.55-0.40.
100% agree, no idea why I didn't look for the official version sooner. |
Squashed branch is at https://github.com/kmaragon/Konscious.Security.Cryptography/tree/feature/intrinsics. I'm also calling this 2.0 and getting rid of the ability to explicitly integrate the tasks and just pushing it all into Parallel.ForEach with no async contracts. From the issues it seems like no one is able to use the async contracts. Or maybe it's just that they are the loudest bunch. Either way, I'll remove them entirely. I've implemented the modifiedblake2 stuff there for ARM. I'd like to get SSE4 in there too for good measure. It'll probably be similar to the ARM NEON implementation. I was looking at saucecontrol's blake2 work and it's strictly for x86. That said, their SSE4 implementation may serve as a reasonable base for AdvSimd support. .NET 8.0 is looking to be adding support for AVX512 but I see no word on ARMv9 SVE2 yet. I expect the latter to be the biggest bump in perf for users on the hardware. Maybe the next gen Apple chips? |
Very cool to see this work going on here 👍 I'm following the AVX-512 work in .NET 8 closely and will be using my blake2 project for testing once more of the instructions are available in the API. Keep an eye out for updates this year if you're interested. Arm SVE in .NET is probably a ways off. They're not prioritizing it at the moment because hardware implementing it isn't widely available. |
I've been playing around with intrinsics and thought that this project would benefit from parallelization. By adding ModifiedBlake2Intrinsics to parallelize the shuffle I experienced performance increases of 40-55%.
Argon
Without intrinsics
With Intrinsics
I've only changed Argon but from looking at Blake2Fast you could probably get some performance gains for normal blake usage.
I'm new to intrinsics and haven't added any tests but this should demonstrate the potential benefits.