Intrinsics experimental POC #52

TimothyMakkison · 2022-10-18T15:41:35Z

I've been playing around with intrinsics and thought that this project would benefit from parallelization. By adding ModifiedBlake2Intrinsics to parallelize the shuffle I experienced performance increases of 40-55%.

Argon

Without intrinsics

With Intrinsics

I've only changed Argon but from looking at Blake2Fast you could probably get some performance gains for normal blake usage.

I'm new to intrinsics and haven't added any tests but this should demonstrate the potential benefits.

TimothyMakkison · 2022-10-19T20:54:57Z

Blake

Beforehand - non Itrinsics version (with reduced memory usage)

Intrinsics refactor `d90f3a6`

Using code from @saucecontrol Blake2Fast

Adapted Blake2bSimd to use intrinsics in d90f3a6, it's arguably more readable than the Blake2Fast version but is slightly slower, both are at least 4x time faster than the current version.

Changed Blake2bNormal to use stackalloc for v and m. Reducing memory usage to a flat 416 byte.

Blake2bNormal before stackalloc

kmaragon · 2023-04-01T22:12:16Z

I'm just getting some time to look at this repo for the first time in like years. I appreciate this PR so much. This is exactly what I wanted to do with this library from the start. But I was a C++ developer working on .NET core 1.0 on Linux. I'm going to re-open this with your commits squashed while trying to add ARM64/SVE support and tidying up the CI / versioning stuff before publishing. Thank you so much for this!

kmaragon · 2023-04-01T23:08:43Z

Also adding a note here that .NET8 is beginning to introduce support for AVX512 which will require an update to use but should improve things even more assuming newer gen chips don't suffer from the power draw issues that they historically have with AVX512 extensions.

TimothyMakkison · 2023-04-01T23:47:23Z

Glad you liked it! 😄

ModifiedBlake2Intrinsics was largely done at 2 AM, it the passes the test but I have no idea if it safe. I'm sure the performance can be improved.
I had meant to compare this to a proper SIMD Argon2Id implementations for Rust, C or C++. See- it looks like they have two versions of the diagonalize and G functions, whereas I shuffled the vectors, repeatedy using the same G functons.

Blake2bSimd should be solid as its from SauceControl/Blake2Fast, the only modification is the use of spans.

TimothyMakkison · 2023-04-02T15:57:31Z

Added the PHC compress function. It passes the tests but the old hacky version appears to be faster?

Hacky version

Method	Job	EnvironmentVariables	Iterations	RamKilobytes	Mean	Error	StdDev	Median	Ratio	RatioSD	Gen0	Gen1	Gen2	Allocated	Alloc Ratio
GetHashAsync	Job-VMNFSH	COMPlus_EnableSSE2=0	1	65536	80.01 ms	1.597 ms	3.226 ms	79.12 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	64.58 MB	1.00
GetHashAsync	Job-VWRWYY	Empty	1	65536	42.65 ms	1.023 ms	2.967 ms	42.10 ms	0.55	0.05	1000.0000	1000.0000	1000.0000	64.59 MB	1.00

GetHashAsync	Job-VMNFSH	COMPlus_EnableSSE2=0	1	73728	90.40 ms	1.767 ms	2.534 ms	89.53 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	72.61 MB	1.00
GetHashAsync	Job-VWRWYY	Empty	1	73728	46.02 ms	0.917 ms	2.432 ms	45.32 ms	0.52	0.03	1000.0000	1000.0000	1000.0000	72.6 MB	1.00

GetHashAsync	Job-VMNFSH	COMPlus_EnableSSE2=0	1	81920	101.67 ms	2.030 ms	3.336 ms	100.79 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	80.64 MB	1.00
GetHashAsync	Job-VWRWYY	Empty	1	81920	52.65 ms	1.482 ms	4.229 ms	51.23 ms	0.54	0.06	1000.0000	1000.0000	1000.0000	80.64 MB	1.00

GetHashAsync	Job-VMNFSH	COMPlus_EnableSSE2=0	1	90112	108.64 ms	1.813 ms	1.514 ms	108.43 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	88.67 MB	1.00
GetHashAsync	Job-VWRWYY	Empty	1	90112	54.80 ms	1.080 ms	2.566 ms	54.04 ms	0.51	0.02	1000.0000	1000.0000	1000.0000	88.68 MB	1.00

GetHashAsync	Job-VMNFSH	COMPlus_EnableSSE2=0	6	65536	381.93 ms	2.475 ms	1.933 ms	382.09 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	64.61 MB	1.00
GetHashAsync	Job-VWRWYY	Empty	6	65536	162.35 ms	2.922 ms	4.374 ms	162.10 ms	0.42	0.01	1000.0000	1000.0000	1000.0000	64.61 MB	1.00

GetHashAsync	Job-VMNFSH	COMPlus_EnableSSE2=0	6	73728	438.29 ms	6.539 ms	6.116 ms	437.29 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	72.65 MB	1.00
GetHashAsync	Job-VWRWYY	Empty	6	73728	184.10 ms	3.537 ms	4.211 ms	182.78 ms	0.42	0.01	1000.0000	1000.0000	1000.0000	72.64 MB	1.00

GetHashAsync	Job-VMNFSH	COMPlus_EnableSSE2=0	6	81920	483.11 ms	4.978 ms	4.657 ms	483.11 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	80.67 MB	1.00
GetHashAsync	Job-VWRWYY	Empty	6	81920	199.99 ms	3.996 ms	4.602 ms	199.00 ms	0.42	0.01	1000.0000	1000.0000	1000.0000	80.67 MB	1.00

GetHashAsync	Job-VMNFSH	COMPlus_EnableSSE2=0	6	90112	532.28 ms	4.697 ms	4.394 ms	532.31 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	88.71 MB	1.00
GetHashAsync	Job-VWRWYY	Empty	6	90112	218.81 ms	3.116 ms	2.914 ms	218.86 ms	0.41	0.01	1000.0000	1000.0000	1000.0000	88.71 MB	1.00

PHC

Method	Job	EnvironmentVariables	Iterations	RamKilobytes	Mean	Error	StdDev	Median	Ratio	RatioSD	Gen0	Gen1	Gen2	Allocated	Alloc Ratio
GetHashAsync	Job-UTPDAE	COMPlus_EnableSSE2=0	1	65536	78.49 ms	1.546 ms	2.941 ms	77.60 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	64.57 MB	1.00
GetHashAsync	Job-MLBAUK	Empty	1	65536	46.31 ms	0.920 ms	2.654 ms	45.58 ms	0.60	0.05	1000.0000	1000.0000	1000.0000	64.57 MB	1.00

GetHashAsync	Job-UTPDAE	COMPlus_EnableSSE2=0	1	73728	89.21 ms	1.768 ms	2.478 ms	88.82 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	72.61 MB	1.00
GetHashAsync	Job-MLBAUK	Empty	1	73728	55.02 ms	1.091 ms	2.614 ms	54.52 ms	0.62	0.03	1000.0000	1000.0000	1000.0000	72.62 MB	1.00

GetHashAsync	Job-UTPDAE	COMPlus_EnableSSE2=0	1	81920	100.68 ms	2.007 ms	3.515 ms	99.52 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	80.64 MB	1.00
GetHashAsync	Job-MLBAUK	Empty	1	81920	61.48 ms	1.227 ms	2.393 ms	60.59 ms	0.61	0.03	1000.0000	1000.0000	1000.0000	80.62 MB	1.00

GetHashAsync	Job-UTPDAE	COMPlus_EnableSSE2=0	1	90112	109.50 ms	2.186 ms	3.530 ms	107.99 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	88.67 MB	1.00
GetHashAsync	Job-MLBAUK	Empty	1	90112	65.04 ms	1.291 ms	2.361 ms	64.42 ms	0.60	0.03	1000.0000	1000.0000	1000.0000	88.65 MB	1.00

GetHashAsync	Job-UTPDAE	COMPlus_EnableSSE2=0	6	65536	391.36 ms	7.696 ms	11.037 ms	386.65 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	64.6 MB	1.00
GetHashAsync	Job-MLBAUK	Empty	6	65536	197.04 ms	3.842 ms	4.111 ms	196.55 ms	0.50	0.02	1000.0000	1000.0000	1000.0000	64.61 MB	1.00

GetHashAsync	Job-UTPDAE	COMPlus_EnableSSE2=0	6	73728	428.48 ms	5.384 ms	4.773 ms	428.75 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	72.64 MB	1.00
GetHashAsync	Job-MLBAUK	Empty	6	73728	231.74 ms	3.331 ms	3.420 ms	231.15 ms	0.54	0.01	1000.0000	1000.0000	1000.0000	72.64 MB	1.00

GetHashAsync	Job-UTPDAE	COMPlus_EnableSSE2=0	6	81920	476.48 ms	5.884 ms	5.504 ms	475.71 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	80.67 MB	1.00
GetHashAsync	Job-MLBAUK	Empty	6	81920	262.77 ms	3.401 ms	3.182 ms	263.35 ms	0.55	0.01	1000.0000	1000.0000	1000.0000	80.65 MB	1.00

GetHashAsync	Job-UTPDAE	COMPlus_EnableSSE2=0	6	90112	522.82 ms	3.090 ms	2.580 ms	523.01 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	88.7 MB	1.00
GetHashAsync	Job-MLBAUK	Empty	6	90112	290.81 ms	3.877 ms	3.437 ms	290.16 ms	0.56	0.01	1000.0000	1000.0000	1000.0000	88.73 MB	1.00

kmaragon · 2023-04-02T17:00:13Z

I think that if I'm reading the numbers right, they're pretty much the same? Which is what I would sort of expect. The implementations actually seem roughly the same overall. The PHC version is just easier to read and link back to the reference implementation.

TimothyMakkison · 2023-04-02T17:17:35Z

I think that if I'm reading the numbers right, they're pretty much the same? Which is what I would sort of expect. The implementations actually seem roughly the same overall.

Ratio seems to be lower for the older version,. In my runs PHC seemed to be 0.60-0.55 with the older one at 0.55-0.40.
I couldn't figure out why the two appeared to be different. After giving it some thought it could be due to a combination of missing AggressiveInlining, loop unrolling and possible JIT funkiness with the rotr methods.

The PHC version is just easier to read and link back to the reference implementation.

100% agree, no idea why I didn't look for the official version sooner.

kmaragon · 2023-04-02T19:42:08Z

Squashed branch is at https://github.com/kmaragon/Konscious.Security.Cryptography/tree/feature/intrinsics. I'm also calling this 2.0 and getting rid of the ability to explicitly integrate the tasks and just pushing it all into Parallel.ForEach with no async contracts. From the issues it seems like no one is able to use the async contracts. Or maybe it's just that they are the loudest bunch. Either way, I'll remove them entirely.

I've implemented the modifiedblake2 stuff there for ARM. I'd like to get SSE4 in there too for good measure. It'll probably be similar to the ARM NEON implementation. I was looking at saucecontrol's blake2 work and it's strictly for x86. That said, their SSE4 implementation may serve as a reasonable base for AdvSimd support. .NET 8.0 is looking to be adding support for AVX512 but I see no word on ARMv9 SVE2 yet. I expect the latter to be the biggest bump in perf for users on the hardware. Maybe the next gen Apple chips?

saucecontrol · 2023-04-02T20:14:57Z

Very cool to see this work going on here 👍

I'm following the AVX-512 work in .NET 8 closely and will be using my blake2 project for testing once more of the instructions are available in the API. Keep an eye out for updates this year if you're interested.

Arm SVE in .NET is probably a ways off. They're not prioritizing it at the moment because hardware implementing it isn't widely available.

TimothyMakkison added 13 commits October 12, 2022 17:56

Create uninished intriniscs methods

e12786c

Added intrinsics compression function (not working)

6a0d770

Fix Compress intrinsics

77b4525

Refactor Reshuffle

59e1167

Refactor DoRoundRows

dd23f8b

Edit comments and mild changes

bc790b9

Add csproj settings

ffadcef

Add explanations and rename functions.

cde0ac7

Add intrinsics to Simd

f65eab2

Add Blake2 benchmarks

08ab66c

Refactor Blake2bSimd

d90f3a6

Refactor Blake2Simd to use saucecontrol/Blake2Fast technique

8927140

Refactor Blake2bNormal to use stackalloc reducing memory usage

10bef51

TimothyMakkison added 5 commits November 18, 2022 10:35

Add span and debug assert

6026c86

Remove unsafe keyword and fixed pointers

758941b

Add guards and code cleanup

698f90e

Inline rotation masks

6463cae

Create static method

3d480c2

TimothyMakkison mentioned this pull request Nov 18, 2022

Add Blake2b intrinsics POC bcgit/bc-csharp#398

Closed

Remove unsafe code and up requirements to .Net 6

17632ad

kmaragon pushed a commit that referenced this pull request Apr 2, 2023

(#52) Intrinsics experimental POC

2939de0

edit: added phc winner simd

7b1b698

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Intrinsics experimental POC #52

Intrinsics experimental POC #52

Uh oh!

TimothyMakkison commented Oct 18, 2022 •

edited

Loading

Uh oh!

TimothyMakkison commented Oct 19, 2022 •

edited

Loading

Uh oh!

kmaragon commented Apr 1, 2023 •

edited

Loading

Uh oh!

kmaragon commented Apr 1, 2023

Uh oh!

TimothyMakkison commented Apr 1, 2023 •

edited

Loading

Uh oh!

TimothyMakkison commented Apr 2, 2023

Uh oh!

kmaragon commented Apr 2, 2023

Uh oh!

TimothyMakkison commented Apr 2, 2023 •

edited

Loading

Uh oh!

kmaragon commented Apr 2, 2023 •

edited

Loading

Uh oh!

saucecontrol commented Apr 2, 2023

Uh oh!

Uh oh!

Intrinsics experimental POC #52

Are you sure you want to change the base?

Intrinsics experimental POC #52

Uh oh!

Conversation

TimothyMakkison commented Oct 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Argon

Without intrinsics

With Intrinsics

Uh oh!

TimothyMakkison commented Oct 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Blake

Beforehand - non Itrinsics version (with reduced memory usage)

Intrinsics refactor d90f3a6

Using code from @saucecontrol Blake2Fast

Blake2bNormal before stackalloc

Uh oh!

kmaragon commented Apr 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kmaragon commented Apr 1, 2023

Uh oh!

TimothyMakkison commented Apr 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TimothyMakkison commented Apr 2, 2023

Hacky version

PHC

Uh oh!

kmaragon commented Apr 2, 2023

Uh oh!

TimothyMakkison commented Apr 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kmaragon commented Apr 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saucecontrol commented Apr 2, 2023

Uh oh!

Uh oh!

TimothyMakkison commented Oct 18, 2022 •

edited

Loading

TimothyMakkison commented Oct 19, 2022 •

edited

Loading

Intrinsics refactor `d90f3a6`

kmaragon commented Apr 1, 2023 •

edited

Loading

TimothyMakkison commented Apr 1, 2023 •

edited

Loading

TimothyMakkison commented Apr 2, 2023 •

edited

Loading

kmaragon commented Apr 2, 2023 •

edited

Loading