Use Primops-based atomic counter and distribution implementation #42

TravisWhitaker · 2020-07-19T19:54:26Z

After finding #41 I was curious whether or not a primops-based implementation of cbits could achieve competitive performance. Turns out that it can, with some caveats.

I took four measurements to compare the implementation presented here with the current tip of master:

counter benchmark with -N1
counter benchmark with -N
distribution benchmark with -N1
distribution benchmark with -N

The results were recorded on my 2018 i9 MacBook Pro, which has 12 logical cores (6 physical cores, two hyperthreads each), so -N means -N12 on this machine. I added -O2 to ghc-options for the current tip of master and the code here. These are the results for the current tip of master:

$ bench ./counter-N1.sh
benchmarking ./counter-N1.sh
time                 80.14 ms   (78.19 ms .. 82.57 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 79.68 ms   (78.43 ms .. 80.69 ms)
std dev              1.949 ms   (1.070 ms .. 3.085 ms)

$ bench ./counter-N.sh
benchmarking ./counter-N.sh
time                 206.8 ms   (187.4 ms .. 235.2 ms)
                     0.992 R²   (0.988 R² .. 1.000 R²)
mean                 197.8 ms   (191.7 ms .. 204.6 ms)
std dev              9.020 ms   (5.511 ms .. 12.86 ms)
variance introduced by outliers: 14% (moderately inflated)

$ bench ./distribution-N1.sh
benchmarking ./distribution-N1.sh
time                 117.3 ms   (113.1 ms .. 123.7 ms)
                     0.997 R²   (0.993 R² .. 1.000 R²)
mean                 115.9 ms   (113.9 ms .. 118.1 ms)
std dev              3.398 ms   (2.475 ms .. 4.696 ms)
variance introduced by outliers: 11% (moderately inflated)

$ bench ./distribution-N.sh
benchmarking ./distribution-N.sh
time                 242.1 ms   (233.4 ms .. 252.4 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 233.4 ms   (229.9 ms .. 238.0 ms)
std dev              4.882 ms   (2.161 ms .. 7.047 ms)
variance introduced by outliers: 16% (moderately inflated)

And here are the results for the code presented here:

$ bench ./counter-N1.sh
benchmarking ./counter-N1.sh
time                 64.14 ms   (61.11 ms .. 66.64 ms)
                     0.995 R²   (0.985 R² .. 0.999 R²)
mean                 64.64 ms   (62.84 ms .. 66.19 ms)
std dev              3.086 ms   (2.273 ms .. 4.978 ms)
variance introduced by outliers: 15% (moderately inflated)

$ bench ./counter-N.sh
benchmarking ./counter-N.sh
time                 187.6 ms   (156.1 ms .. 218.5 ms)
                     0.989 R²   (0.982 R² .. 1.000 R²)
mean                 193.1 ms   (182.9 ms .. 202.1 ms)
std dev              13.66 ms   (8.776 ms .. 21.37 ms)
variance introduced by outliers: 15% (moderately inflated)

$ bench ./distribution-N1.sh
benchmarking ./distribution-N1.sh
time                 346.9 ms   (340.3 ms .. 351.4 ms)
                     1.000 R²   (1.000 R² .. NaN R²)
mean                 339.8 ms   (336.3 ms .. 342.5 ms)
std dev              3.661 ms   (1.435 ms .. 4.770 ms)
variance introduced by outliers: 19% (moderately inflated)

$ bench ./distribution-N.sh
benchmarking ./distribution-N.sh
time                 251.8 ms   (241.7 ms .. 266.4 ms)
                     0.997 R²   (0.987 R² .. 1.000 R²)
mean                 250.8 ms   (245.9 ms .. 256.0 ms)
std dev              6.666 ms   (5.042 ms .. 8.009 ms)
variance introduced by outliers: 16% (moderately inflated)

The new implementation yields slightly faster atomic counter performance in both the single-capability and capability-per-core cases. This could be due to the fact that in the new implementation the counter is not in pinned memory, or the lack of withForeignPtr, which isn't free.

The distribution results are quite interesting. In the single-capability case, this implementation is about three times slower than the existing implementation, but the gap between them almost entirely disappears in the capability-per-core case. I'm not quite sure what's going on here. I thought that perhaps the fact that the unsafe FFI call can't be interrupted by GC could have something to do with it, but using -S shows that only a single major GC at the end of benchmark execution takes place in both the single-capability case and the capability-per-core case. I suspect the answer is somewhere in the core emitted for spinLock, but I haven't dug into it yet.

The most important difference between the two implementations is correctness. Because of #41, ekg sometimes reports stale or gibberish metrics on newer aarch64 chips, while this implementation produces correct results (at least the results look right to me, but a second pair of eyes on the distribution update code would be much appreciated).

I'm not too happy about having to unpack I# from a CAF to get the field array offsets in these functions, but since hsc2hs seems to always wrap parens around the results of its directives, I couldn't figure out how to generate an unboxed literal from them. One solution might be to export TH splices from a separate hsc file, then splice in the unboxed literals in the System.Metrics.Distribution module. That would also allow us to skip hsc2hs on that module entirely, which would roughly halve the number of # required.

~~Another minor difference is the change from Int64 to Int. This is necessary because GHC doesn't make available any 64 bit wide atomic primops on 32 bit chips.~~

TravisWhitaker · 2020-07-29T03:26:56Z

Turns out I mistakenly used an atomic write to release the distribution spinlock, when a weak write is sufficient (and much faster). Now the primops implementation is comparable to the existing implementation's performance (even a bit faster in the multicapability case).

$ bench ./distribution-N1.sh
benchmarking ./distribution-N1.sh
time                 120.9 ms   (117.6 ms .. 125.8 ms)
                     0.998 R²   (0.993 R² .. 1.000 R²)
mean                 117.8 ms   (116.1 ms .. 119.9 ms)
std dev              2.907 ms   (2.099 ms .. 4.170 ms)
variance introduced by outliers: 11% (moderately inflated)

$ bench ./distribution-N.sh
benchmarking ./distribution-N.sh
time                 210.1 ms   (199.2 ms .. 219.5 ms)
                     0.998 R²   (0.992 R² .. 1.000 R²)
mean                 225.5 ms   (219.2 ms .. 236.0 ms)
std dev              10.80 ms   (5.505 ms .. 15.14 ms)
variance introduced by outliers: 14% (moderately inflated)

TravisWhitaker · 2020-10-05T23:09:28Z

One way or another, #41 needs to be fixed for EKG to work well on newer ARM boards. If there's no interest in this patch, we should find another way to achieve memory safety on weakly ordered machines.

This commit attempts to address issue haskell-github-trust#41 of tibbe/ekg-core by replacing the C code for the distribution metric with GHC prim ops. The performance of this implementation is about half that of the existing C code in a single-threaded benchmark; without masking the performance is comparable. This commit is based on the work of Travis Whitaker in PR haskell-github-trust#42 of tibbe/ekg-core.

TravisWhitaker · 2021-08-23T19:45:21Z

@23Skidoo What more can I do to help get this package fixed on aarch64?

AndreasPK

The Int overflowing on 32bit might be a concern and I wonder if read should be locking. But looks reasonable to me on the whole.

System/Metrics/Distribution.hsc

…y-safe

…o memory-safe

L0neGamer

I'm honestly very impressed, this is a pretty faithful, more haskell implementation of what was previously here.

I'm worried primarily about two things: implicit casting from Int64, and the incredibly unreadable state passing style. Happy to discuss both, but I would prefer to see some cleanup in those areas.

Let me know if that wards you off taking this PR the distance and I'll see if I can develop it further.

System/Metrics.hs

benchmarks/Counter.hs

System/Metrics/Distribution.hsc

TravisWhitaker · 2025-06-17T01:32:22Z

implicit casting from Int64

Are you talking about moving most things from Int64 to Int? Or something else? In the former case, GHC forces our hand: we only have Int versions of the required atomic operations (as 32-bit machines won't be able to atomically operate on Int64s in general). This trade makes sense for two reasons in my opinion:

The current implementation of this library is totally broken on aarch64, silently returns bogus results, and there isn't even a warning about this written in its documentation anywhere.
32-bit machines are a vanishing fraction of the wider GHC user base, yet alone this particular library.

incredibly unreadable state passing style

This is what stateful primop-based code looks like, at least I don't know of a nice way to do these things. Perhaps someone would be interested in rewriting these in Core, or STG, or Cmm or something like that?

Data/Atomic.hs

… comments.

TravisWhitaker · 2025-06-19T01:53:28Z

Ok, the current state of this PR is the best compromise I've thought of so far:

On 64-bit machines, we use the nice primops-based concurrent operations to operate on Int64s everywhere. There's a bunch of new CPP to make this work with both newer GHC's that have Int64# and older ones without it (I had completely forgotten that Int64# didn't always exist).
On 32-bit machines we fall back to a boxed IORef. The good news is that this doesn't perform as poorly as I had feared on what's probably the only relevant 32-bit target going forward: WASM. A major caveat here is that WASM is actually the only 32-bit target I've tested this with.

TravisWhitaker · 2025-06-19T02:05:20Z

Also, it'd be great to have as many eyes and as much testing on the distribution implementation as possible. I've been using it for years now, but that doesn't mean it's bug-free.

System/Metrics/Distribution.hsc

Bodigrim

LGTM!

L0neGamer

Looks good to me!
I couldn't get an IO alternative to the primop-cases working, so this is where we are.
I'll merge this soon, and look into updating the hackage over the next week. I'll bump the version to 0.1.2.0 from 0.1.1.8, to signify that there has been a significant backend change but it shouldn't be any visible changes that aren't an improvement.

TravisWhitaker · 2025-06-29T16:46:56Z

Thanks gents, nice to finally put this to rest.

L0neGamer · 2025-06-29T20:30:34Z

Finally published: https://hackage.haskell.org/package/ekg-core-0.1.2.0

Bump base constraint.

0433c5b

TravisWhitaker force-pushed the memory-safe branch from 8288724 to 4273d93 Compare July 19, 2020 19:56

TravisWhitaker changed the title ~~Sketch primops-based atomic counter and distribution implementation~~ WIP: Sketch primops-based atomic counter and distribution implementation Jul 19, 2020

Prototype atomic-memory-safe implementation.

420f559

TravisWhitaker force-pushed the memory-safe branch from 4273d93 to 420f559 Compare July 29, 2020 03:20

TravisWhitaker mentioned this pull request Mar 24, 2021

Atomic C functions are not atomic. #41

Open

AndreasPK reviewed Jun 19, 2023

View reviewed changes

System/Metrics/Distribution.hsc Show resolved Hide resolved

TravisWhitaker added 12 commits December 24, 2023 13:44

Dependency version bumps.

2ac505c

Merge branch 'bumps' of github.com:TravisWhitaker/ekg-core into bumps

16954da

Merge branch 'bumps' of github.com:TravisWhitaker/ekg-core into memor…

a52ddad

…y-safe

Update .gitignore

74f2239

Fix CPP

12ab6a7

Merge branch 'master' of github.com:haskell-github-trust/ekg-core int…

a4b7ff9

…o memory-safe

Clean up a bit.

53d0986

Fix ghc-prim constraint

d76fd8f

Fix with 8.0.x

db37322

fix it harder

56ca464

fix it harder

0fc6787

fix it harder

316a206

TravisWhitaker changed the title ~~WIP: Sketch primops-based atomic counter and distribution implementation~~ Use Primops-based atomic counter and distribution implementation Jun 8, 2025

L0neGamer reviewed Jun 9, 2025

View reviewed changes

Bodigrim reviewed Jun 17, 2025

View reviewed changes

Data/Atomic.hs Show resolved Hide resolved

TravisWhitaker changed the title ~~Use Primops-based atomic counter and distribution implementation~~ WIP: Use Primops-based atomic counter and distribution implementation Jun 18, 2025

Fast way on 64-bit, slow way on 32-bit.

0653030

TravisWhitaker added 2 commits June 18, 2025 17:19

Remove some unnecessary changes, preserve some of the old explanatory…

5d6ad40

… comments.

Make it work on wasm32-wasi and older GHCs

e5439d2

TravisWhitaker force-pushed the memory-safe branch 6 times, most recently from cd62bd2 to 5097e63 Compare June 19, 2025 01:28

Int64# was added in GHC 9.4.x

a12f5bd

TravisWhitaker force-pushed the memory-safe branch from 5097e63 to a12f5bd Compare June 19, 2025 01:31

TravisWhitaker added 2 commits June 18, 2025 18:37

Make it build on WASM again.

3d3fc98

Clean up once more

90391c4

TravisWhitaker changed the title ~~WIP: Use Primops-based atomic counter and distribution implementation~~ Use Primops-based atomic counter and distribution implementation Jun 19, 2025

TravisWhitaker requested review from Bodigrim and L0neGamer June 19, 2025 01:54

Bodigrim reviewed Jun 19, 2025

View reviewed changes

System/Metrics/Distribution.hsc Outdated Show resolved Hide resolved

System/Metrics/Distribution.hsc Outdated Show resolved Hide resolved

System/Metrics/Distribution.hsc Outdated Show resolved Hide resolved

TravisWhitaker added 2 commits June 19, 2025 13:44

Don't have to destructure boxed constants everywhere.

2921d8c

yield in spinLock unhappy path

a89e220

Bodigrim approved these changes Jun 22, 2025

View reviewed changes

L0neGamer approved these changes Jun 23, 2025

View reviewed changes

L0neGamer merged commit 70bb7ae into haskell-github-trust:master Jun 24, 2025
14 checks passed

Use Primops-based atomic counter and distribution implementation #42

Use Primops-based atomic counter and distribution implementation #42

Uh oh!

Conversation

TravisWhitaker commented Jul 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TravisWhitaker commented Jul 29, 2020

Uh oh!

TravisWhitaker commented Oct 5, 2020

Uh oh!

TravisWhitaker commented Aug 23, 2021

Uh oh!

AndreasPK left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

L0neGamer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TravisWhitaker commented Jun 17, 2025

Uh oh!

Uh oh!

TravisWhitaker commented Jun 19, 2025

Uh oh!

TravisWhitaker commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Bodigrim left a comment

Choose a reason for hiding this comment

Uh oh!

L0neGamer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TravisWhitaker commented Jun 29, 2025

Uh oh!

L0neGamer commented Jun 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

TravisWhitaker commented Jul 19, 2020 •

edited

Loading