Skip to content

Conversation

@TravisWhitaker
Copy link
Contributor

@TravisWhitaker TravisWhitaker commented Jul 19, 2020

After finding #41 I was curious whether or not a primops-based implementation of cbits could achieve competitive performance. Turns out that it can, with some caveats.

I took four measurements to compare the implementation presented here with the current tip of master:

  • counter benchmark with -N1
  • counter benchmark with -N
  • distribution benchmark with -N1
  • distribution benchmark with -N

The results were recorded on my 2018 i9 MacBook Pro, which has 12 logical cores (6 physical cores, two hyperthreads each), so -N means -N12 on this machine. I added -O2 to ghc-options for the current tip of master and the code here. These are the results for the current tip of master:

$ bench ./counter-N1.sh
benchmarking ./counter-N1.sh
time                 80.14 ms   (78.19 ms .. 82.57 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 79.68 ms   (78.43 ms .. 80.69 ms)
std dev              1.949 ms   (1.070 ms .. 3.085 ms)

$ bench ./counter-N.sh
benchmarking ./counter-N.sh
time                 206.8 ms   (187.4 ms .. 235.2 ms)
                     0.992 R²   (0.988 R² .. 1.000 R²)
mean                 197.8 ms   (191.7 ms .. 204.6 ms)
std dev              9.020 ms   (5.511 ms .. 12.86 ms)
variance introduced by outliers: 14% (moderately inflated)

$ bench ./distribution-N1.sh
benchmarking ./distribution-N1.sh
time                 117.3 ms   (113.1 ms .. 123.7 ms)
                     0.997 R²   (0.993 R² .. 1.000 R²)
mean                 115.9 ms   (113.9 ms .. 118.1 ms)
std dev              3.398 ms   (2.475 ms .. 4.696 ms)
variance introduced by outliers: 11% (moderately inflated)

$ bench ./distribution-N.sh
benchmarking ./distribution-N.sh
time                 242.1 ms   (233.4 ms .. 252.4 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 233.4 ms   (229.9 ms .. 238.0 ms)
std dev              4.882 ms   (2.161 ms .. 7.047 ms)
variance introduced by outliers: 16% (moderately inflated)

And here are the results for the code presented here:

$ bench ./counter-N1.sh
benchmarking ./counter-N1.sh
time                 64.14 ms   (61.11 ms .. 66.64 ms)
                     0.995 R²   (0.985 R² .. 0.999 R²)
mean                 64.64 ms   (62.84 ms .. 66.19 ms)
std dev              3.086 ms   (2.273 ms .. 4.978 ms)
variance introduced by outliers: 15% (moderately inflated)

$ bench ./counter-N.sh
benchmarking ./counter-N.sh
time                 187.6 ms   (156.1 ms .. 218.5 ms)
                     0.989 R²   (0.982 R² .. 1.000 R²)
mean                 193.1 ms   (182.9 ms .. 202.1 ms)
std dev              13.66 ms   (8.776 ms .. 21.37 ms)
variance introduced by outliers: 15% (moderately inflated)

$ bench ./distribution-N1.sh
benchmarking ./distribution-N1.sh
time                 346.9 ms   (340.3 ms .. 351.4 ms)
                     1.000 R²   (1.000 R² .. NaN R²)
mean                 339.8 ms   (336.3 ms .. 342.5 ms)
std dev              3.661 ms   (1.435 ms .. 4.770 ms)
variance introduced by outliers: 19% (moderately inflated)

$ bench ./distribution-N.sh
benchmarking ./distribution-N.sh
time                 251.8 ms   (241.7 ms .. 266.4 ms)
                     0.997 R²   (0.987 R² .. 1.000 R²)
mean                 250.8 ms   (245.9 ms .. 256.0 ms)
std dev              6.666 ms   (5.042 ms .. 8.009 ms)
variance introduced by outliers: 16% (moderately inflated)

The new implementation yields slightly faster atomic counter performance in both the single-capability and capability-per-core cases. This could be due to the fact that in the new implementation the counter is not in pinned memory, or the lack of withForeignPtr, which isn't free.

The distribution results are quite interesting. In the single-capability case, this implementation is about three times slower than the existing implementation, but the gap between them almost entirely disappears in the capability-per-core case. I'm not quite sure what's going on here. I thought that perhaps the fact that the unsafe FFI call can't be interrupted by GC could have something to do with it, but using -S shows that only a single major GC at the end of benchmark execution takes place in both the single-capability case and the capability-per-core case. I suspect the answer is somewhere in the core emitted for spinLock, but I haven't dug into it yet.

The most important difference between the two implementations is correctness. Because of #41, ekg sometimes reports stale or gibberish metrics on newer aarch64 chips, while this implementation produces correct results (at least the results look right to me, but a second pair of eyes on the distribution update code would be much appreciated).

I'm not too happy about having to unpack I# from a CAF to get the field array offsets in these functions, but since hsc2hs seems to always wrap parens around the results of its directives, I couldn't figure out how to generate an unboxed literal from them. One solution might be to export TH splices from a separate hsc file, then splice in the unboxed literals in the System.Metrics.Distribution module. That would also allow us to skip hsc2hs on that module entirely, which would roughly halve the number of # required.

Another minor difference is the change from Int64 to Int. This is necessary because GHC doesn't make available any 64 bit wide atomic primops on 32 bit chips.

@TravisWhitaker TravisWhitaker changed the title Sketch primops-based atomic counter and distribution implementation WIP: Sketch primops-based atomic counter and distribution implementation Jul 19, 2020
@TravisWhitaker
Copy link
Contributor Author

Turns out I mistakenly used an atomic write to release the distribution spinlock, when a weak write is sufficient (and much faster). Now the primops implementation is comparable to the existing implementation's performance (even a bit faster in the multicapability case).

$ bench ./distribution-N1.sh
benchmarking ./distribution-N1.sh
time                 120.9 ms   (117.6 ms .. 125.8 ms)
                     0.998 R²   (0.993 R² .. 1.000 R²)
mean                 117.8 ms   (116.1 ms .. 119.9 ms)
std dev              2.907 ms   (2.099 ms .. 4.170 ms)
variance introduced by outliers: 11% (moderately inflated)

$ bench ./distribution-N.sh
benchmarking ./distribution-N.sh
time                 210.1 ms   (199.2 ms .. 219.5 ms)
                     0.998 R²   (0.992 R² .. 1.000 R²)
mean                 225.5 ms   (219.2 ms .. 236.0 ms)
std dev              10.80 ms   (5.505 ms .. 15.14 ms)
variance introduced by outliers: 14% (moderately inflated)

@TravisWhitaker
Copy link
Contributor Author

One way or another, #41 needs to be fixed for EKG to work well on newer ARM boards. If there's no interest in this patch, we should find another way to achieve memory safety on weakly ordered machines.

awjchen added a commit to hasura/ekg-core that referenced this pull request May 29, 2021
This commit attempts to address issue haskell-github-trust#41 of tibbe/ekg-core by
replacing the C code for the distribution metric with GHC prim ops.

The performance of this implementation is about half that of the
existing C code in a single-threaded benchmark; without masking the
performance is comparable.

This commit is based on the work of Travis Whitaker in PR haskell-github-trust#42 of
tibbe/ekg-core.
awjchen added a commit to hasura/ekg-core that referenced this pull request Jun 23, 2021
This commit attempts to address issue haskell-github-trust#41 of tibbe/ekg-core by
replacing the C code for the distribution metric with GHC prim ops.

The performance of this implementation is about half that of the
existing C code in a single-threaded benchmark; without masking the
performance is comparable.

This commit is based on the work of Travis Whitaker in PR haskell-github-trust#42 of
tibbe/ekg-core.
@TravisWhitaker
Copy link
Contributor Author

@23Skidoo What more can I do to help get this package fixed on aarch64?

Copy link

@AndreasPK AndreasPK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Int overflowing on 32bit might be a concern and I wonder if read should be locking. But looks reasonable to me on the whole.

@TravisWhitaker TravisWhitaker changed the title WIP: Sketch primops-based atomic counter and distribution implementation Use Primops-based atomic counter and distribution implementation Jun 8, 2025
Copy link
Collaborator

@L0neGamer L0neGamer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly very impressed, this is a pretty faithful, more haskell implementation of what was previously here.

I'm worried primarily about two things: implicit casting from Int64, and the incredibly unreadable state passing style. Happy to discuss both, but I would prefer to see some cleanup in those areas.

Let me know if that wards you off taking this PR the distance and I'll see if I can develop it further.

@TravisWhitaker
Copy link
Contributor Author

implicit casting from Int64

Are you talking about moving most things from Int64 to Int? Or something else? In the former case, GHC forces our hand: we only have Int versions of the required atomic operations (as 32-bit machines won't be able to atomically operate on Int64s in general). This trade makes sense for two reasons in my opinion:

  • The current implementation of this library is totally broken on aarch64, silently returns bogus results, and there isn't even a warning about this written in its documentation anywhere.
  • 32-bit machines are a vanishing fraction of the wider GHC user base, yet alone this particular library.

incredibly unreadable state passing style

This is what stateful primop-based code looks like, at least I don't know of a nice way to do these things. Perhaps someone would be interested in rewriting these in Core, or STG, or Cmm or something like that?

@TravisWhitaker TravisWhitaker changed the title Use Primops-based atomic counter and distribution implementation WIP: Use Primops-based atomic counter and distribution implementation Jun 18, 2025
@TravisWhitaker TravisWhitaker force-pushed the memory-safe branch 6 times, most recently from cd62bd2 to 5097e63 Compare June 19, 2025 01:28
@TravisWhitaker TravisWhitaker changed the title WIP: Use Primops-based atomic counter and distribution implementation Use Primops-based atomic counter and distribution implementation Jun 19, 2025
@TravisWhitaker
Copy link
Contributor Author

Ok, the current state of this PR is the best compromise I've thought of so far:

  • On 64-bit machines, we use the nice primops-based concurrent operations to operate on Int64s everywhere. There's a bunch of new CPP to make this work with both newer GHC's that have Int64# and older ones without it (I had completely forgotten that Int64# didn't always exist).
  • On 32-bit machines we fall back to a boxed IORef. The good news is that this doesn't perform as poorly as I had feared on what's probably the only relevant 32-bit target going forward: WASM. A major caveat here is that WASM is actually the only 32-bit target I've tested this with.

@TravisWhitaker
Copy link
Contributor Author

Also, it'd be great to have as many eyes and as much testing on the distribution implementation as possible. I've been using it for years now, but that doesn't mean it's bug-free.

Copy link

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@L0neGamer L0neGamer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!
I couldn't get an IO alternative to the primop-cases working, so this is where we are.
I'll merge this soon, and look into updating the hackage over the next week. I'll bump the version to 0.1.2.0 from 0.1.1.8, to signify that there has been a significant backend change but it shouldn't be any visible changes that aren't an improvement.

@L0neGamer L0neGamer merged commit 70bb7ae into haskell-github-trust:master Jun 24, 2025
14 checks passed
@TravisWhitaker
Copy link
Contributor Author

Thanks gents, nice to finally put this to rest.

@L0neGamer
Copy link
Collaborator

Finally published: https://hackage.haskell.org/package/ekg-core-0.1.2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants