-
Notifications
You must be signed in to change notification settings - Fork 39
Use Primops-based atomic counter and distribution implementation #42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Primops-based atomic counter and distribution implementation #42
Conversation
8288724 to
4273d93
Compare
4273d93 to
420f559
Compare
|
Turns out I mistakenly used an atomic write to release the distribution spinlock, when a weak write is sufficient (and much faster). Now the primops implementation is comparable to the existing implementation's performance (even a bit faster in the multicapability case). |
|
One way or another, #41 needs to be fixed for EKG to work well on newer ARM boards. If there's no interest in this patch, we should find another way to achieve memory safety on weakly ordered machines. |
This commit attempts to address issue haskell-github-trust#41 of tibbe/ekg-core by replacing the C code for the distribution metric with GHC prim ops. The performance of this implementation is about half that of the existing C code in a single-threaded benchmark; without masking the performance is comparable. This commit is based on the work of Travis Whitaker in PR haskell-github-trust#42 of tibbe/ekg-core.
This commit attempts to address issue haskell-github-trust#41 of tibbe/ekg-core by replacing the C code for the distribution metric with GHC prim ops. The performance of this implementation is about half that of the existing C code in a single-threaded benchmark; without masking the performance is comparable. This commit is based on the work of Travis Whitaker in PR haskell-github-trust#42 of tibbe/ekg-core.
|
@23Skidoo What more can I do to help get this package fixed on aarch64? |
AndreasPK
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Int overflowing on 32bit might be a concern and I wonder if read should be locking. But looks reasonable to me on the whole.
L0neGamer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm honestly very impressed, this is a pretty faithful, more haskell implementation of what was previously here.
I'm worried primarily about two things: implicit casting from Int64, and the incredibly unreadable state passing style. Happy to discuss both, but I would prefer to see some cleanup in those areas.
Let me know if that wards you off taking this PR the distance and I'll see if I can develop it further.
Are you talking about moving most things from
This is what stateful primop-based code looks like, at least I don't know of a nice way to do these things. Perhaps someone would be interested in rewriting these in Core, or STG, or Cmm or something like that? |
cd62bd2 to
5097e63
Compare
5097e63 to
a12f5bd
Compare
|
Ok, the current state of this PR is the best compromise I've thought of so far:
|
|
Also, it'd be great to have as many eyes and as much testing on the distribution implementation as possible. I've been using it for years now, but that doesn't mean it's bug-free. |
Bodigrim
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
L0neGamer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
I couldn't get an IO alternative to the primop-cases working, so this is where we are.
I'll merge this soon, and look into updating the hackage over the next week. I'll bump the version to 0.1.2.0 from 0.1.1.8, to signify that there has been a significant backend change but it shouldn't be any visible changes that aren't an improvement.
|
Thanks gents, nice to finally put this to rest. |
|
Finally published: https://hackage.haskell.org/package/ekg-core-0.1.2.0 |
After finding #41 I was curious whether or not a primops-based implementation of cbits could achieve competitive performance. Turns out that it can, with some caveats.
I took four measurements to compare the implementation presented here with the current tip of master:
counterbenchmark with-N1counterbenchmark with-Ndistributionbenchmark with-N1distributionbenchmark with-NThe results were recorded on my 2018 i9 MacBook Pro, which has 12 logical cores (6 physical cores, two hyperthreads each), so
-Nmeans-N12on this machine. I added-O2toghc-optionsfor the current tip of master and the code here. These are the results for the current tip of master:And here are the results for the code presented here:
The new implementation yields slightly faster atomic counter performance in both the single-capability and capability-per-core cases. This could be due to the fact that in the new implementation the counter is not in pinned memory, or the lack of
withForeignPtr, which isn't free.The distribution results are quite interesting. In the single-capability case, this implementation is about three times slower than the existing implementation, but the gap between them almost entirely disappears in the capability-per-core case. I'm not quite sure what's going on here. I thought that perhaps the fact that the unsafe FFI call can't be interrupted by GC could have something to do with it, but using-Sshows that only a single major GC at the end of benchmark execution takes place in both the single-capability case and the capability-per-core case. I suspect the answer is somewhere in the core emitted forspinLock, but I haven't dug into it yet.The most important difference between the two implementations is correctness. Because of #41, ekg sometimes reports stale or gibberish metrics on newer aarch64 chips, while this implementation produces correct results (at least the results look right to me, but a second pair of eyes on the distribution update code would be much appreciated).
I'm not too happy about having to unpack
I#from a CAF to get the field array offsets in these functions, but sincehsc2hsseems to always wrap parens around the results of its directives, I couldn't figure out how to generate an unboxed literal from them. One solution might be to export TH splices from a separate hsc file, then splice in the unboxed literals in the System.Metrics.Distribution module. That would also allow us to skip hsc2hs on that module entirely, which would roughly halve the number of#required.Another minor difference is the change fromInt64toInt. This is necessary because GHC doesn't make available any 64 bit wide atomic primops on 32 bit chips.