-
Notifications
You must be signed in to change notification settings - Fork 4.6k
alts: receive low watermark support #8513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
It's generally useful to have new-and-improved Go. One specific useful feature is `b.Loop()`, which makes benchmarking easier.
It's only called in one place, and is effectively a method on conn. Part of grpc#8510.
Increases large write speed by 9.62% per BenchmarkLargeMessage. Detailed benchmarking numbers below. Rather than use different sizes for the maximum read record, write record, and write buffer, just use 1MB for all of them. Using larger records reduces the amount of payload splitting and the number of syscalls made by ALTS. Part of grpc#8510. SO_RCVLOWAT and TCP receive zerocopy are only effective with larger payloads, and so ALTS can't be limiting payload sizes to 4 KiB. SO_RCVLOWAT and zerocopy are on the receive side, but for benchmarking purposes we need ALTS to send large messages. Benchmarks: $ benchstat large_msg_old.txt large_msg.txt goos: linux goarch: amd64 pkg: google.golang.org/grpc/credentials/alts/internal/conn cpu: AMD Ryzen Threadripper PRO 3945WX 12-Cores │ large_msg_old.txt │ large_msg.txt │ │ sec/op │ sec/op vs base │ LargeMessage-12 68.88m ± 1% 62.25m ± 0% -9.62% (p=0.002 n=6)
SO_RCVLOWAT is *not* enabled by default. Users must turn it on via an option, as not everyone will want the CPU/throughput tradeoff. Part of grpc#8510.
The implementation of setRcvlowat is based on the gRCP C++ library implementation. Part of grpc#8510.
Part of grpc#8510. For large payloads, we see about a 36% reduction in CPU usage and a 2.5% reduction in throughput. This is expected, and has been observed in the C++ gRPC library as well. The expectation is that with TCP receive zerocopy also enabled, we'll see both a reduction in CPU usage and an increase in throughput. For users not using zerocopy, they can choose whether the CPU/throughput tradeoff is worthwhile. SO_RCVLOWAT is unused for small payloads, where its impact would be insignificant (but would cost cycles to make syscalls). Enabling it has no effect on CPU usage or throughput of small payloads, and so is omitted from the below benchmarks so as not to water down the impact on both CPU usage and throughput. Benchmark numbers: $ benchstat -col "/rcvlowat" -filter "/size:(64_KiB OR 512_KiB OR 1_MiB OR 4_MiB) .unit:(Mbps OR cpu-usec/op)" ~/lowat_numbers.txt goos: linux goarch: amd64 pkg: google.golang.org/grpc/credentials/alts/internal/conn cpu: AMD Ryzen Threadripper PRO 3945WX 12-Cores │ false │ true │ │ Mbps │ Mbps vs base │ Rcvlowat/size=64_KiB-12 47.44 ± 0% 47.32 ± 0% -0.24% (p=0.015 n=6) Rcvlowat/size=512_KiB-12 299.2 ± 0% 293.6 ± 0% -1.90% (p=0.002 n=6) Rcvlowat/size=1_MiB-12 482.1 ± 0% 468.1 ± 0% -2.88% (p=0.002 n=6) Rcvlowat/size=4_MiB-12 887.4 ± 1% 842.3 ± 0% -5.08% (p=0.002 n=6) geomean 279.1 272.0 -2.54% │ false │ true │ │ cpu-usec/op │ cpu-usec/op vs base │ Rcvlowat/size=64_KiB-12 992.2 ± 1% 666.1 ± 1% -32.87% (p=0.002 n=6) Rcvlowat/size=512_KiB-12 7.431k ± 1% 4.660k ± 0% -37.30% (p=0.002 n=6) Rcvlowat/size=1_MiB-12 14.720k ± 1% 9.192k ± 0% -37.56% (p=0.002 n=6) Rcvlowat/size=4_MiB-12 59.19k ± 1% 37.50k ± 3% -36.64% (p=0.002 n=6) geomean 8.953k 5.719k -36.12%
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8513 +/- ##
==========================================
- Coverage 82.40% 81.76% -0.64%
==========================================
Files 414 414
Lines 40531 40575 +44
==========================================
- Hits 33399 33178 -221
- Misses 5770 6026 +256
- Partials 1362 1371 +9
🚀 New features to boost your workflow:
|
I believe it should be possible to set this using a custom dialer, without any code changes. Have you considered that approach? |
I don't think using a custom dialer would work since we need to update the value of socket option before every read. Maybe we should consider directly implementing this in the transport. |
The socket option should be set to the maximum of the HTTP/2 frame size and the TLS/ALTS record size: However, there are several challenges to implementing this in gRPC-Go:
|
I actually started by writing my own I'm not sure I understand the need for HTTP2 frame and TLS record size. HTTP2 is the layer above this; we care about the ALTS frame size. And IIUC ALTS requires the whole record in order to decrypt, so I'm not sure why we need the TLS record size. But please correct me if I'm wrong here; I'm new to working with ALTS. Is there an issue with implementing this inside ALTS? ALTS has access both the underlying socket (so it can set SO_RCVLOWAT) and information (incoming ALTS message length). |
Framing occurs at two layers in the networking stack:
Based on my reading of the gRPC C-core, it seems the goal is to set To correctly determine the @ctiller and @Vignesh2208, could you confirm if my understanding of how In practice, gRPC Go defaults to a maximum H2 frame size of 16KB, and ALTS also appears to use 16KB frames (with an option for larger sizes if supported by the peer). Therefore, implementing the Finally, it's worth noting that when an HTTPS proxy is in use, the credentials may receive a wrapped |
Add in gRPC's own message framing to that. Messages may span multiple H2 frames if they are larger than the max frame size. But also keep in mind that applications can't send more data than our flow control allows. This flow control is optimized assuming that it is replenished while receiving, so we would have to be careful and make sure we are pre-replenishing it before using this feature, e.g. |
In the gRPC Go team meeting today, we decided to proceed with implementing There was a request to run real-world benchmarks to better quantify the performance impact. I will share some resources via private message for running a GCS read benchmark to help with this. |
Reduce CPU usage significantly via
SO_RCVLOWAT
. There is a smallthroughput penalty, so
SO_RCVLOWAT
is not enabled by default.Users must turn it on via an option, as not everyone will want the
CPU/throughput tradeoff.
Part of #8510.
For large payloads, we see about a 36% reduction in CPU usage and a 2.5%
reduction in throughput. This is expected, and has been observed in the
C++ gRPC library as well. The expectation is that with TCP receive
zerocopy also enabled, we'll see both a reduction in CPU usage and an
increase in throughput. For users not using zerocopy, they can choose
whether the CPU/throughput tradeoff is worthwhile.
SO_RCVLOWAT is unused for small payloads, where its impact would be
insignificant (but would cost cycles to make syscalls). Enabling it in
grpc-go has no effect on CPU usage or throughput of small payloads, and
so is omitted from the below benchmarks so as not to water down the
impact on both CPU usage and throughput.
Benchmarks are of the ALTS layer alone.
Note that this PR includes #8512. GitHub doesn't support proper commit chains / stacked PRs, so I'm doing this in several PRs with some (annoyingly) redundant commits. Let me know if this isn't a good workflow for you and I'll change things up.
Benchmark numbers: