Skip to content

Conversation

kevinGC
Copy link
Contributor

@kevinGC kevinGC commented Aug 13, 2025

Reduce CPU usage significantly via SO_RCVLOWAT. There is a small
throughput penalty, so SO_RCVLOWAT is not enabled by default.
Users must turn it on via an option, as not everyone will want the
CPU/throughput tradeoff.

Part of #8510.

For large payloads, we see about a 36% reduction in CPU usage and a 2.5%
reduction in throughput. This is expected, and has been observed in the
C++ gRPC library as well. The expectation is that with TCP receive
zerocopy also enabled, we'll see both a reduction in CPU usage and an
increase in throughput. For users not using zerocopy, they can choose
whether the CPU/throughput tradeoff is worthwhile.

SO_RCVLOWAT is unused for small payloads, where its impact would be
insignificant (but would cost cycles to make syscalls). Enabling it in
grpc-go has no effect on CPU usage or throughput of small payloads, and
so is omitted from the below benchmarks so as not to water down the
impact on both CPU usage and throughput.

Benchmarks are of the ALTS layer alone.

Note that this PR includes #8512. GitHub doesn't support proper commit chains / stacked PRs, so I'm doing this in several PRs with some (annoyingly) redundant commits. Let me know if this isn't a good workflow for you and I'll change things up.

Benchmark numbers:

$ benchstat -col "/rcvlowat" -filter "/size:(64_KiB OR 512_KiB OR 1_MiB OR 4_MiB) .unit:(Mbps OR cpu-usec/op)" ~/lowat_numbers.txt
goos: linux
goarch: amd64
pkg: google.golang.org/grpc/credentials/alts/internal/conn
cpu: AMD Ryzen Threadripper PRO 3945WX 12-Cores
                         │   false    │               true               │
                         │    Mbps    │    Mbps     vs base              │
Rcvlowat/size=64_KiB-12    47.44 ± 0%   47.32 ± 0%  -0.24% (p=0.015 n=6)
Rcvlowat/size=512_KiB-12   299.2 ± 0%   293.6 ± 0%  -1.90% (p=0.002 n=6)
Rcvlowat/size=1_MiB-12     482.1 ± 0%   468.1 ± 0%  -2.88% (p=0.002 n=6)
Rcvlowat/size=4_MiB-12     887.4 ± 1%   842.3 ± 0%  -5.08% (p=0.002 n=6)
geomean                    279.1        272.0       -2.54%

                         │    false     │                true                │
                         │ cpu-usec/op  │ cpu-usec/op  vs base               │
Rcvlowat/size=64_KiB-12      992.2 ± 1%    666.1 ± 1%  -32.87% (p=0.002 n=6)
Rcvlowat/size=512_KiB-12    7.431k ± 1%   4.660k ± 0%  -37.30% (p=0.002 n=6)
Rcvlowat/size=1_MiB-12     14.720k ± 1%   9.192k ± 0%  -37.56% (p=0.002 n=6)
Rcvlowat/size=4_MiB-12      59.19k ± 1%   37.50k ± 3%  -36.64% (p=0.002 n=6)
geomean                     8.953k        5.719k       -36.12%

It's generally useful to have new-and-improved Go. One specific useful
feature is `b.Loop()`, which makes benchmarking easier.
It's only called in one place, and is effectively a method on conn.

Part of grpc#8510.
Increases large write speed by 9.62% per BenchmarkLargeMessage. Detailed
benchmarking numbers below.

Rather than use different sizes for the maximum read record, write
record, and write buffer, just use 1MB for all of them.

Using larger records reduces the amount of payload splitting and the
number of syscalls made by ALTS.

Part of grpc#8510. SO_RCVLOWAT and TCP receive zerocopy are only effective
with larger payloads, and so ALTS can't be limiting payload sizes to 4
KiB. SO_RCVLOWAT and zerocopy are on the receive side, but for
benchmarking purposes we need ALTS to send large messages.

Benchmarks:

$ benchstat large_msg_old.txt large_msg.txt
goos: linux
goarch: amd64
pkg: google.golang.org/grpc/credentials/alts/internal/conn
cpu: AMD Ryzen Threadripper PRO 3945WX 12-Cores
                │ large_msg_old.txt │           large_msg.txt           │
                │      sec/op       │   sec/op     vs base              │
LargeMessage-12         68.88m ± 1%   62.25m ± 0%  -9.62% (p=0.002 n=6)
SO_RCVLOWAT is *not* enabled by default. Users must turn it on via an
option, as not everyone will want the CPU/throughput tradeoff.

Part of grpc#8510.
The implementation of setRcvlowat is based on the gRCP C++ library
implementation.

Part of grpc#8510.
Part of grpc#8510.

For large payloads, we see about a 36% reduction in CPU usage and a 2.5%
reduction in throughput. This is expected, and has been observed in the
C++ gRPC library as well. The expectation is that with TCP receive
zerocopy also enabled, we'll see both a reduction in CPU usage and an
increase in throughput. For users not using zerocopy, they can choose
whether the CPU/throughput tradeoff is worthwhile.

SO_RCVLOWAT is unused for small payloads, where its impact would be
insignificant (but would cost cycles to make syscalls). Enabling it has
no effect on CPU usage or throughput of small payloads, and so is
omitted from the below benchmarks so as not to water down the impact on
both CPU usage and throughput.

Benchmark numbers:

$ benchstat -col "/rcvlowat" -filter "/size:(64_KiB OR 512_KiB OR 1_MiB OR 4_MiB) .unit:(Mbps OR cpu-usec/op)" ~/lowat_numbers.txt
goos: linux
goarch: amd64
pkg: google.golang.org/grpc/credentials/alts/internal/conn
cpu: AMD Ryzen Threadripper PRO 3945WX 12-Cores
                         │   false    │               true               │
                         │    Mbps    │    Mbps     vs base              │
Rcvlowat/size=64_KiB-12    47.44 ± 0%   47.32 ± 0%  -0.24% (p=0.015 n=6)
Rcvlowat/size=512_KiB-12   299.2 ± 0%   293.6 ± 0%  -1.90% (p=0.002 n=6)
Rcvlowat/size=1_MiB-12     482.1 ± 0%   468.1 ± 0%  -2.88% (p=0.002 n=6)
Rcvlowat/size=4_MiB-12     887.4 ± 1%   842.3 ± 0%  -5.08% (p=0.002 n=6)
geomean                    279.1        272.0       -2.54%

                         │    false     │                true                │
                         │ cpu-usec/op  │ cpu-usec/op  vs base               │
Rcvlowat/size=64_KiB-12      992.2 ± 1%    666.1 ± 1%  -32.87% (p=0.002 n=6)
Rcvlowat/size=512_KiB-12    7.431k ± 1%   4.660k ± 0%  -37.30% (p=0.002 n=6)
Rcvlowat/size=1_MiB-12     14.720k ± 1%   9.192k ± 0%  -37.56% (p=0.002 n=6)
Rcvlowat/size=4_MiB-12      59.19k ± 1%   37.50k ± 3%  -36.64% (p=0.002 n=6)
geomean                     8.953k        5.719k       -36.12%
Copy link

codecov bot commented Aug 13, 2025

Codecov Report

❌ Patch coverage is 49.45055% with 46 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.76%. Comparing base (55e8b90) to head (af10d77).
⚠️ Report is 83 commits behind head on master.

Files with missing lines Patch % Lines
credentials/alts/internal/conn/conn_linux.go 10.81% 33 Missing ⚠️
credentials/alts/internal/conn/record.go 71.11% 11 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8513      +/-   ##
==========================================
- Coverage   82.40%   81.76%   -0.64%     
==========================================
  Files         414      414              
  Lines       40531    40575      +44     
==========================================
- Hits        33399    33178     -221     
- Misses       5770     6026     +256     
- Partials     1362     1371       +9     
Files with missing lines Coverage Δ
credentials/alts/alts.go 75.65% <100.00%> (+0.49%) ⬆️
credentials/alts/internal/conn/common.go 100.00% <ø> (ø)
credentials/alts/internal/handshaker/handshaker.go 79.29% <100.00%> (+1.74%) ⬆️
credentials/alts/internal/conn/record.go 74.47% <71.11%> (-3.73%) ⬇️
credentials/alts/internal/conn/conn_linux.go 10.81% <10.81%> (ø)

... and 40 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@arjan-bal
Copy link
Contributor

arjan-bal commented Aug 21, 2025

I believe it should be possible to set this using a custom dialer, without any code changes. Have you considered that approach?

@arjan-bal
Copy link
Contributor

I believe it should be possible to set this using a custom dialer, without any code changes. Have you considered that approach?

I don't think using a custom dialer would work since we need to update the value of socket option before every read. Maybe we should consider directly implementing this in the transport.

@arjan-bal
Copy link
Contributor

I don't think using a custom dialer would work since we need to update the value of socket option before every read. Maybe we should consider directly implementing this in the transport.

The socket option should be set to the maximum of the HTTP/2 frame size and the TLS/ALTS record size: max(http/2_frame_size, tls_record_size). This would be similar to the gRPC C-core API, which passes a parameter specifically for this purpose. This value would need to be propagated down through the layers, from the framer to the credentials, and finally to the dialer (framer -> credentials -> dialer).

However, there are several challenges to implementing this in gRPC-Go:

  • net.Conn Interface: The standard net.Conn interface, which gRPC-Go uses for socket operations, does not support passing a minimum read size. A potential workaround is to define an optional interface that our connection types can implement, which we can then check for using a type assertion.
  • HTTP/2 Framer: gRPC-Go uses the standard library's HTTP/2 framer. Exposing the frame size information would likely require forking and modifying this package.
  • TLS Record Size: Similarly, the standard library's TLS implementation doesn't easily expose the record size, which is another necessary input for this logic.

@kevinGC
Copy link
Contributor Author

kevinGC commented Oct 3, 2025

I actually started by writing my own net.Conn, but it didn't make sense: you have to call a bunch of special methods on it that make it confusing to use. I agree it'd be cleaner if it were possible, but right now that's not the case.

I'm not sure I understand the need for HTTP2 frame and TLS record size. HTTP2 is the layer above this; we care about the ALTS frame size. And IIUC ALTS requires the whole record in order to decrypt, so I'm not sure why we need the TLS record size. But please correct me if I'm wrong here; I'm new to working with ALTS.

Is there an issue with implementing this inside ALTS? ALTS has access both the underlying socket (so it can set SO_RCVLOWAT) and information (incoming ALTS message length).

@arjan-bal
Copy link
Contributor

Framing occurs at two layers in the networking stack:

  1. Transport Security: ALTS and TLS frame records.
  2. HTTP/2 (H2): H2 interleaves frames for different streams. Most H2 frames are small, except for DATA frames that carry gRPC messages.

Based on my reading of the gRPC C-core, it seems the goal is to set SO_RCVLOWAT to the larger of the security and application frame sizes. For example, if H2 DATA frames are 16KB and ALTS records are 3KB, we should wait for 16KB to become available in the kernel's read buffer before waking the application.

To correctly determine the SO_RCVLOWAT value, both the H2 framer and the transport credentials need to be involved. Implementing this logic in the core gRPC layer would avoid duplicating it across each credential's implementation.

@ctiller and @Vignesh2208, could you confirm if my understanding of how SO_RCVLOWAT is used in C-core is correct?


In practice, gRPC Go defaults to a maximum H2 frame size of 16KB, and ALTS also appears to use 16KB frames (with an option for larger sizes if supported by the peer). Therefore, implementing the SO_RCVLOWAT logic solely within the ALTS credentials might be sufficient for performance. However, this approach could lead to code duplication later if, for example, gRPC adds support for larger H2 frames or if TLS credentials require similar functionality.


Finally, it's worth noting that when an HTTPS proxy is in use, the credentials may receive a wrapped buffconn instead of the underlying TCP connection. This can be handled with a type assertion to get the raw file descriptor.

FYI @dfawley @easwars .

@dfawley
Copy link
Member

dfawley commented Oct 6, 2025

Framing occurs at two layers in the networking stack:

Add in gRPC's own message framing to that. Messages may span multiple H2 frames if they are larger than the max frame size. But also keep in mind that applications can't send more data than our flow control allows. This flow control is optimized assuming that it is replenished while receiving, so we would have to be careful and make sure we are pre-replenishing it before using this feature, e.g.

@easwars easwars added this to the 1.77 Release milestone Oct 6, 2025
@arjan-bal
Copy link
Contributor

In the gRPC Go team meeting today, we decided to proceed with implementing SO_RCVLOWAT support within the ALTS credentials for now, as the potential performance benefits are significant.

There was a request to run real-world benchmarks to better quantify the performance impact. I will share some resources via private message for running a GCS read benchmark to help with this.

@easwars easwars added Type: Feature New features or improvements in behavior Area: Auth Includes regular credentials API and implementation. Also includes advancedtls, authz, rbac etc. Status: Requires Reporter Clarification labels Oct 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Auth Includes regular credentials API and implementation. Also includes advancedtls, authz, rbac etc. Status: Requires Reporter Clarification Type: Feature New features or improvements in behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants