Skip to content

Conversation

kanurag94
Copy link
Contributor

@kanurag94 kanurag94 commented Sep 30, 2025

Description

With this change envoy will send one GOAWAY frame with a last_stream_id set to 2^31-1, as described in RFC 9113. After a reasonable interval (defaults to 1s), envoy will send a followup GOAWAY frame with last_stream_id set to the highest received stream ID at that point in time.
This interval is configurable using graceful_goaway_timeout field.


Commit Message: http2: send graceful GoAway on close
Additional Description: According to RFC 9113, envoy should send a goaway with last_stream_id set to 2^31-1 before sending a final goaway with the highest received stream ID.
Risk Level: Low
Testing: Tests added. Manually verified.
Docs Changes: NA
Release Notes: Changelog added
Platform Specific Features:
Runtime guard: envoy.reloadable_features.http2_graceful_goaway
Fixes #39876
[Optional API Considerations:]

Copy link

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #41284 was opened by kanurag94.

see: more, trace.

Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #41284 was opened by kanurag94.

see: more, trace.

Copy link

CC @envoyproxy/runtime-guard-changes: FYI only for changes made to (source/common/runtime/runtime_features.cc).

🐱

Caused by: #41284 was synchronize by kanurag94.

see: more, trace.

@kanurag94 kanurag94 marked this pull request as ready for review September 30, 2025 18:23
@kanurag94
Copy link
Contributor Author

/retest

@kanurag94
Copy link
Contributor Author

/retest

Signed-off-by: Anurag Aggarwal <[email protected]>
@antoniovleonti
Copy link
Contributor

My understanding is that a graceful goaway is sent at the beginning of the drain sequence via Connection::shutdownNotice() (check code here) That eventually calls this adapter function which in both nghttp2 and oghttp2 sends the goaway frame with the max stream ID as you have done here.

Can you explain why this is not sufficient?

@kanurag94
Copy link
Contributor Author

My understanding is that a graceful goaway is sent at the beginning of the drain sequence via Connection::shutdownNotice() (check code here) That eventually calls this adapter function which in both nghttp2 and oghttp2 sends the goaway frame with the max stream ID as you have done here.

Can you explain why this is not sufficient?

On connection drain the shutdown notice in both nghttp2/oghttp2 adapter sends goaways correctly. If you look at the original issue for this PR #39876, an important part is filters triggering GoAways immediately. This PR solves that part.
For nghttp2/oghttp2 connection drain and subsequent shutdown, this goaway will not be called AFAICS.

Copy link
Contributor

@antoniovleonti antoniovleonti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry! I missed the issue in the description. Thank you for the explanation.

EXPECT_EQ(1, server_stats_store_.counter("http2.goaway_sent").value());
}

TEST_P(Http2CodecImplTest, GracefulGoAwayCausesOutboundFlood) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be prudent to add some tests showing how this new feature and the existing http connection manager drain phase interact.

Could you also add a couple integration tests?

Copy link
Contributor Author

@kanurag94 kanurag94 Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!, I added some integration tests. Please let me know if they are okay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TY for the added testing!

I think what worries me the most is if it's possible for the drain timer to fire after the graceful shutdown timer or vice versa.

Will that crash the envoy? (e.g. the second timer fires and tries calling a function on a now null object).

Could that potentially cause us to send two final goaways (with the second having a higher stream_id-- which is not allowed according to the spec)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand the code (which is little to be honest) -- the destructor on ConnectionImpl should destroy the timers -> cancelling any pending callbacks -> preventing null object access.

Actually the behavior hasn't changed much. I reverted the change on calling gracefulGoAway on onDrainTimeout. Now only filter or load shedding invoked goaways are changed to graceful. Both goaway paths are separate and non overlapping now. It just changes the manual goaway to be more RFC compliant.

However, I will keep attempting adding test.

1. only call custom gracefulGoAway when triggered from LoadSheddding or Filter
2. Add UTs
3. Add integration tests showing interaction of connection drain and filter invoked

Signed-off-by: Anurag Aggarwal <[email protected]>
Signed-off-by: Anurag Aggarwal <[email protected]>
Signed-off-by: Anurag Aggarwal <[email protected]>
@kanurag94
Copy link
Contributor Author

While trying to fix loadShedding integration tests, I see that envoy allows new streams to be created when they shouldn't be during the GOAWAY period. See ServerConnectionImpl::onBeginHeaders(int32_t stream_id). This violates RFC 7540.
Please correct me if my understanding is wrong.

cc @birenroy, @antoniovleonti, @yanavlasov

Signed-off-by: Anurag Aggarwal <[email protected]>
Signed-off-by: Anurag Aggarwal <[email protected]>
@antoniovleonti
Copy link
Contributor

While trying to fix loadShedding integration tests, I see that envoy allows new streams to be created when they shouldn't be during the GOAWAY period. See ServerConnectionImpl::onBeginHeaders(int32_t stream_id). This violates RFC 7540. Please correct me if my understanding is wrong.

cc @birenroy, @antoniovleonti, @yanavlasov

https://github.com/envoyproxy/envoy/blob/main/envoy/http/codec.h#L599 This comment seems to imply this is working as intended, but I agree the RFC seems to say this shouldn't be allowed.

@kanurag94
Copy link
Contributor Author

While trying to fix loadShedding integration tests, I see that envoy allows new streams to be created when they shouldn't be during the GOAWAY period. See ServerConnectionImpl::onBeginHeaders(int32_t stream_id). This violates RFC 7540. Please correct me if my understanding is wrong.
cc @birenroy, @antoniovleonti, @yanavlasov

https://github.com/envoyproxy/envoy/blob/main/envoy/http/codec.h#L599 This comment seems to imply this is working as intended, but I agree the RFC seems to say this shouldn't be allowed.

Thanks for the reference. I see the behavior with nghttp2 vs oghttp2 different.

  • nghttp2 resets any new streams as soon as it goaway.
  • oghttp2 seems to accept new streams.

I think this makes the behaviour already inconsistent irrespective what we expect according to the comment, orz.

@yanavlasov
Copy link
Contributor

While trying to fix loadShedding integration tests, I see that envoy allows new streams to be created when they shouldn't be during the GOAWAY period. See ServerConnectionImpl::onBeginHeaders(int32_t stream_id). This violates RFC 7540. Please correct me if my understanding is wrong.

cc @birenroy, @antoniovleonti, @yanavlasov

Yes, Envoy does not enforce GOAWAY semantics as it should. We should fix this and make it configurable with default to RFC correct behavior.

@yanavlasov
Copy link
Contributor

@kanurag94 please resolve merge conflicts. Will wait for @antoniovleonti LGTM.

/wait

Copy link
Contributor

@antoniovleonti antoniovleonti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for the improved testing!

@birenroy
Copy link
Contributor

birenroy commented Oct 7, 2025

While trying to fix loadShedding integration tests, I see that envoy allows new streams to be created when they shouldn't be during the GOAWAY period. See ServerConnectionImpl::onBeginHeaders(int32_t stream_id). This violates RFC 7540. Please correct me if my understanding is wrong.
cc @birenroy, @antoniovleonti, @yanavlasov

https://github.com/envoyproxy/envoy/blob/main/envoy/http/codec.h#L599 This comment seems to imply this is working as intended, but I agree the RFC seems to say this shouldn't be allowed.

Thanks for the reference. I see the behavior with nghttp2 vs oghttp2 different.

  • nghttp2 resets any new streams as soon as it goaway.
  • oghttp2 seems to accept new streams.

I think this makes the behaviour already inconsistent irrespective what we expect according to the comment, orz.

If nghttp2 resets new streams after Envoy sends the initial GOAWAY with max stream ID, then that is a bug in nghttp2.

A server should continue to accept new streams that arrive until it sends the final GOAWAY.

@birenroy
Copy link
Contributor

birenroy commented Oct 7, 2025

I appreciate the work you've done here to fix the issue in question. Would it be possible to offer two options for the inter-GOAWAY timeout?

  1. time duration specified in config (already done in the current version of the PR)
  2. a round trip time, as measured by a PING frame initiated by the Envoy

Copy link
Member

@wbpcode wbpcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great contribution. I added some comments. But I think the core problems are here:

  • sendGoAwayAndClose() will reset all pending streams directly. It's make no sense to delay the final goaway. That's say, the sendGoAwayAndClose() should not be used if graceful shutdown is necessary. I think it's OK to submit a shutdown notice first for RFC. But it makes no much sense in actual practice.
  • If we need a graceful shutdown, we should enhance the sendGoAwayAndClose() to provide a graceful mode, then we can redirect the call to the startDrainSequence(). Then the shutdown notice (initial goaway) will be sent immediately. And the final goway will be sent after the drain time is end.

cc @yanavlasov cc @birenroy

/wait

Comment on lines +589 to +593
/**
* Indicate graceful "go away" to the remote. For HTTP/2, this implements RFC-compliant
* graceful shutdown. For other protocols, this falls back to regular goAway().
*/
virtual void goAwayGraceful() { goAway(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should reuse the shutdownNotice().

Comment on lines +665 to +672

// Timeout for graceful HTTP/2 GOAWAY shutdown sequence.
// When graceful shutdown is enabled, Envoy will send an initial GOAWAY frame with
// last_stream_id set to 2^31-1, wait for this timeout period, then send a final
// GOAWAY frame with the actual highest received stream ID. This allows in-flight
// requests to complete gracefully. If set to 0, graceful shutdown is disabled
// and the standard immediate GOAWAY behavior is used. Defaults to 1000ms (1 second).
google.protobuf.Duration graceful_goaway_timeout = 18;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connection will be closed directly after the delayed_close_timeout, the that will finally effect the result of the graceful_goaway_timeout.

@wbpcode wbpcode assigned yanavlasov and wbpcode and unassigned markdroth Oct 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Envoy does not use graceful shutdown when sending HTTP/2 GOAWAYs

6 participants