Skip to content

Conversation

asingh-g
Copy link

To be able to load test Terminating HTTP CONNECT/CONNECT-UDP Proxies, I've implemented encapsulation by spinning up a second envoy as a subprocess and routing nighthawk's generated traffic through it.

Presently CONNECT-UDP is only supported for HTTP/2 in envoy, and CONNECT is supported for all of H1, H2 and H3 QUIC.
Therefore only following combinations are supported
H1,H2 over H1,H2,H3
and
H3 over H2
using a combination of the --protocol and --tunnel-protocol flags.

@asingh-g asingh-g force-pushed the nighthawk-connect-support branch from 46c5e51 to fbe6068 Compare June 28, 2025 17:22
Copy link
Collaborator

@mum4k mum4k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @asingh-g, glad to see this landing!

I note this is still marked as Draft, leaving a few observations below to get us started. Please mark it as ready for review once ready. I will be ooo for two weeks, but @eric846 will be able to help with review and merge.

@mum4k mum4k requested a review from eric846 July 1, 2025 17:00
@mum4k mum4k added the waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. label Jul 1, 2025
@asingh-g asingh-g force-pushed the nighthawk-connect-support branch from ca97948 to eda8939 Compare July 1, 2025 23:14
asingh-g added 25 commits July 2, 2025 00:18
Signed-off-by: asingh-g <[email protected]>
Signed-off-by: asingh-g <[email protected]>
asingh-g added 2 commits July 2, 2025 00:27
Signed-off-by: asingh-g <[email protected]>
Signed-off-by: asingh-g <[email protected]>
Copy link

@asedeno asedeno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to read over source/client/process_impl.cc and the tests, but here's some comments.

@@ -1,5 +1,7 @@
#include "source/client/process_bootstrap.h"

#include <semaphore.h>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the C++ semaphore header rather than the C semaphore.h header, or perhaps something from Abseil's synchronization library?

Copy link
Author

@asingh-g asingh-g Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into std::semaphore and abseil's synchronisation libs, I am not convinced if they're meant for IPC. The pshared flag (PTHREAD_PROCESS_SHARED) in semaphore.h is what allows us to do inter process synchronisation and it looks to me that both of the alternatives use PTHREAD_PROCESS_PRIVATE

@asingh-g asingh-g force-pushed the nighthawk-connect-support branch 2 times, most recently from b0b330f to 47d662f Compare September 16, 2025 00:44
Signed-off-by: asingh-g <[email protected]>
@asingh-g asingh-g marked this pull request as ready for review September 16, 2025 01:09
@asingh-g asingh-g requested a review from eric846 September 16, 2025 01:09
eric846
eric846 previously approved these changes Sep 19, 2025
@eric846
Copy link
Contributor

eric846 commented Sep 19, 2025

The tsan test timed out. Could you try something like git commit -s --allow-empty -m 'Kick CI' && git push ? Hopefully it will pass.

Signed-off-by: asingh-g <[email protected]>
@asingh-g
Copy link
Author

The tsan test timed out. Could you try something like git commit -s --allow-empty -m 'Kick CI' && git push ? Hopefully it will pass.

Hmm seems like its still timing out @eric846

@eric846
Copy link
Contributor

eric846 commented Sep 22, 2025

The tsan test timed out. Could you try something like git commit -s --allow-empty -m 'Kick CI' && git push ? Hopefully it will pass.

Hmm seems like its still timing out @eric846

We probably need to increase the timeout to fit the extra tests. But it seems nontrivial to increase the timeout.

  • I would have thought you could just set the timeout here:
    build:tsan --test_timeout=900 # unique
    . Setting it to 5 there doesn't cut the test short when I try it locally. The 1800s timeout in the log may have been two attempts each lasting 900s. Or could this 900 not actually be working?
  • Another possibility I thought could be to add --test_timeout=3600 here:
    --test_env=UBSAN_OPTIONS=print_stacktrace=1 \
    . But again, --test_timeout=5 failed to cut the test short.

@eric846
Copy link
Contributor

eric846 commented Sep 24, 2025

Sorry for the confusion. I found these conflicting timeout settings:

Could you try editing line 6 of .bazelrc to be something like

build:tsan --test_timeout=3600

and line 220 of .bazelrc to

# build:tsan --test_timeout=120,600,1500,4800 # unique

(commenting it out with our standard #unique note)

Hopefully the tsan test will succeed within 3600 seconds.

Once it's working we can reduce the 3600 to something closer to the true duration.

Why it was like this: I believe line 220 from Envoy's .bazelrc sneaked in at some point, unwittingly neutralizing our Nighthawk-specific tsan setting on line 6. But I can't explain why the tsan test timed out after 1800. If it still times out after 1800, the 1800 could be at some level we haven't looked yet.

@eric846
Copy link
Contributor

eric846 commented Sep 26, 2025

We still got a 30m timeout despite adjusting .bazelrc.

I poked around a bit more.

Check out

exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //test:python_test
whoami: cannot find name for user ID 108
-----------------------------------------------------------------------------
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from PythonTest
[ RUN      ] PythonTest.IntegrationTests
============================= test session starts ==============================
platform linux -- Python 3.12.3, pytest-8.4.1, pluggy-1.6.0
rootdir: /mnt/engflow/worker/work/3/exec/bazel-out/k8-dbg/bin/test/python_test.runfiles/nighthawk/test/integration
plugins: xdist-3.8.0, dependency-0.6.0
created: 1/1 worker
1 worker [4 items]
scheduling tests via LoadScheduling
test/integration/test_integration_basics.py::test_connect_tunneling[IpVersion.IPV4-nighthawk/test/integration/configurations/terminating_http1_connect_envoy.yaml-http1] -- Test timed out at 2025-09-25 16:32:37 UTC --
Execution result: https://mordenite.cluster.engflow.com/actions/executions/ChAnXXze2cZOnrhdkfocxvsoEgdkZWZhdWx0GiUKIFQ819htoPmp_o_lOsndan3e_Zqc8oQPIu8OyRUhzHLwEJ0C

(//test:python_test is a C++ wrapper that runs the Python integration tests.)

Could one of the new integration testcases be getting stuck? And could we have an implicit 30m timeout somewhere in the test?

The fact that it timed out during one of the new tests could indicate a problem with those tests.

Up to now, my assumption has been that the total duration of all the integration tests just exceeded the allocated time. But what if the new integration tests uniquely get stuck in tsan mode?

I have a couple of ideas for experiments:

  • Comment out every testcase in python_test except the test_connect_tunneling http1 case, and see if tsan still times out in 30m.
  • Set a very low timeout in .bazelrc and see if the timeout error message looks the same as the current one that timed out after 1800.2s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting-for-changes A PR waiting for comments to be resolved and changes to be applied.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants