Skip to content

Conversation

@starius
Copy link
Contributor

@starius starius commented Nov 27, 2025

Summary

  • make RequestReadStream/RequestWriteStream wait for their respective channels (bounded by streamAcquireTimeout and respecting context) instead of immediately returning "stream occupied", removing the race that caused TestHashMailServerReturnStream to flake

  • ensure setupAperture stops its Aperture instance via t.Cleanup, so each test tears down its server even when it fails midway; this prevents one test from leaving a live HashMail stream that causes the next test's
    NewCipherBox call to hit AlreadyExists

Testing

  • go test -run 'TestHashMailServer(ReturnStream|LargeMessage)' -count=20

Register the Aperture instance created in setupAperture with t.Cleanup so
that every test stops its own server even if it fails. This keeps the global
HashMail stream map clean and prevents TestHashMailServerLargeMessage from
inheriting leftover streams from TestHashMailServerReturnStream.

This prevents cascading test failures, when a failure in one test is replicated
as many failures in many tests, complicating debugging from logs.
Fix flaky tests. Reproducer:
go test -run TestHashMailServerReturnStream -count=20

TestHashMailServerReturnStream fails because the test cancels a read stream
and immediately dials RecvStream again expecting the same stream to be handed
out once the server returns it. The hashmail server implemented
RequestReadStream/RequestWriteStream with a non-blocking channel poll and
returned "read/write stream occupied" as soon as the mailbox was busy. That
raced with the deferred ReturnStream call and the reconnect often happened
before the stream got pushed back, so clients received the occupancy error
instead of the context cancellation they triggered.

Teach RequestReadStream/RequestWriteStream to wait for the stream to become
available (or the caller's context / server shutdown) with a bounded timeout.
If the wait expires we still return the "... stream occupied" error, so callers
that legitimately pile up can see that signal. The new streamAcquireTimeout
constant documents the policy, and the blocking select removes the race, so
reconnect attempts now either succeed or surface the original context error.
@hieblmi hieblmi self-requested a review November 27, 2025 07:45
Copy link
Collaborator

@hieblmi hieblmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants