hashmail: fix flacky tests #199

starius · 2025-11-27T01:29:28Z

Summary

make RequestReadStream/RequestWriteStream wait for their respective channels (bounded by streamAcquireTimeout and respecting context) instead of immediately returning "stream occupied", removing the race that caused TestHashMailServerReturnStream to flake
ensure setupAperture stops its Aperture instance via t.Cleanup, so each test tears down its server even when it fails midway; this prevents one test from leaving a live HashMail stream that causes the next test's
NewCipherBox call to hit AlreadyExists

Testing

go test -run 'TestHashMailServer(ReturnStream|LargeMessage)' -count=20

Register the Aperture instance created in setupAperture with t.Cleanup so that every test stops its own server even if it fails. This keeps the global HashMail stream map clean and prevents TestHashMailServerLargeMessage from inheriting leftover streams from TestHashMailServerReturnStream. This prevents cascading test failures, when a failure in one test is replicated as many failures in many tests, complicating debugging from logs.

Fix flaky tests. Reproducer: go test -run TestHashMailServerReturnStream -count=20 TestHashMailServerReturnStream fails because the test cancels a read stream and immediately dials RecvStream again expecting the same stream to be handed out once the server returns it. The hashmail server implemented RequestReadStream/RequestWriteStream with a non-blocking channel poll and returned "read/write stream occupied" as soon as the mailbox was busy. That raced with the deferred ReturnStream call and the reconnect often happened before the stream got pushed back, so clients received the occupancy error instead of the context cancellation they triggered. Teach RequestReadStream/RequestWriteStream to wait for the stream to become available (or the caller's context / server shutdown) with a bounded timeout. If the wait expires we still return the "... stream occupied" error, so callers that legitimately pile up can see that signal. The new streamAcquireTimeout constant documents the policy, and the blocking select removes the race, so reconnect attempts now either succeed or surface the original context error.

hieblmi

Thanks for the fix, LGTM!

starius added 2 commits November 26, 2025 22:26

starius force-pushed the fix-flacky-tests branch from fcf5621 to e734b4a Compare November 27, 2025 02:53

hieblmi self-requested a review November 27, 2025 07:45

hieblmi approved these changes Nov 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

hashmail: fix flacky tests #199

hashmail: fix flacky tests #199

Uh oh!

starius commented Nov 27, 2025

Uh oh!

hieblmi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hashmail: fix flacky tests #199

Are you sure you want to change the base?

hashmail: fix flacky tests #199

Uh oh!

Conversation

starius commented Nov 27, 2025

Summary

Testing

Uh oh!

hieblmi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants