Fix EOF retry bug that could cause data duplication when Range header is ignored#609
Fix EOF retry bug that could cause data duplication when Range header is ignored#609sarum90 wants to merge 3 commits intoapache:mainfrom
Conversation
- Add defensive checks + warning logs ensuring that partial range requests receive partial range responses (206 status + the expected content_range)
tustvold
left a comment
There was a problem hiding this comment.
Fun bug, well done for tracking this one down. Perhaps we could use the HttpServer machinery used for testing retries to test this?
f280218 to
6e17d91
Compare
|
Thanks for the suggestion! I initially tried using I've added tests for:
Let me know if the |
|
Hi @tustvold Any thoughts on this test coverage? Happy to work on it more as needed. I forgot about this PR last week, but I'd love to get this fixed. |
Which issue does this PR close?
N/A - no existing issue. Happy to create one if preferred.
Rationale for this change
If a reqwest timeout occurs after consuming all bytes from an S3 response body but before receiving the EOF signal,
retry_streamattempts to retry with aRangeheader likebytes=5-4(where 5 is the file size). This is an invalid/backwards range.Per RFC 7233, servers MUST ignore syntactically invalid Range headers. S3 follows this spec and returns
200 OKwith the full file instead of206 Partial Contentor416 Range Not Satisfiable.Without validation,
retry_streamwould read the full file again, causing data duplication (e.g.,"hellohello"instead of"hello").What changes are included in this PR?
Early EOF check: If
range.start >= range.endbefore retrying, return EOF immediately rather than sending an invalid range request.206 status validation: Verify the retry response is
206 Partial Content. If the server returns200 OK(indicating it ignored our Range header), fail the retry and return the original error.Content-Range validation: Verify the
Content-Rangeheader matches the requested range. This catches cases where the server returns a different range than requested.Warning logs: Added
warn!logs for the new validation failures to aid debugging.Are there any user-facing changes?
No breaking changes. Users may see different error behavior in edge cases where retries were previously succeeding incorrectly (by silently duplicating data).
Test Coverage
Automated test coverage for this specific bug is limited for two reasons:
Requires real S3: LocalStack returns
416 Range Not Satisfiablefor invalid ranges, which is actually non-compliant with RFC 7233. Real S3 correctly ignores the invalid range and returns200 OKwith the full file, which is what triggers the bug.Hard to trigger: The bug requires a timeout (or other error) from reqwest after all data is read but before EOF is signaled. This is timing-dependent and would require integration tests with long sleeps to force a timeout.
A manual reproduction script is available in this gist: https://gist.github.com/sarum90/cb95633750390db690ef701f08aa20ed
The script demonstrates the bug against real S3 and shows the data duplication.