Skip to content

Conversation

@driftx
Copy link

@driftx driftx commented Sep 4, 2025

What is the issue

AggregationQueriesTest failing quite often due to timing

What does this PR fix and why was it fixed

Make the tests more predictable by explicitly controlling pagination behavior rather than relying on timing

Make the tests more predictable by explicitly controlling pagination behavior rather than relying on timing
@github-actions
Copy link

github-actions bot commented Sep 4, 2025

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@driftx
Copy link
Author

driftx commented Sep 4, 2025

As evidenced by this comment, this test is very dependent on timings:

// single page read should fit in the range timeout, but multiple pages should not;
// the query should complete nevertheless because aggregate timeout is large

It was difficult finding timings that wouldn't be flakey, and ultimately I decided there was no way to guarantee no flakiness in other environments, even if I managed to stabilize my own. Instead, the test is now more deterministic by using an appropriate sub-page size to control the number of page fetches:

  • testAggregationQueryShouldTimeoutWhenSinglePageReadExceedesReadTimeout: Uses 1KB pages to ensure multiple fetches trigger the timeout
  • testAggregationQueryShouldNotTimeoutWhenItExceedsReadTimeout: Uses 64KB pages for moderate page fetches without timing out
  • testAggregationQueryShouldTimeoutWhenSinglePageReadIsFastButAggregationExceedesTimeout: Uses 1KB pages to force many fetches that exceed the aggregation timeout

Tracking of page reads was also added so that tests can verify that multiple pages were exercised. These changes increased determinism allowing the timeouts/delays to be reduced, and the data volume was reduced from 40k rows per partition to 7.5k to help avoid flakiness, however it's still enough data to require guardrail adjustment.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Sep 4, 2025

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-1984 rejected by Butler


3 regressions found
See build details here


Found 3 new test failures

Test Explanation Runs Upstream
o.a.c.distributed.test.NativeProtocolTest.withClientRequests REGRESSION 🔴 0 / 2
o.a.c.distributed.test.repair.ForceRepairTest.forceWithDifference () REGRESSION 🔴 1 / 2
o.a.c.metrics.TrieMemtableMetricsTest.testContentionMetrics (compression) REGRESSION 🔴 0 / 2

Found 8 known test failures

@driftx driftx requested a review from djatnieks September 4, 2025 18:44
@driftx driftx merged commit bac9230 into main-5.0 Sep 5, 2025
573 of 588 checks passed
@driftx driftx deleted the CNDB-15265 branch September 5, 2025 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants