Skip to content

Conversation

@BenPope
Copy link
Member

@BenPope BenPope commented Nov 7, 2025

There are a class of failures that result in errors such as:

ducktape.errors.TimeoutError: Timed out waiting for status endpoint KgoVerifierConsumerGroupConsumer-0-140503148073056 to be available

This is due to:

time="2025-11-07T15:15:05Z" level=info msg="Reading with consumer group source-cg-source-topic"
time="2025-11-07T15:15:05Z" level=error msg="More partitions in valid_offsets file than in topic!"

And the reason for that is that the consumer is running on a different node than the producer.

The fix is in two parts:

  • Wait for the file before creating the consumer
  • Run the consumer on the same node as the producer

Fixes https://redpandadata.atlassian.net/browse/CORE-14327
Fixes https://redpandadata.atlassian.net/browse/CORE-14518

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

  • none

@BenPope BenPope self-assigned this Nov 7, 2025
Copilot AI review requested due to automatic review settings November 7, 2025 19:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a race condition in the ClusterLinkingProgressVerifier test that caused timeout errors when the consumer tried to read before the producer's offset map file was available. The fix ensures the producer's offset map is ready before creating the consumer and runs the consumer on the same node as the producer to avoid file access issues.

  • Adds a wait for the producer's offset map file before creating the consumer
  • Configures the consumer to run on the same nodes as the producer
  • Reduces cluster node count from 8-10 to 7 across multiple tests (now that consumer runs on producer's node)

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
tests/rptest/tests/cluster_linking_test_base.py Adds wait for offset map and configures consumer to use producer's nodes
tests/rptest/tests/cluster_linking_e2e_test.py Reduces num_nodes from 8-10 to 7 for multiple test methods
tests/rptest/scale_tests/cluster_linking_many_partitions_test.py Reduces num_nodes from 8 to 7 for scale test

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Nov 7, 2025

Retry command for Build#75863

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/shadow_linking_rnot_test.py::ShadowLinkingRandomOpsTest.test_node_operations@{"failures":false}
tests/rptest/tests/shadow_linking_rnot_test.py::ShadowLinkingRandomOpsTest.test_node_operations@{"failures":true}

@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#75863
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkingReplicationTests test_replication_basic {"shuffle_leadership": true, "source_cluster_spec": {"cluster_type": "redpanda"}} integration https://buildkite.com/redpanda/redpanda/builds/75863#019a5ff5-7790-4528-96ca-b125c0e2057f FLAKY 17/21 upstream reliability is '93.67816091954023'. current run reliability is '80.95238095238095'. drift is 12.72578 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_basic
DataMigrationsApiTest test_creating_and_listing_migrations null integration https://buildkite.com/redpanda/redpanda/builds/75863#019a5ff5-7795-4a8c-8f0d-4131c19dcbf4 FLAKY 18/21 upstream reliability is '95.30516431924883'. current run reliability is '85.71428571428571'. drift is 9.59088 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsApiTest&test_method=test_creating_and_listing_migrations
DataMigrationsApiTest test_creating_and_listing_migrations null integration https://buildkite.com/redpanda/redpanda/builds/75863#019a5ff8-4b1d-4df9-92b1-4d983d9cd604 FLAKY 16/21 upstream reliability is '95.30516431924883'. current run reliability is '76.19047619047619'. drift is 19.11469 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsApiTest&test_method=test_creating_and_listing_migrations
FollowerFetchingTest test_follower_fetching_with_maintenance_mode {"fetch_from": "fetch-from-cloud-topic"} integration https://buildkite.com/redpanda/redpanda/builds/75863#019a5ff5-7793-4b8b-a870-b428cf4b39d0 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=FollowerFetchingTest&test_method=test_follower_fetching_with_maintenance_mode
ShadowLinkingRandomOpsTest test_node_operations {"failures": false} integration https://buildkite.com/redpanda/redpanda/builds/75863#019a5ff5-778f-47a6-9907-d8916851dfce FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
ShadowLinkingRandomOpsTest test_node_operations {"failures": false} integration https://buildkite.com/redpanda/redpanda/builds/75863#019a5ff8-4b14-44e8-8060-6bbf6f475a18 FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
ShadowLinkingRandomOpsTest test_node_operations {"failures": true} integration https://buildkite.com/redpanda/redpanda/builds/75863#019a5ff5-7790-4528-96ca-b125c0e2057f FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
ShadowLinkingRandomOpsTest test_node_operations {"failures": true} integration https://buildkite.com/redpanda/redpanda/builds/75863#019a5ff8-4b17-44aa-b498-5c6a7da07061 FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
TieredStorageTest test_tiered_storage {"cloud_storage_type_and_url_style": [2, "virtual_host"], "test_case": {"name": "(TS_Read == True, TS_Timequery == True)"}} integration https://buildkite.com/redpanda/redpanda/builds/75863#019a5ff8-4b13-4f05-a589-d4bf6cec46bd FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TieredStorageTest&test_method=test_tiered_storage

There are a class of failures that result in errors such as:
```
ducktape.errors.TimeoutError: Timed out waiting for status endpoint KgoVerifierConsumerGroupConsumer-0-140503148073056 to be available
```

This is due to:
```
time="2025-11-07T15:15:05Z" level=info msg="Reading with consumer group source-cg-source-topic"
time="2025-11-07T15:15:05Z" level=error msg="More partitions in valid_offsets file than in topic!"
```

And the reason for that is that the consumer is running on a different
node than the producer.

The fix is in two parts:
* Wait for the file before creating the consumer
* Run the consumer on the same node as the producer

Signed-off-by: Ben Pope <[email protected]>
@BenPope BenPope force-pushed the cl/test_replication_basic/time_out_waiting_for_KgoVerifierConsumerGroupConsumer branch from 6018b22 to b916060 Compare November 10, 2025 09:19
@BenPope
Copy link
Member Author

BenPope commented Nov 10, 2025

/ci-repeat 1
tests/rptest/tests/shadow_linking_rnot_test.py::ShadowLinkingRandomOpsTest.test_node_operations@{"failures":false}
tests/rptest/tests/shadow_linking_rnot_test.py::ShadowLinkingRandomOpsTest.test_node_operations@{"failures":true}

@BenPope BenPope merged commit 27053c5 into redpanda-data:dev Nov 10, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants