Skip to content

Conversation

@joe-redpanda
Copy link
Contributor

@joe-redpanda joe-redpanda commented Nov 6, 2025

...immutable

nodes_decommissioning_test.py
::NodesDecommissioningTest
.test_decommissioning_node_rf_1_replica

would periodically fail on partitions not being reported as allocation failures. This happened because there was a race. A partition would NOT be reported as an allocation failure if there was a move in progress.

In this test, the node is stopped and then the decommed. As a result, the broker could be picked up as unresponsive, which would init a move before the decomission is made visible to the planner.

This commit changes pbp to report in-progress moves with quorum loss on the original replica set as immutable.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

Improvements

  • partitions with an in-flight move and original replica quorum loss will now be reported as immutable

immutable

nodes_decommissioning_test.py
::NodesDecommissioningTest
.test_decommissioning_node_rf_1_replica

would periodically fail on partitions not being reported as allocation
failures. This happened because there was a race. A partition would NOT
be reported as an allocation failure if there was a move in progress.

In this test, the node is stopped and then the decommed. As a result,
the broker could be picked up as unresponsive, which would init a move
before the decomission is made visible to the planner.

This commit changes pbp to report in-progress moves with quorum loss on
the original replica set as immutable.
@joe-redpanda
Copy link
Contributor Author

/ci-repeat 1

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Nov 6, 2025

CI test results

test results on build#75724
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkingReplicationTests test_topic_delete {"source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}} integration https://buildkite.com/redpanda/redpanda/builds/75724#019a57b8-c47a-440a-b737-fee55e0619b9 FLAKY 20/21 upstream reliability is '99.7584541062802'. current run reliability is '95.23809523809523'. drift is 4.52036 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_topic_delete
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/75724#019a57b8-c484-471b-9366-78f92601c7dd FLAKY 19/21 upstream reliability is '90.0925925925926'. current run reliability is '90.47619047619048'. drift is -0.3836 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
test results on build#75764
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkingReplicationTests test_replication_basic {"shuffle_leadership": true, "source_cluster_spec": {"cluster_type": "redpanda"}} integration https://buildkite.com/redpanda/redpanda/builds/75764#019a5a01-c31c-46e3-a17e-95a99496a739 FLAKY 19/21 upstream reliability is '96.55581947743468'. current run reliability is '90.47619047619048'. drift is 6.07963 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_basic
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": true} integration https://buildkite.com/redpanda/redpanda/builds/75764#019a5a01-c31b-4fae-b894-8533847f43aa FLAKY 20/21 upstream reliability is '97.796817625459'. current run reliability is '95.23809523809523'. drift is 2.55872 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test

@joe-redpanda joe-redpanda marked this pull request as ready for review November 6, 2025 15:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a race condition in the partition balancer planner where partitions that have lost quorum during an in-progress move were not being reported as allocation failures. The fix ensures that partitions with an in-flight move and quorum loss on the original replica set are now correctly reported as immutable.

Key Changes:

  • Added quorum loss detection for partitions during in-progress moves
  • Partitions that lost quorum during moves are now reported as immutable with no_quorum reason
  • Restructured control flow to handle the new quorum loss case before attempting cancellations

Copy link
Contributor

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link the jira?

@joe-redpanda
Copy link
Contributor Author

@joe-redpanda joe-redpanda merged commit 9947b85 into redpanda-data:dev Nov 21, 2025
22 of 24 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v25.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v25.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v25.1.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.3.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants