Re-recruit backup workers to avoid recoveries #12564

jzhou77 · 2025-11-19T18:01:38Z

Similar to #12558, this PR changes how backup workers are monitored and recruited to avoid recoveries. After fully_recovered state, a new actor monitorAndRecruitBackupWorkers takes over the monitoring and re-recruitment for backup workers of the current generation. Note before fully_recovered state, the monitoring is done in the original TagPartitionedLogSystem::onError_internal() actor. Monitoring backup workers for older generations is possible, but complicated by the fact that the number of log routers can be modified via a configuration change. However, this PR should be strictly better in terms of availability, as in most cases backup workers are running at fully_recovered state. Reducing recoveries improves availability.

Details of the implementation:

During normal operation: Backup workers periodically call saveProgress() to save their progress to the database using key backupProgressKeyFor(workerID) with value containing {epoch, version, tag, totalTags}.
When a backup worker fails:
- monitorAndRecruitBackupWorkers() in ClusterController detects the failure
- Calls recruitFailedBackupWorkers() which creates InitializeBackupRequest with:
  - isReplacement = true
  - startVersion = 0 (to be determined by backup worker)
  - backupEpoch and routerTag identifying which worker failed
New backup worker starts:
- Checks if (req.isReplacement && req.startVersion == 0)
- Calls getSavedVersion(cx, workerID, backupEpoch, routerTag)
- Scans all backup progress entries to find matching (epoch, tag)
- Resumes from savedVersion + 1

20251218-005403-jzhou-142b8def2f2a0af9 compressed=True data_size=34249774 duration=5068735 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:56:16 sanity=False started=100000 stopped=20251218-015019 submitted=20251218-005403 timeout=5400 username=jzhou

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

The PR has a description, explaining both the problem and the solution.
The description mentions which forms of testing were done and the testing seems reasonable.
Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

foundationdb-ci · 2025-11-19T18:26:57Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: 06720ba
Duration 0:25:06
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-11-19T18:40:39Z

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Commit ID: 06720ba
Duration 0:38:50
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-11-19T18:48:02Z

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Commit ID: 06720ba
Duration 0:46:10
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-11-19T18:56:20Z

Result of foundationdb-pr-macos on macOS Ventura 13.x

Commit ID: 06720ba
Duration 0:54:30
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-11-19T19:04:02Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: 06720ba
Duration 1:02:15
Result: ❌ FAILED
Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-11-19T19:12:25Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: 06720ba
Duration 1:10:36
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci · 2025-11-19T19:16:16Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: 06720ba
Duration 1:14:25
Result: ❌ FAILED
Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

The idea is that backup workers are stateless roles that can reconstruct its state after failures. So we don't need to trigger recoveries unnecessarily.

By removing an unused parameter.

Before fully recovered, there are backup workers for the old generations. The APIs for tracking these old generations and updating them are not well defined. To simplify the work, I'll limit the monitoring to the current generation only. Before fully recovered, any backup worker failures will cause another recovery, i.e., old behavior.

During normal operation, backup workers periodically call saveProgress() to save their progress to the database using key backupProgressKeyFor(workerID) with value containing {epoch, version, tag, totalTags}. So for the re-recruited one, let the new worker to take over old worker's progress and use that as its start version. 20251119-040547-jzhou-78a15d307baeed3f

20251119-043027-jzhou-94321348b7ca1c02

It was running infinite loops because no backup workers to monitor. Change so that the function returns Never() in such cases. 20251119-050212-jzhou-c3691b1ed4c330a3

When there is a new recovery, let the recovery process to handle initial backup worker recruitment. 20251119-054841-jzhou-1e367b6d83439c39

There could be a new recovery, thus the worker is popping from a previous epoch. As a result, this assertion doesn't make sense. The pop() is always safe to do because mutations are saved already. 20251119-173704-jzhou-31139c54035a3755

foundationdb-ci · 2025-12-18T00:27:09Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: 4ab570c
Duration 0:23:59
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-12-18T00:38:05Z

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Commit ID: 4ab570c
Duration 0:34:54
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-12-18T00:48:22Z

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Commit ID: 4ab570c
Duration 0:45:09
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-12-18T00:51:14Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: 4ab570c
Duration 0:48:05
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-12-18T00:53:08Z

Result of foundationdb-pr-macos on macOS Ventura 13.x

Commit ID: 4ab570c
Duration 0:50:01
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-12-18T00:59:48Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: 4ab570c
Duration 0:56:37
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-12-18T02:03:41Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: 4ab570c
Duration 2:00:28
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

jzhou77 changed the title ~~Rerecruit backup workers to avoid recoveries~~ Re-recruit backup workers to avoid recoveries Nov 19, 2025

jzhou77 added 8 commits December 17, 2025 15:57

Add monitorAndRecruitBackupWorkers to CC

01f5d14

The idea is that backup workers are stateless roles that can reconstruct its state after failures. So we don't need to trigger recoveries unnecessarily.

Clean up monitorAndRecruitBackupWorkers()

798b3de

By removing an unused parameter.

Fix an assertion failure

a25008a

20251119-043027-jzhou-94321348b7ca1c02

Fix monitorBackupWorkers when no backups are running

7c6e297

It was running infinite loops because no backup workers to monitor. Change so that the function returns Never() in such cases. 20251119-050212-jzhou-c3691b1ed4c330a3

Fix recruitFailedBackupWorkers() to skip recruiting

219453a

When there is a new recovery, let the recovery process to handle initial backup worker recruitment. 20251119-054841-jzhou-1e367b6d83439c39

Remove an assertion in backup worker's pop()

4ab570c

There could be a new recovery, thus the worker is popping from a previous epoch. As a result, this assertion doesn't make sense. The pop() is always safe to do because mutations are saved already. 20251119-173704-jzhou-31139c54035a3755

jzhou77 force-pushed the rerecruit-backup-worker branch from 06720ba to 4ab570c Compare December 18, 2025 00:03

Re-recruit backup workers to avoid recoveries #12564

Are you sure you want to change the base?

Re-recruit backup workers to avoid recoveries #12564

Uh oh!

Conversation

jzhou77 commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code-Reviewer Section

For Release-Branches

Uh oh!

foundationdb-ci commented Nov 19, 2025

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Uh oh!

foundationdb-ci commented Nov 19, 2025

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Uh oh!

foundationdb-ci commented Nov 19, 2025

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Uh oh!

foundationdb-ci commented Nov 19, 2025

Result of foundationdb-pr-macos on macOS Ventura 13.x

Uh oh!

foundationdb-ci commented Nov 19, 2025

Result of foundationdb-pr on Linux RHEL 9

Uh oh!

foundationdb-ci commented Nov 19, 2025

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Uh oh!

foundationdb-ci commented Nov 19, 2025

Result of foundationdb-pr-clang on Linux RHEL 9

Uh oh!

foundationdb-ci commented Dec 18, 2025

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Uh oh!

foundationdb-ci commented Dec 18, 2025

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Uh oh!

foundationdb-ci commented Dec 18, 2025

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Uh oh!

foundationdb-ci commented Dec 18, 2025

Result of foundationdb-pr-clang on Linux RHEL 9

Uh oh!

foundationdb-ci commented Dec 18, 2025

Result of foundationdb-pr-macos on macOS Ventura 13.x

Uh oh!

foundationdb-ci commented Dec 18, 2025

Result of foundationdb-pr on Linux RHEL 9

Uh oh!

foundationdb-ci commented Dec 18, 2025

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jzhou77 commented Nov 19, 2025 •

edited

Loading