-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Re-recruit backup workers to avoid recoveries #12564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jzhou77
wants to merge
8
commits into
apple:main
Choose a base branch
from
jzhou77:rerecruit-backup-worker
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Contributor
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang on Linux RHEL 9
|
The idea is that backup workers are stateless roles that can reconstruct its state after failures. So we don't need to trigger recoveries unnecessarily.
By removing an unused parameter.
Before fully recovered, there are backup workers for the old generations. The APIs for tracking these old generations and updating them are not well defined. To simplify the work, I'll limit the monitoring to the current generation only. Before fully recovered, any backup worker failures will cause another recovery, i.e., old behavior.
During normal operation, backup workers periodically call saveProgress() to save
their progress to the database using key backupProgressKeyFor(workerID) with
value containing {epoch, version, tag, totalTags}. So for the re-recruited one,
let the new worker to take over old worker's progress and use that as its start
version.
20251119-040547-jzhou-78a15d307baeed3f
20251119-043027-jzhou-94321348b7ca1c02
It was running infinite loops because no backup workers to monitor. Change so that the function returns Never() in such cases. 20251119-050212-jzhou-c3691b1ed4c330a3
When there is a new recovery, let the recovery process to handle initial backup worker recruitment. 20251119-054841-jzhou-1e367b6d83439c39
There could be a new recovery, thus the worker is popping from a previous epoch. As a result, this assertion doesn't make sense. The pop() is always safe to do because mutations are saved already. 20251119-173704-jzhou-31139c54035a3755
06720ba to
4ab570c
Compare
Contributor
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Contributor
Result of foundationdb-pr-clang on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Similar to #12558, this PR changes how backup workers are monitored and recruited to avoid recoveries. After fully_recovered state, a new actor
monitorAndRecruitBackupWorkerstakes over the monitoring and re-recruitment for backup workers of the current generation. Note before fully_recovered state, the monitoring is done in the originalTagPartitionedLogSystem::onError_internal()actor. Monitoring backup workers for older generations is possible, but complicated by the fact that the number of log routers can be modified via a configuration change. However, this PR should be strictly better in terms of availability, as in most cases backup workers are running at fully_recovered state. Reducing recoveries improves availability.Details of the implementation:
During normal operation: Backup workers periodically call
saveProgress()to save their progress to the database using keybackupProgressKeyFor(workerID)with value containing{epoch, version, tag, totalTags}.When a backup worker fails:
New backup worker starts:
20251218-005403-jzhou-142b8def2f2a0af9 compressed=True data_size=34249774 duration=5068735 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:56:16 sanity=False started=100000 stopped=20251218-015019 submitted=20251218-005403 timeout=5400 username=jzhou
Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branchormainif this is the youngest branch)