Skip to content

Conversation

@frankdavid
Copy link
Contributor

If a bad GuestOS version is released and nodes die, they have the option to roll back locally to the previous version. However, once the old GuestOS version boots, the node would want to upgrade to the broken version again. This PR introduces a new mechanism (recalled_replica_version) to stop nodes from upgrading to the latest version if the latest version has been recalled. A GuestOS version can be recalled by adding it in the corresponding field in the SetSubnetOperationalLevel proposal. Unlike other fields, we take the recalled GuestOS versions from the latest registry version (and not the registry version at the latest CUP), since no CUPs will be produced in a broken/stalled subnet, so the registry version with the recalled GuestOS version would never be used otherwise.

@github-actions github-actions bot added the feat label Dec 23, 2025
ssh_backup_access: vec![],
chain_key_config: None,
canister_cycles_cost_schedule: CanisterCyclesCostSchedule::Normal as i32,
recalled_replica_version_ids: vec![],
Copy link
Contributor Author

@frankdavid frankdavid Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we continue using replica version or should we use the more up-to-date recalled_guestos_version_ids name?

Copy link
Contributor

@pierugo-dfinity pierugo-dfinity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice! Did a first pass and will continue later

// even if we manage to rollback the GuestOS locally, the GuestOS would automatically try
// to upgrade to the broken GuestOS again. We can use this field to prevent that.
// While nodes read the recalled_replica_version_ids from the registry vesion from the CUP,
// they check the latest registry version for recalled_replica_version_ids.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// they check the latest registry version for recalled_replica_version_ids.
// While nodes read the replica_version_id from the registry vesion from the CUP,
// they check the latest registry version for recalled_replica_version_ids.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry the formatting is off, I meant to comment on lines 86+87: replace the first recalled_replica_version_ids with replica_version_id

@@ -185,13 +194,30 @@ fn validate_node_ssh_access(node_ssh_access: &NodeSshAccess) -> Result<(), Strin
Ok(())
}

fn validate_recalled_replica_version_ids(
recalled_replica_version_ids: &Option<Vec<String>>,
) -> Result<(), String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an additional check, should we also check that the version that we are trying to recall is currently elected?

@@ -160,7 +161,7 @@ impl Upgrade {
/// Checks for a new release package, and if found, upgrades to this release
/// package
pub(crate) async fn check(&mut self) -> OrchestratorResult<OrchestratorControlFlow> {
let latest_registry_version = self.registry.get_latest_version();
let latest_registry_version = self.replicate_latest_registry_version().await?;
Copy link
Contributor

@pierugo-dfinity pierugo-dfinity Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this implementation could introduce a race condition since both the registry replicator and this upgrade loop would write to disk at the same time. Also, that would mean that we are contacting the real registry canister (through the network) every time the upgrade loop runs. I would prefer to do this during the registry replicator’s initialization.

What I mean is to remove that responsibility from the upgrade loop, and instead ensure that this upgrade loop will never run if the local store has not been initialized to the latest certified registry version. I am thinking of moving the logic to RegistryReplicator::initialize_local_store, where instead of returning early only if the local store is non-empty, we would return early only if the local store does not contain the latest certified registry version. And then eventually modify the rest of the function to initialize the local store until the latest certified registry version: see my next comment, I am not sure that calling get_certified_changes_since in a loop (which is almost identical to the your implementation of replicate_latest_registry_version) guarantees us that we reached the latest certified registry version.

}
Ok(latest_registry_version)
}

Copy link
Contributor

@pierugo-dfinity pierugo-dfinity Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: the implementation of RegistryReplicator::poll is calling get_certified_changes_since. Does calling that function in a loop until we no longer have new records guarantees us that we reached the latest certified registry version? I think we should use the get_certified_latest_version method of the registry canister instead.

new_replica_version,
subnet_id,
latest_registry_version
);
Copy link
Contributor

@pierugo-dfinity pierugo-dfinity Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return from the function here, otherwise the ensure_replica_is_running call below would start the replica. AFAIU, in case we hit that branch, we would not like to start the replica, but only ensure that the orchestrator is running such that SSH keys can be deployed and finally propose a Recovery CUP.

Note: in the current implementation, if we return an error here, SSH keys would not be applied, but they would if #7868 gets merged. An alternative would be to return Ok(OrchestratorControlFlow::Assigned(subnet_id)), but that feels less "right", since we would be assigned but the replica would not be running.

);
})
.await;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test, thank you! To go further, once we arrived here, we could put a Recovery CUP (that has registry version 3) in the registry at version 3. Then verify that upgrade.check().await.unwrap() returns an OrchestratorControlFlow::Stop since it should upgrade to current_version.

Maybe you'd need to manually call upgrade.set_prepared_version before otherwise it will actually try to download the image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants