feat: Allow recalling GuestOS versions #8188

frankdavid · 2025-12-23T13:49:35Z

If a bad GuestOS version is released and nodes die, they have the option to roll back locally to the previous version. However, once the old GuestOS version boots, the node would want to upgrade to the broken version again. This PR introduces a new mechanism (recalled_replica_version) to stop nodes from upgrading to the latest version if the latest version has been recalled. A GuestOS version can be recalled by adding it in the corresponding field in the SetSubnetOperationalLevel proposal. Unlike other fields, we take the recalled GuestOS versions from the latest registry version (and not the registry version at the latest CUP), since no CUPs will be produced in a broken/stalled subnet, so the registry version with the recalled GuestOS version would never be used otherwise.

frankdavid · 2025-12-23T13:52:20Z

rs/boundary_node/ic_boundary/src/test_utils.rs

        ssh_backup_access: vec![],
        chain_key_config: None,
        canister_cycles_cost_schedule: CanisterCyclesCostSchedule::Normal as i32,
+        recalled_replica_version_ids: vec![],


Should we continue using replica version or should we use the more up-to-date recalled_guestos_version_ids name?

pierugo-dfinity

Looks nice! Did a first pass and will continue later

pierugo-dfinity · 2025-12-23T14:51:20Z

rs/protobuf/def/registry/subnet/v1/subnet.proto

+  // even if we manage to rollback the GuestOS locally, the GuestOS would automatically try
+  // to upgrade to the broken GuestOS again. We can use this field to prevent that.
+  // While nodes read the recalled_replica_version_ids from the registry vesion from the CUP,
+  // they check the latest registry version for recalled_replica_version_ids.


Suggested change

// they check the latest registry version for recalled_replica_version_ids.

// While nodes read the replica_version_id from the registry vesion from the CUP,

// they check the latest registry version for recalled_replica_version_ids.

Sorry the formatting is off, I meant to comment on lines 86+87: replace the first recalled_replica_version_ids with replica_version_id

pierugo-dfinity · 2025-12-23T14:52:15Z

rs/registry/canister/src/mutations/do_set_subnet_operational_level.rs

@@ -185,13 +194,30 @@ fn validate_node_ssh_access(node_ssh_access: &NodeSshAccess) -> Result<(), Strin
    Ok(())
 }

+fn validate_recalled_replica_version_ids(
+    recalled_replica_version_ids: &Option<Vec<String>>,
+) -> Result<(), String> {


As an additional check, should we also check that the version that we are trying to recall is currently elected?

pierugo-dfinity · 2025-12-23T14:57:15Z

rs/orchestrator/src/upgrade.rs

@@ -160,7 +161,7 @@ impl Upgrade {
    /// Checks for a new release package, and if found, upgrades to this release
    /// package
    pub(crate) async fn check(&mut self) -> OrchestratorResult<OrchestratorControlFlow> {
-        let latest_registry_version = self.registry.get_latest_version();
+        let latest_registry_version = self.replicate_latest_registry_version().await?;


I think this implementation could introduce a race condition since both the registry replicator and this upgrade loop would write to disk at the same time. Also, that would mean that we are contacting the real registry canister (through the network) every time the upgrade loop runs. I would prefer to do this during the registry replicator’s initialization.

What I mean is to remove that responsibility from the upgrade loop, and instead ensure that this upgrade loop will never run if the local store has not been initialized to the latest certified registry version. I am thinking of moving the logic to RegistryReplicator::initialize_local_store, where instead of returning early only if the local store is non-empty, we would return early only if the local store does not contain the latest certified registry version. And then eventually modify the rest of the function to initialize the local store until the latest certified registry version: see my next comment, I am not sure that calling get_certified_changes_since in a loop (which is almost identical to the your implementation of replicate_latest_registry_version) guarantees us that we reached the latest certified registry version.

pierugo-dfinity · 2025-12-23T14:58:14Z

rs/orchestrator/src/upgrade.rs

+        }
+        Ok(latest_registry_version)
+    }
+


Question: the implementation of RegistryReplicator::poll is calling get_certified_changes_since. Does calling that function in a loop until we no longer have new records guarantees us that we reached the latest certified registry version? I think we should use the get_certified_latest_version method of the registry canister instead.

pierugo-dfinity · 2025-12-23T16:14:03Z

rs/orchestrator/src/upgrade.rs

+                new_replica_version,
+                subnet_id,
+                latest_registry_version
+            );


I think we should return from the function here, otherwise the ensure_replica_is_running call below would start the replica. AFAIU, in case we hit that branch, we would not like to start the replica, but only ensure that the orchestrator is running such that SSH keys can be deployed and finally propose a Recovery CUP.

Note: in the current implementation, if we return an error here, SSH keys would not be applied, but they would if #7868 gets merged. An alternative would be to return Ok(OrchestratorControlFlow::Assigned(subnet_id)), but that feels less "right", since we would be assigned but the replica would not be running.

pierugo-dfinity · 2025-12-23T16:21:02Z

rs/orchestrator/src/upgrade.rs

+            );
+        })
+        .await;
+    }


Nice test, thank you! To go further, once we arrived here, we could put a Recovery CUP (that has registry version 3) in the registry at version 3. Then verify that upgrade.check().await.unwrap() returns an OrchestratorControlFlow::Stop since it should upgrade to current_version.

Maybe you'd need to manually call upgrade.set_prepared_version before otherwise it will actually try to download the image.

frankdavid added 3 commits December 10, 2025 16:22

WIP

79a1f00

testing

9aa76e3

testing

5e682c9

github-actions bot added the feat label Dec 23, 2025

frankdavid commented Dec 23, 2025

View reviewed changes

pierugo-dfinity reviewed Dec 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Allow recalling GuestOS versions #8188

feat: Allow recalling GuestOS versions #8188

Uh oh!

frankdavid commented Dec 23, 2025

Uh oh!

frankdavid Dec 23, 2025 •

edited

Loading

Uh oh!

pierugo-dfinity left a comment

Uh oh!

pierugo-dfinity Dec 23, 2025

Uh oh!

pierugo-dfinity Dec 23, 2025

Uh oh!

pierugo-dfinity Dec 23, 2025

Uh oh!

pierugo-dfinity Dec 23, 2025 •

edited

Loading

Uh oh!

pierugo-dfinity Dec 23, 2025 •

edited

Loading

Uh oh!

pierugo-dfinity Dec 23, 2025 •

edited

Loading

Uh oh!

pierugo-dfinity Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	// they check the latest registry version for recalled_replica_version_ids.
	// While nodes read the replica_version_id from the registry vesion from the CUP,
	// they check the latest registry version for recalled_replica_version_ids.

feat: Allow recalling GuestOS versions #8188

Are you sure you want to change the base?

feat: Allow recalling GuestOS versions #8188

Uh oh!

Conversation

frankdavid commented Dec 23, 2025

Uh oh!

frankdavid Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pierugo-dfinity left a comment

Choose a reason for hiding this comment

Uh oh!

pierugo-dfinity Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

pierugo-dfinity Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

pierugo-dfinity Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

pierugo-dfinity Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pierugo-dfinity Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pierugo-dfinity Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pierugo-dfinity Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

frankdavid Dec 23, 2025 •

edited

Loading

pierugo-dfinity Dec 23, 2025 •

edited

Loading

pierugo-dfinity Dec 23, 2025 •

edited

Loading

pierugo-dfinity Dec 23, 2025 •

edited

Loading