test: make WaitUntilNodeReady robust to watch disconnections by r2k1 · Pull Request #7827 · Azure/AgentBaker

r2k1 · 2026-02-07T09:36:44Z

Summary

Simplify WaitUntilNodeReady by inlining single-use helpers (failNodeNotReady, isNodeReady, logNodeDetail)
Add watch retry loop — reconnects on watch disconnection or error instead of silently exiting
Use immediately-invoked func literal so watcher.Stop() is always called via defer
Fail fast on node deletion events
Remove noisy taints/conditions from success log
Simplify timing in prepareAKSNode (WaitUntilNodeReady already logs its own duration)
Add bastion creation timing log

Test plan

go build ./... and go vet ./... pass
e2e tests pass with node provisioning

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Copilot · 2026-02-07T09:49:30Z

e2e/kube.go

+		nodes, err := k.Typed.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
+		if err != nil {
+			logf(ctx, "failed to list nodes: %v", err)
+		} else {
+			foundNode = nil
+			for _, node := range nodes.Items {
+				if strings.HasPrefix(node.Name, vmssName) {
+					foundNode = &node
+					if isNodeReady(&node) {
+						return node.Name, nil
+					}
+					break
+				}
+			}
 		}



If Nodes().List keeps failing (e.g., auth/RBAC issue), this loop will only log the error and then eventually return either "node with prefix ... not found" or "node ... not ready" on context timeout, which can hide the real root cause. Consider tracking the last List error and returning it (or returning immediately for non-retryable errors) when ctx.Done() fires so the final failure message reflects the API failure.

e2e/kube.go

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Copilot · 2026-02-07T23:01:26Z

e2e/kube.go

+			watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{})
+			if err != nil {
+				t.Logf("failed to start node watch: %v, retrying in 5s", err)
+				select {
+				case <-ctx.Done():
+				case <-time.After(5 * time.Second):
+				}
+				return ""
+			}
+			defer watcher.Stop()

-		for _, cond := range node.Status.Conditions {
-			if cond.Type == corev1.NodeReady && cond.Status == corev1.ConditionTrue {
-				t.Logf("node %s is ready. Taints: %s Conditions: %s", node.Name, string(nodeTaints), string(nodeConditions))
-				return node.Name
+			for event := range watcher.ResultChan() {


WaitUntilNodeReady starts a watch and then only reacts to future events. If the node is already present (or already Ready) when the watch begins, there may be no subsequent events and this can block until the context times out. Consider doing an initial Nodes().List/Nodes().Get check (or switching fully to PollUntilContextTimeout + List) to evaluate current state before watching/polling again.

Copilot · 2026-02-07T23:01:26Z

e2e/kube.go

 func (k *Kubeclient) WaitUntilNodeReady(ctx context.Context, t testing.TB, vmssName string) string {
 	startTime := time.Now()
-	t.Logf("waiting for node %s to be ready in k8s API", vmssName)
+	t.Logf("waiting for node %s to be ready", vmssName)
 	defer func() {
-		t.Logf("waited for node %s to be ready in k8s API for %s", vmssName, time.Since(startTime))
+		t.Logf("waited for node %s to be ready for %s", vmssName, time.Since(startTime))
 	}()

-	var node *corev1.Node = nil
-	watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{})
-	require.NoError(t, err, "failed to start watching nodes")
-	defer watcher.Stop()
-
-	for event := range watcher.ResultChan() {
-		if event.Type != watch.Added && event.Type != watch.Modified {
-			continue
-		}
-
-		var nodeFromEvent *corev1.Node
-		switch v := event.Object.(type) {
-		case *corev1.Node:
-			nodeFromEvent = v
-
-		default:
-			t.Logf("skipping object type %T", event.Object)
-			continue
-		}
-
-		if !strings.HasPrefix(nodeFromEvent.Name, vmssName) {
-			continue
-		}
-
-		// found the right node. Use it!
-		node = nodeFromEvent
-		nodeTaints, _ := json.Marshal(node.Spec.Taints)
-		nodeConditions, _ := json.Marshal(node.Status.Conditions)
+	var lastNode *corev1.Node
+	for ctx.Err() == nil {
+		name := func() string {
+			watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{})
+			if err != nil {


The PR description says this was switched to ticker-based polling using List and that the function returns (string, error) instead of calling t.Fatalf, but the implementation still uses Watch and still calls t.Fatalf on failure/deleted events. Either update the implementation to match (List+polling, return error) or adjust the PR description to reflect the actual behavior.

Copilot · 2026-02-07T23:01:26Z

e2e/kube.go

 		}
-
-		t.Logf("node %s is not ready. Taints: %s Conditions: %s", node.Name, string(nodeTaints), string(nodeConditions))
+		t.Logf("node watch disconnected, restarting")


This logs "node watch disconnected, restarting" even when the context is done (e.g., timeout/cancel), because the log happens after the watch loop returns but before the next loop condition check. Consider guarding this log with a ctx.Err() == nil check to avoid misleading output on expected shutdown.

Suggested change

t.Logf("node watch disconnected, restarting")

if ctx.Err() == nil {

t.Logf("node watch disconnected, restarting")

}

Inline single-use helpers (failNodeNotReady, isNodeReady, logNodeDetail) into WaitUntilNodeReady. Add watch retry logic for disconnections and context cancellation. Add timing log for bastion creation.

Copilot AI review requested due to automatic review settings February 7, 2026 09:36

r2k1 requested review from AbelHu, Devinwong, YaoC, awesomenix, cameronmeissner, djsly, ganeshkumarashok, juan-lee, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, timmy-wright, yagmurbaydogan, yewmsft and zachary-bailey as code owners February 7, 2026 09:36

This comment was marked as resolved.

Sign in to view

r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from 0c0cda5 to 667faf9 Compare February 7, 2026 09:39

Copilot AI review requested due to automatic review settings February 7, 2026 09:45

r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from 667faf9 to d62b821 Compare February 7, 2026 09:45

r2k1 temporarily deployed to test February 7, 2026 09:45 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 February 7, 2026 09:45 View session

r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from d62b821 to cf7a371 Compare February 7, 2026 09:45

r2k1 temporarily deployed to test February 7, 2026 09:46 — with GitHub Actions Inactive

Copilot AI reviewed Feb 7, 2026

View reviewed changes

r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from cf7a371 to cab645d Compare February 7, 2026 09:57

r2k1 temporarily deployed to test February 7, 2026 09:57 — with GitHub Actions Inactive

r2k1 changed the title ~~fix(e2e): make WaitUntilNodeReady robust to watch disconnections~~ test: make WaitUntilNodeReady robust to watch disconnections Feb 7, 2026

Copilot AI review requested due to automatic review settings February 7, 2026 22:56

r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from cab645d to 3b80171 Compare February 7, 2026 22:56

r2k1 temporarily deployed to test February 7, 2026 22:57 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 February 7, 2026 22:57 View session

Copilot AI reviewed Feb 7, 2026

View reviewed changes

simplify WaitUntilNodeReady and add bastion creation timing

1ee4c6b

Inline single-use helpers (failNodeNotReady, isNodeReady, logNodeDetail) into WaitUntilNodeReady. Add watch retry logic for disconnections and context cancellation. Add timing log for bastion creation.

r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from 3b80171 to 1ee4c6b Compare February 7, 2026 23:39

r2k1 temporarily deployed to test February 7, 2026 23:39 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: make WaitUntilNodeReady robust to watch disconnections#7827

test: make WaitUntilNodeReady robust to watch disconnections#7827
r2k1 wants to merge 1 commit intomainfrom
r2k1/fix-wait-until-node-ready

r2k1 commented Feb 7, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

r2k1 commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

This comment was marked as resolved.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

r2k1 commented Feb 7, 2026 •

edited

Loading