Skip to content

test: make WaitUntilNodeReady robust to watch disconnections#7827

Open
r2k1 wants to merge 1 commit intomainfrom
r2k1/fix-wait-until-node-ready
Open

test: make WaitUntilNodeReady robust to watch disconnections#7827
r2k1 wants to merge 1 commit intomainfrom
r2k1/fix-wait-until-node-ready

Conversation

@r2k1
Copy link
Contributor

@r2k1 r2k1 commented Feb 7, 2026

Summary

  • Simplify WaitUntilNodeReady by inlining single-use helpers (failNodeNotReady, isNodeReady, logNodeDetail)
  • Add watch retry loop — reconnects on watch disconnection or error instead of silently exiting
  • Use immediately-invoked func literal so watcher.Stop() is always called via defer
  • Fail fast on node deletion events
  • Remove noisy taints/conditions from success log
  • Simplify timing in prepareAKSNode (WaitUntilNodeReady already logs its own duration)
  • Add bastion creation timing log

Test plan

  • go build ./... and go vet ./... pass
  • e2e tests pass with node provisioning

This comment was marked as resolved.

@r2k1 r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from 0c0cda5 to 667faf9 Compare February 7, 2026 09:39
Copilot AI review requested due to automatic review settings February 7, 2026 09:45
@r2k1 r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from 667faf9 to d62b821 Compare February 7, 2026 09:45
@r2k1 r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from d62b821 to cf7a371 Compare February 7, 2026 09:45
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

e2e/kube.go Outdated
Comment on lines 150 to 165
nodes, err := k.Typed.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
if err != nil {
logf(ctx, "failed to list nodes: %v", err)
} else {
foundNode = nil
for _, node := range nodes.Items {
if strings.HasPrefix(node.Name, vmssName) {
foundNode = &node
if isNodeReady(&node) {
return node.Name, nil
}
break
}
}
}

Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Nodes().List keeps failing (e.g., auth/RBAC issue), this loop will only log the error and then eventually return either "node with prefix ... not found" or "node ... not ready" on context timeout, which can hide the real root cause. Consider tracking the last List error and returning it (or returning immediately for non-retryable errors) when ctx.Done() fires so the final failure message reflects the API failure.

Copilot uses AI. Check for mistakes.
@r2k1 r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from cf7a371 to cab645d Compare February 7, 2026 09:57
@r2k1 r2k1 changed the title fix(e2e): make WaitUntilNodeReady robust to watch disconnections test: make WaitUntilNodeReady robust to watch disconnections Feb 7, 2026
Copilot AI review requested due to automatic review settings February 7, 2026 22:56
@r2k1 r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from cab645d to 3b80171 Compare February 7, 2026 22:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment on lines +155 to +166
watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{})
if err != nil {
t.Logf("failed to start node watch: %v, retrying in 5s", err)
select {
case <-ctx.Done():
case <-time.After(5 * time.Second):
}
return ""
}
defer watcher.Stop()

for _, cond := range node.Status.Conditions {
if cond.Type == corev1.NodeReady && cond.Status == corev1.ConditionTrue {
t.Logf("node %s is ready. Taints: %s Conditions: %s", node.Name, string(nodeTaints), string(nodeConditions))
return node.Name
for event := range watcher.ResultChan() {
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WaitUntilNodeReady starts a watch and then only reacts to future events. If the node is already present (or already Ready) when the watch begins, there may be no subsequent events and this can block until the context times out. Consider doing an initial Nodes().List/Nodes().Get check (or switching fully to PollUntilContextTimeout + List) to evaluate current state before watching/polling again.

Copilot uses AI. Check for mistakes.
Comment on lines 145 to +156
func (k *Kubeclient) WaitUntilNodeReady(ctx context.Context, t testing.TB, vmssName string) string {
startTime := time.Now()
t.Logf("waiting for node %s to be ready in k8s API", vmssName)
t.Logf("waiting for node %s to be ready", vmssName)
defer func() {
t.Logf("waited for node %s to be ready in k8s API for %s", vmssName, time.Since(startTime))
t.Logf("waited for node %s to be ready for %s", vmssName, time.Since(startTime))
}()

var node *corev1.Node = nil
watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{})
require.NoError(t, err, "failed to start watching nodes")
defer watcher.Stop()

for event := range watcher.ResultChan() {
if event.Type != watch.Added && event.Type != watch.Modified {
continue
}

var nodeFromEvent *corev1.Node
switch v := event.Object.(type) {
case *corev1.Node:
nodeFromEvent = v

default:
t.Logf("skipping object type %T", event.Object)
continue
}

if !strings.HasPrefix(nodeFromEvent.Name, vmssName) {
continue
}

// found the right node. Use it!
node = nodeFromEvent
nodeTaints, _ := json.Marshal(node.Spec.Taints)
nodeConditions, _ := json.Marshal(node.Status.Conditions)
var lastNode *corev1.Node
for ctx.Err() == nil {
name := func() string {
watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{})
if err != nil {
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says this was switched to ticker-based polling using List and that the function returns (string, error) instead of calling t.Fatalf, but the implementation still uses Watch and still calls t.Fatalf on failure/deleted events. Either update the implementation to match (List+polling, return error) or adjust the PR description to reflect the actual behavior.

Copilot uses AI. Check for mistakes.
e2e/kube.go Outdated
}

t.Logf("node %s is not ready. Taints: %s Conditions: %s", node.Name, string(nodeTaints), string(nodeConditions))
t.Logf("node watch disconnected, restarting")
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logs "node watch disconnected, restarting" even when the context is done (e.g., timeout/cancel), because the log happens after the watch loop returns but before the next loop condition check. Consider guarding this log with a ctx.Err() == nil check to avoid misleading output on expected shutdown.

Suggested change
t.Logf("node watch disconnected, restarting")
if ctx.Err() == nil {
t.Logf("node watch disconnected, restarting")
}

Copilot uses AI. Check for mistakes.
Inline single-use helpers (failNodeNotReady, isNodeReady, logNodeDetail)
into WaitUntilNodeReady. Add watch retry logic for disconnections and
context cancellation. Add timing log for bastion creation.
@r2k1 r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from 3b80171 to 1ee4c6b Compare February 7, 2026 23:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant