test: make WaitUntilNodeReady robust to watch disconnections#7827
test: make WaitUntilNodeReady robust to watch disconnections#7827
Conversation
0c0cda5 to
667faf9
Compare
667faf9 to
d62b821
Compare
d62b821 to
cf7a371
Compare
e2e/kube.go
Outdated
| nodes, err := k.Typed.CoreV1().Nodes().List(ctx, metav1.ListOptions{}) | ||
| if err != nil { | ||
| logf(ctx, "failed to list nodes: %v", err) | ||
| } else { | ||
| foundNode = nil | ||
| for _, node := range nodes.Items { | ||
| if strings.HasPrefix(node.Name, vmssName) { | ||
| foundNode = &node | ||
| if isNodeReady(&node) { | ||
| return node.Name, nil | ||
| } | ||
| break | ||
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
If Nodes().List keeps failing (e.g., auth/RBAC issue), this loop will only log the error and then eventually return either "node with prefix ... not found" or "node ... not ready" on context timeout, which can hide the real root cause. Consider tracking the last List error and returning it (or returning immediately for non-retryable errors) when ctx.Done() fires so the final failure message reflects the API failure.
cf7a371 to
cab645d
Compare
cab645d to
3b80171
Compare
| watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{}) | ||
| if err != nil { | ||
| t.Logf("failed to start node watch: %v, retrying in 5s", err) | ||
| select { | ||
| case <-ctx.Done(): | ||
| case <-time.After(5 * time.Second): | ||
| } | ||
| return "" | ||
| } | ||
| defer watcher.Stop() | ||
|
|
||
| for _, cond := range node.Status.Conditions { | ||
| if cond.Type == corev1.NodeReady && cond.Status == corev1.ConditionTrue { | ||
| t.Logf("node %s is ready. Taints: %s Conditions: %s", node.Name, string(nodeTaints), string(nodeConditions)) | ||
| return node.Name | ||
| for event := range watcher.ResultChan() { |
There was a problem hiding this comment.
WaitUntilNodeReady starts a watch and then only reacts to future events. If the node is already present (or already Ready) when the watch begins, there may be no subsequent events and this can block until the context times out. Consider doing an initial Nodes().List/Nodes().Get check (or switching fully to PollUntilContextTimeout + List) to evaluate current state before watching/polling again.
| func (k *Kubeclient) WaitUntilNodeReady(ctx context.Context, t testing.TB, vmssName string) string { | ||
| startTime := time.Now() | ||
| t.Logf("waiting for node %s to be ready in k8s API", vmssName) | ||
| t.Logf("waiting for node %s to be ready", vmssName) | ||
| defer func() { | ||
| t.Logf("waited for node %s to be ready in k8s API for %s", vmssName, time.Since(startTime)) | ||
| t.Logf("waited for node %s to be ready for %s", vmssName, time.Since(startTime)) | ||
| }() | ||
|
|
||
| var node *corev1.Node = nil | ||
| watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{}) | ||
| require.NoError(t, err, "failed to start watching nodes") | ||
| defer watcher.Stop() | ||
|
|
||
| for event := range watcher.ResultChan() { | ||
| if event.Type != watch.Added && event.Type != watch.Modified { | ||
| continue | ||
| } | ||
|
|
||
| var nodeFromEvent *corev1.Node | ||
| switch v := event.Object.(type) { | ||
| case *corev1.Node: | ||
| nodeFromEvent = v | ||
|
|
||
| default: | ||
| t.Logf("skipping object type %T", event.Object) | ||
| continue | ||
| } | ||
|
|
||
| if !strings.HasPrefix(nodeFromEvent.Name, vmssName) { | ||
| continue | ||
| } | ||
|
|
||
| // found the right node. Use it! | ||
| node = nodeFromEvent | ||
| nodeTaints, _ := json.Marshal(node.Spec.Taints) | ||
| nodeConditions, _ := json.Marshal(node.Status.Conditions) | ||
| var lastNode *corev1.Node | ||
| for ctx.Err() == nil { | ||
| name := func() string { | ||
| watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{}) | ||
| if err != nil { |
There was a problem hiding this comment.
The PR description says this was switched to ticker-based polling using List and that the function returns (string, error) instead of calling t.Fatalf, but the implementation still uses Watch and still calls t.Fatalf on failure/deleted events. Either update the implementation to match (List+polling, return error) or adjust the PR description to reflect the actual behavior.
e2e/kube.go
Outdated
| } | ||
|
|
||
| t.Logf("node %s is not ready. Taints: %s Conditions: %s", node.Name, string(nodeTaints), string(nodeConditions)) | ||
| t.Logf("node watch disconnected, restarting") |
There was a problem hiding this comment.
This logs "node watch disconnected, restarting" even when the context is done (e.g., timeout/cancel), because the log happens after the watch loop returns but before the next loop condition check. Consider guarding this log with a ctx.Err() == nil check to avoid misleading output on expected shutdown.
| t.Logf("node watch disconnected, restarting") | |
| if ctx.Err() == nil { | |
| t.Logf("node watch disconnected, restarting") | |
| } |
Inline single-use helpers (failNodeNotReady, isNodeReady, logNodeDetail) into WaitUntilNodeReady. Add watch retry logic for disconnections and context cancellation. Add timing log for bastion creation.
3b80171 to
1ee4c6b
Compare
Summary
WaitUntilNodeReadyby inlining single-use helpers (failNodeNotReady,isNodeReady,logNodeDetail)watcher.Stop()is always called viadeferprepareAKSNode(WaitUntilNodeReady already logs its own duration)Test plan
go build ./...andgo vet ./...pass