[ci-operator]: expand the pod lifecycle metrics to include the state of the machinesets#4938
[ci-operator]: expand the pod lifecycle metrics to include the state of the machinesets#4938droslean wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
/hold |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughRegisters OpenShift machine/autoscaling schemes, adds a Machines metrics plugin and wiring, lists MachineAutoscaler resources at agent init and passes them into pod/machine plugins, extends metrics events with workload/machine-set capacity types, and adds corresponding tests and go.mod dependency updates. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/test e2e |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: droslean The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test e2e |
|
/test e2e |
|
/retest |
|
/test e2e |
|
/test e2e |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/metrics/pods_test.go (1)
21-23: Consider registering the autoscalingv1beta1 scheme for test consistency.The test
init()only registersmachinev1beta1scheme. While the current tests don't create autoscaler objects via the fake client (they're passed directly), adding the autoscaling scheme registration would improve consistency with production code.♻️ Suggested improvement
func init() { machinev1beta1.AddToScheme(scheme.Scheme) + autoscalingv1beta1.SchemeBuilder.AddToScheme(scheme.Scheme) }pkg/metrics/pods.go (1)
167-213: Consider scoping MachineSet listing to the machine-api namespace.The
client.Listfor MachineSets on line 170 fetches cluster-wide. In OpenShift clusters, MachineSets typically reside inopenshift-machine-api. Scoping the list could improve performance and avoid unexpected results from MachineSets in other namespaces.♻️ Suggested improvement
func (p *PodLifecyclePlugin) getWorkloadCounts(workload string) WorkloadNodeCount { ret := WorkloadNodeCount{Workload: workload} machineSetList := &machinev1beta1.MachineSetList{} - if err := p.client.List(p.ctx, machineSetList); err != nil { + if err := p.client.List(p.ctx, machineSetList, ctrlruntimeclient.InNamespace(MachineAPINamespace)); err != nil { p.logger.WithError(err).Warn("Failed to list MachineSets") return WorkloadNodeCount{} }
|
/test e2e |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
pkg/metrics/metrics.go (1)
83-100: Autoscaler data is fetched once and may become stale.The
MachineAutoscalerListis queried once duringNewMetricsAgentinitialization. If autoscaler configurations change (min/max replicas, new autoscalers added) during the agent's lifetime, the plugin will report outdated capacity information.This is likely acceptable for metrics purposes, but worth noting if precise real-time capacity data is important. Consider documenting this behavior or refreshing periodically if needed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/metrics/metrics.go` around lines 83 - 100, The NewMetricsAgent currently builds a MachineAutoscalerList once (autoscalerList := &autoscalingv1beta1.MachineAutoscalerList{} then passed as autoscalerList.Items into NewPodLifecyclePlugin), which can make PodLifecycle metrics stale if autoscalers change; either document this behavior in NewMetricsAgent/MetricsAgent constructor comments and README, or implement periodic refresh: add a reconciler/goroutine in NewMetricsAgent that re-lists MachineAutoscaler (using client.List(ctx, autoscalerList)) on a configurable interval and updates the value passed to NewPodLifecyclePlugin (or expose a thread-safe setter on PodLifecyclePlugin to accept new autoscaler items) so plugins always use the latest autoscalerItems.pkg/metrics/pods.go (1)
179-226: API calls on everyRecordmay impact performance.
getWorkloadCountslists MachineSets and Machines from the API on each pod lifecycle event. Under high event volume, this could create significant API load and latency.Consider caching with periodic refresh, or moving this aggregation to a background routine that updates a shared cache.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/metrics/pods.go` around lines 179 - 226, getWorkloadCounts currently calls the API on every pod event which will overload the API under high event volume; instead introduce a cached aggregation updated by a background refresher and have getWorkloadCounts read from that cache. Add fields on PodLifecyclePlugin (e.g., workloadCache map[string]WorkloadNodeCount, workloadCacheMu sync.RWMutex, cacheTTL/time.Ticker) and implement a background method (e.g., refreshWorkloadCounts or startWorkloadCacheUpdater) that periodically lists MachineSets and Machines, computes WorkloadNodeCount, writes under workloadCacheMu, and logs errors without failing; change getWorkloadCounts to grab the entry under workloadCacheMu (optionally trigger a synchronous refresh if cache miss or stale) and return the cached WorkloadNodeCount immediately. Ensure the updater is started at plugin initialization and handle graceful shutdown of the ticker/context.pkg/metrics/pods_test.go (1)
21-23: Scheme registration in testinit()ignores error.Unlike
metrics.gowhich logs errors, the test silently ignores theAddToSchemereturn value.♻️ Suggested improvement
func init() { - machinev1beta1.AddToScheme(scheme.Scheme) + if err := machinev1beta1.AddToScheme(scheme.Scheme); err != nil { + panic(err) + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/metrics/pods_test.go` around lines 21 - 23, The test init() currently calls machinev1beta1.AddToScheme(scheme.Scheme) and ignores the returned error; update the test to check and handle that error instead—either replace the init() with a TestMain(m *testing.M) that calls machinev1beta1.AddToScheme(scheme.Scheme) and exits non‑zero on error, or call utilruntime.Must(machinev1beta1.AddToScheme(scheme.Scheme)) (from k8s.io/apimachinery/pkg/util/runtime) inside init() so failures are not silently ignored; reference: init(), machinev1beta1.AddToScheme, scheme.Scheme, TestMain, utilruntime.Must.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/metrics/pods.go`:
- Around line 105-106: pod.Status.StartTime can be nil and dereferencing it in
the assignment to e.StartTime will panic; update the code around the e.StartTime
assignment (the lines handling pod.CreationTimestamp and pod.Status.StartTime)
to check pod.Status.StartTime for nil before taking its Time pointer and assign
e.StartTime only when non-nil (otherwise leave it nil or explicitly set to nil)
to avoid a nil pointer dereference.
---
Nitpick comments:
In `@pkg/metrics/metrics.go`:
- Around line 83-100: The NewMetricsAgent currently builds a
MachineAutoscalerList once (autoscalerList :=
&autoscalingv1beta1.MachineAutoscalerList{} then passed as autoscalerList.Items
into NewPodLifecyclePlugin), which can make PodLifecycle metrics stale if
autoscalers change; either document this behavior in
NewMetricsAgent/MetricsAgent constructor comments and README, or implement
periodic refresh: add a reconciler/goroutine in NewMetricsAgent that re-lists
MachineAutoscaler (using client.List(ctx, autoscalerList)) on a configurable
interval and updates the value passed to NewPodLifecyclePlugin (or expose a
thread-safe setter on PodLifecyclePlugin to accept new autoscaler items) so
plugins always use the latest autoscalerItems.
In `@pkg/metrics/pods_test.go`:
- Around line 21-23: The test init() currently calls
machinev1beta1.AddToScheme(scheme.Scheme) and ignores the returned error; update
the test to check and handle that error instead—either replace the init() with a
TestMain(m *testing.M) that calls machinev1beta1.AddToScheme(scheme.Scheme) and
exits non‑zero on error, or call
utilruntime.Must(machinev1beta1.AddToScheme(scheme.Scheme)) (from
k8s.io/apimachinery/pkg/util/runtime) inside init() so failures are not silently
ignored; reference: init(), machinev1beta1.AddToScheme, scheme.Scheme, TestMain,
utilruntime.Must.
In `@pkg/metrics/pods.go`:
- Around line 179-226: getWorkloadCounts currently calls the API on every pod
event which will overload the API under high event volume; instead introduce a
cached aggregation updated by a background refresher and have getWorkloadCounts
read from that cache. Add fields on PodLifecyclePlugin (e.g., workloadCache
map[string]WorkloadNodeCount, workloadCacheMu sync.RWMutex,
cacheTTL/time.Ticker) and implement a background method (e.g.,
refreshWorkloadCounts or startWorkloadCacheUpdater) that periodically lists
MachineSets and Machines, computes WorkloadNodeCount, writes under
workloadCacheMu, and logs errors without failing; change getWorkloadCounts to
grab the entry under workloadCacheMu (optionally trigger a synchronous refresh
if cache miss or stale) and return the cached WorkloadNodeCount immediately.
Ensure the updater is started at plugin initialization and handle graceful
shutdown of the ticker/context.
| e.CreationTime = &pod.CreationTimestamp.Time | ||
| e.StartTime = &pod.Status.StartTime.Time |
There was a problem hiding this comment.
Potential nil pointer dereference on StartTime.
pod.Status.StartTime can be nil for pods that haven't started yet. Dereferencing it unconditionally will cause a panic.
🐛 Proposed fix
e.CreationTime = &pod.CreationTimestamp.Time
- e.StartTime = &pod.Status.StartTime.Time
+ if pod.Status.StartTime != nil {
+ e.StartTime = &pod.Status.StartTime.Time
+ }
e.CompletionTime = getPodCompletionTime(pod)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| e.CreationTime = &pod.CreationTimestamp.Time | |
| e.StartTime = &pod.Status.StartTime.Time | |
| e.CreationTime = &pod.CreationTimestamp.Time | |
| if pod.Status.StartTime != nil { | |
| e.StartTime = &pod.Status.StartTime.Time | |
| } | |
| e.CompletionTime = getPodCompletionTime(pod) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/metrics/pods.go` around lines 105 - 106, pod.Status.StartTime can be nil
and dereferencing it in the assignment to e.StartTime will panic; update the
code around the e.StartTime assignment (the lines handling pod.CreationTimestamp
and pod.Status.StartTime) to check pod.Status.StartTime for nil before taking
its Time pointer and assign e.StartTime only when non-nil (otherwise leave it
nil or explicitly set to nil) to avoid a nil pointer dereference.
7b826a5 to
3d3d9f0
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/metrics/metrics.go (1)
165-173:⚠️ Potential issue | 🟠 MajorInclude MachinesPlugin data in the flushed metrics artifact.
Without this, collected machine metrics are never persisted.
Proposed fix
output := map[string]any{ ma.insightsPlugin.Name(): ma.insightsPlugin.Events(), ma.eventsPlugin.Name(): ma.eventsPlugin.Events(), ma.buildPlugin.Name(): ma.buildPlugin.Events(), ma.nodesPlugin.Name(): ma.nodesPlugin.Events(), ma.leasePlugin.Name(): ma.leasePlugin.Events(), ma.imagesPlugin.Name(): ma.imagesPlugin.Events(), ma.podPlugin.Name(): ma.podPlugin.Events(), + ma.machinesPlugin.Name(): ma.machinesPlugin.Events(), }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/metrics/metrics.go` around lines 165 - 173, The metrics flush omitted the Machines plugin, so add the machines plugin entry to the output map: include ma.machinesPlugin.Name(): ma.machinesPlugin.Events() alongside the existing entries (e.g., ma.insightsPlugin, ma.eventsPlugin, ma.buildPlugin, ma.nodesPlugin, ma.leasePlugin, ma.imagesPlugin, ma.podPlugin) so machine metrics are included in the flushed artifact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/metrics/machines.go`:
- Around line 122-128: In getWorkloadCounts, the MachineSet listing is not
namespace-scoped which breaks namespace-scoped RBAC; update the p.client.List
call that populates machineSetList to include the namespace option (use the same
MachineAPINamespace option used when listing Machines) so the List is scoped to
MachineAPINamespace; modify the p.client.List invocation in getWorkloadCounts
(for variable machineSetList) to pass client.InNamespace(MachineAPINamespace)
alongside p.ctx.
In `@pkg/metrics/metrics.go`:
- Around line 85-88: The MachineAutoscaler list call is currently cluster-scoped
(autoscalerList + client.List(ctx, autoscalerList)) which breaks under
namespace-scoped RBAC; update the client.List invocation to restrict the query
to the machine API namespace (e.g., pass a namespace option such as
client.InNamespace(<machine API namespace>)) so that MachineAutoscalerList is
populated; ensure you use the correct namespace source (a constant or config
value) rather than hardcoding, and keep the same logger/error handling around
the filtered client.List call.
---
Outside diff comments:
In `@pkg/metrics/metrics.go`:
- Around line 165-173: The metrics flush omitted the Machines plugin, so add the
machines plugin entry to the output map: include ma.machinesPlugin.Name():
ma.machinesPlugin.Events() alongside the existing entries (e.g.,
ma.insightsPlugin, ma.eventsPlugin, ma.buildPlugin, ma.nodesPlugin,
ma.leasePlugin, ma.imagesPlugin, ma.podPlugin) so machine metrics are included
in the flushed artifact.
b88df5a to
195fcb1
Compare
|
/test e2e |
|
/test e2e |
|
/test e2e |
…of the machinesets Signed-off-by: Nikolaos Moraitis <nmoraiti@redhat.com>
|
/test e2e |
1 similar comment
|
/test e2e |
|
/retest |
|
@droslean: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/cc @openshift/test-platform