[ci-operator]: expand the pod lifecycle metrics to include the state of the machinesets by droslean · Pull Request #4938 · openshift/ci-tools

droslean · 2026-02-11T13:10:03Z

/cc @openshift/test-platform

openshift-ci-robot · 2026-02-11T13:10:06Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

droslean · 2026-02-11T13:10:13Z

/hold

coderabbitai · 2026-02-11T13:10:22Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Registers OpenShift machine/autoscaling schemes, adds a Machines metrics plugin and wiring, lists MachineAutoscaler resources at agent init and passes them into pod/machine plugins, extends metrics events with workload/machine-set capacity types, and adds corresponding tests and go.mod dependency updates.

Changes

Cohort / File(s)	Summary
Dependencies `go.mod`	Updated OpenShift module pseudo-versions and added `github.com/openshift/cluster-autoscaler-operator` dependency.
Metrics core & wiring `pkg/metrics/metrics.go`	Registers `machinev1beta1` and `autoscalingv1beta1` schemes, lists `MachineAutoscaler` resources during agent init (logs on error), adds constants (`CIWorkloadLabel`, `MachineAPINamespace`, `MachineSetLabel`), adds `machinesPlugin` to `MetricsAgent`, and wires autoscalers into plugin creation.
Pod lifecycle plugin `pkg/metrics/pods.go`, `pkg/metrics/pods_test.go`	Minor formatting and label constant use in `pods.go`. `NewPodLifecyclePlugin` signature now accepts `autoscalers`; tests migrated to controller-runtime fake client, register machine schemes, and updated to assert new `WorkloadCapacity` topology fields; pointer helpers added.
Machines plugin (new) `pkg/metrics/machines.go`, `pkg/metrics/machines_test.go`	Added `MachinesPlugin` implementation and types (`MachinesEventType`, `MachinesEvent`, `WorkloadNodeCount`, `MachineSetCount`, `MachineInfo`, `AutoscalerInfo`), thread-safe event buffering, workload aggregation by listing `MachineSets`/`Machines` and correlating `MachineAutoscaler`; tests exercising aggregation and autoscaler association.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

droslean · 2026-02-11T13:13:30Z

/test e2e

openshift-ci · 2026-02-11T13:14:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: droslean

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [droslean]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

droslean · 2026-02-11T14:48:33Z

/test e2e

droslean · 2026-02-11T16:47:49Z

/test e2e

droslean · 2026-02-16T01:23:30Z

/retest

droslean · 2026-02-16T01:23:33Z

/test e2e

droslean · 2026-02-16T07:03:53Z

/test e2e

coderabbitai

🧹 Nitpick comments (2)

pkg/metrics/pods_test.go (1)
21-23: Consider registering the autoscalingv1beta1 scheme for test consistency.

The test init() only registers machinev1beta1 scheme. While the current tests don't create autoscaler objects via the fake client (they're passed directly), adding the autoscaling scheme registration would improve consistency with production code.
♻️ Suggested improvement
 func init() {
 	machinev1beta1.AddToScheme(scheme.Scheme)
+	autoscalingv1beta1.SchemeBuilder.AddToScheme(scheme.Scheme)
 }
pkg/metrics/pods.go (1)
167-213: Consider scoping MachineSet listing to the machine-api namespace.

The client.List for MachineSets on line 170 fetches cluster-wide. In OpenShift clusters, MachineSets typically reside in openshift-machine-api. Scoping the list could improve performance and avoid unexpected results from MachineSets in other namespaces.
♻️ Suggested improvement
 func (p *PodLifecyclePlugin) getWorkloadCounts(workload string) WorkloadNodeCount {
 	ret := WorkloadNodeCount{Workload: workload}
 	machineSetList := &machinev1beta1.MachineSetList{}
-	if err := p.client.List(p.ctx, machineSetList); err != nil {
+	if err := p.client.List(p.ctx, machineSetList, ctrlruntimeclient.InNamespace(MachineAPINamespace)); err != nil {
 		p.logger.WithError(err).Warn("Failed to list MachineSets")
 		return WorkloadNodeCount{}
 	}

droslean · 2026-02-17T09:54:06Z

/test e2e

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

pkg/metrics/metrics.go (1)

83-100: Autoscaler data is fetched once and may become stale.

The MachineAutoscalerList is queried once during NewMetricsAgent initialization. If autoscaler configurations change (min/max replicas, new autoscalers added) during the agent's lifetime, the plugin will report outdated capacity information.

This is likely acceptable for metrics purposes, but worth noting if precise real-time capacity data is important. Consider documenting this behavior or refreshing periodically if needed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metrics/metrics.go` around lines 83 - 100, The NewMetricsAgent currently
builds a MachineAutoscalerList once (autoscalerList :=
&autoscalingv1beta1.MachineAutoscalerList{} then passed as autoscalerList.Items
into NewPodLifecyclePlugin), which can make PodLifecycle metrics stale if
autoscalers change; either document this behavior in
NewMetricsAgent/MetricsAgent constructor comments and README, or implement
periodic refresh: add a reconciler/goroutine in NewMetricsAgent that re-lists
MachineAutoscaler (using client.List(ctx, autoscalerList)) on a configurable
interval and updates the value passed to NewPodLifecyclePlugin (or expose a
thread-safe setter on PodLifecyclePlugin to accept new autoscaler items) so
plugins always use the latest autoscalerItems.

pkg/metrics/pods.go (1)

179-226: API calls on every Record may impact performance.

getWorkloadCounts lists MachineSets and Machines from the API on each pod lifecycle event. Under high event volume, this could create significant API load and latency.

Consider caching with periodic refresh, or moving this aggregation to a background routine that updates a shared cache.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/metrics/pods.go` around lines 179 - 226, getWorkloadCounts currently
calls the API on every pod event which will overload the API under high event
volume; instead introduce a cached aggregation updated by a background refresher
and have getWorkloadCounts read from that cache. Add fields on
PodLifecyclePlugin (e.g., workloadCache map[string]WorkloadNodeCount,
workloadCacheMu sync.RWMutex, cacheTTL/time.Ticker) and implement a background
method (e.g., refreshWorkloadCounts or startWorkloadCacheUpdater) that
periodically lists MachineSets and Machines, computes WorkloadNodeCount, writes
under workloadCacheMu, and logs errors without failing; change getWorkloadCounts
to grab the entry under workloadCacheMu (optionally trigger a synchronous
refresh if cache miss or stale) and return the cached WorkloadNodeCount
immediately. Ensure the updater is started at plugin initialization and handle
graceful shutdown of the ticker/context.

pkg/metrics/pods_test.go (1)

21-23: Scheme registration in test init() ignores error.

Unlike metrics.go which logs errors, the test silently ignores the AddToScheme return value.

♻️ Suggested improvement

 func init() {
-	machinev1beta1.AddToScheme(scheme.Scheme)
+	if err := machinev1beta1.AddToScheme(scheme.Scheme); err != nil {
+		panic(err)
+	}
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/metrics/pods_test.go` around lines 21 - 23, The test init() currently
calls machinev1beta1.AddToScheme(scheme.Scheme) and ignores the returned error;
update the test to check and handle that error instead—either replace the init()
with a TestMain(m *testing.M) that calls
machinev1beta1.AddToScheme(scheme.Scheme) and exits non‑zero on error, or call
utilruntime.Must(machinev1beta1.AddToScheme(scheme.Scheme)) (from
k8s.io/apimachinery/pkg/util/runtime) inside init() so failures are not silently
ignored; reference: init(), machinev1beta1.AddToScheme, scheme.Scheme, TestMain,
utilruntime.Must.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/metrics/pods.go`:
- Around line 105-106: pod.Status.StartTime can be nil and dereferencing it in
the assignment to e.StartTime will panic; update the code around the e.StartTime
assignment (the lines handling pod.CreationTimestamp and pod.Status.StartTime)
to check pod.Status.StartTime for nil before taking its Time pointer and assign
e.StartTime only when non-nil (otherwise leave it nil or explicitly set to nil)
to avoid a nil pointer dereference.

---

Nitpick comments:
In `@pkg/metrics/metrics.go`:
- Around line 83-100: The NewMetricsAgent currently builds a
MachineAutoscalerList once (autoscalerList :=
&autoscalingv1beta1.MachineAutoscalerList{} then passed as autoscalerList.Items
into NewPodLifecyclePlugin), which can make PodLifecycle metrics stale if
autoscalers change; either document this behavior in
NewMetricsAgent/MetricsAgent constructor comments and README, or implement
periodic refresh: add a reconciler/goroutine in NewMetricsAgent that re-lists
MachineAutoscaler (using client.List(ctx, autoscalerList)) on a configurable
interval and updates the value passed to NewPodLifecyclePlugin (or expose a
thread-safe setter on PodLifecyclePlugin to accept new autoscaler items) so
plugins always use the latest autoscalerItems.

In `@pkg/metrics/pods_test.go`:
- Around line 21-23: The test init() currently calls
machinev1beta1.AddToScheme(scheme.Scheme) and ignores the returned error; update
the test to check and handle that error instead—either replace the init() with a
TestMain(m *testing.M) that calls machinev1beta1.AddToScheme(scheme.Scheme) and
exits non‑zero on error, or call
utilruntime.Must(machinev1beta1.AddToScheme(scheme.Scheme)) (from
k8s.io/apimachinery/pkg/util/runtime) inside init() so failures are not silently
ignored; reference: init(), machinev1beta1.AddToScheme, scheme.Scheme, TestMain,
utilruntime.Must.

In `@pkg/metrics/pods.go`:
- Around line 179-226: getWorkloadCounts currently calls the API on every pod
event which will overload the API under high event volume; instead introduce a
cached aggregation updated by a background refresher and have getWorkloadCounts
read from that cache. Add fields on PodLifecyclePlugin (e.g., workloadCache
map[string]WorkloadNodeCount, workloadCacheMu sync.RWMutex,
cacheTTL/time.Ticker) and implement a background method (e.g.,
refreshWorkloadCounts or startWorkloadCacheUpdater) that periodically lists
MachineSets and Machines, computes WorkloadNodeCount, writes under
workloadCacheMu, and logs errors without failing; change getWorkloadCounts to
grab the entry under workloadCacheMu (optionally trigger a synchronous refresh
if cache miss or stale) and return the cached WorkloadNodeCount immediately.
Ensure the updater is started at plugin initialization and handle graceful
shutdown of the ticker/context.

coderabbitai · 2026-02-17T10:04:34Z

pkg/metrics/pods.go

 	e.CreationTime = &pod.CreationTimestamp.Time
 	e.StartTime = &pod.Status.StartTime.Time


⚠️ Potential issue | 🔴 Critical

Potential nil pointer dereference on StartTime.

pod.Status.StartTime can be nil for pods that haven't started yet. Dereferencing it unconditionally will cause a panic.

🐛 Proposed fix

e.CreationTime = &pod.CreationTimestamp.Time - e.StartTime = &pod.Status.StartTime.Time + if pod.Status.StartTime != nil { + e.StartTime = &pod.Status.StartTime.Time + } e.CompletionTime = getPodCompletionTime(pod)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

e.CreationTime = &pod.CreationTimestamp.Time

e.StartTime = &pod.Status.StartTime.Time

e.CreationTime = &pod.CreationTimestamp.Time

if pod.Status.StartTime != nil {

e.StartTime = &pod.Status.StartTime.Time

}

e.CompletionTime = getPodCompletionTime(pod)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/metrics/pods.go` around lines 105 - 106, pod.Status.StartTime can be nil and dereferencing it in the assignment to e.StartTime will panic; update the code around the e.StartTime assignment (the lines handling pod.CreationTimestamp and pod.Status.StartTime) to check pod.Status.StartTime for nil before taking its Time pointer and assign e.StartTime only when non-nil (otherwise leave it nil or explicitly set to nil) to avoid a nil pointer dereference.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/metrics/metrics.go (1)

165-173: ⚠️ Potential issue | 🟠 Major

Include MachinesPlugin data in the flushed metrics artifact.

Without this, collected machine metrics are never persisted.

Proposed fix

 	output := map[string]any{
 		ma.insightsPlugin.Name(): ma.insightsPlugin.Events(),
 		ma.eventsPlugin.Name():   ma.eventsPlugin.Events(),
 		ma.buildPlugin.Name():    ma.buildPlugin.Events(),
 		ma.nodesPlugin.Name():    ma.nodesPlugin.Events(),
 		ma.leasePlugin.Name():    ma.leasePlugin.Events(),
 		ma.imagesPlugin.Name():   ma.imagesPlugin.Events(),
 		ma.podPlugin.Name():      ma.podPlugin.Events(),
+		ma.machinesPlugin.Name(): ma.machinesPlugin.Events(),
 	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/metrics/metrics.go` around lines 165 - 173, The metrics flush omitted the
Machines plugin, so add the machines plugin entry to the output map: include
ma.machinesPlugin.Name(): ma.machinesPlugin.Events() alongside the existing
entries (e.g., ma.insightsPlugin, ma.eventsPlugin, ma.buildPlugin,
ma.nodesPlugin, ma.leasePlugin, ma.imagesPlugin, ma.podPlugin) so machine
metrics are included in the flushed artifact.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/metrics/machines.go`:
- Around line 122-128: In getWorkloadCounts, the MachineSet listing is not
namespace-scoped which breaks namespace-scoped RBAC; update the p.client.List
call that populates machineSetList to include the namespace option (use the same
MachineAPINamespace option used when listing Machines) so the List is scoped to
MachineAPINamespace; modify the p.client.List invocation in getWorkloadCounts
(for variable machineSetList) to pass client.InNamespace(MachineAPINamespace)
alongside p.ctx.

In `@pkg/metrics/metrics.go`:
- Around line 85-88: The MachineAutoscaler list call is currently cluster-scoped
(autoscalerList + client.List(ctx, autoscalerList)) which breaks under
namespace-scoped RBAC; update the client.List invocation to restrict the query
to the machine API namespace (e.g., pass a namespace option such as
client.InNamespace(<machine API namespace>)) so that MachineAutoscalerList is
populated; ensure you use the correct namespace source (a constant or config
value) rather than hardcoding, and keep the same logger/error handling around
the filtered client.List call.

---

Outside diff comments:
In `@pkg/metrics/metrics.go`:
- Around line 165-173: The metrics flush omitted the Machines plugin, so add the
machines plugin entry to the output map: include ma.machinesPlugin.Name():
ma.machinesPlugin.Events() alongside the existing entries (e.g.,
ma.insightsPlugin, ma.eventsPlugin, ma.buildPlugin, ma.nodesPlugin,
ma.leasePlugin, ma.imagesPlugin, ma.podPlugin) so machine metrics are included
in the flushed artifact.

pkg/metrics/machines.go

pkg/metrics/metrics.go

droslean · 2026-02-17T14:47:05Z

/test e2e

droslean · 2026-02-17T14:47:57Z

/test e2e

droslean · 2026-02-18T12:52:40Z

/test e2e

…of the machinesets Signed-off-by: Nikolaos Moraitis <nmoraiti@redhat.com>

droslean · 2026-02-18T16:07:13Z

/test e2e

droslean · 2026-02-18T18:31:16Z

/test e2e

droslean · 2026-02-19T10:03:36Z

/retest

openshift-ci · 2026-02-19T11:36:51Z

@droslean: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/breaking-changes	`853074a`	link	false	`/test breaking-changes`
ci/prow/lint	`853074a`	link	true	`/test lint`
ci/prow/e2e	`853074a`	link	true	`/test e2e`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot requested a review from a team February 11, 2026 13:10

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 11, 2026

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 11, 2026

droslean force-pushed the metrics-99 branch from 4fe12ff to 05a3a18 Compare February 11, 2026 14:38

droslean mentioned this pull request Feb 11, 2026

allow ci-operator to read machinesets and machineautoscalers openshift/release#74762

Merged

droslean force-pushed the metrics-99 branch from 05a3a18 to b6bdef7 Compare February 16, 2026 07:02

coderabbitai bot reviewed Feb 16, 2026

View reviewed changes

droslean force-pushed the metrics-99 branch from b6bdef7 to f52f871 Compare February 17, 2026 09:54

coderabbitai bot reviewed Feb 17, 2026

View reviewed changes

droslean force-pushed the metrics-99 branch 2 times, most recently from 7b826a5 to 3d3d9f0 Compare February 17, 2026 14:45

coderabbitai bot reviewed Feb 17, 2026

View reviewed changes

pkg/metrics/machines.go Show resolved Hide resolved

pkg/metrics/metrics.go Show resolved Hide resolved

droslean force-pushed the metrics-99 branch 2 times, most recently from b88df5a to 195fcb1 Compare February 17, 2026 14:46

droslean force-pushed the metrics-99 branch from 195fcb1 to f8a1926 Compare February 17, 2026 14:47

droslean force-pushed the metrics-99 branch from f8a1926 to bf00a05 Compare February 18, 2026 12:43

[ci-operator]: expand the pod lifecycle metrics to include the state …

853074a

…of the machinesets Signed-off-by: Nikolaos Moraitis <nmoraiti@redhat.com>

droslean force-pushed the metrics-99 branch from bf00a05 to 853074a Compare February 18, 2026 16:07

		e.CreationTime = &pod.CreationTimestamp.Time
		e.StartTime = &pod.Status.StartTime.Time

Comments

Conversation

droslean commented Feb 11, 2026

Uh oh!

openshift-ci-robot commented Feb 11, 2026

Uh oh!

droslean commented Feb 11, 2026

Uh oh!

coderabbitai bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Uh oh!

droslean commented Feb 11, 2026

Uh oh!

openshift-ci bot commented Feb 11, 2026

Uh oh!

droslean commented Feb 11, 2026

Uh oh!

droslean commented Feb 11, 2026

Uh oh!

droslean commented Feb 16, 2026

Uh oh!

droslean commented Feb 16, 2026

Uh oh!

droslean commented Feb 16, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

droslean commented Feb 17, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

droslean commented Feb 17, 2026

Uh oh!

droslean commented Feb 17, 2026

Uh oh!

droslean commented Feb 18, 2026

Uh oh!

droslean commented Feb 18, 2026

Uh oh!

droslean commented Feb 18, 2026

Uh oh!

droslean commented Feb 19, 2026

Uh oh!

openshift-ci bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Feb 11, 2026 •

edited

Loading