Skip to content

Comments

[ci-operator]: expand the pod lifecycle metrics to include the state of the machinesets#4938

Open
droslean wants to merge 1 commit intoopenshift:mainfrom
droslean:metrics-99
Open

[ci-operator]: expand the pod lifecycle metrics to include the state of the machinesets#4938
droslean wants to merge 1 commit intoopenshift:mainfrom
droslean:metrics-99

Conversation

@droslean
Copy link
Member

/cc @openshift/test-platform

@openshift-ci-robot
Copy link
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci openshift-ci bot requested a review from a team February 11, 2026 13:10
@droslean
Copy link
Member Author

/hold

@coderabbitai
Copy link

coderabbitai bot commented Feb 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Registers OpenShift machine/autoscaling schemes, adds a Machines metrics plugin and wiring, lists MachineAutoscaler resources at agent init and passes them into pod/machine plugins, extends metrics events with workload/machine-set capacity types, and adds corresponding tests and go.mod dependency updates.

Changes

Cohort / File(s) Summary
Dependencies
go.mod
Updated OpenShift module pseudo-versions and added github.com/openshift/cluster-autoscaler-operator dependency.
Metrics core & wiring
pkg/metrics/metrics.go
Registers machinev1beta1 and autoscalingv1beta1 schemes, lists MachineAutoscaler resources during agent init (logs on error), adds constants (CIWorkloadLabel, MachineAPINamespace, MachineSetLabel), adds machinesPlugin to MetricsAgent, and wires autoscalers into plugin creation.
Pod lifecycle plugin
pkg/metrics/pods.go, pkg/metrics/pods_test.go
Minor formatting and label constant use in pods.go. NewPodLifecyclePlugin signature now accepts autoscalers; tests migrated to controller-runtime fake client, register machine schemes, and updated to assert new WorkloadCapacity topology fields; pointer helpers added.
Machines plugin (new)
pkg/metrics/machines.go, pkg/metrics/machines_test.go
Added MachinesPlugin implementation and types (MachinesEventType, MachinesEvent, WorkloadNodeCount, MachineSetCount, MachineInfo, AutoscalerInfo), thread-safe event buffering, workload aggregation by listing MachineSets/Machines and correlating MachineAutoscaler; tests exercising aggregation and autoscaler association.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 11, 2026
@droslean
Copy link
Member Author

/test e2e

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 11, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: droslean

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 11, 2026
@droslean
Copy link
Member Author

/test e2e

@droslean
Copy link
Member Author

/test e2e

@droslean
Copy link
Member Author

/retest

@droslean
Copy link
Member Author

/test e2e

@droslean
Copy link
Member Author

/test e2e

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
pkg/metrics/pods_test.go (1)

21-23: Consider registering the autoscalingv1beta1 scheme for test consistency.

The test init() only registers machinev1beta1 scheme. While the current tests don't create autoscaler objects via the fake client (they're passed directly), adding the autoscaling scheme registration would improve consistency with production code.

♻️ Suggested improvement
 func init() {
 	machinev1beta1.AddToScheme(scheme.Scheme)
+	autoscalingv1beta1.SchemeBuilder.AddToScheme(scheme.Scheme)
 }
pkg/metrics/pods.go (1)

167-213: Consider scoping MachineSet listing to the machine-api namespace.

The client.List for MachineSets on line 170 fetches cluster-wide. In OpenShift clusters, MachineSets typically reside in openshift-machine-api. Scoping the list could improve performance and avoid unexpected results from MachineSets in other namespaces.

♻️ Suggested improvement
 func (p *PodLifecyclePlugin) getWorkloadCounts(workload string) WorkloadNodeCount {
 	ret := WorkloadNodeCount{Workload: workload}
 	machineSetList := &machinev1beta1.MachineSetList{}
-	if err := p.client.List(p.ctx, machineSetList); err != nil {
+	if err := p.client.List(p.ctx, machineSetList, ctrlruntimeclient.InNamespace(MachineAPINamespace)); err != nil {
 		p.logger.WithError(err).Warn("Failed to list MachineSets")
 		return WorkloadNodeCount{}
 	}

@droslean
Copy link
Member Author

/test e2e

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
pkg/metrics/metrics.go (1)

83-100: Autoscaler data is fetched once and may become stale.

The MachineAutoscalerList is queried once during NewMetricsAgent initialization. If autoscaler configurations change (min/max replicas, new autoscalers added) during the agent's lifetime, the plugin will report outdated capacity information.

This is likely acceptable for metrics purposes, but worth noting if precise real-time capacity data is important. Consider documenting this behavior or refreshing periodically if needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metrics/metrics.go` around lines 83 - 100, The NewMetricsAgent currently
builds a MachineAutoscalerList once (autoscalerList :=
&autoscalingv1beta1.MachineAutoscalerList{} then passed as autoscalerList.Items
into NewPodLifecyclePlugin), which can make PodLifecycle metrics stale if
autoscalers change; either document this behavior in
NewMetricsAgent/MetricsAgent constructor comments and README, or implement
periodic refresh: add a reconciler/goroutine in NewMetricsAgent that re-lists
MachineAutoscaler (using client.List(ctx, autoscalerList)) on a configurable
interval and updates the value passed to NewPodLifecyclePlugin (or expose a
thread-safe setter on PodLifecyclePlugin to accept new autoscaler items) so
plugins always use the latest autoscalerItems.
pkg/metrics/pods.go (1)

179-226: API calls on every Record may impact performance.

getWorkloadCounts lists MachineSets and Machines from the API on each pod lifecycle event. Under high event volume, this could create significant API load and latency.

Consider caching with periodic refresh, or moving this aggregation to a background routine that updates a shared cache.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metrics/pods.go` around lines 179 - 226, getWorkloadCounts currently
calls the API on every pod event which will overload the API under high event
volume; instead introduce a cached aggregation updated by a background refresher
and have getWorkloadCounts read from that cache. Add fields on
PodLifecyclePlugin (e.g., workloadCache map[string]WorkloadNodeCount,
workloadCacheMu sync.RWMutex, cacheTTL/time.Ticker) and implement a background
method (e.g., refreshWorkloadCounts or startWorkloadCacheUpdater) that
periodically lists MachineSets and Machines, computes WorkloadNodeCount, writes
under workloadCacheMu, and logs errors without failing; change getWorkloadCounts
to grab the entry under workloadCacheMu (optionally trigger a synchronous
refresh if cache miss or stale) and return the cached WorkloadNodeCount
immediately. Ensure the updater is started at plugin initialization and handle
graceful shutdown of the ticker/context.
pkg/metrics/pods_test.go (1)

21-23: Scheme registration in test init() ignores error.

Unlike metrics.go which logs errors, the test silently ignores the AddToScheme return value.

♻️ Suggested improvement
 func init() {
-	machinev1beta1.AddToScheme(scheme.Scheme)
+	if err := machinev1beta1.AddToScheme(scheme.Scheme); err != nil {
+		panic(err)
+	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metrics/pods_test.go` around lines 21 - 23, The test init() currently
calls machinev1beta1.AddToScheme(scheme.Scheme) and ignores the returned error;
update the test to check and handle that error instead—either replace the init()
with a TestMain(m *testing.M) that calls
machinev1beta1.AddToScheme(scheme.Scheme) and exits non‑zero on error, or call
utilruntime.Must(machinev1beta1.AddToScheme(scheme.Scheme)) (from
k8s.io/apimachinery/pkg/util/runtime) inside init() so failures are not silently
ignored; reference: init(), machinev1beta1.AddToScheme, scheme.Scheme, TestMain,
utilruntime.Must.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/metrics/pods.go`:
- Around line 105-106: pod.Status.StartTime can be nil and dereferencing it in
the assignment to e.StartTime will panic; update the code around the e.StartTime
assignment (the lines handling pod.CreationTimestamp and pod.Status.StartTime)
to check pod.Status.StartTime for nil before taking its Time pointer and assign
e.StartTime only when non-nil (otherwise leave it nil or explicitly set to nil)
to avoid a nil pointer dereference.

---

Nitpick comments:
In `@pkg/metrics/metrics.go`:
- Around line 83-100: The NewMetricsAgent currently builds a
MachineAutoscalerList once (autoscalerList :=
&autoscalingv1beta1.MachineAutoscalerList{} then passed as autoscalerList.Items
into NewPodLifecyclePlugin), which can make PodLifecycle metrics stale if
autoscalers change; either document this behavior in
NewMetricsAgent/MetricsAgent constructor comments and README, or implement
periodic refresh: add a reconciler/goroutine in NewMetricsAgent that re-lists
MachineAutoscaler (using client.List(ctx, autoscalerList)) on a configurable
interval and updates the value passed to NewPodLifecyclePlugin (or expose a
thread-safe setter on PodLifecyclePlugin to accept new autoscaler items) so
plugins always use the latest autoscalerItems.

In `@pkg/metrics/pods_test.go`:
- Around line 21-23: The test init() currently calls
machinev1beta1.AddToScheme(scheme.Scheme) and ignores the returned error; update
the test to check and handle that error instead—either replace the init() with a
TestMain(m *testing.M) that calls machinev1beta1.AddToScheme(scheme.Scheme) and
exits non‑zero on error, or call
utilruntime.Must(machinev1beta1.AddToScheme(scheme.Scheme)) (from
k8s.io/apimachinery/pkg/util/runtime) inside init() so failures are not silently
ignored; reference: init(), machinev1beta1.AddToScheme, scheme.Scheme, TestMain,
utilruntime.Must.

In `@pkg/metrics/pods.go`:
- Around line 179-226: getWorkloadCounts currently calls the API on every pod
event which will overload the API under high event volume; instead introduce a
cached aggregation updated by a background refresher and have getWorkloadCounts
read from that cache. Add fields on PodLifecyclePlugin (e.g., workloadCache
map[string]WorkloadNodeCount, workloadCacheMu sync.RWMutex,
cacheTTL/time.Ticker) and implement a background method (e.g.,
refreshWorkloadCounts or startWorkloadCacheUpdater) that periodically lists
MachineSets and Machines, computes WorkloadNodeCount, writes under
workloadCacheMu, and logs errors without failing; change getWorkloadCounts to
grab the entry under workloadCacheMu (optionally trigger a synchronous refresh
if cache miss or stale) and return the cached WorkloadNodeCount immediately.
Ensure the updater is started at plugin initialization and handle graceful
shutdown of the ticker/context.

Comment on lines 105 to 106
e.CreationTime = &pod.CreationTimestamp.Time
e.StartTime = &pod.Status.StartTime.Time
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Potential nil pointer dereference on StartTime.

pod.Status.StartTime can be nil for pods that haven't started yet. Dereferencing it unconditionally will cause a panic.

🐛 Proposed fix
 	e.CreationTime = &pod.CreationTimestamp.Time
-	e.StartTime = &pod.Status.StartTime.Time
+	if pod.Status.StartTime != nil {
+		e.StartTime = &pod.Status.StartTime.Time
+	}
 	e.CompletionTime = getPodCompletionTime(pod)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
e.CreationTime = &pod.CreationTimestamp.Time
e.StartTime = &pod.Status.StartTime.Time
e.CreationTime = &pod.CreationTimestamp.Time
if pod.Status.StartTime != nil {
e.StartTime = &pod.Status.StartTime.Time
}
e.CompletionTime = getPodCompletionTime(pod)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metrics/pods.go` around lines 105 - 106, pod.Status.StartTime can be nil
and dereferencing it in the assignment to e.StartTime will panic; update the
code around the e.StartTime assignment (the lines handling pod.CreationTimestamp
and pod.Status.StartTime) to check pod.Status.StartTime for nil before taking
its Time pointer and assign e.StartTime only when non-nil (otherwise leave it
nil or explicitly set to nil) to avoid a nil pointer dereference.

@droslean droslean force-pushed the metrics-99 branch 2 times, most recently from 7b826a5 to 3d3d9f0 Compare February 17, 2026 14:45
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/metrics/metrics.go (1)

165-173: ⚠️ Potential issue | 🟠 Major

Include MachinesPlugin data in the flushed metrics artifact.

Without this, collected machine metrics are never persisted.

Proposed fix
 	output := map[string]any{
 		ma.insightsPlugin.Name(): ma.insightsPlugin.Events(),
 		ma.eventsPlugin.Name():   ma.eventsPlugin.Events(),
 		ma.buildPlugin.Name():    ma.buildPlugin.Events(),
 		ma.nodesPlugin.Name():    ma.nodesPlugin.Events(),
 		ma.leasePlugin.Name():    ma.leasePlugin.Events(),
 		ma.imagesPlugin.Name():   ma.imagesPlugin.Events(),
 		ma.podPlugin.Name():      ma.podPlugin.Events(),
+		ma.machinesPlugin.Name(): ma.machinesPlugin.Events(),
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metrics/metrics.go` around lines 165 - 173, The metrics flush omitted the
Machines plugin, so add the machines plugin entry to the output map: include
ma.machinesPlugin.Name(): ma.machinesPlugin.Events() alongside the existing
entries (e.g., ma.insightsPlugin, ma.eventsPlugin, ma.buildPlugin,
ma.nodesPlugin, ma.leasePlugin, ma.imagesPlugin, ma.podPlugin) so machine
metrics are included in the flushed artifact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/metrics/machines.go`:
- Around line 122-128: In getWorkloadCounts, the MachineSet listing is not
namespace-scoped which breaks namespace-scoped RBAC; update the p.client.List
call that populates machineSetList to include the namespace option (use the same
MachineAPINamespace option used when listing Machines) so the List is scoped to
MachineAPINamespace; modify the p.client.List invocation in getWorkloadCounts
(for variable machineSetList) to pass client.InNamespace(MachineAPINamespace)
alongside p.ctx.

In `@pkg/metrics/metrics.go`:
- Around line 85-88: The MachineAutoscaler list call is currently cluster-scoped
(autoscalerList + client.List(ctx, autoscalerList)) which breaks under
namespace-scoped RBAC; update the client.List invocation to restrict the query
to the machine API namespace (e.g., pass a namespace option such as
client.InNamespace(<machine API namespace>)) so that MachineAutoscalerList is
populated; ensure you use the correct namespace source (a constant or config
value) rather than hardcoding, and keep the same logger/error handling around
the filtered client.List call.

---

Outside diff comments:
In `@pkg/metrics/metrics.go`:
- Around line 165-173: The metrics flush omitted the Machines plugin, so add the
machines plugin entry to the output map: include ma.machinesPlugin.Name():
ma.machinesPlugin.Events() alongside the existing entries (e.g.,
ma.insightsPlugin, ma.eventsPlugin, ma.buildPlugin, ma.nodesPlugin,
ma.leasePlugin, ma.imagesPlugin, ma.podPlugin) so machine metrics are included
in the flushed artifact.

@droslean droslean force-pushed the metrics-99 branch 2 times, most recently from b88df5a to 195fcb1 Compare February 17, 2026 14:46
@droslean
Copy link
Member Author

/test e2e

@droslean
Copy link
Member Author

/test e2e

@droslean
Copy link
Member Author

/test e2e

…of the machinesets

Signed-off-by: Nikolaos Moraitis <nmoraiti@redhat.com>
@droslean
Copy link
Member Author

/test e2e

1 similar comment
@droslean
Copy link
Member Author

/test e2e

@droslean
Copy link
Member Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 19, 2026

@droslean: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/breaking-changes 853074a link false /test breaking-changes
ci/prow/lint 853074a link true /test lint
ci/prow/e2e 853074a link true /test e2e

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants