feat(catalogs): expose catalog controller metrics by abhijith-darshan · Pull Request #1787 · cloudoperators/greenhouse

abhijith-darshan · 2026-02-07T00:46:05Z

Description

expose Catalog Controller metrics

What type of PR is this? (check all applicable)

Related Tickets & Documents

Related [FEAT] - Catalog controller metrics #1646

Added tests?

👍 yes
🙅 no, because they aren't needed
🙋 no, because I need help
Separate ticket for tests # (issue/pr)

Added to documentation?

📜 README.md
🤝 Documentation pages updated
🙅 no documentation needed
(if applicable) generated OpenAPI docs for CRD changes

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
My changes generate no new warnings
New and existing unit tests pass locally with my changes

Summary by CodeRabbit

New Features
- Introduced Prometheus catalog readiness metrics with labels for catalog, organization, owner and readiness reason (SecretMissing, AuthError, Unknown) to surface auth failures and missing secrets.
Improvements
- Readiness metric updates now occur after reconciliation to reflect final state; obsolete label values are cleared.
Bug Fixes
- Catalog metrics are cleaned up when a catalog is deleted.

IvoGoman

It would be cool if this also added a metric to the PluginDefinition regarding it's ready status.
e.g. greenhouse_plugindefintion_ready{"plugindefinition", "organization", "catalog", "reason"} & one for the clusterplugindefinition

Where catalog is filled when available. That way we could have a overall perses dashboard that shows PluginDefinitions managed by Catalogs as well.

internal/controller/catalog/metrics.go

IvoGoman · 2026-02-24T16:05:14Z

internal/controller/catalog/metrics.go

+				if strings.Contains(msg, "authentication") ||
+					strings.Contains(msg, "auth") ||
+					strings.Contains(msg, "unauthorized") ||
+					strings.Contains(msg, "forbidden") ||
+					strings.Contains(msg, "credentials") {


Is this list coming from Flux? If so can we link the code as reference, in case we need to update the list.

Another option is to look at the error conditions (e.g. FetchFailed with reason AuthenticationFailed for the inventory. (docs)
From Flux side both (bad credentials/non-existing secret) are considered an Auth failure.

I have checked if we get the same error message inside the Inventory. It seems that the catalog controller returns the same AuthenticationFailed error that Flux emits. This is reflected in the overall Ready status but not in the inventory:

inventory: name: kustomize-7893119400682804229 ready: "True" github-corporate-greenhouse-extensions-main-14266026803966897299: - kind: GitRepository message: stored artifact for revision 'main@sha1:79f4774956559c26b03527c9f6ac6efd065bacca' name: repository-14266026803966897299 ready: "True" - kind: ArtifactGenerator message: reconciliation succeeded, generated 1 artifact(s) name: generator-14266026803966897299 ready: "True" - kind: ExternalArtifact message: Artifact is ready name: artifact-14266026803966897299 ready: "True" - kind: Kustomization message: 'Applied revision: latest@sha256:ed3d4dd6cfa7fb5e6138d7ffc176c919a901322a1b46cba47f21155a80922607' name: kustomize-14266026803966897299 ready: "True" statusConditions: conditions: - lastTransitionTime: "2026-02-24T15:58:24Z" message: 'not all catalog objects are ready: GitRepository/repository-14266026803966897299: failed to get secret sci/12github-app-secret-sci: Secret "12github-app-secret-sci" not found' reason: CatalogNotReady status: "False" type: Ready

Yes the thing is when secret not found the github repository is not created by catalog, hence this error is available in Catalog ready status otherwise the condition message is propagated to inventory list.

we are not propagating all the conditions of GitRepository only the ready so asserting on the ConditionType is not possible hence string matching the error message.

We can create a follow up ticket to propagate all conditions to inventory so we can easily find by ConditionType and it would be more stable than string matching!

In the meantime I will link the code reference till we implement all condition propagation to inventory list.

Ok then I was mislead by the error message, since it follows the same pattern as the source controllers.

The Ready status still seems inconsistent in this case. The Ready state is False pointing to the GitRepository saying that the repo is not ready. The Inventory entry and also the GitRepo will stay ready in this case, since it is not updated. Would there be a downside of passing the secret through to the GitRepository directly.

It will save us from updating/creating a GitRepository with a non-existing secret. On the other hand we don't need to worry about the edge-case of the non-existing secret and rely solely on the Ready conditions of the Flux resources.

Just checked and the Ready condition of the GitRepository would give us what we need:

(edited the GitRepo for the Catalog with the bad secret name)

- lastTransitionTime: "2026-02-25T15:53:09Z" message: 'failed to get secret ''sci/12-github-app-secret-sci'': failed to get secret ''sci/12-github-app-secret-sci'': secrets "12-github-app-secret-sci" not found' observedGeneration: 2 reason: AuthenticationFailed status: "False" type: Ready

That would require no existence checks for secrets & adding of the reason to the inventory.

Yes sure, this can be done and we drop secrets missing from metrics and keep it to just Auth errors.

This will cover missing secrets, authentication errors, token having no permissions etc all grouped under auth.

we can then just check for AuthenticationFailed reason instead of substring matches.

coderabbitai · 2026-03-03T09:47:03Z

📝 Walkthrough

Walkthrough

Adds a Prometheus readiness metric for Catalogs, implements metric set/delete and reason determination (SecretMissing, AuthError, Unknown/OK) from inventory and Ready condition, and integrates metric updates into the catalog reconciler with deferred emission after reconciliation.

Changes

Cohort / File(s)	Summary
Metrics subsystem `internal/controller/catalog/metrics.go`	Adds a `GaugeVec` (`greenhouse_catalog_ready`) with labels (catalog, organization, owned_by, reason); registers it; implements `updateCatalogReadyMetric`, `deleteCatalogReadyMetric`, and `determineReason` to compute ownership, derive readiness reason from inventory and Ready condition, clear stale reasons, and set/remove metric series.
Controller integration `internal/controller/catalog/catalog_controller.go`	Changes `EnsureDeleted/EnsureSuspended/EnsureCreated` to accept a named `obj` parameter and perform type assertions (with `//nolint:errcheck`); defers `updateCatalogReadyMetric(catalog)` in `EnsureCreated`; calls `deleteCatalogReadyMetric` during deletion flow; adds `//nolint:unparam` to `verifyStatus`.
Minor whitespace `internal/controller/catalog/source.go`	Removes an extraneous empty line in `objectReadiness` (no behavioral change).

Sequence Diagram(s)

sequenceDiagram
    participant Controller
    participant Reconciler
    participant Metrics as "Metrics Package"
    participant Prometheus as "Prometheus Registry"

    rect rgba(200,230,255,0.5)
    Controller->>Reconciler: EnsureCreated(obj)
    Reconciler->>Reconciler: reconciliation logic (update resources)
    Reconciler-->>Controller: reconciliation result
    end

    rect rgba(200,255,200,0.5)
    Reconciler->>Metrics: updateCatalogReadyMetric(catalog) [deferred]
    Metrics->>Metrics: determineReason(catalog)
    Metrics->>Prometheus: ReadyGauge.With(labels).Set(1 or 0)
    end

    rect rgba(255,230,200,0.5)
    Controller->>Reconciler: EnsureDeleted(obj)
    Reconciler->>Metrics: deleteCatalogReadyMetric(catalog)
    Metrics->>Prometheus: ReadyGauge.DeleteLabelValues(...)
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰
I hopped through code to plant a gauge,
I sniffed the secrets on each page.
If auth is foul or keys are missed,
I mark the reason on the list.
Small rabbit metrics, front and stage.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main feature being added: exposing catalog controller metrics.
Description check	✅ Passed	The PR description covers the main sections: provides a brief description, identifies the PR type as a feature, references a related issue, confirms tests aren't needed, indicates no documentation updates required, and completes the checklist with self-review and code quality confirmations.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/catalog_metrics

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

internal/controller/catalog/catalog_controller.go (1)

379-379: Remove unnecessary unparam suppression.

Line 379 suppresses unparam, but verifyStatus parameters are actively used. Keeping this suppression lowers lint signal quality.

Suggested cleanup

-//nolint:unparam
 func (r *CatalogReconciler) verifyStatus(ctx context.Context, catalog *greenhousev1alpha1.Catalog, allErrors []catalogError) (ctrl.Result, lifecycle.ReconcileResult, error) {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/controller/catalog/catalog_controller.go` at line 379, Remove the
unnecessary nolint:unparam directive on the verifyStatus declaration: the
function's parameters are used so the suppression should be deleted. Locate the
verifyStatus function in catalog_controller.go (the line with
"//nolint:unparam") and delete that nolint comment so the linter can properly
report real unparam issues elsewhere.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/catalog/catalog_controller.go`:
- Around line 215-217: Ensure the catalog readiness metric is cleaned up when a
Catalog CR is deleted by adding metric removal logic to EnsureDeleted: use the
same identifying fields used in UpdateCatalogReadyMetric (e.g., catalog.Name and
catalog.Namespace from the Catalog object) to remove the
greenhouse_catalog_ready series. Either call a new helper
DeleteCatalogReadyMetric(catalog) or extend UpdateCatalogReadyMetric to accept a
deletion flag, and invoke that from EnsureDeleted in the CatalogReconciler to
guarantee the metric is removed when the CR is gone.

In `@internal/controller/catalog/metrics.go`:
- Around line 46-51: The code sets reason = "OK" for ready catalogs but can also
emit "OK" for NOT-ready catalogs when determineReason(catalog) returns "OK";
update both places that compute the reason (the blocks using isReady and
determineReason) so that when isReady is false you use determineReason(catalog)
and if that returns "OK" replace it with a distinct not-ready marker (e.g.,
"UNKNOWN_NOT_READY" or empty string) instead of "OK" so not-ready catalogs never
report "OK". Ensure the same change is applied to the duplicate logic later in
the file.

---

Nitpick comments:
In `@internal/controller/catalog/catalog_controller.go`:
- Line 379: Remove the unnecessary nolint:unparam directive on the verifyStatus
declaration: the function's parameters are used so the suppression should be
deleted. Locate the verifyStatus function in catalog_controller.go (the line
with "//nolint:unparam") and delete that nolint comment so the linter can
properly report real unparam issues elsewhere.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 67bbde6 and 591fb59.

📒 Files selected for processing (3)

internal/controller/catalog/catalog_controller.go
internal/controller/catalog/metrics.go
internal/controller/catalog/source.go

💤 Files with no reviewable changes (1)

internal/controller/catalog/source.go

internal/controller/catalog/catalog_controller.go

internal/controller/catalog/metrics.go

coderabbitai

♻️ Duplicate comments (1)

internal/controller/catalog/metrics.go (1)

102-110: ⚠️ Potential issue | 🟠 Major

Not-ready catalogs can still be emitted as reason="OK"

On Line 109, fallback returns "OK" even when the catalog is not ready. Since Line 50 uses this for non-ready states, you can emit value=0 with reason="OK", which breaks reason-based alerts/dashboards.

Suggested fix

- // Reason values: OK (ready), AuthError (auth issues), SecretMissing (missing secrets)
+ // Reason values: OK (ready), AuthError (auth issues), SecretMissing (missing secrets), OtherError (non-auth readiness failures)

- for _, r := range []string{"OK", "AuthError", "SecretMissing"} {
+ for _, r := range []string{"OK", "AuthError", "SecretMissing", "OtherError"} {

- return "OK"
+ return "OtherError"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/controller/catalog/metrics.go` around lines 102 - 110, The current
fallback returns "OK" even for non-ready catalogs (the block with returns for
hasMissingSecret and hasAuthError falls through to return "OK"), which causes
non-ready metrics to be emitted with reason="OK"; update the logic so that if
the catalog is not ready you return a distinct reason (e.g. "NotReady" or
"Initializing") instead of "OK" — change the final return that currently returns
"OK" to return a non-ready reason when readiness is false (use the same
readiness check used where value=0 is emitted), keeping the existing
hasMissingSecret/hasAuthError branches intact (refer to the hasMissingSecret,
hasAuthError variables and the return "OK" in this function).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@internal/controller/catalog/metrics.go`:
- Around line 102-110: The current fallback returns "OK" even for non-ready
catalogs (the block with returns for hasMissingSecret and hasAuthError falls
through to return "OK"), which causes non-ready metrics to be emitted with
reason="OK"; update the logic so that if the catalog is not ready you return a
distinct reason (e.g. "NotReady" or "Initializing") instead of "OK" — change the
final return that currently returns "OK" to return a non-ready reason when
readiness is false (use the same readiness check used where value=0 is emitted),
keeping the existing hasMissingSecret/hasAuthError branches intact (refer to the
hasMissingSecret, hasAuthError variables and the return "OK" in this function).

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 591fb59 and 01a9da9.

📒 Files selected for processing (3)

internal/controller/catalog/catalog_controller.go
internal/controller/catalog/metrics.go
internal/controller/catalog/source.go

💤 Files with no reviewable changes (1)

internal/controller/catalog/source.go

🚧 Files skipped from review as they are similar to previous changes (1)

internal/controller/catalog/catalog_controller.go

Signed-off-by: abhijith-darshan <abhijith.ravindra@sap.com> (chore): set secret missing and repo auth metrics Signed-off-by: abhijith-darshan <abhijith.ravindra@sap.com> (chore): set catalog ready metric Signed-off-by: abhijith-darshan <abhijith.ravindra@sap.com> (chore): add license header Signed-off-by: abhijith-darshan <abhijith.ravindra@sap.com> (chore): update catalog object metrics func name Signed-off-by: abhijith-darshan <abhijith.ravindra@sap.com> (chore): use simple if Signed-off-by: abhijith-darshan <abhijith.ravindra@sap.com> (chore): suppress unparam lint issue Signed-off-by: abhijith-darshan <abhijith.ravindra@sap.com> (chore): consolidate metrics with reason grouping Signed-off-by: abhijith-darshan <abhijith.ravindra@sap.com> (chore): handle clean up of metrics when resource is deleted Signed-off-by: abhijith-darshan <abhijith.ravindra@sap.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/catalog/metrics.go`:
- Around line 44-47: The cleanup logic only deletes series for the current
ownedBy label value, leaving historical series behind; update the cleanup calls
that build label matchers (the code that computes ownedBy from catalog.Labels
and the subsequent DeleteSeries/cleanup calls) to remove/omit the owned_by label
from the matcher so deletion targets all series for the catalog+namespace
regardless of owned_by (i.e., build matchers using catalog.Name and
catalog.Namespace only, or explicitly delete across all owned_by values), and
apply the same change to the other occurrences referenced (the blocks around the
ownedBy usage at the other locations).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 93268c46-9614-4fa1-b188-07efe0a09e5a

📥 Commits

Reviewing files that changed from the base of the PR and between 01a9da9 and 9fad9d8.

📒 Files selected for processing (3)

internal/controller/catalog/catalog_controller.go
internal/controller/catalog/metrics.go
internal/controller/catalog/source.go

💤 Files with no reviewable changes (1)

internal/controller/catalog/source.go

coderabbitai · 2026-03-06T14:22:40Z

internal/controller/catalog/metrics.go

+	ownedBy := ""
+	if catalog.Labels != nil {
+		ownedBy = catalog.Labels[greenhouseapis.LabelKeyOwnedBy]
+	}


⚠️ Potential issue | 🟡 Minor

Stale metric series can remain when owned_by changes.

Cleanup uses only the current owned_by label value, so historical series for the same catalog/namespace with a previous owned_by value are never deleted and can skew dashboards/alerts.

Suggested fix

import ( + "fmt" "strings" + "sync" @@ var ( + allReasons = []string{ReasonOK, ReasonAuthError, ReasonSecretMissing, ReasonUnknown} + catalogOwnedByCache sync.Map // key: "<namespace>/<name>", value: owned_by + // readyGauge indicates whether the catalog is ready with a reason field @@ func updateCatalogReadyMetric(catalog *greenhousev1alpha1.Catalog) { ownedBy := "" if catalog.Labels != nil { ownedBy = catalog.Labels[greenhouseapis.LabelKeyOwnedBy] } + key := fmt.Sprintf("%s/%s", catalog.Namespace, catalog.Name) + if prev, ok := catalogOwnedByCache.Load(key); ok { + if prevOwnedBy, _ := prev.(string); prevOwnedBy != ownedBy { + for _, r := range allReasons { + readyGauge.DeleteLabelValues(catalog.Name, catalog.Namespace, prevOwnedBy, r) + } + } + } @@ - for _, r := range []string{ReasonOK, ReasonAuthError, ReasonSecretMissing, ReasonUnknown} { + for _, r := range allReasons { if r != reason { readyGauge.DeleteLabelValues(catalog.Name, catalog.Namespace, ownedBy, r) } } @@ readyGauge.WithLabelValues(catalog.Name, catalog.Namespace, ownedBy, reason).Set(value) + catalogOwnedByCache.Store(key, ownedBy) } @@ func deleteCatalogReadyMetric(catalog *greenhousev1alpha1.Catalog) { ownedBy := "" @@ - for _, r := range []string{ReasonOK, ReasonAuthError, ReasonSecretMissing, ReasonUnknown} { + for _, r := range allReasons { readyGauge.DeleteLabelValues(catalog.Name, catalog.Namespace, ownedBy, r) } + key := fmt.Sprintf("%s/%s", catalog.Namespace, catalog.Name) + if prev, ok := catalogOwnedByCache.LoadAndDelete(key); ok { + if prevOwnedBy, _ := prev.(string); prevOwnedBy != ownedBy { + for _, r := range allReasons { + readyGauge.DeleteLabelValues(catalog.Name, catalog.Namespace, prevOwnedBy, r) + } + } + } }

Also applies to: 60-64, 77-80, 83-84

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/controller/catalog/metrics.go` around lines 44 - 47, The cleanup logic only deletes series for the current ownedBy label value, leaving historical series behind; update the cleanup calls that build label matchers (the code that computes ownedBy from catalog.Labels and the subsequent DeleteSeries/cleanup calls) to remove/omit the owned_by label from the matcher so deletion targets all series for the catalog+namespace regardless of owned_by (i.e., build matchers using catalog.Name and catalog.Namespace only, or explicitly delete across all owned_by values), and apply the same change to the other occurrences referenced (the blocks around the ownedBy usage at the other locations).

github-actions bot added size/L feature and removed size/L labels Feb 7, 2026

abhijith-darshan force-pushed the feat/catalog_metrics branch from e00e0f8 to 8f3abe8 Compare February 7, 2026 20:53

github-actions bot added the size/L label Feb 7, 2026

abhijith-darshan marked this pull request as ready for review February 10, 2026 14:30

abhijith-darshan requested a review from a team as a code owner February 10, 2026 14:30

IvoGoman reviewed Feb 11, 2026

View reviewed changes

internal/controller/catalog/metrics.go Show resolved Hide resolved

abhijith-darshan force-pushed the feat/catalog_metrics branch 2 times, most recently from 305d140 to a1e62d7 Compare February 18, 2026 10:46

abhijith-darshan force-pushed the feat/catalog_metrics branch 2 times, most recently from 980455f to 7c3d60f Compare February 23, 2026 13:37

abhijith-darshan requested a review from IvoGoman February 23, 2026 13:40

IvoGoman reviewed Feb 24, 2026

View reviewed changes

abhijith-darshan requested a review from IvoGoman February 25, 2026 12:23

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

internal/controller/catalog/catalog_controller.go Outdated Show resolved Hide resolved

internal/controller/catalog/metrics.go Show resolved Hide resolved

abhijith-darshan force-pushed the feat/catalog_metrics branch from 591fb59 to 01a9da9 Compare March 3, 2026 14:07

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

IvoGoman previously approved these changes Mar 3, 2026

View reviewed changes

abhijith-darshan dismissed IvoGoman’s stale review via 9fad9d8 March 6, 2026 14:18

abhijith-darshan force-pushed the feat/catalog_metrics branch from 01a9da9 to 9fad9d8 Compare March 6, 2026 14:18

coderabbitai bot reviewed Mar 6, 2026

View reviewed changes

Conversation

abhijith-darshan commented Feb 7, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What type of PR is this? (check all applicable)

Related Tickets & Documents

Added tests?

Added to documentation?

Checklist

Summary by CodeRabbit

Uh oh!

IvoGoman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

IvoGoman Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

abhijith-darshan Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

abhijith-darshan Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

IvoGoman Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

abhijith-darshan Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

abhijith-darshan commented Feb 7, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 3, 2026 •

edited

Loading