Skip to content

operator is not scaling the collector when status fields are incorrect #4430

@ethanmdavidson

Description

@ethanmdavidson

Component(s)

collector

What happened?

Description

We use the otel operator, deployed via the helm chart. After upgrading from 0.129.1 (chart v0.92.0) to 0.136.0 (chart v0.97.1) we ran into an issue where autoscaling was no longer working, I think due to #4400. After reverting back to 0.129.1, autoscaling is no longer working, and the operator does not seem to be setting the replica count at all. I can change the spec.replicas on the collector deployment, and the operator will never reset it (unlike in #4400, where it keeps getting set to minReplicas). I have confirmed that the HPA is still working, and the spec.replicas field is still being changed correctly on the OpenTelemetryCollector (v1beta1). However, the operator is not copying this value over to the deployment.

I think the problem might be that the status fields on the OpenTelemetryCollector have gotten into a bad state. In particular, status.scale.replicas and status.scale.statusReplicas are no longer correct, and aren't changing to match the deployment status.

status:
  image: otel/opentelemetry-collector-contrib:0.136.0
  scale:
    replicas: 2
    selector: app.kubernetes.io/component=opentelemetry-collector,app.kubernetes.io/instance=otel-collector.otel-main,app.kubernetes.io/managed-by=opentelemetry-operator,app.kubernetes.io/name=otel-main-collector,app.kubernetes.io/part-of=opentelemetry,app.kubernetes.io/version=latest,managed_by=terraform,repo=prefab,service_id=otel-collector,team=platform
    statusReplicas: 0/2
  version: 0.136.0

I have verified that all fields are correct, including the selector. replicas and statusReplicas are not correct - the deployment currently has 3 replicas, all of which are healthy.

Steps to Reproduce

  1. Deploy the operator
  2. Deploy an OpenTelemetryCollector
  3. Edit the status.scale.replicas and status.scale.statusReplicas fields to no longer match the deployment
  4. Observe that the operator stops scaling the deployment.

Expected Result

Operator should overwrite any incorrect status values.

Actual Result

Incorrect status values persist, and seem to prevent autoscaling.

Kubernetes Version

1.33.4

Operator version

0.129.1

Collector version

0.136.0

Environment information

Environment

  • GKE 1.33.4-gke.1350000

Log output

No relevant log output

Additional context

Collector metadata confirms that the status is not getting updated:

- apiVersion: opentelemetry.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        .: {}
        f:image: {}
        f:scale:
          .: {}
          f:replicas: {}
          f:selector: {}
          f:statusReplicas: {}
        f:version: {}
    manager: manager
    operation: Update
    subresource: status
    time: "2025-10-09T20:41:00Z"

Tip

React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions