Handle missing Envoy metrics for active-requests scaling #4172

nemacysts · 2025-11-26T20:58:23Z

Unlike the other scaling metrics, this will report no data rather than 0 and make the Prometheus adapter/HPA sad - which ends up in these services getting stuck at max_instances during extended periods of no traffic

nemacysts · 2025-11-26T21:19:40Z

paasta_tools/setup_prometheus_adapter_config.py

                "kube_deployment", "{deployment_name}", "", ""
            )
-        ) by (kube_deployment)
+        ) by (kube_deployment) or label_replace(vector(0), "kube_deployment", "{deployment_name}", "", "")


can't just do vector(0) as then we can't join later on :)

As discussed in 1:1, I worry about this causing us to scale down if the metric disappears for some reason (maybe we break the scrape rule or Envoy renames the metric or something).

This seems to be a problem because we set usedonly https://sourcegraph.yelpcorp.com/misc/eks-k8s-configs/-/blob/lib/prometheus/shard_config/envoy/additional_scrape_configs.jsonnet?L38 -- we could get rid of that but it would cause a lot more time series to get recorded on the envoy shard. Maybe we set up a second scrape rule that allowlists specifically the envoy_cluster__egress_cluster_upstream_rq_active metric and doesn't set usedonly?

EvanKrall · 2025-12-01T19:14:42Z

paasta_tools/setup_prometheus_adapter_config.py

                "kube_deployment", "{deployment_name}", "", ""
            )
-        ) by (kube_deployment)
+        ) by (kube_deployment) or label_replace(vector(0), "kube_deployment", "{deployment_name}", "", "")


As discussed in 1:1, I worry about this causing us to scale down if the metric disappears for some reason (maybe we break the scrape rule or Envoy renames the metric or something).

This seems to be a problem because we set usedonly https://sourcegraph.yelpcorp.com/misc/eks-k8s-configs/-/blob/lib/prometheus/shard_config/envoy/additional_scrape_configs.jsonnet?L38 -- we could get rid of that but it would cause a lot more time series to get recorded on the envoy shard. Maybe we set up a second scrape rule that allowlists specifically the envoy_cluster__egress_cluster_upstream_rq_active metric and doesn't set usedonly?

Handle missing Envoy metrics for active-requests scaling

76cdf03

Unlike the other scaling metrics, this will report no data rather than 0 and make the Prometheus adapter/HPA sad - which ends up in these services getting stuck at max_instances during extended periods of no traffic

nemacysts requested review from EvanKrall and jbuns November 26, 2025 20:58

nemacysts requested a review from a team as a code owner November 26, 2025 20:58

nemacysts added 2 commits November 26, 2025 13:08

this needs some labels...

ede5afc

typo

a5b146f

nemacysts commented Nov 26, 2025

View reviewed changes

jbuns approved these changes Dec 1, 2025

View reviewed changes

EvanKrall requested changes Dec 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle missing Envoy metrics for active-requests scaling #4172

Handle missing Envoy metrics for active-requests scaling #4172

nemacysts commented Nov 26, 2025

Uh oh!

nemacysts Nov 26, 2025

Uh oh!

EvanKrall Dec 1, 2025

Uh oh!

EvanKrall Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Handle missing Envoy metrics for active-requests scaling #4172

Are you sure you want to change the base?

Handle missing Envoy metrics for active-requests scaling #4172

Conversation

nemacysts commented Nov 26, 2025

Uh oh!

nemacysts Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

EvanKrall Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

EvanKrall Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants