Skip to content

Conversation

@nemacysts
Copy link
Member

Unlike the other scaling metrics, this will report no data rather than 0 and make the Prometheus adapter/HPA sad - which ends up in these services getting stuck at max_instances during extended periods of no traffic

Unlike the other scaling metrics, this will report no data rather than 0
and make the Prometheus adapter/HPA sad - which ends up in these
services getting stuck at max_instances during extended periods of no
traffic
@nemacysts nemacysts requested review from EvanKrall and jbuns November 26, 2025 20:58
@nemacysts nemacysts requested a review from a team as a code owner November 26, 2025 20:58
"kube_deployment", "{deployment_name}", "", ""
)
) by (kube_deployment)
) by (kube_deployment) or label_replace(vector(0), "kube_deployment", "{deployment_name}", "", "")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't just do vector(0) as then we can't join later on :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in 1:1, I worry about this causing us to scale down if the metric disappears for some reason (maybe we break the scrape rule or Envoy renames the metric or something).

This seems to be a problem because we set usedonly https://sourcegraph.yelpcorp.com/misc/eks-k8s-configs/-/blob/lib/prometheus/shard_config/envoy/additional_scrape_configs.jsonnet?L38 -- we could get rid of that but it would cause a lot more time series to get recorded on the envoy shard. Maybe we set up a second scrape rule that allowlists specifically the envoy_cluster__egress_cluster_upstream_rq_active metric and doesn't set usedonly?

"kube_deployment", "{deployment_name}", "", ""
)
) by (kube_deployment)
) by (kube_deployment) or label_replace(vector(0), "kube_deployment", "{deployment_name}", "", "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in 1:1, I worry about this causing us to scale down if the metric disappears for some reason (maybe we break the scrape rule or Envoy renames the metric or something).

This seems to be a problem because we set usedonly https://sourcegraph.yelpcorp.com/misc/eks-k8s-configs/-/blob/lib/prometheus/shard_config/envoy/additional_scrape_configs.jsonnet?L38 -- we could get rid of that but it would cause a lot more time series to get recorded on the envoy shard. Maybe we set up a second scrape rule that allowlists specifically the envoy_cluster__egress_cluster_upstream_rq_active metric and doesn't set usedonly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants