feat: Add pod affinity/anti-affinity metrics for deployments #2733

SoumyaRaikwar · 2025-08-12T21:15:21Z

What this PR does / why we need it

Adds explicit rule-based pod affinity and anti-affinity metrics for deployments to provide granular visibility into Kubernetes scheduling constraints, addressing issue #2701.

Refactored from count-based to explicit rule-based approach following maintainer feedback for enhanced operational value.

Which issue(s) this PR fixes

Fixes #2701

Metrics Added

kube_deployment_spec_affinity - Pod affinity and anti-affinity rules with granular labels

Labels provided:

affinity - podaffinity | podantiaffinity
type - requiredDuringSchedulingIgnoredDuringExecution | preferredDuringSchedulingIgnoredDuringExecution
topology_key - The topology key for the rule
label_selector - The formatted label selector string

- Add kube_deployment_spec_pod_affinity_required_rules metric - Add kube_deployment_spec_pod_affinity_preferred_rules metric - Add kube_deployment_spec_pod_anti_affinity_required_rules metric - Add kube_deployment_spec_pod_anti_affinity_preferred_rules metric - Update deployment metrics documentation - Add comprehensive test coverage for all scenarios

mrueg · 2025-08-13T08:09:23Z

How would you use this metric for alerting and/or showing information about the deployment?

SoumyaRaikwar · 2025-08-13T09:38:35Z

How would you use this metric for alerting and/or showing information about the deployment?

These metrics enable critical alerting on scheduling constraint violations. For example: (kube_deployment_spec_pod_anti_affinity_preferred_rules > 0) and (kube_deployment_spec_pod_anti_affinity_required_rules == 0) alerts when deployments rely only on soft anti-affinity rules that can be ignored during node pressure, creating single points of failure.

They also help monitor missing protection: (kube_deployment_spec_pod_anti_affinity_required_rules == 0) and (kube_deployment_spec_pod_anti_affinity_preferred_rules == 0) identifies deployments without any anti-affinity rules.

For dashboards, you can visualize cluster-wide scheduling health with count(kube_deployment_spec_pod_anti_affinity_required_rules > 0) to show how many deployments have proper distribution protection.

During incidents, these metrics help correlate why workloads ended up co-located or why pods failed to schedule due to overly complex constraints.

This addresses #2701's core need: visibility into "preferred vs required" scheduling logic to maintain reliable workload distribution during cluster events. Thanks @mrueg for the question - these use cases demonstrate the operational value of these scheduling constraint metrics!

CatherineF-dev · 2025-08-13T17:48:53Z

/triage accepted
/assign @mrueg

mrueg · 2025-08-14T18:55:45Z

i think the metric should be explicit, something like:

kube_deployment_affinity{affinity="podaffinity", type="requiredDuringSchedulingIgnoredDuringExecution",topologyKey="foo",labelSelector="matchExpression foo in bar,baz"}  1

then you can count over these and get the desired result, as well as gather exactly that information about the specific affinity setting.

I'm not sure about the labelSelector at this point, if this should be split into subtypes as well or just calling https://github.com/kubernetes/apimachinery/blob/master/pkg/apis/meta/v1/helpers.go#L171 is enough.

k8s-ci-robot · 2025-08-14T19:13:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SoumyaRaikwar
Once this PR has been reviewed and has the lgtm label, please ask for approval from mrueg. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

SoumyaRaikwar · 2025-08-14T19:23:27Z

Thanks @mrueg for the feedback! I understand you're looking for more explicit metrics that expose individual affinity rule details rather than just counts.

You're absolutely right that explicit metrics would provide much more granular visibility. Instead of simple count metrics

…etrics - Replace 4 count-based metrics with single kube_deployment_spec_affinity metric - Add granular labels: affinity, type, topology_key, label_selector - Enable individual rule visibility and flexible querying - Update tests and documentation for new metric structure

SoumyaRaikwar · 2025-08-14T20:37:46Z

Hi @mrueg,

I've successfully refactored the implementation to use explicit rule-based metrics as you requested.

Key Changes:

Replaced 4 count-based metrics with single kube_deployment_spec_affinity metric
Added granular labels for individual rule visibility and flexible querying
Used metav1.FormatLabelSelector() for consistent labelSelector formatting
Updated comprehensive tests and documentation

The new approach provides exponentially more operational value while maintaining low cardinality and following the individual object-level data principle from the best practices document.

linux-foundation-easycla · 2025-08-26T11:28:37Z

The committers listed above are authorized under a signed CLA.

✅ login: dependabot[bot] / name: dependabot[bot] (e565886)
✅ login: SoumyaRaikwar / name: Soumya Raikwar (5238d5e, d8cafe9, d1612ee, e478afc, 1caba42, d639ded, 994a3d0, 56e28b1, 9680c76, 1f88398, e21e20b, 33fae89)
✅ login: Rishab87 / name: Rishab Kumar Jha (9a17c65)
✅ login: mrueg / name: Manuel Rüger (6737e83)
✅ login: timonegk / name: Timon Engelke (d20fbfc)

This pull requests fixes a logic error in metrics_writer.go where metrics headers are replaced when a protobuf format is requested. However, the existing logic is never used because the content type negotiation is already done in a previous step (in metrics_handler.go). There, the content type for proto-based formats is changed to text/plain before passing the argument to SanitizeHeaders. The pull request changes the condition in SanitizeHeaders to check for the plain-text format instead of protobuf. I changed the signature of SanitizeHeaders to accept expfmt.Format instead of string. This makes checking the content type a bit cleaner. If this is considered a breaking change, we can also change it to a string prefix comparison. I encountered the error when I tried to use native histogram parsing in prometheus and found errors while parsing kube-state-metrics' metrics. The issue is already described in kubernetes#2587. Signed-off-by: Timon Engelke <[email protected]>

…etrics - Replace 4 count-based metrics with single kube_deployment_spec_affinity metric - Add granular labels: affinity, type, topology_key, label_selector - Enable individual rule visibility and flexible querying - Update tests and documentation for new metric structure

SoumyaRaikwar · 2025-09-19T11:08:35Z

Hi @mrueg,
When you have a moment, could you please review my PR?

SoumyaRaikwar · 2025-09-20T08:02:38Z

Hi @CatherineF-dev, @logicalhan, and @rexagod — could you please review this PR when you have a chance?

internal/store/deployment_test.go

internal/store/deployment.go

internal/store/deployment_test.go

internal/store/deployment.go

… stray comments; keep header spacing; docs+tests updated

SoumyaRaikwar · 2025-09-21T21:43:56Z

Hi @mrueg, can you review it please sir.

examples/standard/kustomization.yaml

examples/autosharding/kustomization.yaml

internal/store/deployment_test.go

…cidentally removed during rebase)

SoumyaRaikwar · 2025-09-21T23:01:05Z

Hi @mrueg — I’ve restored the deleted kustomization.yaml files in examples/autosharding and examples/standard.
Reverted the whitespace change in internal/store/deployment_test.go; could you please take another look?

SoumyaRaikwar · 2025-09-23T19:07:34Z

@mrueg @CatherineF-dev , could you please review my pr.

SoumyaRaikwar · 2025-09-26T00:18:07Z

@rexagod could you please review my pr

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 12, 2025

k8s-ci-robot requested review from CatherineF-dev and logicalhan August 12, 2025 21:15

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 12, 2025

k8s-ci-robot assigned mrueg Aug 13, 2025

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 13, 2025

This was referenced Aug 13, 2025

feat: Add kube_deployment_spec_topology_spread_constraints metric for issue #2701 #2728

Open

Add schedule spec and status for workload #2701

Open

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 14, 2025

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 14, 2025

SoumyaRaikwar force-pushed the add-deployment-pod-affinity-metrics branch from b99f07e to d1612ee Compare August 14, 2025 20:32

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 14, 2025

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 14, 2025

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Aug 25, 2025

SoumyaRaikwar force-pushed the add-deployment-pod-affinity-metrics branch from d826011 to a7bb15f Compare August 26, 2025 11:28

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 26, 2025

timonegk and others added 6 commits September 13, 2025 23:47

chore: Bump versions

6737e83

chore: prep v2.17.0

9a17c65

fix: resolve gocritic unlambda and formatting issues

e21e20b

fix: clean up unused variables in deployment test

994a3d0

SoumyaRaikwar force-pushed the add-deployment-pod-affinity-metrics branch from 6f5dd85 to 994a3d0 Compare September 19, 2025 11:06

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 19, 2025

Merge branch 'main' into add-deployment-pod-affinity-metrics

1f88398

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 19, 2025