Skip to content

(feat) - Add High Availability Support with Leader Election#147

Merged
InsomniaCoder merged 5 commits intomainfrom
ha-noe
Oct 6, 2025
Merged

(feat) - Add High Availability Support with Leader Election#147
InsomniaCoder merged 5 commits intomainfrom
ha-noe

Conversation

@InsomniaCoder
Copy link
Copy Markdown
Contributor

Problem Statement

Noe controller currently runs as a single replica, If the controller pod fails, the webhook admission requests become unavailable.

This leads to a possible pod being misplaced when the noe pod is not available.

Proposal

Implemented a simplified High Availability architecture.

Webhook Component (Admission Controller)

  • No leader election needed - All replicas can safely process webhook requests simultaneously, load balanced under Kubernetes service/endpoint

Controller Component (Reconciler)

  • Leader election required - Only one replica actively processes pod eviction to prevent race conditions (evict pod that's already being evicted)
  • Automatic failover - Followers standby and compete for leadership when leader fails

✅ Implemented health endpoints (/healthz, /readyz on port 8081)
✅ Updated Helm template with conditional replica count and anti-affinity
✅ Fixed 2 replicas for HA set up to simplify set up
✅ Automatic PodDisruptionBudget creation for HA deployments
✅ Added comprehensive documentation to README

flag.StringVar(&schedulableArchs, "cluster-schedulable-archs", "", "Comma separated list of architectures schedulable in the cluster")
flag.StringVar(&systemOS, "system-os", "linux", "Sole OS supported by the system")
flag.StringVar(&metricsAddr, "metrics-addr", ":8080", "The address the metric endpoint binds to.")
flag.StringVar(&healthProbeAddr, "health-probe-addr", ":8081", "The address the health probe endpoint binds to.")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping this as flag to keep in consistent with metricsAddr

@InsomniaCoder
Copy link
Copy Markdown
Contributor Author

Tested and the reconciler runs in only leader while another pod does not have any logs from {"controller":"*controllers.PodReconciler"} but has webhook processing logs

{"image":"alpine/k8s","level":"error","method":"POST","msg":"image is registered as private. Skipping anonymous authentication","path":"/mutate","pattern":"*.containers.mpi-internal.com","preferredArch":"amd64","registry":"dockerhub.containers.mpi-internal.com","request":{"kind":{"group":"","version":"v1","kind":"Pod"},"name":"","namespace":"cre-system","operation":"CREATE"},"time":"2025-10-06T07:00:00Z"}
{"compatibleImages":{"amd64":{},"arm64":{}},"level":"info","method":"POST","msg":"updating nodeSelector to match preferred architecture","path":"/mutate","preferredArch":"amd64","request":{"kind":{"group":"","version":"v1","kind":"Pod"},"name":"","namespace":"cre-system","operation":"CREATE"},"time":"2025-10-06T07:00:00Z"}
{"level":"info","method":"POST","msg":"skipping adding node selector to pod updates","path":"/mutate","request":{"kind":{"group":"","version":"v1","kind":"Pod"},"name":"patch-kube-proxy-config-metrics-generic-cronjob-29328900-mjxwg","namespace":"cre-system","operation":"UPDATE"},"time":"2025-10-06T07:00:06Z"}

@Fsero Fsero added this pull request to the merge queue Oct 6, 2025
@Fsero Fsero removed this pull request from the merge queue due to a manual request Oct 6, 2025
@InsomniaCoder InsomniaCoder added this pull request to the merge queue Oct 6, 2025
Merged via the queue into main with commit 5a275a3 Oct 6, 2025
3 checks passed
@alfredolopezzz alfredolopezzz deleted the ha-noe branch October 6, 2025 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants