-
Notifications
You must be signed in to change notification settings - Fork 129
Description
Summary
Add an optional untainting controller to the Datadog Operator that removes a specific taint from a node once the Datadog agent is successfully running on it. This would allow users to enforce that the Datadog agent is the first workload scheduled on new nodes, ensuring observability coverage before any other workloads begin execution.
Use Case
In many environments, it's important to ensure observability agents like the Datadog agent are running before any application workloads are scheduled on a node. One pattern to achieve this is to apply a "startup" taint (e.g., node.datadoghq.com/startup=datadog:NoSchedule) to the node pool at provisioning time. The taint blocks all pods except those that tolerate it (e.g., the Datadog agent).
Currently, there's no automated way to remove this taint once the agent is confirmed to be running. This requires out-of-band scripting or external controllers, which adds operational overhead and complexity.
By having the Datadog Operator manage this behavior, the system can:
- Ensure the agent is the first workload on a node
- Automatically remove the startup taint once the agent is successfully running
- Reduce complexity and eliminate the need for custom automation
Proposal
Introduce a new optional controller in the Datadog Operator that:
- Watches nodes with a configurable taint key (e.g.,
node.datadoghq.com/startup=datadog:NoSchedule) - Detects when a healthy Datadog agent pod is running on that node
- Removes the configured taint from the node
Configuration could be introduced via the DatadogAgent CRD, for example:
spec:
untaint:
enabled: true
taintKey: "node.datadoghq.com/startup"
taintValue: "datadog"
taintEffect: "NoSchedule"Similar Patterns in the Wild
Istio provides a similar feature in their Operator to ensure their control plane components are prioritized before other workloads.
Benefits
- Ensures observability is initialized before application workloads
- Simplifies node bootstrapping workflows in cluster autoscaling environments
- Reduces reliance on custom scripts or external controllers
- Aligns with patterns used in other Operators (e.g., Istio)
Thank you for considering this feature! I'd be happy to contribute or test this functionality if it's accepted.