Skip to content

Add untaint controller to Datadog Operator for startup taint removal #2052

@imdevin567

Description

@imdevin567

Summary

Add an optional untainting controller to the Datadog Operator that removes a specific taint from a node once the Datadog agent is successfully running on it. This would allow users to enforce that the Datadog agent is the first workload scheduled on new nodes, ensuring observability coverage before any other workloads begin execution.

Use Case

In many environments, it's important to ensure observability agents like the Datadog agent are running before any application workloads are scheduled on a node. One pattern to achieve this is to apply a "startup" taint (e.g., node.datadoghq.com/startup=datadog:NoSchedule) to the node pool at provisioning time. The taint blocks all pods except those that tolerate it (e.g., the Datadog agent).

Currently, there's no automated way to remove this taint once the agent is confirmed to be running. This requires out-of-band scripting or external controllers, which adds operational overhead and complexity.

By having the Datadog Operator manage this behavior, the system can:

  • Ensure the agent is the first workload on a node
  • Automatically remove the startup taint once the agent is successfully running
  • Reduce complexity and eliminate the need for custom automation

Proposal

Introduce a new optional controller in the Datadog Operator that:

  1. Watches nodes with a configurable taint key (e.g., node.datadoghq.com/startup=datadog:NoSchedule)
  2. Detects when a healthy Datadog agent pod is running on that node
  3. Removes the configured taint from the node

Configuration could be introduced via the DatadogAgent CRD, for example:

spec:
  untaint:
    enabled: true
    taintKey: "node.datadoghq.com/startup"
    taintValue: "datadog"
    taintEffect: "NoSchedule"

Similar Patterns in the Wild

Istio provides a similar feature in their Operator to ensure their control plane components are prioritized before other workloads.

Benefits

  • Ensures observability is initialized before application workloads
  • Simplifies node bootstrapping workflows in cluster autoscaling environments
  • Reduces reliance on custom scripts or external controllers
  • Aligns with patterns used in other Operators (e.g., Istio)

Thank you for considering this feature! I'd be happy to contribute or test this functionality if it's accepted.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions