feat: add exhaust node resources fault #329

preespp · 2025-10-20T04:37:43Z

Added a new fault: exhaust-node-resources and a corresponding incident spec (Incident 297) from issue #297

Signed-off-by: Pree Simphliphan <[email protected]>

sre/roles/faults/tasks/inject_custom_exhaust_node_resources.yaml

sre/roles/incidents/files/ground_truths/incident_297.yaml

rohanarora

Pree (@preespp), Dropped a few comments for you to look at. Please let us know if you have any follow-up questions.

cc: @Red-GV

sre/roles/incidents/files/ground_truths/incident_297.yaml

sre/roles/faults/tasks/inject_custom_exhaust_node_resources.yaml

sre/roles/incidents/files/specs/incident_297.yaml

sre/roles/incidents/files/ground_truths/incident_297.yaml

…e hog

preespp · 2025-10-28T08:34:12Z

Thank you for your comment. Below are what I changed based on what we discussed in last Thursday's meeting.

Changed node label from 'itbench-fault: exhaust-node' to 'node-role.kubernetes.io/app: "true"'
Introduced anti-affinity so that observability/tool-related pods are deployed to separate nodes, preventing them from being affected by the resource hog.
Ensured the targeted workload is scheduled on a different node to avoid pending pods.
Updated incident metadata and alerts to reflect resource pressure correctly.
Changed tasks to label/unlabel nodes before/after resource hog deployment to maintain cluster state.

Red-GV

Few changes I'm noticing on this round of the review. Sorry if I didn't mention them earlier.

sre/roles/faults/tasks/inject_custom_exhaust_node_resources.yaml

sre/roles/incidents/files/specs/incident_297.yaml

sre/roles/faults/tasks/inject_custom_exhaust_node_resources.yaml

sre/roles/faults/tasks/remove_custom_exhaust_node_resources.yaml

… number

Red-GV

Just one minor thing in the ground truth and I think we're good here.

@rohanarora I'm going to recommend that we table the node labeling for now until we add observability node labels to install the tools onto. That also means we should probably avoid smoke testing this one for now because it might render the cluster inoperable.

sre/roles/incidents/files/ground_truths/incident_41.yaml

…urces

Red-GV

@preespp Not seeing any of the alerts go off for this one. I suggest doing two things for this PR:

Look at Incident 26/27 for how to enable a ChaosMesh fault rather than manually orchestrating it.
Make a new fault that removes resource limits on a workload. Technically, you can make two incidents with that. Though I think that runs a bit into #336 with that.

…nfinished)

preespp and others added 2 commits October 6, 2025 19:03

Found Typo While Setting Up Remote Cluster

809124d

Signed-off-by: Pree Simphliphan <[email protected]>

feat: new fault exhaust node resources

0336435

preespp requested review from Red-GV and rohanarora as code owners October 20, 2025 04:37

Red-GV changed the title ~~Feat: new fault exhaust node resources~~ feat: add exhaust node resources fault Oct 20, 2025

Red-GV linked an issue Oct 21, 2025 that may be closed by this pull request

Exhaut Node Resources #297

Open