Skip to content

Conversation

@preespp
Copy link
Contributor

@preespp preespp commented Oct 20, 2025

Added a new fault: exhaust-node-resources and a corresponding incident spec (Incident 297) from issue #297

@Red-GV Red-GV changed the title Feat: new fault exhaust node resources feat: add exhaust node resources fault Oct 20, 2025
@Red-GV Red-GV linked an issue Oct 21, 2025 that may be closed by this pull request
Copy link
Collaborator

@rohanarora rohanarora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pree (@preespp), Dropped a few comments for you to look at. Please let us know if you have any follow-up questions.

cc: @Red-GV

@preespp
Copy link
Contributor Author

preespp commented Oct 28, 2025

Thank you for your comment. Below are what I changed based on what we discussed in last Thursday's meeting.

  • Changed node label from 'itbench-fault: exhaust-node' to 'node-role.kubernetes.io/app: "true"'
  • Introduced anti-affinity so that observability/tool-related pods are deployed to separate nodes, preventing them from being affected by the resource hog.
  • Ensured the targeted workload is scheduled on a different node to avoid pending pods.
  • Updated incident metadata and alerts to reflect resource pressure correctly.
  • Changed tasks to label/unlabel nodes before/after resource hog deployment to maintain cluster state.

Copy link
Collaborator

@Red-GV Red-GV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few changes I'm noticing on this round of the review. Sorry if I didn't mention them earlier.

Copy link
Collaborator

@Red-GV Red-GV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one minor thing in the ground truth and I think we're good here.

@rohanarora I'm going to recommend that we table the node labeling for now until we add observability node labels to install the tools onto. That also means we should probably avoid smoke testing this one for now because it might render the cluster inoperable.

Copy link
Collaborator

@Red-GV Red-GV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@preespp Not seeing any of the alerts go off for this one. I suggest doing two things for this PR:

  1. Look at Incident 26/27 for how to enable a ChaosMesh fault rather than manually orchestrating it.
  2. Make a new fault that removes resource limits on a workload. Technically, you can make two incidents with that. Though I think that runs a bit into #336 with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Exhaut Node Resources

3 participants