Skip to content

Conversation

@VincentCCandela
Copy link

Limits on memory and CPU access which would simulate database strain.

@VincentCCandela
Copy link
Author

I'll fix the linting issues in a bit.

@Red-GV Red-GV changed the title Memory and CPU limits (Reduce Database Resources #302) feat: add database resource limit fault Oct 21, 2025
@Red-GV Red-GV linked an issue Oct 21, 2025 that may be closed by this pull request
@rohanarora
Copy link
Collaborator

Also @VincentCCandela can you please consider adding a guide to the docs directory based on our Slack discussion on what needs to be done to add a fault mechanism?

@Red-GV
Copy link
Collaborator

Red-GV commented Oct 21, 2025

Please update the section in the Contributing.md instead. There's already a section there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why these files were deleted? @VincentCCandela

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may have been been deleted when I ran "ansible-lint roles/faults roles/incidents --fix" to address the linting issues. I did not intentionally delete. Same with sre/ansible.cfg

sre/ansible.cfg Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why these files were deleted? @VincentCCandela

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vincent (@VincentCCandela), Thank you for taking a pass at this.

The intention here would be 2-fold:

  1. Bump the the load such that the underlying valkey /PostgreSQL service starts to suffer
  2. Introduce limits on the workload itself (again valkey / PosgreSQL) [and not a resource quota on the namespace]. What you have here is almost the same fault mechanism as https://github.com/itbench-hub/ITBench-Scenarios/blob/main/sre/roles/faults/tasks/inject_custom_misconfigured_resource_quota.yaml

Please let me know if that is not the case.

Copy link
Collaborator

@rohanarora rohanarora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped a couple of comments for you to take a look at, @VincentCCandela!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this from the commit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this file.

Copy link
Collaborator

@Red-GV Red-GV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few additional comments on this side.

sre/a.out Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this file.

vars:
spec: "{{ fault.custom.misconfigured_service_port }}"
when:
- fault.custom.name == 'low-mem-cpu-constraints'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's change the name to this one. Should be more in line with the suggestions proposed by @rohanarora .

Suggested change
- fault.custom.name == 'low-mem-cpu-constraints'
- fault.custom.name == 'low_resource_limits'

@VincentCCandela
Copy link
Author

Made some updates. Did not update CONTRIBUTING.MD because it looks like it already has the necessary information.

Added pod-level resource limits for requirement (2).
Need to work on requirement (1): Bump the the load such that the underlying valkey /PostgreSQL service starts to suffer

Copy link
Collaborator

@Red-GV Red-GV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments from me. Also, although it will probably just be a copy and paste of the same warning. Please go ahead and add a remove step for this fault as well.

template:
spec:
containers:
- name: "{{ result.resources[0].spec.template.spec.containers[0].name }}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe do something like this instead. Should the container being targeted not be the first in the index, this will overwrite it (and the entire array itself), which is not desired.

- faults_workloads_info is defined
- result.resources | length == 1

- name: Restart workloads to apply new resource limits
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off my head, I'm pretty sure this unnecessary. Once the deployment has been patched, Kubernetes should already redeploy the pod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce Database Resources

4 participants