-
Notifications
You must be signed in to change notification settings - Fork 21
feat: add database resource limit fault #336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ch-Scenarios into mem_cpu_limits new fault limiting cpu and memory
|
I'll fix the linting issues in a bit. |
|
Also @VincentCCandela can you please consider adding a guide to the |
|
Please update the section in the |
sre/.ansible-lint-ignore
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why these files were deleted? @VincentCCandela
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may have been been deleted when I ran "ansible-lint roles/faults roles/incidents --fix" to address the linting issues. I did not intentionally delete. Same with sre/ansible.cfg
sre/ansible.cfg
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why these files were deleted? @VincentCCandela
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vincent (@VincentCCandela), Thank you for taking a pass at this.
The intention here would be 2-fold:
- Bump the the load such that the underlying valkey /PostgreSQL service starts to suffer
- Introduce limits on the workload itself (again valkey / PosgreSQL) [and not a resource quota on the namespace]. What you have here is almost the same fault mechanism as https://github.com/itbench-hub/ITBench-Scenarios/blob/main/sre/roles/faults/tasks/inject_custom_misconfigured_resource_quota.yaml
Please let me know if that is not the case.
rohanarora
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped a couple of comments for you to take a look at, @VincentCCandela!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this from the commit.
sre/playbooks/roles
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this file.
Red-GV
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few additional comments on this side.
sre/a.out
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this file.
| vars: | ||
| spec: "{{ fault.custom.misconfigured_service_port }}" | ||
| when: | ||
| - fault.custom.name == 'low-mem-cpu-constraints' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe let's change the name to this one. Should be more in line with the suggestions proposed by @rohanarora .
| - fault.custom.name == 'low-mem-cpu-constraints' | |
| - fault.custom.name == 'low_resource_limits' |
|
Made some updates. Did not update CONTRIBUTING.MD because it looks like it already has the necessary information. Added pod-level resource limits for requirement (2). |
Red-GV
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments from me. Also, although it will probably just be a copy and paste of the same warning. Please go ahead and add a remove step for this fault as well.
| template: | ||
| spec: | ||
| containers: | ||
| - name: "{{ result.resources[0].spec.template.spec.containers[0].name }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe do something like this instead. Should the container being targeted not be the first in the index, this will overwrite it (and the entire array itself), which is not desired.
| - faults_workloads_info is defined | ||
| - result.resources | length == 1 | ||
|
|
||
| - name: Restart workloads to apply new resource limits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Off my head, I'm pretty sure this unnecessary. Once the deployment has been patched, Kubernetes should already redeploy the pod.
Limits on memory and CPU access which would simulate database strain.