-
Notifications
You must be signed in to change notification settings - Fork 67
Description
Version of Eraser
v1.3.1
Expected Behavior
I have multiple clusters running Eraser w/ v1.3.1 and we've set our success rate pretty low (80%), down from 95% because we couldn't get Eraser to mark the ImageJob as successful. Looking at the logs, it seems that there's a bug in the success rate math that causes Eraser to think it's 0% successful when one or two pods fail in a strange way.
{"level":"info","ts":1755535148.5651102,"logger":"controller","msg":"Marking job as failed","process":"imagejob-controller","success ratio":0.8,"actual ratio":0}
In reality, the job had 272 successful nodes and one node that causes the pod to reach an outOfCpu state. This happened on other clusters with nodes w/ memory pressure instead of cpu pressure.
Expected behavior: the ImageJob is marked as successful (as it's currently >99% successful) and the pods are cleaned up (we have .runtimeConfig.manager.imageJob.cleanup.delayOnSuccess set to 0s).
Actual Behavior
Actual behavior: ImageJob fails w/ 0% success rate and pods aren't cleaned up. (we have .runtimeConfig.manager.imageJob.cleanup.delayOnFailure set to 5h).
Steps To Reproduce
K8s v1.32.6
Eraser helm chart v1.3.1
helm values:
runtimeConfig:
manager:
nodeFilter:
type: exclude
selectors:
- eraser.sh/exclude-node # exclude nodes with this label
scheduling:
repeatInterval: "6h" # default is 24h
imageJob:
successRatio: 0.80 # 80% success ratio for image jobs to be considered 'successful'. Needs to be lower than 100% to account for cpu/memory pressure that causes the job to fail occasionally.
cleanup:
delayOnSuccess: "0s" # clean up pods immediately after success
delayOnFailure: "5h" # keep the pods around for 5 hours after failure to allow for investigationThen get a node to have enough cpu/mem pressure to cause an imagejob pod to error with outOfCpu or outOfMemory.
Are you willing to submit PRs to contribute to this bug fix?
- Yes, I am willing to implement it.