Skip to content

[BUG] Success Rate Incorrect - says 0% #1163

@christensenjairus

Description

@christensenjairus

Version of Eraser

v1.3.1

Expected Behavior

I have multiple clusters running Eraser w/ v1.3.1 and we've set our success rate pretty low (80%), down from 95% because we couldn't get Eraser to mark the ImageJob as successful. Looking at the logs, it seems that there's a bug in the success rate math that causes Eraser to think it's 0% successful when one or two pods fail in a strange way.

 {"level":"info","ts":1755535148.5651102,"logger":"controller","msg":"Marking job as failed","process":"imagejob-controller","success ratio":0.8,"actual ratio":0}

In reality, the job had 272 successful nodes and one node that causes the pod to reach an outOfCpu state. This happened on other clusters with nodes w/ memory pressure instead of cpu pressure.

Expected behavior: the ImageJob is marked as successful (as it's currently >99% successful) and the pods are cleaned up (we have .runtimeConfig.manager.imageJob.cleanup.delayOnSuccess set to 0s).

Actual Behavior

Actual behavior: ImageJob fails w/ 0% success rate and pods aren't cleaned up. (we have .runtimeConfig.manager.imageJob.cleanup.delayOnFailure set to 5h).

Steps To Reproduce

K8s v1.32.6
Eraser helm chart v1.3.1

helm values:

runtimeConfig:
  manager:
    nodeFilter:
      type: exclude
      selectors:
        - eraser.sh/exclude-node # exclude nodes with this label
    scheduling:
      repeatInterval: "6h" # default is 24h
    imageJob:
      successRatio: 0.80 # 80% success ratio for image jobs to be considered 'successful'. Needs to be lower than 100% to account for cpu/memory pressure that causes the job to fail occasionally.
      cleanup:
        delayOnSuccess: "0s" # clean up pods immediately after success
        delayOnFailure: "5h" # keep the pods around for 5 hours after failure to allow for investigation

Then get a node to have enough cpu/mem pressure to cause an imagejob pod to error with outOfCpu or outOfMemory.

Are you willing to submit PRs to contribute to this bug fix?

  • Yes, I am willing to implement it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions