-
Notifications
You must be signed in to change notification settings - Fork 164
ROX-31079: Reduce OOM kills in Konflux pipelines #17032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Skipping CI for Draft Pull Request. |
https://konflux-ui.apps.stone-prd-rh01.pg1f.p1.openshiftapps.com/ns/rh-acs-tenant/applications/acs/pipelineruns/main-on-push-bjlqd still failed after the downgrade! |
Images are ready for the commit at 60a4392. To use with deploy scripts, first |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #17032 +/- ##
==========================================
- Coverage 48.80% 48.78% -0.02%
==========================================
Files 2707 2712 +5
Lines 202201 202407 +206
==========================================
+ Hits 98682 98743 +61
- Misses 95748 95878 +130
- Partials 7771 7786 +15
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
/cancel |
/retest |
/test main-on-push |
/test roxctl-on-push |
/test operator-on-push |
/test scanner-v4-on-push |
/test scanner-v4-on-push |
/test main-on-push |
/test operator-on-push |
/test roxctl-on-push |
/cancel |
/test scanner-v4-on-push |
/test roxctl-on-push |
/test operator-on-push |
/test main-on-push |
1 similar comment
/test main-on-push |
/cancel |
/test main-on-push |
c6ed81a
to
5f633df
Compare
/test gke-nongroovy-e2e-tests gke-upgrade-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, let's try 🤷
Not the traditional approach in prefetch-dependencies
, but if it works, it works.
/retest-required |
93ed86b
to
2a440fe
Compare
/retest-required |
The one also tends to be OOM killed. See https://konflux-ui.apps.stone-prd-rh01.pg1f.p1.openshiftapps.com/ns/rh-acs-tenant/applications/acs/taskruns/scanner-v4-on-push-j9qxs-build-source-image/logs From `describe pod`: ``` step-build: State: Terminated Reason: OOMKilled Message: [{"key":"BUILD_RESULT","value":"{\"status\": \"failure\", \"message\": \"[Errno 2] No such file or directory: '/var/workdir/source-build/bsi_output/index.json'\"}","type":1},{"key":"StartedAt","value":"2025-09-29T09:32:26.186Z","type":3}] Exit Code: 1 Started: Mon, 29 Sep 2025 11:31:53 +0200 Finished: Mon, 29 Sep 2025 11:44:00 +0200 Limits: memory: 2Gi Requests: cpu: 250m memory: 512Mi ```
and update corresponding comments around that. If by default they prefer to go without limits, why shouldn't we too. This could make builds faster when the cluster is less loaded and the build process can utilize more cpu cores.
in `main-build.yaml`.
I don't think any more it's a good idea to remove them. At the very least, we'll get different behavior when nodes are fully loaded and when not. At worst, we may see OOM kills due to this variability. See https://redhat-internal.slack.com/archives/C04PZ7H0VA8/p1759216841405689?thread_ts=1758784462.802249&cid=C04PZ7H0VA8
2a440fe
to
60a4392
Compare
@msugakov: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Description
prefetch-dependencies
From the Slack thread https://redhat-internal.slack.com/archives/C04PZ7H0VA8/p1758784462802249, my observations and trials in this PR it's clear that OOMs happen in
/root/.cache/hermeto/go/go1.21.0/bin/go mod download -json
command that's executed by Hermeto to prefetch Go dependencies.The thread is inconclusive about the root cause of the error. On one hand, I'm sure little has changed on our side, especially in
release-4.6
branch. On the other, what exactly happens ingo mod download
and what to do to prevent it requesting lots of memory quickly is left without the answers.I used #17053 to trigger multiple runs and validate the change is stable.
Everything else
build-source-image
tends to fail too. It's not news but the one started doing that more frequently.Since I touched this area, I did a couple other cosmetic changes. You can see more info in the commit messages.
User-facing documentation
Testing and quality
Automated testing
No change.
How I validated my change