Add some exponential backoff to k8s watch restarts #225

nemacysts · 2025-04-03T20:25:04Z

We're still seeing task_proc/tron get stuck in pretty hot restart loops for expired resource versions - hopefully backing off a bit will help out here since one current theory we have is that by hitting the apiserver so hard is causing extra load and further exacerbating the issue.

If this doesn't work, we'll likely want to switch to a pattern where we have a reconcilliation thread/process periodically reconciling our state with k8s' on top of having the watch always restart from a resourceVersion of 0 (which skips the initial pod listing and starts the watch "now").

We're still seeing task_proc/tron get stuck in pretty hot restart loops for expired resource versions - hopefully backing off a bit will help out here since one current theory we have is that by hitting the apiserver so hard is causing extra load and further exacerbating the issue. If this doesn't work, we'll likely want to switch to a pattern where we have a reconcilliation thread/process periodically reconciling our state with k8s' on top of having the watch always restart from a resourceVersion of 0 (which skips the initial pod listing and starts the watch "now").

This includes Yelp/task_processing#225, which should add some backoff to watch restarts to avoid slamming the apiserver

nemacysts added a commit to Yelp/Tron that referenced this pull request Apr 3, 2025

Update task_processing to 1.3.4 for watch backoff

ce56603

This includes Yelp/task_processing#225, which should add some backoff to watch restarts to avoid slamming the apiserver

nemacysts requested review from EvanKrall, jfongatyelp, EmanElsaban and KaspariK April 3, 2025 20:25

nemacysts mentioned this pull request Apr 3, 2025

Update task_processing to 1.3.4 for watch backoff Yelp/Tron#1038

Merged

KaspariK approved these changes Apr 3, 2025

View reviewed changes

nemacysts merged commit 02e6540 into master Apr 3, 2025
2 checks passed

nemacysts added a commit to Yelp/Tron that referenced this pull request Apr 3, 2025

Update task_processing to 1.3.4 for watch backoff (#1038)

b2e7dc1

This includes Yelp/task_processing#225, which should add some backoff to watch restarts to avoid slamming the apiserver

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add some exponential backoff to k8s watch restarts #225

Add some exponential backoff to k8s watch restarts #225

Uh oh!

nemacysts commented Apr 3, 2025

Uh oh!

Uh oh!

Uh oh!

Add some exponential backoff to k8s watch restarts #225

Add some exponential backoff to k8s watch restarts #225

Uh oh!

Conversation

nemacysts commented Apr 3, 2025

Uh oh!

Uh oh!

Uh oh!