Skip to content

Conversation

nemacysts
Copy link
Member

We're still seeing task_proc/tron get stuck in pretty hot restart loops for expired resource versions - hopefully backing off a bit will help out here since one current theory we have is that by hitting the apiserver so hard is causing extra load and further exacerbating the issue.

If this doesn't work, we'll likely want to switch to a pattern where we have a reconcilliation thread/process periodically reconciling our state with k8s' on top of having the watch always restart from a resourceVersion of 0 (which skips the initial pod listing and starts the watch "now").

We're still seeing task_proc/tron get stuck in pretty hot restart loops
for expired resource versions - hopefully backing off a bit will help
out here since one current theory we have is that by hitting the
apiserver so hard is causing extra load and further exacerbating the
issue.

If this doesn't work, we'll likely want to switch to a pattern where we
have a reconcilliation thread/process periodically reconciling our state
with k8s' on top of having the watch always restart from a
resourceVersion of 0 (which skips the initial pod listing and starts the
watch "now").
nemacysts added a commit to Yelp/Tron that referenced this pull request Apr 3, 2025
This includes Yelp/task_processing#225, which
should add some backoff to watch restarts to avoid slamming the
apiserver
@nemacysts nemacysts merged commit 02e6540 into master Apr 3, 2025
2 checks passed
nemacysts added a commit to Yelp/Tron that referenced this pull request Apr 3, 2025
This includes Yelp/task_processing#225, which
should add some backoff to watch restarts to avoid slamming the
apiserver
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants