Detach cron job execution from canceled scheduler context#6964
Draft
Detach cron job execution from canceled scheduler context#6964
Conversation
Co-authored-by: olensmar <1917063+olensmar@users.noreply.github.com>
Co-authored-by: olensmar <1917063+olensmar@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix cron scheduler broken state causing job failures
Detach cron job execution from canceled scheduler context
Jan 15, 2026
vsukhin
reviewed
Jan 15, 2026
| log.Info("executing scheduled workflow") | ||
|
|
||
| results, err := m.executor.Execute(ctx, request) | ||
| executionCtx := context.WithoutCancel(ctx) |
Collaborator
There was a problem hiding this comment.
it's a bit funny - uncancel context, guess, real reason is different
Member
|
@greptile |
Contributor
Greptile SummaryThis PR fixes a persistent failure mode where cron-triggered TestWorkflow executions would immediately return Key changes:
The fix is minimal, targeted, and directly addresses the reported bug using the standard-library primitive Confidence Score: 5/5
Last reviewed commit: 3a01865 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull request description
Cron-triggered TestWorkflow executions could enter a persistent failure mode where every cron run returned
context canceleduntil a restart. This change isolates cron execution from scheduler context cancellation so long-lived pods keep scheduling reliably.Example:
Checklist (choose whats happened)
Breaking changes
Changes
Fixes
context canceledfailures on all cron jobs.Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
baduri.badbadbad/tmp/go-build3561596624/b2212/webhook.test /tmp/go-build3561596624/b2212/webhook.test -test.testlogfile=/tmp/go-build3561596624/b2212/testlog.txt -test.paniconexit0 -test.gocoverdir=/tmp/go-build3561596624/b2212/gocoverdir -test.v=test2json -test.timeout=10m0s -test.coverprofile=/tmp/go-build3561596624/b2212/_cover_.out -importcfg /tmp/go-build3561596624/b2184/importcfg -pack /home/REDACTED/go/pkg/mod/k8s.io/client-go@v0.34.0/kubernetes/typed/storage/v1/fake/doc.go /home/REDACTED/go/pkg/mod/k8s.io/client-go@v0.34.0/kubernetes/typed/storage/v1/fake/fake_csidriver.go /act�� /action_handlers/tmp/go-build3561596624/b1444 /execution_conte-nolocalimports ux-amd64/pkg/tool/linux_amd64/vet /state_manager.g/home/REDACTED/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/pkg/too-trimpath -ifaceassert t ux-amd64/pkg/too-trimpath(dns block)fakehost/tmp/go-build3561596624/b2447/triggers.test /tmp/go-build3561596624/b2447/triggers.test -test.testlogfile=/tmp/go-build3561596624/b2447/testlog.txt -test.paniconexit0 -test.gocoverdir=/tmp/go-build3561596624/b2447/gocoverdir -test.v=test2json -test.timeout=10m0s -test.coverprofile=/tmp/go-build3561596624/b2447/_cover_.out -c=4 -nolocalimports -importcfg /tmp/go-build3561596624/b2382/importcfg -pack port�� d/apps/v1/fake/d/tmp/go-build3561596624/b2299 d/apps/v1/fake/f-c=4 b.com/kubeshop/testkube/pkg/mapper/executions;/tmp/go-build3561596624/b2258=> -errorsas -ifaceassert t 0.1-go1.25.0.lin-importcfg(dns block)If you need me to access, download, or install something from one of these locations, you can either:
Original prompt
This section details on the original issue you should resolve
<issue_title>Cron Scheduler Enters Broken State - All Cron Jobs Fail with "context canceled"</issue_title>
<issue_description>Describe the bug
The internal cron scheduler can enter a permanently broken state where all cron-triggered TestWorkflow executions immediately fail with
"error": "context canceled". The scheduler never recovers automatically, and only a pod restart resolves the issue.The cron reconciler continues to schedule TestWorkflows to cron jobs, but the executor component fails to execute them - the context is canceled immediately (within the same second) after scheduling.
To Reproduce
The exact trigger is unclear, but the observed conditions were:
ENABLE_CRON_JOBS=trueandFEATURE_NEW_ARCHITECTURE=trueExpected behavior
The cron scheduler should reliably execute TestWorkflows according to their cron schedules without entering a broken state that requires manual intervention.
Version / Cluster
kubeshop/testkube-api-server:2.4.4)Screenshots / Logs
Broken State - Cron Executions Failing
The executor schedules workflows but immediately fails with "context canceled":
{"level":"info","ts":"2026-01-09T08:42:00Z","caller":"cronjob/executor.go:48","msg":"cron job scheduler: executor component: scheduling testworkflow execution for orlop-e2e-tests/42 * * * *"} {"level":"error","ts":"2026-01-09T08:42:00Z","caller":"cronjob/executor.go:60","msg":"cron job scheduler: executor component: error executing testworkflow for cron orlop-e2e-tests/42 * * * *","error":"context canceled","stacktrace":"github.com/kubeshop/testkube/pkg/cronjob.(*Scheduler).executeTestWorkflow\n\t/build/pkg/cronjob/executor.go:60\ngithub.com/kubeshop/testkube/pkg/cronjob.(*Scheduler).addTestWorkflowCronJob.func1\n\t/build/pkg/cronjob/testworkflow.go:191\ngithub.com/robfig/cron/v3.FuncJob.Run\n\t/go/pkg/mod/github.com/robfig/cron/v3@v3.0.1/cron.go:136\ngithub.com/robfig/cron/v3.(*Cron).startJob.func1\n\t/go/pkg/mod/github.com/robfig/cron/v3@v3.0.1/cron.go:312"} {"level":"info","ts":"2026-01-09T09:00:00Z","caller":"cronjob/executor.go:48","msg":"cron job scheduler: executor component: scheduling testworkflow execution for smoke-dqf-prod-dowel/0 7-17 * * *"} {"level":"error","ts":"2026-01-09T09:00:00Z","caller":"cronjob/executor.go:60","msg":"cron job scheduler: executor component: error executing testworkflow for cron smoke-dqf-prod-dowel/0 7-17 * * *","error":"context canceled","stacktrace":"..."}Failure rate: 156 out of 167 cron execution attempts (93%) failed with "context canceled".
Pattern: The error occurs at the same second as the scheduling - the context is canceled immediately.
After Pod Restart - Working Correctly
After restarting the pod with
kubectl rollout restart deployment/testkube-api-server -n testkube:{"level":"info","ts":"2026-01-09T09:59:00Z","caller":"cronjob/executor.go:48","msg":"cron job scheduler: executor component: scheduling testworkflow execution for smoke-onefrontend-prod-dowel-ai/59 * * * *"} {"level":"info","ts":"2026-01-09T09:59:28Z","caller":"runner/runner.go:263","msg":"Saving execution","id":"6960d1647a09600283e61e10"} {"level":"info","ts":"2026-01-09T10:00:00Z","caller":"cronjob/executor.go:48","msg":"cron job scheduler: executor component: scheduling testworkflow execution for smoke-dqf-prod-dowel/0 7-17 * * *"} {"level":"info","ts":"2026-01-09T10:00:04Z","caller":"runner/runner.go:263","msg":"Saving execution","id":"6960d1a07a09600283e61e16"}Pattern: Scheduling followed by successful execution (no "context canceled" errors).
Additional context
Possibly Related Logs
Around 15 days before the issue was discovered, we found logs showing various "context canceled" errors in the reconciler components. It's unclear if these are related to the broken cron state or coincidental:
{"level":"error","ts":"2025-12-24T20:03:04Z","caller":"cronjob/testworkflow.go:72","msg":"cron job scheduler: reconciler component: failed to watch TestWorkflows","error":"context canceled","stacktrace":"github.com/kubeshop/testkube/pkg/cronjob.(*Scheduler).ReconcileTestWorkflows..."} {"level":"error","ts":"2025-12-24T20:03:04Z","caller":"cronjob/testworkflow.go:148","msg":"cron job scheduler: reconciler component: failed to watch TestWorkflowTemplates","error":"context canceled","stacktrace":... </details> <!-- START COPILOT CODING AGENT SUFFIX --> - Fixes kubeshop/testkube#6963 <!-- START COPILOT CODING AGENT TIPS --> --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey).