Skip to content

Detach cron job execution from canceled scheduler context#6964

Draft
Copilot wants to merge 3 commits intomainfrom
copilot/fix-cron-scheduler-issue
Draft

Detach cron job execution from canceled scheduler context#6964
Copilot wants to merge 3 commits intomainfrom
copilot/fix-cron-scheduler-issue

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Jan 15, 2026

Pull request description

Cron-triggered TestWorkflow executions could enter a persistent failure mode where every cron run returned context canceled until a restart. This change isolates cron execution from scheduler context cancellation so long-lived pods keep scheduling reliably.

  • Scheduler context isolation: execute cron jobs with a context that ignores scheduler cancellation.
  • Regression coverage: unit test confirms canceled scheduler context does not cancel cron execution.

Example:

executionCtx := context.WithoutCancel(ctx)
results, err := m.executor.Execute(executionCtx, request)

Checklist (choose whats happened)

  • breaking change! (describe)
  • tested locally
  • tested on cluster
  • added new dependencies
  • updated the docs
  • added a test

Breaking changes

  • None.

Changes

  • Use a non-cancelable context for cron-triggered execution requests.
  • Add a unit test verifying canceled scheduler contexts do not propagate to executor calls.

Fixes

  • Prevents cron scheduler from getting stuck with immediate context canceled failures on all cron jobs.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • baduri.badbadbad
    • Triggering command: /tmp/go-build3561596624/b2212/webhook.test /tmp/go-build3561596624/b2212/webhook.test -test.testlogfile=/tmp/go-build3561596624/b2212/testlog.txt -test.paniconexit0 -test.gocoverdir=/tmp/go-build3561596624/b2212/gocoverdir -test.v=test2json -test.timeout=10m0s -test.coverprofile=/tmp/go-build3561596624/b2212/_cover_.out -importcfg /tmp/go-build3561596624/b2184/importcfg -pack /home/REDACTED/go/pkg/mod/k8s.io/client-go@v0.34.0/kubernetes/typed/storage/v1/fake/doc.go /home/REDACTED/go/pkg/mod/k8s.io/client-go@v0.34.0/kubernetes/typed/storage/v1/fake/fake_csidriver.go /act�� /action_handlers/tmp/go-build3561596624/b1444 /execution_conte-nolocalimports ux-amd64/pkg/tool/linux_amd64/vet /state_manager.g/home/REDACTED/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/pkg/too-trimpath -ifaceassert t ux-amd64/pkg/too-trimpath (dns block)
  • fakehost
    • Triggering command: /tmp/go-build3561596624/b2447/triggers.test /tmp/go-build3561596624/b2447/triggers.test -test.testlogfile=/tmp/go-build3561596624/b2447/testlog.txt -test.paniconexit0 -test.gocoverdir=/tmp/go-build3561596624/b2447/gocoverdir -test.v=test2json -test.timeout=10m0s -test.coverprofile=/tmp/go-build3561596624/b2447/_cover_.out -c=4 -nolocalimports -importcfg /tmp/go-build3561596624/b2382/importcfg -pack port�� d/apps/v1/fake/d/tmp/go-build3561596624/b2299 d/apps/v1/fake/f-c=4 b.com/kubeshop/testkube/pkg/mapper/executions;/tmp/go-build3561596624/b2258=> -errorsas -ifaceassert t 0.1-go1.25.0.lin-importcfg (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>Cron Scheduler Enters Broken State - All Cron Jobs Fail with "context canceled"</issue_title>
<issue_description>Describe the bug

The internal cron scheduler can enter a permanently broken state where all cron-triggered TestWorkflow executions immediately fail with "error": "context canceled". The scheduler never recovers automatically, and only a pod restart resolves the issue.

The cron reconciler continues to schedule TestWorkflows to cron jobs, but the executor component fails to execute them - the context is canceled immediately (within the same second) after scheduling.

To Reproduce

The exact trigger is unclear, but the observed conditions were:

  1. Run testkube-api-server with ENABLE_CRON_JOBS=true and FEATURE_NEW_ARCHITECTURE=true
  2. Have TestWorkflows with cron schedules configured
  3. Let the pod run for an extended period (in our case ~15 days)
  4. At some point, all cron-triggered TestWorkflow executions start failing with "context canceled"
  5. The issue persists indefinitely until the testkube-api-server pod is restarted

Expected behavior

The cron scheduler should reliably execute TestWorkflows according to their cron schedules without entering a broken state that requires manual intervention.

Version / Cluster

  • Testkube version: 2.4.4 (kubeshop/testkube-api-server:2.4.4)
  • Kubernetes cluster: AWS EKS
  • Kubernetes version: 1.33
  • Configuration:
    ENABLE_CRON_JOBS=true
    FEATURE_NEW_ARCHITECTURE=true
    NATS_EMBEDDED=false
    NATS_URI=nats://testkube-nats
    

Screenshots / Logs

Broken State - Cron Executions Failing

The executor schedules workflows but immediately fails with "context canceled":

{"level":"info","ts":"2026-01-09T08:42:00Z","caller":"cronjob/executor.go:48","msg":"cron job scheduler: executor component: scheduling testworkflow execution for orlop-e2e-tests/42 * * * *"}
{"level":"error","ts":"2026-01-09T08:42:00Z","caller":"cronjob/executor.go:60","msg":"cron job scheduler: executor component: error executing testworkflow for cron orlop-e2e-tests/42 * * * *","error":"context canceled","stacktrace":"github.com/kubeshop/testkube/pkg/cronjob.(*Scheduler).executeTestWorkflow\n\t/build/pkg/cronjob/executor.go:60\ngithub.com/kubeshop/testkube/pkg/cronjob.(*Scheduler).addTestWorkflowCronJob.func1\n\t/build/pkg/cronjob/testworkflow.go:191\ngithub.com/robfig/cron/v3.FuncJob.Run\n\t/go/pkg/mod/github.com/robfig/cron/v3@v3.0.1/cron.go:136\ngithub.com/robfig/cron/v3.(*Cron).startJob.func1\n\t/go/pkg/mod/github.com/robfig/cron/v3@v3.0.1/cron.go:312"}

{"level":"info","ts":"2026-01-09T09:00:00Z","caller":"cronjob/executor.go:48","msg":"cron job scheduler: executor component: scheduling testworkflow execution for smoke-dqf-prod-dowel/0 7-17 * * *"}
{"level":"error","ts":"2026-01-09T09:00:00Z","caller":"cronjob/executor.go:60","msg":"cron job scheduler: executor component: error executing testworkflow for cron smoke-dqf-prod-dowel/0 7-17 * * *","error":"context canceled","stacktrace":"..."}

Failure rate: 156 out of 167 cron execution attempts (93%) failed with "context canceled".

Pattern: The error occurs at the same second as the scheduling - the context is canceled immediately.

After Pod Restart - Working Correctly

After restarting the pod with kubectl rollout restart deployment/testkube-api-server -n testkube:

{"level":"info","ts":"2026-01-09T09:59:00Z","caller":"cronjob/executor.go:48","msg":"cron job scheduler: executor component: scheduling testworkflow execution for smoke-onefrontend-prod-dowel-ai/59 * * * *"}
{"level":"info","ts":"2026-01-09T09:59:28Z","caller":"runner/runner.go:263","msg":"Saving execution","id":"6960d1647a09600283e61e10"}

{"level":"info","ts":"2026-01-09T10:00:00Z","caller":"cronjob/executor.go:48","msg":"cron job scheduler: executor component: scheduling testworkflow execution for smoke-dqf-prod-dowel/0 7-17 * * *"}
{"level":"info","ts":"2026-01-09T10:00:04Z","caller":"runner/runner.go:263","msg":"Saving execution","id":"6960d1a07a09600283e61e16"}

Pattern: Scheduling followed by successful execution (no "context canceled" errors).

Additional context

Possibly Related Logs

Around 15 days before the issue was discovered, we found logs showing various "context canceled" errors in the reconciler components. It's unclear if these are related to the broken cron state or coincidental:

{"level":"error","ts":"2025-12-24T20:03:04Z","caller":"cronjob/testworkflow.go:72","msg":"cron job scheduler: reconciler component: failed to watch TestWorkflows","error":"context canceled","stacktrace":"github.com/kubeshop/testkube/pkg/cronjob.(*Scheduler).ReconcileTestWorkflows..."}

{"level":"error","ts":"2025-12-24T20:03:04Z","caller":"cronjob/testworkflow.go:148","msg":"cron job scheduler: reconciler component: failed to watch TestWorkflowTemplates","error":"context canceled","stacktrace":...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes kubeshop/testkube#6963

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

Copilot AI and others added 2 commits January 15, 2026 17:58
Co-authored-by: olensmar <1917063+olensmar@users.noreply.github.com>
Co-authored-by: olensmar <1917063+olensmar@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix cron scheduler broken state causing job failures Detach cron job execution from canceled scheduler context Jan 15, 2026
Copilot AI requested a review from olensmar January 15, 2026 18:04
log.Info("executing scheduled workflow")

results, err := m.executor.Execute(ctx, request)
executionCtx := context.WithoutCancel(ctx)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit funny - uncancel context, guess, real reason is different

@olensmar
Copy link
Copy Markdown
Member

olensmar commented Mar 3, 2026

@greptile

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 3, 2026

Greptile Summary

This PR fixes a persistent failure mode where cron-triggered TestWorkflow executions would immediately return context canceled after the scheduler's context became canceled, requiring a pod restart to recover. The fix wraps the context passed to executor.Execute with context.WithoutCancel so the execution lifetime is no longer tied to the scheduler's reconciliation context.

Key changes:

  • internal/cronjob/robfig/manager.go: Introduces executionCtx := context.WithoutCancel(ctx) before calling m.executor.Execute (line 144), preventing a canceled scheduler context from propagating into individual cron job executions.
  • internal/cronjob/robfig/manager_test.go: Adds a regression test that pre-cancels the scheduler context, fires the cron job closure, and asserts the executor receives a non-canceled context.
  • k8s/helm/testkube/charts/nats/test/go.mod / go.sum: Removes the legacy github.com/urfave/cli v1.22.14 indirect dependency in favour of the already-present v2 variant, along with the corresponding lockfile cleanup.

The fix is minimal, targeted, and directly addresses the reported bug using the standard-library primitive context.WithoutCancel. The regression test confirms the fix works as intended.

Confidence Score: 5/5

  • Safe to merge — the fix is correct, minimal, and well-tested with a regression test that validates the core change.
  • The PR introduces a targeted, one-line fix using the standard-library context.WithoutCancel to decouple cron job execution from scheduler context cancellation. This directly addresses the reported bug where cron jobs were failing with "context canceled" due to canceled reconciler contexts. The regression test confirms the fix works correctly by ensuring a pre-canceled scheduler context does not propagate to the executor. All changes are minimal and focused on the specific issue.
  • No files require special attention.

Last reviewed commit: 3a01865

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants