airbyte services stuck where no new job pods are launched

### Helm Chart Version

2.0.17

### What step the error happened?

During the Sync

### Relevant information

for version 1.8.5 I am seeing errors across `cron` `temporal` `worker` `workload launcher` and airbyte namespace won't have new pods starting. I am running 500+ syncs every two hours here

summary from claude:
  1. Workloads Cannot Be Claimed (Primary Issue)

  Claimed: false for workload ... via API in dataplane AUTOab3e0a00-dba4-4157-945c-a733be20cfb3

  All workloads are returning Claimed: false from the API. This means another worker/dataplane has already claimed them OR they're in an invalid state.

  2. Workflows Are Stuck in Backoff Mode

  - Connections have failed multiple times (4+ failures)
  - They're waiting 30s to 4+ minutes between retries
  - Source checks are continuously failing

  3. Temporal Rate Limiting (Now Improving)

  - Cron is still hitting rate limits trying to clean up workflows
  - Temporal resources improved (807m → 492m CPU)

  4. Workflow Thread Leaks

  [BUG] Workflow thread can't be destroyed in time. This will lead to a workflow cache leak
  Workflows are getting stuck and can't be cleaned up, causing memory/cache leaks.

logs from worker:
```
│ 2025-10-15 13:26:59,390 [Workflow Executor taskQueue="CONNECTION_UPDATER", namespace="default": 115]    WARN    i.t.i.r.ReplayWorkflowTaskHandler(failureToWFTResult):302 - Workflow task processing failure. startedEventId=102, WorkflowId=conn │
│ ection_manager_5ffdcd75-71ff-4350-9633-549d4e1edec5, RunId=eec42b8d-c13a-48c9-86b3-40cbd47f3d39. If seen continuously the workflow might be stuck.                                                                                                │
│ io.temporal.internal.statemachines.InternalWorkflowTaskException: Failure handling event 102 of type 'EVENT_TYPE_WORKFLOW_TASK_STARTED' during execution. {WorkflowTaskStartedEventId=102, CurrentStartedEventId=102}                             │
│     at io.temporal.internal.statemachines.WorkflowStateMachines.createEventProcessingException(WorkflowStateMachines.java:436)                                                                                                                    │
│     at io.temporal.internal.statemachines.WorkflowStateMachines.handleEventsBatch(WorkflowStateMachines.java:338)                                                                                                                                 │
│     at io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:297)                                                                                                                                       │
│     at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.applyServerHistory(ReplayWorkflowRunTaskHandler.java:260)                                                                                                                         │
│     at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:242)                                                                                                                     │
│     at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:165)                                                                                                                         │
│     at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithQuery(ReplayWorkflowTaskHandler.java:135)                                                                                                                      │
│     at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:100)                                                                                                                               │
│     at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handleTask(WorkflowWorker.java:475)                                                                                                                                             │
│     at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:366)                                                                                                                                                 │
│     at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:306)                                                                                                                                                 │
│     at io.temporal.internal.worker.PollTaskExecutor.lambda$process$1(PollTaskExecutor.java:96)                                                                                                                                                    │
│     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)                                                                                                                                                  │
│     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)                                                                                                                                                  │
│     at java.base/java.lang.Thread.run(Thread.java:1583)                                                                                                                                                                                           │
│ Caused by: java.lang.RuntimeException: WorkflowTask: failure executing SCHEDULED->WORKFLOW_TASK_STARTED, transition history is [CREATED->WORKFLOW_TASK_SCHEDULED]                                                                                 │
│     at io.temporal.internal.statemachines.StateMachine.executeTransition(StateMachine.java:163)                                                                                                                                                   │
│     at io.temporal.internal.statemachines.StateMachine.handleHistoryEvent(StateMachine.java:103)                                                                                                                                                  │
│     at io.temporal.internal.statemachines.EntityStateMachineBase.handleEvent(EntityStateMachineBase.java:84)                                                                                                                                      │
│     at io.temporal.internal.statemachines.EntityStateMachineInitialCommand.handleEvent(EntityStateMachineInitialCommand.java:70)                                                                                                                  │
│     at io.temporal.internal.statemachines.WorkflowStateMachines.handleSingleEvent(WorkflowStateMachines.java:487)                                                                                                                                 │
│     at io.temporal.internal.statemachines.WorkflowStateMachines.handleEventsBatch(WorkflowStateMachines.java:336)                                                                                                                                 │
│     ... 13 common frames omitted                                                                                                                                                                                                                  │
│ Caused by: io.temporal.internal.sync.PotentialDeadlockException: [TMPRL1101] Potential deadlock detected. Workflow thread "workflow-method-connection_manager_5ffdcd75-71ff-4350-9633-549d4e1...-eec42b8d-c13a-48c9-86b3-40cbd47f3d39" didn't yie │
│ ld control for over a second. {detectionTimestamp=1760534816802, threadDumpTimestamp=1760534816803}
```

logs from cron:
```
2025-10-15 18:41:36,202 [scheduled-executor-thread-1]    ERROR    i.m.s.DefaultTaskExceptionHandler(handle):47 - Error invoking scheduled task for bean [io.airbyte.cron.jobs.SelfHealTemporalWorkflows@5b17a929] RESOURCE_EXHAUSTED: namespace r │
│ io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: namespace rate limit exceeded                                                                                                                                                                 │
│     at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:351)                                                                                                                                                                    │
│     at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:332)                                                                                                                                                                                │
│     at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:174)                                                                                                                                                                           │
│     at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.listClosedWorkflowExecutions(WorkflowServiceGrpc.java:5903)                                                                                             │
│     at io.airbyte.commons.temporal.WorkflowServiceStubsWrapped.blockingStubListClosedWorkflowExecutions$lambda$0(WorkflowServiceStubsWrapped.kt:38)                                                                                               │
│     at dev.failsafe.Functions.lambda$toCtxSupplier$11(Functions.java:243)                                                                                                                                                                         │
│     at dev.failsafe.Functions.lambda$get$0(Functions.java:46)                                                                                                                                                                                     │
│     at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:74)                                                                                                                                                      │
│     at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:187)                                                                                                                                                                     │
│     at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:376)                                                                                                                                                                              │
│     at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:112)                                                                                                                                                                               │
│     at io.airbyte.commons.temporal.RetryHelper.withRetries(RetryHelper.kt:57)                                                                                                                                                                     │
│     at io.airbyte.commons.temporal.WorkflowServiceStubsWrapped.withRetries(WorkflowServiceStubsWrapped.kt:63)                                                                                                                                     │
│     at io.airbyte.commons.temporal.WorkflowServiceStubsWrapped.blockingStubListClosedWorkflowExecutions(WorkflowServiceStubsWrapped.kt:37)                                                                                                        │
│     at io.airbyte.commons.temporal.TemporalClient.fetchClosedWorkflowsByStatus(TemporalClient.kt:137)                                                                                                                                             │
│     at io.airbyte.commons.temporal.TemporalClient.restartClosedWorkflowByStatus(TemporalClient.kt:113)                                                                                                                                            │
│     at io.airbyte.cron.jobs.SelfHealTemporalWorkflows.cleanTemporal(SelfHealTemporalWorkflows.kt:39)                                                                                                                                              │
│     at io.airbyte.cron.jobs.$SelfHealTemporalWorkflows$Definition$Exec.dispatch(Unknown Source)                                                                                                                                                   │
│     at io.micronaut.context.AbstractExecutableMethodsDefinition$DispatchedExecutableMethod.invoke(AbstractExecutableMethodsDefinition.java:456)                                                                                                   │
│     at io.micronaut.inject.DelegatingExecutableMethod.invoke(DelegatingExecutableMethod.java:86)                                                                                                                                                  │
│     at io.micronaut.context.bind.DefaultExecutableBeanContextBinder$ContextBoundExecutable.invoke(DefaultExecutableBeanContextBinder.java:152)                                                                                                    │
│     at io.micronaut.scheduling.processor.ScheduledMethodProcessor.lambda$scheduleTask$2(ScheduledMethodProcessor.java:160)                                                                                                                        │
│     at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)                                                                                                                                                          │
│     at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358)                                                                                                                                                                 │
│     at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)                                                                                                                   │
│     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)                                                                                                                                                  │
│     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)                                                                                                                                                  │
│     at java.base/java.lang.Thread.run(Thread.java:1583)
```

logs from temporal:
```
{"level":"error","ts":"2025-10-15T18:46:03.038Z","msg":"Update workflow execution operation failed.","shard-id":4,"address":"0.0.0.0:7234","wf-namespace-id":"8a7c73a0-2c00-4a46-91e5-c2fa6037735d","wf-id":"connection_manager_4da58350-f5e0-492 │
│ {"level":"warn","ts":"2025-10-15T18:46:03.054Z","msg":"Fail to process task","shard-id":4,"address":"0.0.0.0:7234","component":"transfer-queue-processor","wf-namespace-id":"8a7c73a0-2c00-4a46-91e5-c2fa6037735d","wf-id":"connection_manager_4d │
│ {"level":"error","ts":"2025-10-15T18:46:03.294Z","msg":"service failures","operation":"AddWorkflowTask","wf-namespace":"default","grpc_code":"Unavailable","wf-id":"connection_manager_7f4268de-48c5-42b7-a064-2ddb27b94334","wf-run-id":"01c421d │
│ {"level":"info","ts":"2025-10-15T18:46:03.379Z","msg":"Activity task not found","component":"matching-engine","wf-namespace-id":"8a7c73a0-2c00-4a46-91e5-c2fa6037735d","wf-id":"replication_16167","wf-run-id":"0199e91c-fb9c-7c29-a3cc-b756c1df3 │
│ {"level":"error","ts":"2025-10-15T18:46:03.642Z","msg":"service failures","operation":"AddWorkflowTask","wf-namespace":"default","grpc_code":"Unavailable","wf-id":"connection_manager_8753dbaf-d92e-4bb4-8b23-425b8d3941f3","wf-run-id":"4931d68 │
│ {"level":"info","ts":"2025-10-15T18:46:04.457Z","msg":"history client encountered error","service":"matching","error":"Activity task already started.","service-error-type":"serviceerror.TaskAlreadyStarted","logging-call-at":"/home/runner/wor │
│ {"level":"error","ts":"2025-10-15T18:46:04.631Z","msg":"service failures","operation":"AddWorkflowTask","wf-namespace":"default","grpc_code":"Unavailable","wf-id":"connection_manager_843c7c3d-d534-49c0-9c05-d3691f8b0657","wf-run-id":"1146a15
```

logs from launcher
```
│ 2025-10-15 18:46:35,432 [default-13]    INFO    i.a.w.l.p.h.SuccessHandler(accept):83 - Pipeline completed for workload: 9e81e5c3-48a5-4952-8f8b-b2c5fb20fdf8_15961_1_check.                                                                      │
│ 2025-10-15 18:46:35,433 [default-14]    INFO    i.a.w.l.c.WorkloadApiClient(claim):79 - Claimed: false for workload 0032bdbb-1723-4ab8-9aa5-0d1776883400_15726_1_check via API in dataplane AUTOab3e0a00-dba4-4157-945c-a733be20cfb3 (280ed729-b9 │
│ f8-41e7-8289-34e8bf4400a9)                                                                                                                                                                                                                        │
│ 2025-10-15 18:46:35,433 [default-14]    INFO    i.a.w.l.p.s.ClaimStage(applyStage):55 - Workload not claimed. Setting SKIP flag to true.                                                                                                          │
│ 2025-10-15 18:46:35,433 [default-14]    INFO    i.a.w.l.p.s.m.Stage(apply):35 - SKIP Stage: LOAD_SHED — (workloadId=0032bdbb-1723-4ab8-9aa5-0d1776883400_15726_1_check)                                                                           │
│ 2025-10-15 18:46:35,433 [default-14]    INFO    i.a.w.l.p.s.m.Stage(apply):35 - SKIP Stage: CHECK_STATUS — (workloadId=0032bdbb-1723-4ab8-9aa5-0d1776883400_15726_1_check)                                                                        │
│ 2025-10-15 18:46:35,434 [default-13]    INFO    i.a.w.l.p.s.m.Stage(apply):42 - APPLY Stage: BUILD — (workloadId=cdba8436-3a0b-43aa-bcc8-0669e862340e_15799_2_check)                                                                              │
│ 2025-10-15 18:46:35,434 [default-17]    INFO    i.a.w.l.c.WorkloadApiClient(claim):79 - Claimed: false for workload f2c2482d-74af-4beb-a2bd-6565cf8bdfaa_15995_0_check via API in dataplane AUTOab3e0a00-dba4-4157-945c-a733be20cfb3 (280ed729-b9 │
│ f8-41e7-8289-34e8bf4400a9)                                                                                                                                                                                                                        │
│ 2025-10-15 18:46:35,433 [default-14]    INFO    i.a.w.l.p.s.m.Stage(apply):35 - SKIP Stage: MUTEX — (workloadId=0032bdbb-1723-4ab8-9aa5-0d1776883400_15726_1_check)                                                                               │
│ 2025-10-15 18:46:35,429 [default-18]    INFO    i.a.w.l.p.s.m.Stage(apply):35 - SKIP Stage: MUTEX — (workloadId=cbbdf29a-7217-4ca1-839b-ee8b14381e14_16095_0_sync)                                                                                │
│ 2025-10-15 18:46:35,441 [default-11]    INFO    i.a.w.l.c.WorkloadApiClient(claim):79 - Claimed: false for workload a99c7c10-b678-47ee-aec2-433e4639a89a_15587_2_check via API in dataplane AUTOab3e0a00-dba4-4157-945c-a733be20cfb3 (280ed729-b9 │
│ f8-41e7-8289-34e8bf4400a9)                                                                                                                                                                                                                        │
│ 2025-10-15 18:46:39,272 [default-11]    INFO    i.a.w.l.p.s.ClaimStage(applyStage):55 - Workload not claimed. Setting SKIP flag to true.
```





### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

airbyte services stuck where no new job pods are launched #68115

Helm Chart Version

What step the error happened?

Relevant information

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

airbyte services stuck where no new job pods are launched #68115

Description

Helm Chart Version

What step the error happened?

Relevant information

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions