Skip to content

airbyte services stuck where no new job pods are launched #68115

@billy-sightline

Description

@billy-sightline

Helm Chart Version

2.0.17

What step the error happened?

During the Sync

Relevant information

for version 1.8.5 I am seeing errors across cron temporal worker workload launcher and airbyte namespace won't have new pods starting. I am running 500+ syncs every two hours here

summary from claude:

  1. Workloads Cannot Be Claimed (Primary Issue)

Claimed: false for workload ... via API in dataplane AUTOab3e0a00-dba4-4157-945c-a733be20cfb3

All workloads are returning Claimed: false from the API. This means another worker/dataplane has already claimed them OR they're in an invalid state.

  1. Workflows Are Stuck in Backoff Mode
  • Connections have failed multiple times (4+ failures)
  • They're waiting 30s to 4+ minutes between retries
  • Source checks are continuously failing
  1. Temporal Rate Limiting (Now Improving)
  • Cron is still hitting rate limits trying to clean up workflows
  • Temporal resources improved (807m → 492m CPU)
  1. Workflow Thread Leaks

[BUG] Workflow thread can't be destroyed in time. This will lead to a workflow cache leak
Workflows are getting stuck and can't be cleaned up, causing memory/cache leaks.

logs from worker:

│ 2025-10-15 13:26:59,390 [Workflow Executor taskQueue="CONNECTION_UPDATER", namespace="default": 115]    WARN    i.t.i.r.ReplayWorkflowTaskHandler(failureToWFTResult):302 - Workflow task processing failure. startedEventId=102, WorkflowId=conn │
│ ection_manager_5ffdcd75-71ff-4350-9633-549d4e1edec5, RunId=eec42b8d-c13a-48c9-86b3-40cbd47f3d39. If seen continuously the workflow might be stuck.                                                                                                │
│ io.temporal.internal.statemachines.InternalWorkflowTaskException: Failure handling event 102 of type 'EVENT_TYPE_WORKFLOW_TASK_STARTED' during execution. {WorkflowTaskStartedEventId=102, CurrentStartedEventId=102}                             │
│     at io.temporal.internal.statemachines.WorkflowStateMachines.createEventProcessingException(WorkflowStateMachines.java:436)                                                                                                                    │
│     at io.temporal.internal.statemachines.WorkflowStateMachines.handleEventsBatch(WorkflowStateMachines.java:338)                                                                                                                                 │
│     at io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:297)                                                                                                                                       │
│     at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.applyServerHistory(ReplayWorkflowRunTaskHandler.java:260)                                                                                                                         │
│     at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:242)                                                                                                                     │
│     at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:165)                                                                                                                         │
│     at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithQuery(ReplayWorkflowTaskHandler.java:135)                                                                                                                      │
│     at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:100)                                                                                                                               │
│     at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handleTask(WorkflowWorker.java:475)                                                                                                                                             │
│     at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:366)                                                                                                                                                 │
│     at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:306)                                                                                                                                                 │
│     at io.temporal.internal.worker.PollTaskExecutor.lambda$process$1(PollTaskExecutor.java:96)                                                                                                                                                    │
│     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)                                                                                                                                                  │
│     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)                                                                                                                                                  │
│     at java.base/java.lang.Thread.run(Thread.java:1583)                                                                                                                                                                                           │
│ Caused by: java.lang.RuntimeException: WorkflowTask: failure executing SCHEDULED->WORKFLOW_TASK_STARTED, transition history is [CREATED->WORKFLOW_TASK_SCHEDULED]                                                                                 │
│     at io.temporal.internal.statemachines.StateMachine.executeTransition(StateMachine.java:163)                                                                                                                                                   │
│     at io.temporal.internal.statemachines.StateMachine.handleHistoryEvent(StateMachine.java:103)                                                                                                                                                  │
│     at io.temporal.internal.statemachines.EntityStateMachineBase.handleEvent(EntityStateMachineBase.java:84)                                                                                                                                      │
│     at io.temporal.internal.statemachines.EntityStateMachineInitialCommand.handleEvent(EntityStateMachineInitialCommand.java:70)                                                                                                                  │
│     at io.temporal.internal.statemachines.WorkflowStateMachines.handleSingleEvent(WorkflowStateMachines.java:487)                                                                                                                                 │
│     at io.temporal.internal.statemachines.WorkflowStateMachines.handleEventsBatch(WorkflowStateMachines.java:336)                                                                                                                                 │
│     ... 13 common frames omitted                                                                                                                                                                                                                  │
│ Caused by: io.temporal.internal.sync.PotentialDeadlockException: [TMPRL1101] Potential deadlock detected. Workflow thread "workflow-method-connection_manager_5ffdcd75-71ff-4350-9633-549d4e1...-eec42b8d-c13a-48c9-86b3-40cbd47f3d39" didn't yie │
│ ld control for over a second. {detectionTimestamp=1760534816802, threadDumpTimestamp=1760534816803}

logs from cron:

2025-10-15 18:41:36,202 [scheduled-executor-thread-1]    ERROR    i.m.s.DefaultTaskExceptionHandler(handle):47 - Error invoking scheduled task for bean [io.airbyte.cron.jobs.SelfHealTemporalWorkflows@5b17a929] RESOURCE_EXHAUSTED: namespace r │
│ io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: namespace rate limit exceeded                                                                                                                                                                 │
│     at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:351)                                                                                                                                                                    │
│     at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:332)                                                                                                                                                                                │
│     at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:174)                                                                                                                                                                           │
│     at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.listClosedWorkflowExecutions(WorkflowServiceGrpc.java:5903)                                                                                             │
│     at io.airbyte.commons.temporal.WorkflowServiceStubsWrapped.blockingStubListClosedWorkflowExecutions$lambda$0(WorkflowServiceStubsWrapped.kt:38)                                                                                               │
│     at dev.failsafe.Functions.lambda$toCtxSupplier$11(Functions.java:243)                                                                                                                                                                         │
│     at dev.failsafe.Functions.lambda$get$0(Functions.java:46)                                                                                                                                                                                     │
│     at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:74)                                                                                                                                                      │
│     at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:187)                                                                                                                                                                     │
│     at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:376)                                                                                                                                                                              │
│     at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:112)                                                                                                                                                                               │
│     at io.airbyte.commons.temporal.RetryHelper.withRetries(RetryHelper.kt:57)                                                                                                                                                                     │
│     at io.airbyte.commons.temporal.WorkflowServiceStubsWrapped.withRetries(WorkflowServiceStubsWrapped.kt:63)                                                                                                                                     │
│     at io.airbyte.commons.temporal.WorkflowServiceStubsWrapped.blockingStubListClosedWorkflowExecutions(WorkflowServiceStubsWrapped.kt:37)                                                                                                        │
│     at io.airbyte.commons.temporal.TemporalClient.fetchClosedWorkflowsByStatus(TemporalClient.kt:137)                                                                                                                                             │
│     at io.airbyte.commons.temporal.TemporalClient.restartClosedWorkflowByStatus(TemporalClient.kt:113)                                                                                                                                            │
│     at io.airbyte.cron.jobs.SelfHealTemporalWorkflows.cleanTemporal(SelfHealTemporalWorkflows.kt:39)                                                                                                                                              │
│     at io.airbyte.cron.jobs.$SelfHealTemporalWorkflows$Definition$Exec.dispatch(Unknown Source)                                                                                                                                                   │
│     at io.micronaut.context.AbstractExecutableMethodsDefinition$DispatchedExecutableMethod.invoke(AbstractExecutableMethodsDefinition.java:456)                                                                                                   │
│     at io.micronaut.inject.DelegatingExecutableMethod.invoke(DelegatingExecutableMethod.java:86)                                                                                                                                                  │
│     at io.micronaut.context.bind.DefaultExecutableBeanContextBinder$ContextBoundExecutable.invoke(DefaultExecutableBeanContextBinder.java:152)                                                                                                    │
│     at io.micronaut.scheduling.processor.ScheduledMethodProcessor.lambda$scheduleTask$2(ScheduledMethodProcessor.java:160)                                                                                                                        │
│     at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)                                                                                                                                                          │
│     at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358)                                                                                                                                                                 │
│     at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)                                                                                                                   │
│     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)                                                                                                                                                  │
│     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)                                                                                                                                                  │
│     at java.base/java.lang.Thread.run(Thread.java:1583)

logs from temporal:

{"level":"error","ts":"2025-10-15T18:46:03.038Z","msg":"Update workflow execution operation failed.","shard-id":4,"address":"0.0.0.0:7234","wf-namespace-id":"8a7c73a0-2c00-4a46-91e5-c2fa6037735d","wf-id":"connection_manager_4da58350-f5e0-492 │
│ {"level":"warn","ts":"2025-10-15T18:46:03.054Z","msg":"Fail to process task","shard-id":4,"address":"0.0.0.0:7234","component":"transfer-queue-processor","wf-namespace-id":"8a7c73a0-2c00-4a46-91e5-c2fa6037735d","wf-id":"connection_manager_4d │
│ {"level":"error","ts":"2025-10-15T18:46:03.294Z","msg":"service failures","operation":"AddWorkflowTask","wf-namespace":"default","grpc_code":"Unavailable","wf-id":"connection_manager_7f4268de-48c5-42b7-a064-2ddb27b94334","wf-run-id":"01c421d │
│ {"level":"info","ts":"2025-10-15T18:46:03.379Z","msg":"Activity task not found","component":"matching-engine","wf-namespace-id":"8a7c73a0-2c00-4a46-91e5-c2fa6037735d","wf-id":"replication_16167","wf-run-id":"0199e91c-fb9c-7c29-a3cc-b756c1df3 │
│ {"level":"error","ts":"2025-10-15T18:46:03.642Z","msg":"service failures","operation":"AddWorkflowTask","wf-namespace":"default","grpc_code":"Unavailable","wf-id":"connection_manager_8753dbaf-d92e-4bb4-8b23-425b8d3941f3","wf-run-id":"4931d68 │
│ {"level":"info","ts":"2025-10-15T18:46:04.457Z","msg":"history client encountered error","service":"matching","error":"Activity task already started.","service-error-type":"serviceerror.TaskAlreadyStarted","logging-call-at":"/home/runner/wor │
│ {"level":"error","ts":"2025-10-15T18:46:04.631Z","msg":"service failures","operation":"AddWorkflowTask","wf-namespace":"default","grpc_code":"Unavailable","wf-id":"connection_manager_843c7c3d-d534-49c0-9c05-d3691f8b0657","wf-run-id":"1146a15

logs from launcher

│ 2025-10-15 18:46:35,432 [default-13]    INFO    i.a.w.l.p.h.SuccessHandler(accept):83 - Pipeline completed for workload: 9e81e5c3-48a5-4952-8f8b-b2c5fb20fdf8_15961_1_check.                                                                      │
│ 2025-10-15 18:46:35,433 [default-14]    INFO    i.a.w.l.c.WorkloadApiClient(claim):79 - Claimed: false for workload 0032bdbb-1723-4ab8-9aa5-0d1776883400_15726_1_check via API in dataplane AUTOab3e0a00-dba4-4157-945c-a733be20cfb3 (280ed729-b9 │
│ f8-41e7-8289-34e8bf4400a9)                                                                                                                                                                                                                        │
│ 2025-10-15 18:46:35,433 [default-14]    INFO    i.a.w.l.p.s.ClaimStage(applyStage):55 - Workload not claimed. Setting SKIP flag to true.                                                                                                          │
│ 2025-10-15 18:46:35,433 [default-14]    INFO    i.a.w.l.p.s.m.Stage(apply):35 - SKIP Stage: LOAD_SHED — (workloadId=0032bdbb-1723-4ab8-9aa5-0d1776883400_15726_1_check)                                                                           │
│ 2025-10-15 18:46:35,433 [default-14]    INFO    i.a.w.l.p.s.m.Stage(apply):35 - SKIP Stage: CHECK_STATUS — (workloadId=0032bdbb-1723-4ab8-9aa5-0d1776883400_15726_1_check)                                                                        │
│ 2025-10-15 18:46:35,434 [default-13]    INFO    i.a.w.l.p.s.m.Stage(apply):42 - APPLY Stage: BUILD — (workloadId=cdba8436-3a0b-43aa-bcc8-0669e862340e_15799_2_check)                                                                              │
│ 2025-10-15 18:46:35,434 [default-17]    INFO    i.a.w.l.c.WorkloadApiClient(claim):79 - Claimed: false for workload f2c2482d-74af-4beb-a2bd-6565cf8bdfaa_15995_0_check via API in dataplane AUTOab3e0a00-dba4-4157-945c-a733be20cfb3 (280ed729-b9 │
│ f8-41e7-8289-34e8bf4400a9)                                                                                                                                                                                                                        │
│ 2025-10-15 18:46:35,433 [default-14]    INFO    i.a.w.l.p.s.m.Stage(apply):35 - SKIP Stage: MUTEX — (workloadId=0032bdbb-1723-4ab8-9aa5-0d1776883400_15726_1_check)                                                                               │
│ 2025-10-15 18:46:35,429 [default-18]    INFO    i.a.w.l.p.s.m.Stage(apply):35 - SKIP Stage: MUTEX — (workloadId=cbbdf29a-7217-4ca1-839b-ee8b14381e14_16095_0_sync)                                                                                │
│ 2025-10-15 18:46:35,441 [default-11]    INFO    i.a.w.l.c.WorkloadApiClient(claim):79 - Claimed: false for workload a99c7c10-b678-47ee-aec2-433e4639a89a_15587_2_check via API in dataplane AUTOab3e0a00-dba4-4157-945c-a733be20cfb3 (280ed729-b9 │
│ f8-41e7-8289-34e8bf4400a9)                                                                                                                                                                                                                        │
│ 2025-10-15 18:46:39,272 [default-11]    INFO    i.a.w.l.p.s.ClaimStage(applyStage):55 - Workload not claimed. Setting SKIP flag to true.

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions