-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Open
Labels
area/platformissues related to the platformissues related to the platformautoteamcommunityneeds-triageteam/composeteam/platform-movetype/bugSomething isn't workingSomething isn't working
Description
Helm Chart Version
2.0.17
What step the error happened?
During the Sync
Relevant information
for version 1.8.5 I am seeing errors across cron temporal worker workload launcher and airbyte namespace won't have new pods starting. I am running 500+ syncs every two hours here
summary from claude:
- Workloads Cannot Be Claimed (Primary Issue)
Claimed: false for workload ... via API in dataplane AUTOab3e0a00-dba4-4157-945c-a733be20cfb3
All workloads are returning Claimed: false from the API. This means another worker/dataplane has already claimed them OR they're in an invalid state.
- Workflows Are Stuck in Backoff Mode
- Connections have failed multiple times (4+ failures)
- They're waiting 30s to 4+ minutes between retries
- Source checks are continuously failing
- Temporal Rate Limiting (Now Improving)
- Cron is still hitting rate limits trying to clean up workflows
- Temporal resources improved (807m → 492m CPU)
- Workflow Thread Leaks
[BUG] Workflow thread can't be destroyed in time. This will lead to a workflow cache leak
Workflows are getting stuck and can't be cleaned up, causing memory/cache leaks.
logs from worker:
│ 2025-10-15 13:26:59,390 [Workflow Executor taskQueue="CONNECTION_UPDATER", namespace="default": 115] WARN i.t.i.r.ReplayWorkflowTaskHandler(failureToWFTResult):302 - Workflow task processing failure. startedEventId=102, WorkflowId=conn │
│ ection_manager_5ffdcd75-71ff-4350-9633-549d4e1edec5, RunId=eec42b8d-c13a-48c9-86b3-40cbd47f3d39. If seen continuously the workflow might be stuck. │
│ io.temporal.internal.statemachines.InternalWorkflowTaskException: Failure handling event 102 of type 'EVENT_TYPE_WORKFLOW_TASK_STARTED' during execution. {WorkflowTaskStartedEventId=102, CurrentStartedEventId=102} │
│ at io.temporal.internal.statemachines.WorkflowStateMachines.createEventProcessingException(WorkflowStateMachines.java:436) │
│ at io.temporal.internal.statemachines.WorkflowStateMachines.handleEventsBatch(WorkflowStateMachines.java:338) │
│ at io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:297) │
│ at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.applyServerHistory(ReplayWorkflowRunTaskHandler.java:260) │
│ at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:242) │
│ at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:165) │
│ at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithQuery(ReplayWorkflowTaskHandler.java:135) │
│ at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:100) │
│ at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handleTask(WorkflowWorker.java:475) │
│ at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:366) │
│ at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:306) │
│ at io.temporal.internal.worker.PollTaskExecutor.lambda$process$1(PollTaskExecutor.java:96) │
│ at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) │
│ at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) │
│ at java.base/java.lang.Thread.run(Thread.java:1583) │
│ Caused by: java.lang.RuntimeException: WorkflowTask: failure executing SCHEDULED->WORKFLOW_TASK_STARTED, transition history is [CREATED->WORKFLOW_TASK_SCHEDULED] │
│ at io.temporal.internal.statemachines.StateMachine.executeTransition(StateMachine.java:163) │
│ at io.temporal.internal.statemachines.StateMachine.handleHistoryEvent(StateMachine.java:103) │
│ at io.temporal.internal.statemachines.EntityStateMachineBase.handleEvent(EntityStateMachineBase.java:84) │
│ at io.temporal.internal.statemachines.EntityStateMachineInitialCommand.handleEvent(EntityStateMachineInitialCommand.java:70) │
│ at io.temporal.internal.statemachines.WorkflowStateMachines.handleSingleEvent(WorkflowStateMachines.java:487) │
│ at io.temporal.internal.statemachines.WorkflowStateMachines.handleEventsBatch(WorkflowStateMachines.java:336) │
│ ... 13 common frames omitted │
│ Caused by: io.temporal.internal.sync.PotentialDeadlockException: [TMPRL1101] Potential deadlock detected. Workflow thread "workflow-method-connection_manager_5ffdcd75-71ff-4350-9633-549d4e1...-eec42b8d-c13a-48c9-86b3-40cbd47f3d39" didn't yie │
│ ld control for over a second. {detectionTimestamp=1760534816802, threadDumpTimestamp=1760534816803}
logs from cron:
2025-10-15 18:41:36,202 [scheduled-executor-thread-1] ERROR i.m.s.DefaultTaskExceptionHandler(handle):47 - Error invoking scheduled task for bean [io.airbyte.cron.jobs.SelfHealTemporalWorkflows@5b17a929] RESOURCE_EXHAUSTED: namespace r │
│ io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: namespace rate limit exceeded │
│ at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:351) │
│ at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:332) │
│ at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:174) │
│ at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.listClosedWorkflowExecutions(WorkflowServiceGrpc.java:5903) │
│ at io.airbyte.commons.temporal.WorkflowServiceStubsWrapped.blockingStubListClosedWorkflowExecutions$lambda$0(WorkflowServiceStubsWrapped.kt:38) │
│ at dev.failsafe.Functions.lambda$toCtxSupplier$11(Functions.java:243) │
│ at dev.failsafe.Functions.lambda$get$0(Functions.java:46) │
│ at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:74) │
│ at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:187) │
│ at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:376) │
│ at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:112) │
│ at io.airbyte.commons.temporal.RetryHelper.withRetries(RetryHelper.kt:57) │
│ at io.airbyte.commons.temporal.WorkflowServiceStubsWrapped.withRetries(WorkflowServiceStubsWrapped.kt:63) │
│ at io.airbyte.commons.temporal.WorkflowServiceStubsWrapped.blockingStubListClosedWorkflowExecutions(WorkflowServiceStubsWrapped.kt:37) │
│ at io.airbyte.commons.temporal.TemporalClient.fetchClosedWorkflowsByStatus(TemporalClient.kt:137) │
│ at io.airbyte.commons.temporal.TemporalClient.restartClosedWorkflowByStatus(TemporalClient.kt:113) │
│ at io.airbyte.cron.jobs.SelfHealTemporalWorkflows.cleanTemporal(SelfHealTemporalWorkflows.kt:39) │
│ at io.airbyte.cron.jobs.$SelfHealTemporalWorkflows$Definition$Exec.dispatch(Unknown Source) │
│ at io.micronaut.context.AbstractExecutableMethodsDefinition$DispatchedExecutableMethod.invoke(AbstractExecutableMethodsDefinition.java:456) │
│ at io.micronaut.inject.DelegatingExecutableMethod.invoke(DelegatingExecutableMethod.java:86) │
│ at io.micronaut.context.bind.DefaultExecutableBeanContextBinder$ContextBoundExecutable.invoke(DefaultExecutableBeanContextBinder.java:152) │
│ at io.micronaut.scheduling.processor.ScheduledMethodProcessor.lambda$scheduleTask$2(ScheduledMethodProcessor.java:160) │
│ at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) │
│ at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358) │
│ at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) │
│ at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) │
│ at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) │
│ at java.base/java.lang.Thread.run(Thread.java:1583)
logs from temporal:
{"level":"error","ts":"2025-10-15T18:46:03.038Z","msg":"Update workflow execution operation failed.","shard-id":4,"address":"0.0.0.0:7234","wf-namespace-id":"8a7c73a0-2c00-4a46-91e5-c2fa6037735d","wf-id":"connection_manager_4da58350-f5e0-492 │
│ {"level":"warn","ts":"2025-10-15T18:46:03.054Z","msg":"Fail to process task","shard-id":4,"address":"0.0.0.0:7234","component":"transfer-queue-processor","wf-namespace-id":"8a7c73a0-2c00-4a46-91e5-c2fa6037735d","wf-id":"connection_manager_4d │
│ {"level":"error","ts":"2025-10-15T18:46:03.294Z","msg":"service failures","operation":"AddWorkflowTask","wf-namespace":"default","grpc_code":"Unavailable","wf-id":"connection_manager_7f4268de-48c5-42b7-a064-2ddb27b94334","wf-run-id":"01c421d │
│ {"level":"info","ts":"2025-10-15T18:46:03.379Z","msg":"Activity task not found","component":"matching-engine","wf-namespace-id":"8a7c73a0-2c00-4a46-91e5-c2fa6037735d","wf-id":"replication_16167","wf-run-id":"0199e91c-fb9c-7c29-a3cc-b756c1df3 │
│ {"level":"error","ts":"2025-10-15T18:46:03.642Z","msg":"service failures","operation":"AddWorkflowTask","wf-namespace":"default","grpc_code":"Unavailable","wf-id":"connection_manager_8753dbaf-d92e-4bb4-8b23-425b8d3941f3","wf-run-id":"4931d68 │
│ {"level":"info","ts":"2025-10-15T18:46:04.457Z","msg":"history client encountered error","service":"matching","error":"Activity task already started.","service-error-type":"serviceerror.TaskAlreadyStarted","logging-call-at":"/home/runner/wor │
│ {"level":"error","ts":"2025-10-15T18:46:04.631Z","msg":"service failures","operation":"AddWorkflowTask","wf-namespace":"default","grpc_code":"Unavailable","wf-id":"connection_manager_843c7c3d-d534-49c0-9c05-d3691f8b0657","wf-run-id":"1146a15
logs from launcher
│ 2025-10-15 18:46:35,432 [default-13] INFO i.a.w.l.p.h.SuccessHandler(accept):83 - Pipeline completed for workload: 9e81e5c3-48a5-4952-8f8b-b2c5fb20fdf8_15961_1_check. │
│ 2025-10-15 18:46:35,433 [default-14] INFO i.a.w.l.c.WorkloadApiClient(claim):79 - Claimed: false for workload 0032bdbb-1723-4ab8-9aa5-0d1776883400_15726_1_check via API in dataplane AUTOab3e0a00-dba4-4157-945c-a733be20cfb3 (280ed729-b9 │
│ f8-41e7-8289-34e8bf4400a9) │
│ 2025-10-15 18:46:35,433 [default-14] INFO i.a.w.l.p.s.ClaimStage(applyStage):55 - Workload not claimed. Setting SKIP flag to true. │
│ 2025-10-15 18:46:35,433 [default-14] INFO i.a.w.l.p.s.m.Stage(apply):35 - SKIP Stage: LOAD_SHED — (workloadId=0032bdbb-1723-4ab8-9aa5-0d1776883400_15726_1_check) │
│ 2025-10-15 18:46:35,433 [default-14] INFO i.a.w.l.p.s.m.Stage(apply):35 - SKIP Stage: CHECK_STATUS — (workloadId=0032bdbb-1723-4ab8-9aa5-0d1776883400_15726_1_check) │
│ 2025-10-15 18:46:35,434 [default-13] INFO i.a.w.l.p.s.m.Stage(apply):42 - APPLY Stage: BUILD — (workloadId=cdba8436-3a0b-43aa-bcc8-0669e862340e_15799_2_check) │
│ 2025-10-15 18:46:35,434 [default-17] INFO i.a.w.l.c.WorkloadApiClient(claim):79 - Claimed: false for workload f2c2482d-74af-4beb-a2bd-6565cf8bdfaa_15995_0_check via API in dataplane AUTOab3e0a00-dba4-4157-945c-a733be20cfb3 (280ed729-b9 │
│ f8-41e7-8289-34e8bf4400a9) │
│ 2025-10-15 18:46:35,433 [default-14] INFO i.a.w.l.p.s.m.Stage(apply):35 - SKIP Stage: MUTEX — (workloadId=0032bdbb-1723-4ab8-9aa5-0d1776883400_15726_1_check) │
│ 2025-10-15 18:46:35,429 [default-18] INFO i.a.w.l.p.s.m.Stage(apply):35 - SKIP Stage: MUTEX — (workloadId=cbbdf29a-7217-4ca1-839b-ee8b14381e14_16095_0_sync) │
│ 2025-10-15 18:46:35,441 [default-11] INFO i.a.w.l.c.WorkloadApiClient(claim):79 - Claimed: false for workload a99c7c10-b678-47ee-aec2-433e4639a89a_15587_2_check via API in dataplane AUTOab3e0a00-dba4-4157-945c-a733be20cfb3 (280ed729-b9 │
│ f8-41e7-8289-34e8bf4400a9) │
│ 2025-10-15 18:46:39,272 [default-11] INFO i.a.w.l.p.s.ClaimStage(applyStage):55 - Workload not claimed. Setting SKIP flag to true.
Relevant log output
artjoms-labstercamille-lemaitre
Metadata
Metadata
Assignees
Labels
area/platformissues related to the platformissues related to the platformautoteamcommunityneeds-triageteam/composeteam/platform-movetype/bugSomething isn't workingSomething isn't working