Improve SUB_WORKFLOW reliability, recovery, and scalability by rajeshwar-nu · Pull Request #973 · conductor-oss/conductor

rajeshwar-nu · 2026-04-04T03:21:44Z

Summary

This PR improves the reliability, recovery behavior, and scalability of SUB_WORKFLOW execution.

It addresses a failure mode where parent workflows can be left with sub-workflow tasks that are persisted but not cleanly attached or recoverable, especially under large fanout, nested sub-workflow creation, or transient queue/persistence instability.

It also reduces the amount of heavy child-workflow startup work performed on the critical parent execution path, which makes SUB_WORKFLOW orchestration behave better under load.

Why

SUB_WORKFLOW is an orchestration primitive, not a worker-polled task.

In the previous model:

child launch and parent attachment were too tightly coupled
retries did not have a stable child identity to reattach to
partially launched SUB_WORKFLOW tasks could remain in SCHEDULED without a reliable recovery path
parent/child attachment could lag behind child creation
revisit timing for active SUB_WORKFLOW tasks could be too slow for prompt repair
large dynamic fanout of nested sub-workflows put too much synchronous pressure on orchestration

For large dyn fork-join -> subworkflow -> dyn fork-join -> subworkflow workloads, that made the system more fragile than it needed to be.

This PR changes SUB_WORKFLOW launch semantics to better match its actual role:

durable parent-owned child identity
safe retry and reattach
faster parent attachment
quicker revisit for unresolved active sub-workflow tasks
less synchronous orchestration pressure on the parent path

What Changed

Idempotent child launch

reserve a stable child workflow id per owning parent workflow task
reuse the same child id across retries instead of risking duplicate child creation
use execution-store truth for child existence checks instead of index fallback

Faster and lighter parent attachment

treat SUB_WORKFLOW launch as an async orchestration step
create or reattach the child workflow and attach the parent task as soon as the child record exists
avoid waiting for the child workflow’s initial inline expansion before persisting subWorkflowId

This reduces parent-path blocking and improves scalability for nested fanout workloads.

Recovery behavior

allow SCHEDULED SUB_WORKFLOW tasks without subWorkflowId to retry launch instead of dead-ending
preserve reattach behavior after partial persistence failures
make launch failures explicit on the task instead of leaving an ambiguous blank scheduled state

Reservation lifecycle management

add owned reservation cleanup for cancel/delete flows
support both single-task reservation removal and bulk workflow-owned cleanup
store Redis reservations in a workflow-owned hash for cheaper lookup and cleanup

Faster revisit for active SUB_WORKFLOW tasks

give active SUB_WORKFLOW tasks dedicated postpone behavior
use workflowOffsetTimeout for both SCHEDULED and IN_PROGRESS SUB_WORKFLOW tasks
avoid inheriting generic worker-oriented postpone behavior for orchestration tasks

This improves reliability by reducing the time unresolved sub-workflow tasks can sit before being revisited.

Reliability and Scalability Impact

This PR improves reliability by:

making child launch retry-safe
making partial launch/attach failures recoverable
reducing ambiguous SCHEDULED states
revisiting unresolved SUB_WORKFLOW tasks sooner

This PR improves scalability by:

reducing heavy inline child startup work on the parent path
shortening the parent/child attachment gap
making nested fanout workloads less sensitive to transient backend or queue issues

Backward Compatibility

This PR intentionally changes SUB_WORKFLOW behavior:

SUB_WORKFLOW launch now follows the async/idempotent attach model implemented here
WorkflowExecutor.startWorkflowIdempotent(...) now returns a WorkflowModel
persistence implementations must support sub-workflow id reservation APIs

This is both a behavioral change and an SPI change, so mixed-version rollout should be treated carefully.

Tests

Added/updated coverage for:

idempotent create-vs-reattach behavior
execution-store-only existence checks
transient retry for SCHEDULED tasks without subWorkflowId
faster parent attach after child creation
reservation cleanup on task/workflow deletion
backend reservation behavior across persistence implementations
postpone behavior for active SUB_WORKFLOW tasks

…k-join-sw

rajeshwar-nu · 2026-04-09T06:56:45Z

Hi @vishesh-orkes, @mp-orkes , @nthmost-orkes , can I please get some thoughts on this 🙏🏻
Thanks

rajeshwar-nu added 3 commits April 4, 2026 08:50

Recover sub-workflow launch from transient failures

df7a43f

Make sub-workflow launch idempotent

97db0ec

Run sub-workflow launch through async workers

1ea440a

rajeshwar-nu marked this pull request as ready for review April 4, 2026 07:25

rajeshwar-nu changed the title ~~Make sub-workflow launch recoverable for dyn fork-join fanout~~ Make sub-workflow launches durable for dyn fork-join fanouts Apr 4, 2026

rajeshwar-nu added 9 commits April 4, 2026 14:01

Cleanup reservations on workflow deletions/cancellations

6a3aa7f

Improves the reservation cleanup logic

2835fd4

Merge remote-tracking branch 'upstream/main' into improvments/dyn-for…

50f52a3

…k-join-sw

prevent propagting exception on existing workflow non-found case

29c6359

Merge remote-tracking branch 'upstream/main' into improvments/dyn-for…

d6888a5

…k-join-sw

Adding ability to use only executiondao to fetch workflow

12aaee8

async decision on created worklflow of sw task

819cca2

improve reservation cleanup

5a0f280

spotless

feb2de0

rajeshwar-nu changed the title ~~Make sub-workflow launches durable for dyn fork-join fanouts~~ Improve SUB_WORKFLOW reliability, recovery, and scalability Apr 5, 2026

v1r3n requested a review from mp-orkes April 7, 2026 01:43

rajeshwar-nu added 3 commits April 7, 2026 03:25

Merge branch 'main' into improvments/dyn-fork-join-sw

ec328fa

Merge remote-tracking branch 'upstream/main' into improvments/dyn-for…

699d623

…k-join-sw

Merge branch 'main' into improvments/dyn-fork-join-sw

a31fe29

rajeshwar-nu added 2 commits April 10, 2026 10:18

Merge branch 'main' into improvments/dyn-fork-join-sw

61a0eff

Merge branch 'main' into improvments/dyn-fork-join-sw

1630316

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve SUB_WORKFLOW reliability, recovery, and scalability#973

Improve SUB_WORKFLOW reliability, recovery, and scalability#973
rajeshwar-nu wants to merge 17 commits intoconductor-oss:mainfrom
steeleye:improvments/dyn-fork-join-sw

rajeshwar-nu commented Apr 4, 2026 •

edited

Loading

Uh oh!

rajeshwar-nu commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rajeshwar-nu commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What Changed

Idempotent child launch

Faster and lighter parent attachment

Recovery behavior

Reservation lifecycle management

Faster revisit for active SUB_WORKFLOW tasks

Reliability and Scalability Impact

Backward Compatibility

Tests

Uh oh!

rajeshwar-nu commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rajeshwar-nu commented Apr 4, 2026 •

edited

Loading