Skip to content

Improve SUB_WORKFLOW reliability, recovery, and scalability#973

Open
rajeshwar-nu wants to merge 17 commits intoconductor-oss:mainfrom
steeleye:improvments/dyn-fork-join-sw
Open

Improve SUB_WORKFLOW reliability, recovery, and scalability#973
rajeshwar-nu wants to merge 17 commits intoconductor-oss:mainfrom
steeleye:improvments/dyn-fork-join-sw

Conversation

@rajeshwar-nu
Copy link
Copy Markdown
Contributor

@rajeshwar-nu rajeshwar-nu commented Apr 4, 2026

Summary

This PR improves the reliability, recovery behavior, and scalability of SUB_WORKFLOW execution.

It addresses a failure mode where parent workflows can be left with sub-workflow tasks that are persisted but not cleanly attached or recoverable, especially under large fanout, nested sub-workflow creation, or transient queue/persistence instability.

It also reduces the amount of heavy child-workflow startup work performed on the critical parent execution path, which makes SUB_WORKFLOW orchestration behave better under load.

Why

SUB_WORKFLOW is an orchestration primitive, not a worker-polled task.

In the previous model:

  • child launch and parent attachment were too tightly coupled
  • retries did not have a stable child identity to reattach to
  • partially launched SUB_WORKFLOW tasks could remain in SCHEDULED without a reliable recovery path
  • parent/child attachment could lag behind child creation
  • revisit timing for active SUB_WORKFLOW tasks could be too slow for prompt repair
  • large dynamic fanout of nested sub-workflows put too much synchronous pressure on orchestration

For large dyn fork-join -> subworkflow -> dyn fork-join -> subworkflow workloads, that made the system more fragile than it needed to be.

This PR changes SUB_WORKFLOW launch semantics to better match its actual role:

  • durable parent-owned child identity
  • safe retry and reattach
  • faster parent attachment
  • quicker revisit for unresolved active sub-workflow tasks
  • less synchronous orchestration pressure on the parent path

What Changed

Idempotent child launch

  • reserve a stable child workflow id per owning parent workflow task
  • reuse the same child id across retries instead of risking duplicate child creation
  • use execution-store truth for child existence checks instead of index fallback

Faster and lighter parent attachment

  • treat SUB_WORKFLOW launch as an async orchestration step
  • create or reattach the child workflow and attach the parent task as soon as the child record exists
  • avoid waiting for the child workflow’s initial inline expansion before persisting subWorkflowId

This reduces parent-path blocking and improves scalability for nested fanout workloads.

Recovery behavior

  • allow SCHEDULED SUB_WORKFLOW tasks without subWorkflowId to retry launch instead of dead-ending
  • preserve reattach behavior after partial persistence failures
  • make launch failures explicit on the task instead of leaving an ambiguous blank scheduled state

Reservation lifecycle management

  • add owned reservation cleanup for cancel/delete flows
  • support both single-task reservation removal and bulk workflow-owned cleanup
  • store Redis reservations in a workflow-owned hash for cheaper lookup and cleanup

Faster revisit for active SUB_WORKFLOW tasks

  • give active SUB_WORKFLOW tasks dedicated postpone behavior
  • use workflowOffsetTimeout for both SCHEDULED and IN_PROGRESS SUB_WORKFLOW tasks
  • avoid inheriting generic worker-oriented postpone behavior for orchestration tasks

This improves reliability by reducing the time unresolved sub-workflow tasks can sit before being revisited.

Reliability and Scalability Impact

This PR improves reliability by:

  • making child launch retry-safe
  • making partial launch/attach failures recoverable
  • reducing ambiguous SCHEDULED states
  • revisiting unresolved SUB_WORKFLOW tasks sooner

This PR improves scalability by:

  • reducing heavy inline child startup work on the parent path
  • shortening the parent/child attachment gap
  • making nested fanout workloads less sensitive to transient backend or queue issues

Backward Compatibility

This PR intentionally changes SUB_WORKFLOW behavior:

  • SUB_WORKFLOW launch now follows the async/idempotent attach model implemented here
  • WorkflowExecutor.startWorkflowIdempotent(...) now returns a WorkflowModel
  • persistence implementations must support sub-workflow id reservation APIs

This is both a behavioral change and an SPI change, so mixed-version rollout should be treated carefully.

Tests

Added/updated coverage for:

  • idempotent create-vs-reattach behavior
  • execution-store-only existence checks
  • transient retry for SCHEDULED tasks without subWorkflowId
  • faster parent attach after child creation
  • reservation cleanup on task/workflow deletion
  • backend reservation behavior across persistence implementations
  • postpone behavior for active SUB_WORKFLOW tasks

@rajeshwar-nu rajeshwar-nu marked this pull request as ready for review April 4, 2026 07:25
@rajeshwar-nu rajeshwar-nu changed the title Make sub-workflow launch recoverable for dyn fork-join fanout Make sub-workflow launches durable for dyn fork-join fanouts Apr 4, 2026
@rajeshwar-nu rajeshwar-nu changed the title Make sub-workflow launches durable for dyn fork-join fanouts Improve SUB_WORKFLOW reliability, recovery, and scalability Apr 5, 2026
@v1r3n v1r3n requested a review from mp-orkes April 7, 2026 01:43
@rajeshwar-nu
Copy link
Copy Markdown
Contributor Author

Hi @vishesh-orkes, @mp-orkes , @nthmost-orkes , can I please get some thoughts on this 🙏🏻
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant