Improve SUB_WORKFLOW reliability, recovery, and scalability#973
Open
rajeshwar-nu wants to merge 17 commits intoconductor-oss:mainfrom
Open
Improve SUB_WORKFLOW reliability, recovery, and scalability#973rajeshwar-nu wants to merge 17 commits intoconductor-oss:mainfrom
rajeshwar-nu wants to merge 17 commits intoconductor-oss:mainfrom
Conversation
Contributor
Author
|
Hi @vishesh-orkes, @mp-orkes , @nthmost-orkes , can I please get some thoughts on this 🙏🏻 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves the reliability, recovery behavior, and scalability of
SUB_WORKFLOWexecution.It addresses a failure mode where parent workflows can be left with sub-workflow tasks that are persisted but not cleanly attached or recoverable, especially under large fanout, nested sub-workflow creation, or transient queue/persistence instability.
It also reduces the amount of heavy child-workflow startup work performed on the critical parent execution path, which makes
SUB_WORKFLOWorchestration behave better under load.Why
SUB_WORKFLOWis an orchestration primitive, not a worker-polled task.In the previous model:
SUB_WORKFLOWtasks could remain inSCHEDULEDwithout a reliable recovery pathSUB_WORKFLOWtasks could be too slow for prompt repairFor large
dyn fork-join -> subworkflow -> dyn fork-join -> subworkflowworkloads, that made the system more fragile than it needed to be.This PR changes
SUB_WORKFLOWlaunch semantics to better match its actual role:What Changed
Idempotent child launch
Faster and lighter parent attachment
SUB_WORKFLOWlaunch as an async orchestration stepsubWorkflowIdThis reduces parent-path blocking and improves scalability for nested fanout workloads.
Recovery behavior
SCHEDULEDSUB_WORKFLOWtasks withoutsubWorkflowIdto retry launch instead of dead-endingReservation lifecycle management
Faster revisit for active SUB_WORKFLOW tasks
SUB_WORKFLOWtasks dedicated postpone behaviorworkflowOffsetTimeoutfor bothSCHEDULEDandIN_PROGRESSSUB_WORKFLOWtasksThis improves reliability by reducing the time unresolved sub-workflow tasks can sit before being revisited.
Reliability and Scalability Impact
This PR improves reliability by:
SCHEDULEDstatesSUB_WORKFLOWtasks soonerThis PR improves scalability by:
Backward Compatibility
This PR intentionally changes
SUB_WORKFLOWbehavior:SUB_WORKFLOWlaunch now follows the async/idempotent attach model implemented hereWorkflowExecutor.startWorkflowIdempotent(...)now returns aWorkflowModelThis is both a behavioral change and an SPI change, so mixed-version rollout should be treated carefully.
Tests
Added/updated coverage for:
SCHEDULEDtasks withoutsubWorkflowIdSUB_WORKFLOWtasks