ProcessGroupNCCL: use _TimeoutManager to provide sane NCCL abort semantics #141
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This disables normal ProcessGroupNCCL timeouts and instead uses
_TimeoutManagerto provide user space NCCL abort semantics using a background timer thread and a CUDA event to track completion.Notable changes:
ProcessGroupNCCLcollective we register an abort callback that fires if it doesn't complete in time.NCCL 2.25.1-1_opts_hookand_wrap_workto ProcessGroupWrapper to provide wrappers around the base ProcessGroupNCCL to add in this logic.This also updates the
process_group_test.pyresiliency tests since they were broken and always returned success messages. It now guarantees that operations scheduled on a PG on error exit and don't block forever.TODO: Figure out if we need to set these envs when we are also overriding the timeout
Test plan:
I haven't tested this E2E with torchtitan yet but that's next on the list.