ProcessGroupNCCL: use _TimeoutManager to provide sane NCCL abort semantics #141

d4l3k · 2025-03-18T22:55:11Z

This disables normal ProcessGroupNCCL timeouts and instead uses _TimeoutManager to provide user space NCCL abort semantics using a background timer thread and a CUDA event to track completion.

Notable changes:

Every time we wait on ProcessGroupNCCL collective we register an abort callback that fires if it doesn't complete in time.
Long timeouts will result in many extra timeout entries that aren't cancelled since we don't proactively cleanup the timer events.
This depends on NCCL aborting which is known to be buggy in NCCL 2.25.1-1
We use _opts_hook and _wrap_work to ProcessGroupWrapper to provide wrappers around the base ProcessGroupNCCL to add in this logic.
We always disable the timeout passed to base ProcessGroupNCCL to avoid watchdog/heartbeat from failing. We may also need to set other options on creation to avoid having background NCCL errors crash the process.

This also updates the process_group_test.py resiliency tests since they were broken and always returned success messages. It now guarantees that operations scheduled on a PG on error exit and don't block forever.

TODO: Figure out if we need to set these envs when we are also overriding the timeout

TORCH_NCCL_RETHROW_CUDA_ERRORS=0
TORCH_NCCL_DUMP_ON_TIMEOUT=0
TORCH_NCCL_PROPAGATE_ERROR=0

Test plan:

pytest torchft/process_group_test.py

I haven't tested this E2E with torchtitan yet but that's next on the list.

fegin · 2025-03-19T00:47:34Z

Want to confirm my understanding. Now we have ProcessGroupBabyNCCL and ProcessGroupNCCL. The former one uses multi-processing while the later use abort + timeout.

fegin

LGTM

fegin · 2025-03-19T00:34:01Z

torchft/process_group.py

+    def _wrap_work(self, work: Work, opts: object) -> Work:
+        return work


Not quite sure why do we need this function if we just return the work?

Nvm, saw it is overridden later.

…ntics

d4l3k requested review from H-Huang, allenwang28 and fegin March 18, 2025 22:55

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 18, 2025

d4l3k force-pushed the d4l3k/nccl_abort branch 2 times, most recently from ef32d98 to f53a0c8 Compare March 18, 2025 23:58

fegin approved these changes Mar 19, 2025

View reviewed changes

d4l3k force-pushed the d4l3k/nccl_abort branch from f53a0c8 to 962f220 Compare March 19, 2025 17:59

ProcessGroupNCCL: use _TimeoutManager to provide sane NCCL abort sema…

8c449c2

…ntics

d4l3k force-pushed the d4l3k/nccl_abort branch from 962f220 to 8c449c2 Compare March 19, 2025 20:37

d4l3k merged commit ad0ca0a into main Mar 19, 2025
7 checks passed

d4l3k deleted the d4l3k/nccl_abort branch March 19, 2025 20:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ProcessGroupNCCL: use _TimeoutManager to provide sane NCCL abort semantics #141

ProcessGroupNCCL: use _TimeoutManager to provide sane NCCL abort semantics #141

Uh oh!

d4l3k commented Mar 18, 2025 •

edited

Loading

Uh oh!

fegin commented Mar 19, 2025

Uh oh!

fegin left a comment

Uh oh!

fegin Mar 19, 2025

Uh oh!

fegin Mar 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		def _wrap_work(self, work: Work, opts: object) -> Work:
		return work

ProcessGroupNCCL: use _TimeoutManager to provide sane NCCL abort semantics #141

ProcessGroupNCCL: use _TimeoutManager to provide sane NCCL abort semantics #141

Uh oh!

Conversation

d4l3k commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fegin commented Mar 19, 2025

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

fegin Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

fegin Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

d4l3k commented Mar 18, 2025 •

edited

Loading