Skip to content

Conversation

@d4l3k
Copy link
Member

@d4l3k d4l3k commented Mar 18, 2025

This disables normal ProcessGroupNCCL timeouts and instead uses _TimeoutManager to provide user space NCCL abort semantics using a background timer thread and a CUDA event to track completion.

Notable changes:

  • Every time we wait on ProcessGroupNCCL collective we register an abort callback that fires if it doesn't complete in time.
  • Long timeouts will result in many extra timeout entries that aren't cancelled since we don't proactively cleanup the timer events.
  • This depends on NCCL aborting which is known to be buggy in NCCL 2.25.1-1
  • We use _opts_hook and _wrap_work to ProcessGroupWrapper to provide wrappers around the base ProcessGroupNCCL to add in this logic.
  • We always disable the timeout passed to base ProcessGroupNCCL to avoid watchdog/heartbeat from failing. We may also need to set other options on creation to avoid having background NCCL errors crash the process.

This also updates the process_group_test.py resiliency tests since they were broken and always returned success messages. It now guarantees that operations scheduled on a PG on error exit and don't block forever.

TODO: Figure out if we need to set these envs when we are also overriding the timeout

  • TORCH_NCCL_RETHROW_CUDA_ERRORS=0
  • TORCH_NCCL_DUMP_ON_TIMEOUT=0
  • TORCH_NCCL_PROPAGATE_ERROR=0

Test plan:

pytest torchft/process_group_test.py

I haven't tested this E2E with torchtitan yet but that's next on the list.

@d4l3k d4l3k requested review from H-Huang, allenwang28 and fegin March 18, 2025 22:55
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 18, 2025
@d4l3k d4l3k force-pushed the d4l3k/nccl_abort branch 2 times, most recently from ef32d98 to f53a0c8 Compare March 18, 2025 23:58
@fegin
Copy link
Contributor

fegin commented Mar 19, 2025

Want to confirm my understanding. Now we have ProcessGroupBabyNCCL and ProcessGroupNCCL. The former one uses multi-processing while the later use abort + timeout.

Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +396 to +397
def _wrap_work(self, work: Work, opts: object) -> Work:
return work
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure why do we need this function if we just return the work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm, saw it is overridden later.

@d4l3k d4l3k force-pushed the d4l3k/nccl_abort branch from f53a0c8 to 962f220 Compare March 19, 2025 17:59
@d4l3k d4l3k force-pushed the d4l3k/nccl_abort branch from 962f220 to 8c449c2 Compare March 19, 2025 20:37
@d4l3k d4l3k merged commit ad0ca0a into main Mar 19, 2025
7 checks passed
@d4l3k d4l3k deleted the d4l3k/nccl_abort branch March 19, 2025 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants