Skip to content

Conversation

@d4l3k
Copy link
Member

@d4l3k d4l3k commented Mar 20, 2025

This updates the TimeoutManager to use a deletion queue so we delete the CUDA events on the main thread. If we delete these on the background thread we can end up deadlocking and not calling future timeout.

This is critical for NCCL abort as cudaEventDestroy can block in scenarios when a NCCL operation is stuck.

Test plan:

Tested in torchtitan

pytest torchft/futures_test.py

@d4l3k d4l3k requested review from H-Huang and fegin March 20, 2025 21:21
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 20, 2025
@d4l3k d4l3k force-pushed the d4l3k/del_queue branch from 75350d4 to 3dae3c8 Compare March 20, 2025 23:21
while True:
try:
# get and immediately discard item
self._del_queue.get_nowait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do a sanity check to ensure there are no other reference to the event?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! I added a getrefcount assertion

Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@d4l3k d4l3k force-pushed the d4l3k/del_queue branch from 3dae3c8 to c1ab7d9 Compare March 20, 2025 23:54
@d4l3k d4l3k merged commit 538b219 into main Mar 21, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants