Skip to content

Conversation

fegin
Copy link
Contributor

@fegin fegin commented Jan 29, 2025

There are missing gaps of ManagedDeviceMesh to be actually used in TorchTitan. This PR fixes the gpas:

  1. ManagedDeviceMesh is now able to be torch.save()/torch.load().
  2. ManagedDeviceMesh will lie if there are zero replicated group participants. Size 0 DeviceMesh will cause confusion for training loops.
  3. Correctly returns coordinates.
  4. Remove pg reinitialization issue

There are missing gaps of ManagedDeviceMesh to be actually used in
TorchTitan. This PR fixes the gpas:

1. ManagedDeviceMesh is now able to be torch.save()/torch.load().
2. ManagedDeviceMesh will lie if there are zero replicated group participants. Size 0 DeviceMesh will cause confusion for training loops.
3. Corretly returns coordinates.
@fegin fegin requested review from H-Huang and d4l3k January 29, 2025 18:49
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 29, 2025
@fegin fegin requested a review from wz337 January 29, 2025 18:49
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fegin fegin merged commit 2a67d66 into main Jan 29, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants