Skip to content

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented Jan 16, 2025

This adds a torchx component that can be used to run many replica tests locally. This is intended to make it easier to validate E2E behavior since it's now 2 commands to run an e2e job with train_ddp.py.

TorchX is a job launcher and the component we create will launch one role per replica group with torchelastic managing the workers within each group. Elastic uses a port range 29600 + replica_id.

It also includes some small tweaks to logging, static types and the Lighthouse dashboard for monitoring.

Test plan:

torchft_lighthouse --min_replicas 2 --join_timeout_ms 10000 &

torchx run -- --replicas 20

Increasing the join timeout is required to avoid split brain issues when some replicas are recovering.

From experimentation, join_timeout_ms must be longer than the recovery time otherwise workers will never recover.

20250116_14h48m54s_grim

pyre

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 16, 2025
Copy link
Contributor

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@d4l3k d4l3k merged commit 39a40b2 into main Jan 18, 2025
6 checks passed
@d4l3k d4l3k deleted the d4l3k/torchx branch January 18, 2025 05:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants