Skip to content

Conversation

amirafzali
Copy link
Member

This introduces a failure injector with 5 failure modes:

  • SEGFAULT: Triggers a SIGSEGV on the process
  • DEADLOCK = Deadlocks the GIL, resulting in ProcessGroupNCCL timeout and terminal failure
  • KILL_PROC: Immediately kills the process with non-zero exit code
  • COMMS = Forcefully aborts the ProcessGroup and NCCL communicator
  • KILL_SLURM = Kills a random replica SLURM job

It can be enabled with the flag --with--failure, and it runs async every 60 seconds.

@amirafzali amirafzali requested a review from d4l3k September 25, 2025 14:47
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 25, 2025
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Looks like some of the tests/lint is failing?

f"{self.uid} Injecting failure ({failure_type}) into random trainer"
)

await self.failure_actors.fail.choose(failure_type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.choose picks an arbitrary training?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.choose picks an arbitrary training?

yup, choose will send it to one random trainer in the replica mesh

@amirafzali
Copy link
Member Author

LGTM! Looks like some of the tests/lint is failing?

hm failing test is unrelated/flaky, I see it on #268
will address lint

Copy link

@colin2328 colin2328 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks

@amirafzali amirafzali changed the title add failure injector for monarch script add failure injector for monarch training script Sep 30, 2025
@facebook-github-bot
Copy link
Contributor

@amirafzali has imported this pull request. If you are a Meta employee, you can view this in D83601242.

Summary:
This introduces a failure injector with 5 failure modes:
- SEGFAULT: Triggers a SIGSEGV on the process
- DEADLOCK = Deadlocks the GIL, resulting in ProcessGroupNCCL timeout and terminal failure
- KILL_PROC: Immediately kills the process with non-zero exit code
- COMMS = Forcefully aborts the ProcessGroup and NCCL communicator
- KILL_SLURM = Kills a random replica SLURM job

It can be enabled with the flag `--with--failure`, and it runs async every 120 seconds.


Reviewed By: tushar00jain

Differential Revision: D83601242

Pulled By: amirafzali
@facebook-github-bot
Copy link
Contributor

@amirafzali has exported this pull request. If you are a Meta employee, you can view the originating Diff in D83601242.

@facebook-github-bot
Copy link
Contributor

@amirafzali merged this pull request in d596ec7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged meta-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants