Gracefully exit rank != 0 job steps on slurm cluster #1780
Gracefully exit rank != 0 job steps on slurm cluster #1780erjel wants to merge 6 commits intofacebookincubator:mainfrom
Conversation
|
ping @jrapin Any thoughts on this? |
|
Seems like an application issue that should be solved in the application and not in submitit. What is the multiprocessing method that you use for the dataloader processes? Do your jobs fail to exit if you set this before doing any CUDA-related operation? Lengthier explanation: the dataloader processes should not have any GPU resource open, not even an unused CUDA context, or they might fail to die when receiving a signal. The default multiprocessing start method is |
Hi,
first, thanks for open sourcing
submitit! It simplifies the setup of distributed cluster jobs on our inhouse slurm cluster a lot.Some context:
While requeuing 32 node jobs, I got a message from our infrastructure team that my jobs are causing nodes to be stuck in the slurm "drained" state. Even worst: The requeuing worked fine, so my job brought down the first 32 nodes and restarted on 32 different nodes (and probably would have "drained" them as well). This would eventually have brought down our entire GPU partition.
According to the infrastructure team, nodes go to the "drained" state when jobs keep running on the cluster after the final
SIGKILLwas send (i.e. when the timeout is reached or when the job is requeued). As far as I understand, the timing of theSIGKILLdepends on the individual slurm settings. On our cluster there is a very liberal 120 sec delay between timeout and signal send (apparently the default is about 30 sec).For debugging we had a look into the running processes on one of the nodes with 4 GPUs each.
Normally there are multiple processes running:
Once a timeout is reached:
The remaining python scripts keep running (even continuing utilizing GPU resources) until
SIGKILLis send. While we could not reproducibly trigger the drained state on-the-fly, we decided that I should not rely onSIGKILLto bring my remaining job steps down.The solution which worked on our cluster:
After a timeout or a requeue a
SIGTERMis send to all job steps. While we don't want to callsys.exit(-1)on rank != 0 nodes once theSIGUSR2is send (i.e. to give the rank 0 job step time to create a checkpoint first), we can prepare the job steps for the followingSIGTERM.SIGTERMis triggered by slurm afterscontrol requeue <jobid>and thus a strong indication that rank 0 is done with checkpointing and we can callsys.exit()in the remaining job steps.With the proposed changes we could observe that the jobs gracefully exit well before the 120 sec
SIGKILLtimelimit. Additionally we noticed that time spend in the slurm completing state is notably reduced.Even though cluster setups can be wildly different, I decided to create a pull request in order to give feedback to the community (and hope that the workaround and information is useful for others).