-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Adds early stopping support for Ray integration #3276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hi @garylvov, here is the PR as agreed. I have noticed while re-testing that it sometimes works, sometimes doesn't. I think the issue is not the mechanism inside the PR, but rather that Isaac Sim does not always respond in time to the termination signal. When this happens, the next training does not start within One solution idea could be to check if a subprocess exists within a threshold and kill it, if it doesn't, but I do not know how to do this with stoppers as normally Ray handles this. Alternatively, would it be better to add a mechanism to |
@garylvov, a small update: I was wrong to say it sometimes works. By coincidence, the processes were ending anyway shortly after the early stop signal, and that is why I thought it sometimes worked. I have debugged it further and can confirm that even after a "trial" is marked as completed, the subprocess/training continues to the end. The following trial might fail, e.g., due to a lack of GPU memory. |
Hi thanks for your further investigation.
I think this could work, but I would be a little worried about it being too "kill happy", and erroneously shutting down processes that were experiencing an ephemeral stalling period. Perhaps we can just wait for a few moments, and if it's still halted, then kill it. However, I think it may be non-optimal design to have a new ray process try to do cleanup of other processes before starting, as opposed to a ray process doing clean-up on its own process after it finishes.
I would assume that Ray could do this well enough out of the box to stop the rogue processes, but I guess that's wishful thinking ;) I will do some testing of this too. I think you may be onto something with some mechanism related to the Ray stopper. Maybe we can override some sort of cleanup method to aggressively SIGKILL the PID recovered by |
I agree. Nevertheless, I still wanted to try to kill it the following way:
but it did not work. Did you manage to get a lead on what might work in a robust way? |
Hi @ozhanozen I have yet another workaround that may just work ;) I think we could add
to the training scripts (RSL RL, RL Games, SKRL, etc) Then, for each job submitted in util, we can assign an extra PID. I haven't implemented it, but I've tested it manually with the following:
Then, the following has proven to be effective at shutting down all processes.
I think this will be effective, as for example, having a mismatched id like below doesn't kill the process
Now we just need to add some sort of exit hook to the trials to kill the assigned ray proc id |
We could maybe use https://docs.ray.io/en/latest/tune/api/callbacks.html to have each trial clean itself up |
Actually, now that I've played with this more, I think it's important that we SIGKILL instead of SIGINT Maybe
would work without the extra rid stuff |
I think it is better at killing old jobs, but it keeps on running into log extraction errors after terminating processes. Please let me know if you can figure out why! |
Hi @garylvov, I had just the opportunity to check this; sorry for the delay. I had encountered two different issues (one regarding rl_games at #3531, one about ray initialization at #3533), which might be resulting in the log extraction errors you faced. If I combine the fixes on both these PRs with the current PR, I am able to successfully do early stopping without encountering any further errors. Could you also try again with these? Note: You had also used |
Awesome, thank you so much for figuring it out! Sounds good about the I'm excited to try this soon! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kellyguo11,
Description
This PR introduces support for early stopping in Ray integration through the
Stopper
class. It enables trials to end sooner when they are unlikely to yield useful results, reducing wasted compute time and speeding up experimentation.Previously, when running hyperparameter tuning with Ray integration, all trials would continue until the training configuration’s maximum iterations were reached, even if a trial was clearly underperforming. This wasn’t always efficient, since poor-performing trials could often be identified early on. With this PR, an optional early stopping mechanism is introduced, allowing Ray to terminate unpromising trials sooner and improve the overall efficiency of hyperparameter tuning.
The PR also includes a
CartpoleEarlyStopper
example invision_cartpole_cfg.py
. This serves as a reference implementation that halts a trial if theout_of_bounds
metric doesn’t reduce after a set number of iterations. It’s meant as a usage example: users are encouraged to create their own custom stoppers tailored to their specific use cases.Fixes #3270.
Type of change
Checklist
pre-commit
checks with./isaaclab.sh --format
config/extension.toml
fileCONTRIBUTORS.md
or my name already exists there