Adds early stopping support for Ray integration #3276

ozhanozen · 2025-08-26T13:55:14Z

Description

This PR introduces support for early stopping in Ray integration through the Stopper class. It enables trials to end sooner when they are unlikely to yield useful results, reducing wasted compute time and speeding up experimentation.

Previously, when running hyperparameter tuning with Ray integration, all trials would continue until the training configuration’s maximum iterations were reached, even if a trial was clearly underperforming. This wasn’t always efficient, since poor-performing trials could often be identified early on. With this PR, an optional early stopping mechanism is introduced, allowing Ray to terminate unpromising trials sooner and improve the overall efficiency of hyperparameter tuning.

The PR also includes a CartpoleEarlyStopper example in vision_cartpole_cfg.py. This serves as a reference implementation that halts a trial if the out_of_bounds metric doesn’t reduce after a set number of iterations. It’s meant as a usage example: users are encouraged to create their own custom stoppers tailored to their specific use cases.

Fixes #3270.

Type of change

New feature (non-breaking change which adds functionality)

Checklist

I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have updated the changelog and the corresponding version in the extension's config/extension.toml file
I have added my name to the CONTRIBUTORS.md or my name already exists there

ozhanozen · 2025-08-26T14:07:00Z

Hi @garylvov, here is the PR as agreed.

I have noticed while re-testing that it sometimes works, sometimes doesn't. I think the issue is not the mechanism inside the PR, but rather that Isaac Sim does not always respond in time to the termination signal. When this happens, the next training does not start within process_response_timeout, hence halting all the ray main process.

One solution idea could be to check if a subprocess exists within a threshold and kill it, if it doesn't, but I do not know how to do this with stoppers as normally Ray handles this. Alternatively, would it be better to add a mechanism to execute_job such that if there is a halted subprocess for Isaac Sim, we kill it before starting a new subprocess no matter what?

ozhanozen · 2025-08-27T09:20:32Z

@garylvov, a small update:

I was wrong to say it sometimes works. By coincidence, the processes were ending anyway shortly after the early stop signal, and that is why I thought it sometimes worked.

I have debugged it further and can confirm that even after a "trial" is marked as completed, the subprocess/training continues to the end. The following trial might fail, e.g., due to a lack of GPU memory.

garylvov · 2025-08-28T14:17:02Z

Hi thanks for your further investigation.

Alternatively, would it be better to add a mechanism to execute_job such that if there is a halted subprocess for Isaac Sim, we kill it before starting a new subprocess no matter what?

I think this could work, but I would be a little worried about it being too "kill happy", and erroneously shutting down processes that were experiencing an ephemeral stalling period. Perhaps we can just wait for a few moments, and if it's still halted, then kill it.

However, I think it may be non-optimal design to have a new ray process try to do cleanup of other processes before starting, as opposed to a ray process doing clean-up on its own process after it finishes.

I have debugged it further and can confirm that even after a "trial" is marked as completed, the subprocess/training continues to the end. The following trial might fail, e.g., due to a lack of GPU memory.

I would assume that Ray could do this well enough out of the box to stop the rogue processes, but I guess that's wishful thinking ;)

I will do some testing of this too. I think you may be onto something with some mechanism related to the Ray stopper. Maybe we can override some sort of cleanup method to aggressively SIGKILL the PID recovered by execute_job

ozhanozen · 2025-09-04T12:27:59Z

I think this could work, but I would be a little worried about it being too "kill happy", and erroneously shutting down processes that were experiencing an ephemeral stalling period. Perhaps we can just wait for a few moments, and if it's still halted, then kill it.

However, I think it may be non-optimal design to have a new ray process try to do cleanup of other processes before starting, as opposed to a ray process doing clean-up on its own process after it finishes.

I agree. Nevertheless, I still wanted to try to kill it the following way:

self.proc.terminate()
try:
    self.proc.wait(timeout=20)
except subprocess.TimeoutExpired:
   self.proc.kill()
   self.proc.wait()

but it did not work. Did you manage to get a lead on what might work in a robust way?

garylvov · 2025-09-09T17:28:13Z

Hi @ozhanozen I have yet another workaround that may just work ;)

I think we could add

parser.add_argument("--ray-proc-id", "-rid", 
                    type=int,default=None, help="Automatically configured by Ray integration, otherwise None.")

to the training scripts (RSL RL, RL Games, SKRL, etc)

Then, for each job submitted in util, we can assign an extra PID. I haven't implemented it, but I've tested it manually with the following:

./IsaacLab/isaaclab.sh -p IsaacLab/scripts/reinforcement_learning/rl_games/train.py --task Isaac-Cartpole-v0 --headless -rid 31415

Then, the following has proven to be effective at shutting down all processes.

pkill -9 -f "rid 31415" && sleep 5

I think this will be effective, as for example, having a mismatched id like below doesn't kill the process

pkill -9 -f "rid 31414"  # does nothing, proves that kill commands wouldn't affect unrelated running training

Now we just need to add some sort of exit hook to the trials to kill the assigned ray proc id

garylvov · 2025-09-09T18:10:59Z

We could maybe use https://docs.ray.io/en/latest/tune/api/callbacks.html to have each trial clean itself up

garylvov · 2025-09-09T21:43:24Z

Actually, now that I've played with this more, I think it's important that we SIGKILL instead of SIGINT

Maybe

os.kill(self.proc.pid, signal.SIGKILL)

would work without the extra rid stuff

garylvov · 2025-09-10T04:12:46Z

I looked into this even more, and I believe the RID is helpful as there are so many related processes for a single isaac training session:

'

I believe I have a minimal version that kind of works pushing progress now

garylvov · 2025-09-10T04:30:58Z

I think it is better at killing old jobs, but it keeps on running into log extraction errors after terminating processes. Please let me know if you can figure out why!

…ping

ozhanozen · 2025-09-23T15:11:48Z

Hi @garylvov, I had just the opportunity to check this; sorry for the delay.

I had encountered two different issues (one regarding rl_games at #3531, one about ray initialization at #3533), which might be resulting in the log extraction errors you faced. If I combine the fixes on both these PRs with the current PR, I am able to successfully do early stopping without encountering any further errors. Could you also try again with these?

Note: You had also used self.proc.terminate(9), which I believe is an invalid operation, so I reverted this back. Nevertheless, the self.proc.kill() (that follows self.proc.terminate()) should already send the same kill signal as you wanted.

garylvov · 2025-09-23T18:20:07Z

Awesome, thank you so much for figuring it out!

Sounds good about the terminate issue, thank you for spotting that.

I'm excited to try this soon!

garylvov

Hi @kellyguo11,

This LGTM, just #3531 and #3533 need to be merged first

ozhanozen added 2 commits August 26, 2025 15:44

Adds early stopping support with stoppers

0a579e3

Adds early stopper example for cartpole

4305f85

garylvov mentioned this pull request Aug 29, 2025

[Question] Early stopping in RSLRL #3300

Closed

garylvov assigned garylvov and ozhanozen Sep 1, 2025

Merge branch 'main' into feat/ray-early-stopping

8659107

garylvov mentioned this pull request Sep 3, 2025

[Bug Report] GPU resource have not free correctly after stopping the bad trials using ASHAScheduler #3332

Closed

Merge branch 'main' into feat/ray-early-stopping

3275ab8

attempt to add rid

85e5433

Merge branch 'main' into feat/ray-early-stopping

59257f2

github-actions bot added the isaac-lab Related to Isaac Lab team label Sep 16, 2025

sbtc-sipbb added 2 commits September 23, 2025 14:08

Merge remote-tracking branch 'upstream/main' into feat/ray-early-stop…

43bf8b4

…ping

Removes argument to terminate()

93e6ea6

garylvov marked this pull request as ready for review September 23, 2025 18:20

garylvov requested a review from ooctipus as a code owner September 23, 2025 18:20

garylvov approved these changes Sep 23, 2025

View reviewed changes

garylvov mentioned this pull request Sep 24, 2025

[Question] Unit testing of IsaacLab scripts #3546

Closed

garylvov and others added 2 commits October 6, 2025 21:03

Merge branch 'main' into feat/ray-early-stopping

5d673ee

formatting

7a97d8a

garylvov mentioned this pull request Oct 11, 2025

Fixes ray initialization to correctly direct subprocess output #3533

Open

7 tasks

Merge branch 'main' into feat/ray-early-stopping

c677152

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds early stopping support for Ray integration #3276

Adds early stopping support for Ray integration #3276

ozhanozen commented Aug 26, 2025

Uh oh!

ozhanozen commented Aug 26, 2025

Uh oh!

ozhanozen commented Aug 27, 2025

Uh oh!

garylvov commented Aug 28, 2025 •

edited

Loading

Uh oh!

ozhanozen commented Sep 4, 2025

Uh oh!

garylvov commented Sep 9, 2025 •

edited

Loading

Uh oh!

garylvov commented Sep 9, 2025 •

edited

Loading

Uh oh!

garylvov commented Sep 9, 2025 •

edited

Loading

Uh oh!

garylvov commented Sep 10, 2025 •

edited

Loading

Uh oh!

garylvov commented Sep 10, 2025 •

edited

Loading

Uh oh!

ozhanozen commented Sep 23, 2025

Uh oh!

garylvov commented Sep 23, 2025

Uh oh!

garylvov left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Adds early stopping support for Ray integration #3276

Are you sure you want to change the base?

Adds early stopping support for Ray integration #3276

Conversation

ozhanozen commented Aug 26, 2025

Description

Type of change

Checklist

Uh oh!

ozhanozen commented Aug 26, 2025

Uh oh!

ozhanozen commented Aug 27, 2025

Uh oh!

garylvov commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ozhanozen commented Sep 4, 2025

Uh oh!

garylvov commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garylvov commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garylvov commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garylvov commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garylvov commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ozhanozen commented Sep 23, 2025

Uh oh!

garylvov commented Sep 23, 2025

Uh oh!

garylvov left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

garylvov commented Aug 28, 2025 •

edited

Loading

garylvov commented Sep 9, 2025 •

edited

Loading

garylvov commented Sep 9, 2025 •

edited

Loading

garylvov commented Sep 9, 2025 •

edited

Loading

garylvov commented Sep 10, 2025 •

edited

Loading

garylvov commented Sep 10, 2025 •

edited

Loading