Skip to content

Conversation

ozhanozen
Copy link
Contributor

Description

This PR introduces support for early stopping in Ray integration through the Stopper class. It enables trials to end sooner when they are unlikely to yield useful results, reducing wasted compute time and speeding up experimentation.

Previously, when running hyperparameter tuning with Ray integration, all trials would continue until the training configuration’s maximum iterations were reached, even if a trial was clearly underperforming. This wasn’t always efficient, since poor-performing trials could often be identified early on. With this PR, an optional early stopping mechanism is introduced, allowing Ray to terminate unpromising trials sooner and improve the overall efficiency of hyperparameter tuning.

The PR also includes a CartpoleEarlyStopper example in vision_cartpole_cfg.py. This serves as a reference implementation that halts a trial if the out_of_bounds metric doesn’t reduce after a set number of iterations. It’s meant as a usage example: users are encouraged to create their own custom stoppers tailored to their specific use cases.

Fixes #3270.

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist

  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

@ozhanozen
Copy link
Contributor Author

Hi @garylvov, here is the PR as agreed.

I have noticed while re-testing that it sometimes works, sometimes doesn't. I think the issue is not the mechanism inside the PR, but rather that Isaac Sim does not always respond in time to the termination signal. When this happens, the next training does not start within process_response_timeout, hence halting all the ray main process.

One solution idea could be to check if a subprocess exists within a threshold and kill it, if it doesn't, but I do not know how to do this with stoppers as normally Ray handles this. Alternatively, would it be better to add a mechanism to execute_job such that if there is a halted subprocess for Isaac Sim, we kill it before starting a new subprocess no matter what?

@ozhanozen
Copy link
Contributor Author

@garylvov, a small update:

I was wrong to say it sometimes works. By coincidence, the processes were ending anyway shortly after the early stop signal, and that is why I thought it sometimes worked.

I have debugged it further and can confirm that even after a "trial" is marked as completed, the subprocess/training continues to the end. The following trial might fail, e.g., due to a lack of GPU memory.

@garylvov
Copy link
Collaborator

garylvov commented Aug 28, 2025

Hi thanks for your further investigation.

Alternatively, would it be better to add a mechanism to execute_job such that if there is a halted subprocess for Isaac Sim, we kill it before starting a new subprocess no matter what?

I think this could work, but I would be a little worried about it being too "kill happy", and erroneously shutting down processes that were experiencing an ephemeral stalling period. Perhaps we can just wait for a few moments, and if it's still halted, then kill it.

However, I think it may be non-optimal design to have a new ray process try to do cleanup of other processes before starting, as opposed to a ray process doing clean-up on its own process after it finishes.

I have debugged it further and can confirm that even after a "trial" is marked as completed, the subprocess/training continues to the end. The following trial might fail, e.g., due to a lack of GPU memory.

I would assume that Ray could do this well enough out of the box to stop the rogue processes, but I guess that's wishful thinking ;)

I will do some testing of this too. I think you may be onto something with some mechanism related to the Ray stopper. Maybe we can override some sort of cleanup method to aggressively SIGKILL the PID recovered by execute_job

@ozhanozen
Copy link
Contributor Author

I think this could work, but I would be a little worried about it being too "kill happy", and erroneously shutting down processes that were experiencing an ephemeral stalling period. Perhaps we can just wait for a few moments, and if it's still halted, then kill it.

However, I think it may be non-optimal design to have a new ray process try to do cleanup of other processes before starting, as opposed to a ray process doing clean-up on its own process after it finishes.

I agree. Nevertheless, I still wanted to try to kill it the following way:

self.proc.terminate()
try:
    self.proc.wait(timeout=20)
except subprocess.TimeoutExpired:
   self.proc.kill()
   self.proc.wait()

but it did not work. Did you manage to get a lead on what might work in a robust way?

@garylvov
Copy link
Collaborator

garylvov commented Sep 9, 2025

Hi @ozhanozen I have yet another workaround that may just work ;)

I think we could add

parser.add_argument("--ray-proc-id", "-rid", 
                    type=int,default=None, help="Automatically configured by Ray integration, otherwise None.") 

to the training scripts (RSL RL, RL Games, SKRL, etc)

Then, for each job submitted in util, we can assign an extra PID. I haven't implemented it, but I've tested it manually with the following:

./IsaacLab/isaaclab.sh -p IsaacLab/scripts/reinforcement_learning/rl_games/train.py --task Isaac-Cartpole-v0 --headless -rid 31415

Then, the following has proven to be effective at shutting down all processes.

pkill -9 -f "rid 31415" && sleep 5

I think this will be effective, as for example, having a mismatched id like below doesn't kill the process

pkill -9 -f "rid 31414"  # does nothing, proves that kill commands wouldn't affect unrelated running training

Now we just need to add some sort of exit hook to the trials to kill the assigned ray proc id

@garylvov
Copy link
Collaborator

garylvov commented Sep 9, 2025

We could maybe use https://docs.ray.io/en/latest/tune/api/callbacks.html to have each trial clean itself up

@garylvov
Copy link
Collaborator

garylvov commented Sep 9, 2025

Actually, now that I've played with this more, I think it's important that we SIGKILL instead of SIGINT

Maybe

os.kill(self.proc.pid, signal.SIGKILL)

would work without the extra rid stuff

@garylvov
Copy link
Collaborator

garylvov commented Sep 10, 2025

I looked into this even more, and I believe the RID is helpful as there are so many related processes for a single isaac training session:

image'

I believe I have a minimal version that kind of works pushing progress now

@garylvov
Copy link
Collaborator

garylvov commented Sep 10, 2025

I think it is better at killing old jobs, but it keeps on running into log extraction errors after terminating processes. Please let me know if you can figure out why!

@github-actions github-actions bot added the isaac-lab Related to Isaac Lab team label Sep 16, 2025
@ozhanozen
Copy link
Contributor Author

Hi @garylvov, I had just the opportunity to check this; sorry for the delay.

I had encountered two different issues (one regarding rl_games at #3531, one about ray initialization at #3533), which might be resulting in the log extraction errors you faced. If I combine the fixes on both these PRs with the current PR, I am able to successfully do early stopping without encountering any further errors. Could you also try again with these?

Note: You had also used self.proc.terminate(9), which I believe is an invalid operation, so I reverted this back. Nevertheless, the self.proc.kill() (that follows self.proc.terminate()) should already send the same kill signal as you wanted.

@garylvov
Copy link
Collaborator

Awesome, thank you so much for figuring it out!

Sounds good about the terminate issue, thank you for spotting that.

I'm excited to try this soon!

@garylvov garylvov marked this pull request as ready for review September 23, 2025 18:20
@garylvov garylvov requested a review from ooctipus as a code owner September 23, 2025 18:20
Copy link
Collaborator

@garylvov garylvov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kellyguo11,

This LGTM, just #3531 and #3533 need to be merged first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Proposal] Early stopping while doing hyperparameter tuning with Ray integration

3 participants