Fixes ray initialization to correctly direct subprocess output #3533

ozhanozen · 2025-09-23T14:59:02Z

Description

When running Ray directly from tuner.py, Ray is not correctly initialized within invoke_tuning_run(). The two problems associated with this are discussed in #3532. To solve them, this PR:

Removes ray_init() from util.get_gpu_node_resources(). Now, ray needs to be initialized before calling util.get_gpu_node_resources(). This change actually reverses Fixes the missing Ray initialization #3350, which was merged to add the missing initialization when using tuner.py, but it is safer to explicitly initialize Ray with the correct arguments outside of the util.get_gpu_node_resources().
Moves Ray initialization within invoke_tuning_run() to be before util.get_gpu_node_resources() so we explicitly initialize it before and do not raise an exception later.
Adds a warning when calling ray_init() if Ray was already initialized.

Fixes #3532

Type of change

Bug fix (non-breaking change which fixes an issue)

Screenshots

Change 1:

Change 2:

Change 3:

Checklist

I have read and understood the contribution guidelines
I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have updated the changelog and the corresponding version in the extension's config/extension.toml file
I have added my name to the CONTRIBUTORS.md or my name already exists there

garylvov · 2025-09-23T17:27:13Z

LGTM thanks for catching this!

PierrePeng · 2025-10-11T13:07:13Z

@ozhanozen Hi, I have tried this PR. Just run the command with
./isaaclab.sh -p scripts/reinforcement_learning/ray/tuner.py --cfg_file scripts/reinforcement_learning/ray/hyperparameter_tuning/vision_cartpole_cfg.py --cfg_class CartpoleTheiaJobCfg --run_mode local --workflow scripts/reinforcement_learning/rl_games/train.py --num_workers_per_node 1.

Then the ouput log show below.

(IsaacLabTuneTrainable pid=1414) [ERROR]: Could not find experiment logs within 200.0 seconds.
(IsaacLabTuneTrainable pid=1414) [ERROR]: Could not extract experiment_name/logdir from trainer output (experiment_name=None, logdir=None).
(IsaacLabTuneTrainable pid=1414) Make sure your training script prints the following correctly:
(IsaacLabTuneTrainable pid=1414) Exact experiment name requested from command line:
(IsaacLabTuneTrainable pid=1414) [INFO] Logging experiment in directory:
(IsaacLabTuneTrainable pid=1414)
(IsaacLabTuneTrainable pid=1414)

garylvov · 2025-10-11T16:12:06Z

Hi @PierrePeng I believe you also need the fix from #3531

garylvov · 2025-10-11T16:13:13Z

I'd also check out #3276 which depends on this PR and the other one I linked above

PierrePeng · 2025-10-12T02:28:23Z

Hi @PierrePeng I believe you also need the fix from #3531

Thanks @garylvov! I have applied the patch fix in 3531 and 3533.

It still didn't work and get the same error as before. It seems that the process haven't invoke the scripts/reinforcement_learning/rl_games/train.py script.

ozhanozen · 2025-10-17T09:33:14Z

Hi @PierrePeng, couuld you add log_all_output = True as an argument to the

IsaacLab/scripts/reinforcement_learning/ray/tuner.py

Line 93 in e06a067

experiment = util.execute_job(

and track what is wrong? Assuming you already have the 3531, you should be able to see some output that might give clues regarding what is wrong. If train.py script is not executed at all, this problem is not directly linked to this PR and it is better to create a new issue for this.

PierrePeng · 2025-10-20T07:00:44Z

Hi @ozhanozen . Here is the log which is based on the #3531 and #3533, and adding log_all_output = True

log.txt

garylvov · 2025-10-20T16:27:03Z

Thank you for the log @PierrePeng (and for suggesting this functionality @ozhanozen )

It looks like that each training run started as needed, but the exact experiment name couldn't be extracted. I think we need to do add the

"Exact experiment name requested from command line: "

To RL Games. Previously, this was handled implicitly for RL Games (and explicitly for the rest). However, I don't see that this experiment name is being printed, so we need to add it explicitly.

garylvov · 2025-10-20T16:34:44Z

Hi @PierrePeng I believe commit fe6d188 in #3531 should resolve this issue, please let me know if it persists

ozhanozen added 2 commits September 23, 2025 16:18

Splits ray init from get_gpu_node_resources

98257d0

Correctly inits ray for invoke_tuning_run

15c60db

ozhanozen requested a review from ooctipus as a code owner September 23, 2025 14:59

github-actions bot added bug Something isn't working isaac-lab Related to Isaac Lab team labels Sep 23, 2025

ozhanozen mentioned this pull request Sep 23, 2025

Adds early stopping support for Ray integration #3276

Open

6 tasks

garylvov assigned garylvov and ozhanozen Sep 23, 2025

garylvov approved these changes Sep 23, 2025

View reviewed changes

Merge branch 'main' into fix/ray-initialization

3a68b8d

Merge branch 'main' into fix/ray-initialization

64f4166

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes ray initialization to correctly direct subprocess output #3533

Fixes ray initialization to correctly direct subprocess output #3533

ozhanozen commented Sep 23, 2025

Uh oh!

garylvov commented Sep 23, 2025

Uh oh!

PierrePeng commented Oct 11, 2025

Uh oh!

garylvov commented Oct 11, 2025

Uh oh!

garylvov commented Oct 11, 2025

Uh oh!

PierrePeng commented Oct 12, 2025 •

edited

Loading

Uh oh!

ozhanozen commented Oct 17, 2025

Uh oh!

PierrePeng commented Oct 20, 2025

Uh oh!

garylvov commented Oct 20, 2025

Uh oh!

garylvov commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fixes ray initialization to correctly direct subprocess output #3533

Are you sure you want to change the base?

Fixes ray initialization to correctly direct subprocess output #3533

Conversation

ozhanozen commented Sep 23, 2025

Description

Type of change

Screenshots

Checklist

Uh oh!

garylvov commented Sep 23, 2025

Uh oh!

PierrePeng commented Oct 11, 2025

Uh oh!

garylvov commented Oct 11, 2025

Uh oh!

garylvov commented Oct 11, 2025

Uh oh!

PierrePeng commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ozhanozen commented Oct 17, 2025

Uh oh!

PierrePeng commented Oct 20, 2025

Uh oh!

garylvov commented Oct 20, 2025

Uh oh!

garylvov commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PierrePeng commented Oct 12, 2025 •

edited

Loading