Fixes ray initialization to correctly direct subprocess output #3533

ozhanozen · 2025-09-23T14:59:02Z

Description

When running Ray directly from tuner.py, Ray is not correctly initialized within invoke_tuning_run(). The two problems associated with this are discussed in #3532. To solve them, this PR:

Removes ray_init() from util.get_gpu_node_resources(). Now, ray needs to be initialized before calling util.get_gpu_node_resources(). This change actually reverses Fixes the missing Ray initialization #3350, which was merged to add the missing initialization when using tuner.py, but it is safer to explicitly initialize Ray with the correct arguments outside of the util.get_gpu_node_resources().
Moves Ray initialization within invoke_tuning_run() to be before util.get_gpu_node_resources() so we explicitly initialize it before and do not raise an exception later.
Adds a warning when calling ray_init() if Ray was already initialized.

Fixes #3532

Type of change

Bug fix (non-breaking change which fixes an issue)

Screenshots

Change 1:

Change 2:

Change 3:

Checklist

I have read and understood the contribution guidelines
I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have updated the changelog and the corresponding version in the extension's config/extension.toml file
I have added my name to the CONTRIBUTORS.md or my name already exists there

garylvov · 2025-09-23T17:27:13Z

LGTM thanks for catching this!

PierrePeng · 2025-10-11T13:07:13Z

@ozhanozen Hi, I have tried this PR. Just run the command with
./isaaclab.sh -p scripts/reinforcement_learning/ray/tuner.py --cfg_file scripts/reinforcement_learning/ray/hyperparameter_tuning/vision_cartpole_cfg.py --cfg_class CartpoleTheiaJobCfg --run_mode local --workflow scripts/reinforcement_learning/rl_games/train.py --num_workers_per_node 1.

Then the ouput log show below.

(IsaacLabTuneTrainable pid=1414) [ERROR]: Could not find experiment logs within 200.0 seconds.
(IsaacLabTuneTrainable pid=1414) [ERROR]: Could not extract experiment_name/logdir from trainer output (experiment_name=None, logdir=None).
(IsaacLabTuneTrainable pid=1414) Make sure your training script prints the following correctly:
(IsaacLabTuneTrainable pid=1414) Exact experiment name requested from command line:
(IsaacLabTuneTrainable pid=1414) [INFO] Logging experiment in directory:
(IsaacLabTuneTrainable pid=1414)
(IsaacLabTuneTrainable pid=1414)

garylvov · 2025-10-11T16:12:06Z

Hi @PierrePeng I believe you also need the fix from #3531

garylvov · 2025-10-11T16:13:13Z

I'd also check out #3276 which depends on this PR and the other one I linked above

PierrePeng · 2025-10-12T02:28:23Z

Hi @PierrePeng I believe you also need the fix from #3531

Thanks @garylvov! I have applied the patch fix in 3531 and 3533.

It still didn't work and get the same error as before. It seems that the process haven't invoke the scripts/reinforcement_learning/rl_games/train.py script.

ozhanozen · 2025-10-17T09:33:14Z

Hi @PierrePeng, couuld you add log_all_output = True as an argument to the

IsaacLab/scripts/reinforcement_learning/ray/tuner.py

Line 93 in e06a067

experiment = util.execute_job(

and track what is wrong? Assuming you already have the 3531, you should be able to see some output that might give clues regarding what is wrong. If train.py script is not executed at all, this problem is not directly linked to this PR and it is better to create a new issue for this.

PierrePeng · 2025-10-20T07:00:44Z

Hi @ozhanozen . Here is the log which is based on the #3531 and #3533, and adding log_all_output = True

log.txt

garylvov · 2025-10-20T16:27:03Z

Thank you for the log @PierrePeng (and for suggesting this functionality @ozhanozen )

It looks like that each training run started as needed, but the exact experiment name couldn't be extracted. I think we need to do add the

"Exact experiment name requested from command line: "

To RL Games. Previously, this was handled implicitly for RL Games (and explicitly for the rest). However, I don't see that this experiment name is being printed, so we need to add it explicitly.

garylvov · 2025-10-20T16:34:44Z

Hi @PierrePeng I believe commit fe6d188 in #3531 should resolve this issue, please let me know if it persists

greptile-apps

Greptile Overview

Greptile Summary

This PR fixes a critical bug in Ray initialization for distributed training workflows. Previously, util.get_gpu_node_resources() was implicitly initializing Ray with log_to_driver=False, which prevented subprocess output (including debug messages from training steps) from reaching the main process. The fix makes Ray initialization explicit by: (1) moving ray_init() with log_to_driver=True to occur before get_gpu_node_resources() in tuner.py, (2) removing the implicit initialization from get_gpu_node_resources() and instead raising a RuntimeError if Ray isn't already initialized, and (3) adding a warning when attempting to initialize Ray that's already been initialized. This change reverses PR #3350which had added the problematic implicit initialization, and ensures proper control over Ray configuration parameters during the tuning workflow setup.

Important Files Changed

Filename	Score	Overview
scripts/reinforcement_learning/ray/util.py	5/5	Removed implicit Ray initialization from `get_gpu_node_resources()`, now raises `RuntimeError` if Ray not initialized; added warning for double-initialization attempts
scripts/reinforcement_learning/ray/tuner.py	5/5	Moved explicit Ray initialization with `log_to_driver=True` to occur before `get_gpu_node_resources()` call in `invoke_tuning_run()`

Confidence score: 5/5

This PR is safe to merge with minimal risk as it fixes a well-documented bug without introducing breaking changes
Score reflects clear problem identification, straightforward solution, and explicit handling of initialization edge cases with appropriate error messages
No files require special attention; both changes work together to solve the subprocess output routing issue

Sequence Diagram

sequenceDiagram
    participant User
    participant tuner.py
    participant util.ray_init
    participant util.get_gpu_node_resources
    participant Ray

    User->>tuner.py: "Run tuner with --run_mode local/remote"
    tuner.py->>util.ray_init: "ray_init(ray_address, log_to_driver=True)"
    
    alt Ray not initialized
        util.ray_init->>Ray: "ray.init(address, runtime_env, log_to_driver)"
        Ray-->>util.ray_init: "Initialized"
        util.ray_init->>tuner.py: "[INFO] Initializing Ray..."
    else Ray already initialized
        util.ray_init->>tuner.py: "[WARNING] Ray already initialized!"
    end
    
    tuner.py->>util.get_gpu_node_resources: "get_gpu_node_resources()"
    
    alt Ray not initialized (old behavior - removed)
        util.get_gpu_node_resources->>Ray: "ray.init() [REMOVED]"
        Note over util.get_gpu_node_resources,Ray: This caused subprocess output issues
    end
    
    alt Ray initialized (new behavior)
        util.get_gpu_node_resources->>Ray: "ray.nodes()"
        Ray-->>util.get_gpu_node_resources: "Node resources"
        util.get_gpu_node_resources-->>tuner.py: "Available resources dict"
    else Ray not initialized (error case)
        util.get_gpu_node_resources-->>tuner.py: "RuntimeError: Ray must be initialized"
    end
    
    tuner.py->>tuner.py: "Configure OptunaSearch, Repeater, RunConfig"
    tuner.py->>tuner.py: "Create Tuner with IsaacLabTuneTrainable"
    tuner.py->>Ray: "tuner.fit()"
    Ray-->>User: "Training results (Tensorboard/MLFlow)"

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

PierrePeng · 2025-10-29T10:21:42Z

@garylvov I have merge the branch 3531\3533\3276. And the issue still persists. Here the log and my code.

When will these fixes be merged into the main branch? These issues have made Ray completely unusable and have existed for quite a long time.

reinforcement_learning.zip
log.txt

garylvov · 2025-10-29T12:44:18Z

Hi @PierrePeng your issue appears partially unrelated to the ray functionality; I'd take a look at wandb/wandb#10662

As for the open GitHub PRs, that fix the ray functionality, I apologize, I've done everything I can on my end; I don't work at NVIDIA, and I don't have merge permissions on this project. I've reviewed all 3 open PRs, and they are ready to merge.

garylvov · 2025-10-29T12:50:03Z

Hi @kellyguo11 @ooctipus @RandomOakForest . We have three PRs open that need to be merged to fix the Ray functionality; these PRs have been open for quite some time. These PRs are ready to merge.

This PR, as well as #3531 and #3533 will fix many problems with the Ray functionality. Other than adding an extra argument to the training scripts, these PRs don't affect functionality unrelated to Ray.

Can you please help me merge these? I don't have the permissions to but I've already reviewed all three PRs

PierrePeng · 2025-10-30T01:44:50Z

Hi @garylvov , thanks for your quick reply and for pointing out the wandb issue.

I tried to uninstall the wandb and start the ray job again. The issue still persists. I'm not sure yet whether my problem is related to running the Ray job directly on my local machine instead of inside a Docker environment.

Again I really appreciate your effort in reviewing and pushing them forward.

log.txt

greptile-apps

Greptile Overview

Greptile Summary

Fixed Ray initialization order in invoke_tuning_run() to ensure subprocess output is correctly directed to the driver. Previously, get_gpu_node_resources() implicitly initialized Ray with log_to_driver=False before the explicit initialization with log_to_driver=True could run, causing debug output from training processes to be hidden.

Key changes:

Moved Ray initialization in tuner.py:222-225 to occur before get_gpu_node_resources() call
Removed implicit Ray initialization from util.get_gpu_node_resources(), replacing it with explicit error if Ray is not initialized
Added warning in util.ray_init() when attempting to initialize an already-initialized Ray instance

Impact:
This fix ensures that debug lines within the Trainable step() function and other subprocess output are visible during hyperparameter tuning runs.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The changes are well-designed and address a clear bug. The fix properly reverses the problematic PR #3350 by making Ray initialization explicit and controllable. All call sites (tuner.py line 228 and 164, wrap_resources.py line 84) already initialize Ray before calling get_gpu_node_resources(), so the breaking change is safe. The warning added for duplicate initialization attempts provides good debugging feedback.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
scripts/reinforcement_learning/ray/tuner.py	5/5	Moved Ray initialization to correct location with proper `log_to_driver=True` parameter, ensuring subprocess output is visible
scripts/reinforcement_learning/ray/util.py	5/5	Replaced implicit Ray initialization with explicit error, added warning for duplicate initialization attempts

Sequence Diagram

sequenceDiagram
    participant User
    participant tuner.py
    participant util.ray_init
    participant util.get_gpu_node_resources
    participant Ray

    User->>tuner.py: invoke_tuning_run()
    
    Note over tuner.py: BEFORE PR (Broken)
    tuner.py->>util.get_gpu_node_resources: get_gpu_node_resources()
    util.get_gpu_node_resources->>Ray: ray.is_initialized()?
    Ray-->>util.get_gpu_node_resources: False
    util.get_gpu_node_resources->>Ray: ray_init(log_to_driver=False)
    Note over Ray: Initialized with log_to_driver=False<br/>(subprocess output hidden)
    util.get_gpu_node_resources-->>tuner.py: resources
    tuner.py->>Ray: ray.is_initialized()?
    Ray-->>tuner.py: True
    Note over tuner.py: Skips initialization block<br/>(never uses log_to_driver=True)
    
    Note over tuner.py,Ray: ===================================
    
    Note over tuner.py: AFTER PR (Fixed)
    tuner.py->>util.ray_init: ray_init(log_to_driver=True)
    util.ray_init->>Ray: ray.is_initialized()?
    Ray-->>util.ray_init: False
    util.ray_init->>Ray: ray.init(log_to_driver=True)
    Note over Ray: Initialized correctly<br/>(subprocess output visible)
    util.ray_init-->>tuner.py: success
    tuner.py->>util.get_gpu_node_resources: get_gpu_node_resources()
    util.get_gpu_node_resources->>Ray: ray.is_initialized()?
    Ray-->>util.get_gpu_node_resources: True
    util.get_gpu_node_resources-->>tuner.py: resources
    
    tuner.py->>tuner.py: Configure tuner
    tuner.py->>IsaacLabTuneTrainable: default_resource_request()
    IsaacLabTuneTrainable->>util.get_gpu_node_resources: get_gpu_node_resources(one_node_only=True)
    util.get_gpu_node_resources->>Ray: ray.is_initialized()?
    Ray-->>util.get_gpu_node_resources: True
    util.get_gpu_node_resources-->>IsaacLabTuneTrainable: single node resources

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

garylvov · 2025-10-30T04:26:08Z

Hi @PierrePeng , I was unable to recreate your issue when running within the docker environment with the three updates PRs. I used the commands in the documentation, and I had Claude code help me cherry pick the commits from the 3 open PRs to get it running. It seems that there is an issue stemming from how the dependencies are installed in your specific conda environment. I recommend running from within the docker comtainer, especially on a Linux machine because when I developed the code that was the only thing I could test it on.

Maybe you can try installing the same dependency versions in conda, but your YMMV. Unfortunately, it's unrealistic for me and @ozhanozen to really maintain/troubleshoot the Ray functionality (which is partly some Linux bash things hacked together) in other environments beyond what's displayed in the documentation; hence the docs disclaimer about the functionality being experimental and Linux only.

I really appreciate your help in helping identify some of the other issues! Let's try to keep one issue per thread though; if this persists,even within Docker on a Linux machine, then I will be happy to help troubleshoot if you open a new issue. Hopefully we will have these PRs merged soon to simplify this sorry, thank you for your patience!

PierrePeng · 2025-10-31T08:59:16Z

Hi @garylvov. After testing, I was able to get it running locally without using Docker. It seems that my previous method of applying the three open PRs was incorrect. After using cherry-pick like you did, everything works fine now. I’m sorry for the confusion, and thank you for your support.

garylvov · 2025-10-31T12:54:24Z

Glad to hear that it works! 😊

ozhanozen added 2 commits September 23, 2025 16:18

Splits ray init from get_gpu_node_resources

98257d0

Correctly inits ray for invoke_tuning_run

15c60db

ozhanozen requested a review from ooctipus as a code owner September 23, 2025 14:59

github-actions bot added bug Something isn't working isaac-lab Related to Isaac Lab team labels Sep 23, 2025

ozhanozen mentioned this pull request Sep 23, 2025

Adds early stopping support for Ray integration #3276

Open

6 tasks

garylvov assigned garylvov and ozhanozen Sep 23, 2025

garylvov approved these changes Sep 23, 2025

View reviewed changes

Merge branch 'main' into fix/ray-initialization

3a68b8d

Merge branch 'main' into fix/ray-initialization

64f4166

Merge branch 'main' into fix/ray-initialization

b707f26

greptile-apps bot reviewed Oct 27, 2025

View reviewed changes

Merge branch 'main' into fix/ray-initialization

d40abd6

greptile-apps bot reviewed Oct 30, 2025

View reviewed changes

Fixes ray initialization to correctly direct subprocess output #3533

Are you sure you want to change the base?

Fixes ray initialization to correctly direct subprocess output #3533

Uh oh!

Conversation

ozhanozen commented Sep 23, 2025

Description

Type of change

Screenshots

Checklist

Uh oh!

garylvov commented Sep 23, 2025

Uh oh!

PierrePeng commented Oct 11, 2025

Uh oh!

garylvov commented Oct 11, 2025

Uh oh!

garylvov commented Oct 11, 2025

Uh oh!

PierrePeng commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ozhanozen commented Oct 17, 2025

Uh oh!

PierrePeng commented Oct 20, 2025

Uh oh!

garylvov commented Oct 20, 2025

Uh oh!

garylvov commented Oct 20, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 5/5

Sequence Diagram

Uh oh!

PierrePeng commented Oct 29, 2025

Uh oh!

garylvov commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garylvov commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PierrePeng commented Oct 30, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

garylvov commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PierrePeng commented Oct 31, 2025

Uh oh!

garylvov commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PierrePeng commented Oct 12, 2025 •

edited

Loading

garylvov commented Oct 29, 2025 •

edited

Loading

garylvov commented Oct 29, 2025 •

edited

Loading

garylvov commented Oct 30, 2025 •

edited

Loading