Skip to content

Conversation

@ozhanozen
Copy link
Contributor

Description

When running Ray directly from tuner.py, Ray is not correctly initialized within invoke_tuning_run(). The two problems associated with this are discussed in #3532. To solve them, this PR:

  1. Removes ray_init() from util.get_gpu_node_resources(). Now, ray needs to be initialized before calling util.get_gpu_node_resources(). This change actually reverses Fixes the missing Ray initialization #3350, which was merged to add the missing initialization when using tuner.py, but it is safer to explicitly initialize Ray with the correct arguments outside of the util.get_gpu_node_resources().
  2. Moves Ray initialization within invoke_tuning_run() to be before util.get_gpu_node_resources() so we explicitly initialize it before and do not raise an exception later.
  3. Adds a warning when calling ray_init() if Ray was already initialized.

Fixes #3532

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Screenshots

Change 1:
Screenshot 2025-09-23 at 16 52 55

Change 2:
Screenshot 2025-09-23 at 16 52 33

Change 3:
Screenshot 2025-09-23 at 16 55 21

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

@ozhanozen ozhanozen requested a review from ooctipus as a code owner September 23, 2025 14:59
@github-actions github-actions bot added bug Something isn't working isaac-lab Related to Isaac Lab team labels Sep 23, 2025
@garylvov
Copy link
Collaborator

LGTM thanks for catching this!

@PierrePeng
Copy link

@ozhanozen Hi, I have tried this PR. Just run the command with
./isaaclab.sh -p scripts/reinforcement_learning/ray/tuner.py --cfg_file scripts/reinforcement_learning/ray/hyperparameter_tuning/vision_cartpole_cfg.py --cfg_class CartpoleTheiaJobCfg --run_mode local --workflow scripts/reinforcement_learning/rl_games/train.py --num_workers_per_node 1.

Then the ouput log show below.

(IsaacLabTuneTrainable pid=1414) [ERROR]: Could not find experiment logs within 200.0 seconds.
(IsaacLabTuneTrainable pid=1414) [ERROR]: Could not extract experiment_name/logdir from trainer output (experiment_name=None, logdir=None).
(IsaacLabTuneTrainable pid=1414) Make sure your training script prints the following correctly:
(IsaacLabTuneTrainable pid=1414) Exact experiment name requested from command line:
(IsaacLabTuneTrainable pid=1414) [INFO] Logging experiment in directory:
(IsaacLabTuneTrainable pid=1414)
(IsaacLabTuneTrainable pid=1414)

@garylvov
Copy link
Collaborator

Hi @PierrePeng I believe you also need the fix from #3531

@garylvov
Copy link
Collaborator

I'd also check out #3276 which depends on this PR and the other one I linked above

@PierrePeng
Copy link

PierrePeng commented Oct 12, 2025

Hi @PierrePeng I believe you also need the fix from #3531

Thanks @garylvov! I have applied the patch fix in 3531 and 3533.

It still didn't work and get the same error as before. It seems that the process haven't invoke the scripts/reinforcement_learning/rl_games/train.py script.

@ozhanozen
Copy link
Contributor Author

Hi @PierrePeng, couuld you add log_all_output = True as an argument to the

experiment = util.execute_job(

and track what is wrong? Assuming you already have the 3531, you should be able to see some output that might give clues regarding what is wrong. If train.py script is not executed at all, this problem is not directly linked to this PR and it is better to create a new issue for this.

@PierrePeng
Copy link

Hi @ozhanozen . Here is the log which is based on the #3531 and #3533, and adding log_all_output = True

log.txt

@garylvov
Copy link
Collaborator

Thank you for the log @PierrePeng (and for suggesting this functionality @ozhanozen )

It looks like that each training run started as needed, but the exact experiment name couldn't be extracted. I think we need to do add the

"Exact experiment name requested from command line: "

To RL Games. Previously, this was handled implicitly for RL Games (and explicitly for the rest). However, I don't see that this experiment name is being printed, so we need to add it explicitly.

@garylvov
Copy link
Collaborator

Hi @PierrePeng I believe commit fe6d188 in #3531 should resolve this issue, please let me know if it persists

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR fixes a critical bug in Ray initialization for distributed training workflows. Previously, util.get_gpu_node_resources() was implicitly initializing Ray with log_to_driver=False, which prevented subprocess output (including debug messages from training steps) from reaching the main process. The fix makes Ray initialization explicit by: (1) moving ray_init() with log_to_driver=True to occur before get_gpu_node_resources() in tuner.py, (2) removing the implicit initialization from get_gpu_node_resources() and instead raising a RuntimeError if Ray isn't already initialized, and (3) adding a warning when attempting to initialize Ray that's already been initialized. This change reverses PR #3350which had added the problematic implicit initialization, and ensures proper control over Ray configuration parameters during the tuning workflow setup.

Important Files Changed

Filename Score Overview
scripts/reinforcement_learning/ray/util.py 5/5 Removed implicit Ray initialization from get_gpu_node_resources(), now raises RuntimeError if Ray not initialized; added warning for double-initialization attempts
scripts/reinforcement_learning/ray/tuner.py 5/5 Moved explicit Ray initialization with log_to_driver=True to occur before get_gpu_node_resources() call in invoke_tuning_run()

Confidence score: 5/5

  • This PR is safe to merge with minimal risk as it fixes a well-documented bug without introducing breaking changes
  • Score reflects clear problem identification, straightforward solution, and explicit handling of initialization edge cases with appropriate error messages
  • No files require special attention; both changes work together to solve the subprocess output routing issue

Sequence Diagram

sequenceDiagram
    participant User
    participant tuner.py
    participant util.ray_init
    participant util.get_gpu_node_resources
    participant Ray

    User->>tuner.py: "Run tuner with --run_mode local/remote"
    tuner.py->>util.ray_init: "ray_init(ray_address, log_to_driver=True)"
    
    alt Ray not initialized
        util.ray_init->>Ray: "ray.init(address, runtime_env, log_to_driver)"
        Ray-->>util.ray_init: "Initialized"
        util.ray_init->>tuner.py: "[INFO] Initializing Ray..."
    else Ray already initialized
        util.ray_init->>tuner.py: "[WARNING] Ray already initialized!"
    end
    
    tuner.py->>util.get_gpu_node_resources: "get_gpu_node_resources()"
    
    alt Ray not initialized (old behavior - removed)
        util.get_gpu_node_resources->>Ray: "ray.init() [REMOVED]"
        Note over util.get_gpu_node_resources,Ray: This caused subprocess output issues
    end
    
    alt Ray initialized (new behavior)
        util.get_gpu_node_resources->>Ray: "ray.nodes()"
        Ray-->>util.get_gpu_node_resources: "Node resources"
        util.get_gpu_node_resources-->>tuner.py: "Available resources dict"
    else Ray not initialized (error case)
        util.get_gpu_node_resources-->>tuner.py: "RuntimeError: Ray must be initialized"
    end
    
    tuner.py->>tuner.py: "Configure OptunaSearch, Repeater, RunConfig"
    tuner.py->>tuner.py: "Create Tuner with IsaacLabTuneTrainable"
    tuner.py->>Ray: "tuner.fit()"
    Ray-->>User: "Training results (Tensorboard/MLFlow)"
Loading

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@PierrePeng
Copy link

@garylvov I have merge the branch 3531\3533\3276. And the issue still persists. Here the log and my code.

When will these fixes be merged into the main branch? These issues have made Ray completely unusable and have existed for quite a long time.

reinforcement_learning.zip
log.txt

@garylvov
Copy link
Collaborator

garylvov commented Oct 29, 2025

Hi @PierrePeng your issue appears partially unrelated to the ray functionality; I'd take a look at wandb/wandb#10662

As for the open GitHub PRs, that fix the ray functionality, I apologize, I've done everything I can on my end; I don't work at NVIDIA, and I don't have merge permissions on this project. I've reviewed all 3 open PRs, and they are ready to merge.

@garylvov
Copy link
Collaborator

garylvov commented Oct 29, 2025

Hi @kellyguo11 @ooctipus @RandomOakForest . We have three PRs open that need to be merged to fix the Ray functionality; these PRs have been open for quite some time. These PRs are ready to merge.

This PR, as well as #3531 and #3533 will fix many problems with the Ray functionality. Other than adding an extra argument to the training scripts, these PRs don't affect functionality unrelated to Ray.

Can you please help me merge these? I don't have the permissions to but I've already reviewed all three PRs

@PierrePeng
Copy link

Hi @garylvov , thanks for your quick reply and for pointing out the wandb issue.

I tried to uninstall the wandb and start the ray job again. The issue still persists. I'm not sure yet whether my problem is related to running the Ray job directly on my local machine instead of inside a Docker environment.

Again I really appreciate your effort in reviewing and pushing them forward.

log.txt

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Fixed Ray initialization order in invoke_tuning_run() to ensure subprocess output is correctly directed to the driver. Previously, get_gpu_node_resources() implicitly initialized Ray with log_to_driver=False before the explicit initialization with log_to_driver=True could run, causing debug output from training processes to be hidden.

Key changes:

  • Moved Ray initialization in tuner.py:222-225 to occur before get_gpu_node_resources() call
  • Removed implicit Ray initialization from util.get_gpu_node_resources(), replacing it with explicit error if Ray is not initialized
  • Added warning in util.ray_init() when attempting to initialize an already-initialized Ray instance

Impact:
This fix ensures that debug lines within the Trainable step() function and other subprocess output are visible during hyperparameter tuning runs.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The changes are well-designed and address a clear bug. The fix properly reverses the problematic PR #3350 by making Ray initialization explicit and controllable. All call sites (tuner.py line 228 and 164, wrap_resources.py line 84) already initialize Ray before calling get_gpu_node_resources(), so the breaking change is safe. The warning added for duplicate initialization attempts provides good debugging feedback.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
scripts/reinforcement_learning/ray/tuner.py 5/5 Moved Ray initialization to correct location with proper log_to_driver=True parameter, ensuring subprocess output is visible
scripts/reinforcement_learning/ray/util.py 5/5 Replaced implicit Ray initialization with explicit error, added warning for duplicate initialization attempts

Sequence Diagram

sequenceDiagram
    participant User
    participant tuner.py
    participant util.ray_init
    participant util.get_gpu_node_resources
    participant Ray

    User->>tuner.py: invoke_tuning_run()
    
    Note over tuner.py: BEFORE PR (Broken)
    tuner.py->>util.get_gpu_node_resources: get_gpu_node_resources()
    util.get_gpu_node_resources->>Ray: ray.is_initialized()?
    Ray-->>util.get_gpu_node_resources: False
    util.get_gpu_node_resources->>Ray: ray_init(log_to_driver=False)
    Note over Ray: Initialized with log_to_driver=False<br/>(subprocess output hidden)
    util.get_gpu_node_resources-->>tuner.py: resources
    tuner.py->>Ray: ray.is_initialized()?
    Ray-->>tuner.py: True
    Note over tuner.py: Skips initialization block<br/>(never uses log_to_driver=True)
    
    Note over tuner.py,Ray: ===================================
    
    Note over tuner.py: AFTER PR (Fixed)
    tuner.py->>util.ray_init: ray_init(log_to_driver=True)
    util.ray_init->>Ray: ray.is_initialized()?
    Ray-->>util.ray_init: False
    util.ray_init->>Ray: ray.init(log_to_driver=True)
    Note over Ray: Initialized correctly<br/>(subprocess output visible)
    util.ray_init-->>tuner.py: success
    tuner.py->>util.get_gpu_node_resources: get_gpu_node_resources()
    util.get_gpu_node_resources->>Ray: ray.is_initialized()?
    Ray-->>util.get_gpu_node_resources: True
    util.get_gpu_node_resources-->>tuner.py: resources
    
    tuner.py->>tuner.py: Configure tuner
    tuner.py->>IsaacLabTuneTrainable: default_resource_request()
    IsaacLabTuneTrainable->>util.get_gpu_node_resources: get_gpu_node_resources(one_node_only=True)
    util.get_gpu_node_resources->>Ray: ray.is_initialized()?
    Ray-->>util.get_gpu_node_resources: True
    util.get_gpu_node_resources-->>IsaacLabTuneTrainable: single node resources
Loading

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@garylvov
Copy link
Collaborator

garylvov commented Oct 30, 2025

Hi @PierrePeng , I was unable to recreate your issue when running within the docker environment with the three updates PRs. I used the commands in the documentation, and I had Claude code help me cherry pick the commits from the 3 open PRs to get it running. It seems that there is an issue stemming from how the dependencies are installed in your specific conda environment. I recommend running from within the docker comtainer, especially on a Linux machine because when I developed the code that was the only thing I could test it on.

Maybe you can try installing the same dependency versions in conda, but your YMMV. Unfortunately, it's unrealistic for me and @ozhanozen to really maintain/troubleshoot the Ray functionality (which is partly some Linux bash things hacked together) in other environments beyond what's displayed in the documentation; hence the docs disclaimer about the functionality being experimental and Linux only.

I really appreciate your help in helping identify some of the other issues! Let's try to keep one issue per thread though; if this persists,even within Docker on a Linux machine, then I will be happy to help troubleshoot if you open a new issue. Hopefully we will have these PRs merged soon to simplify this sorry, thank you for your patience!

@PierrePeng
Copy link

Hi @garylvov. After testing, I was able to get it running locally without using Docker. It seems that my previous method of applying the three open PRs was incorrect. After using cherry-pick like you did, everything works fine now. I’m sorry for the confusion, and thank you for your support.

@garylvov
Copy link
Collaborator

Glad to hear that it works! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug Report] Wrong ray initialization within invoke_tuning_run()

3 participants