-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Fixes ray initialization to correctly direct subprocess output #3533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
LGTM thanks for catching this! |
|
@ozhanozen Hi, I have tried this PR. Just run the command with Then the ouput log show below. (IsaacLabTuneTrainable pid=1414) [ERROR]: Could not find experiment logs within 200.0 seconds. |
|
Hi @PierrePeng I believe you also need the fix from #3531 |
|
I'd also check out #3276 which depends on this PR and the other one I linked above |
Thanks @garylvov! I have applied the patch fix in 3531 and 3533. It still didn't work and get the same error as before. It seems that the process haven't invoke the scripts/reinforcement_learning/rl_games/train.py script. |
|
Hi @PierrePeng, couuld you add
and track what is wrong? Assuming you already have the 3531, you should be able to see some output that might give clues regarding what is wrong. If train.py script is not executed at all, this problem is not directly linked to this PR and it is better to create a new issue for this.
|
|
Hi @ozhanozen . Here is the log which is based on the #3531 and #3533, and adding |
|
Thank you for the log @PierrePeng (and for suggesting this functionality @ozhanozen ) It looks like that each training run started as needed, but the exact experiment name couldn't be extracted. I think we need to do add the "Exact experiment name requested from command line: " To RL Games. Previously, this was handled implicitly for RL Games (and explicitly for the rest). However, I don't see that this experiment name is being printed, so we need to add it explicitly. |
|
Hi @PierrePeng I believe commit fe6d188 in #3531 should resolve this issue, please let me know if it persists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR fixes a critical bug in Ray initialization for distributed training workflows. Previously, util.get_gpu_node_resources() was implicitly initializing Ray with log_to_driver=False, which prevented subprocess output (including debug messages from training steps) from reaching the main process. The fix makes Ray initialization explicit by: (1) moving ray_init() with log_to_driver=True to occur before get_gpu_node_resources() in tuner.py, (2) removing the implicit initialization from get_gpu_node_resources() and instead raising a RuntimeError if Ray isn't already initialized, and (3) adding a warning when attempting to initialize Ray that's already been initialized. This change reverses PR #3350which had added the problematic implicit initialization, and ensures proper control over Ray configuration parameters during the tuning workflow setup.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| scripts/reinforcement_learning/ray/util.py | 5/5 | Removed implicit Ray initialization from get_gpu_node_resources(), now raises RuntimeError if Ray not initialized; added warning for double-initialization attempts |
| scripts/reinforcement_learning/ray/tuner.py | 5/5 | Moved explicit Ray initialization with log_to_driver=True to occur before get_gpu_node_resources() call in invoke_tuning_run() |
Confidence score: 5/5
- This PR is safe to merge with minimal risk as it fixes a well-documented bug without introducing breaking changes
- Score reflects clear problem identification, straightforward solution, and explicit handling of initialization edge cases with appropriate error messages
- No files require special attention; both changes work together to solve the subprocess output routing issue
Sequence Diagram
sequenceDiagram
participant User
participant tuner.py
participant util.ray_init
participant util.get_gpu_node_resources
participant Ray
User->>tuner.py: "Run tuner with --run_mode local/remote"
tuner.py->>util.ray_init: "ray_init(ray_address, log_to_driver=True)"
alt Ray not initialized
util.ray_init->>Ray: "ray.init(address, runtime_env, log_to_driver)"
Ray-->>util.ray_init: "Initialized"
util.ray_init->>tuner.py: "[INFO] Initializing Ray..."
else Ray already initialized
util.ray_init->>tuner.py: "[WARNING] Ray already initialized!"
end
tuner.py->>util.get_gpu_node_resources: "get_gpu_node_resources()"
alt Ray not initialized (old behavior - removed)
util.get_gpu_node_resources->>Ray: "ray.init() [REMOVED]"
Note over util.get_gpu_node_resources,Ray: This caused subprocess output issues
end
alt Ray initialized (new behavior)
util.get_gpu_node_resources->>Ray: "ray.nodes()"
Ray-->>util.get_gpu_node_resources: "Node resources"
util.get_gpu_node_resources-->>tuner.py: "Available resources dict"
else Ray not initialized (error case)
util.get_gpu_node_resources-->>tuner.py: "RuntimeError: Ray must be initialized"
end
tuner.py->>tuner.py: "Configure OptunaSearch, Repeater, RunConfig"
tuner.py->>tuner.py: "Create Tuner with IsaacLabTuneTrainable"
tuner.py->>Ray: "tuner.fit()"
Ray-->>User: "Training results (Tensorboard/MLFlow)"
2 files reviewed, no comments
|
@garylvov I have merge the branch 3531\3533\3276. And the issue still persists. Here the log and my code. When will these fixes be merged into the main branch? These issues have made Ray completely unusable and have existed for quite a long time. |
|
Hi @PierrePeng your issue appears partially unrelated to the ray functionality; I'd take a look at wandb/wandb#10662 As for the open GitHub PRs, that fix the ray functionality, I apologize, I've done everything I can on my end; I don't work at NVIDIA, and I don't have merge permissions on this project. I've reviewed all 3 open PRs, and they are ready to merge. |
|
Hi @kellyguo11 @ooctipus @RandomOakForest . We have three PRs open that need to be merged to fix the Ray functionality; these PRs have been open for quite some time. These PRs are ready to merge. This PR, as well as #3531 and #3533 will fix many problems with the Ray functionality. Other than adding an extra argument to the training scripts, these PRs don't affect functionality unrelated to Ray. Can you please help me merge these? I don't have the permissions to but I've already reviewed all three PRs |
|
Hi @garylvov , thanks for your quick reply and for pointing out the wandb issue. I tried to uninstall the wandb and start the ray job again. The issue still persists. I'm not sure yet whether my problem is related to running the Ray job directly on my local machine instead of inside a Docker environment. Again I really appreciate your effort in reviewing and pushing them forward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Fixed Ray initialization order in invoke_tuning_run() to ensure subprocess output is correctly directed to the driver. Previously, get_gpu_node_resources() implicitly initialized Ray with log_to_driver=False before the explicit initialization with log_to_driver=True could run, causing debug output from training processes to be hidden.
Key changes:
- Moved Ray initialization in
tuner.py:222-225to occur beforeget_gpu_node_resources()call - Removed implicit Ray initialization from
util.get_gpu_node_resources(), replacing it with explicit error if Ray is not initialized - Added warning in
util.ray_init()when attempting to initialize an already-initialized Ray instance
Impact:
This fix ensures that debug lines within the Trainable step() function and other subprocess output are visible during hyperparameter tuning runs.
Confidence Score: 5/5
- This PR is safe to merge with minimal risk
- The changes are well-designed and address a clear bug. The fix properly reverses the problematic PR #3350 by making Ray initialization explicit and controllable. All call sites (
tuner.pyline 228 and 164,wrap_resources.pyline 84) already initialize Ray before callingget_gpu_node_resources(), so the breaking change is safe. The warning added for duplicate initialization attempts provides good debugging feedback. - No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| scripts/reinforcement_learning/ray/tuner.py | 5/5 | Moved Ray initialization to correct location with proper log_to_driver=True parameter, ensuring subprocess output is visible |
| scripts/reinforcement_learning/ray/util.py | 5/5 | Replaced implicit Ray initialization with explicit error, added warning for duplicate initialization attempts |
Sequence Diagram
sequenceDiagram
participant User
participant tuner.py
participant util.ray_init
participant util.get_gpu_node_resources
participant Ray
User->>tuner.py: invoke_tuning_run()
Note over tuner.py: BEFORE PR (Broken)
tuner.py->>util.get_gpu_node_resources: get_gpu_node_resources()
util.get_gpu_node_resources->>Ray: ray.is_initialized()?
Ray-->>util.get_gpu_node_resources: False
util.get_gpu_node_resources->>Ray: ray_init(log_to_driver=False)
Note over Ray: Initialized with log_to_driver=False<br/>(subprocess output hidden)
util.get_gpu_node_resources-->>tuner.py: resources
tuner.py->>Ray: ray.is_initialized()?
Ray-->>tuner.py: True
Note over tuner.py: Skips initialization block<br/>(never uses log_to_driver=True)
Note over tuner.py,Ray: ===================================
Note over tuner.py: AFTER PR (Fixed)
tuner.py->>util.ray_init: ray_init(log_to_driver=True)
util.ray_init->>Ray: ray.is_initialized()?
Ray-->>util.ray_init: False
util.ray_init->>Ray: ray.init(log_to_driver=True)
Note over Ray: Initialized correctly<br/>(subprocess output visible)
util.ray_init-->>tuner.py: success
tuner.py->>util.get_gpu_node_resources: get_gpu_node_resources()
util.get_gpu_node_resources->>Ray: ray.is_initialized()?
Ray-->>util.get_gpu_node_resources: True
util.get_gpu_node_resources-->>tuner.py: resources
tuner.py->>tuner.py: Configure tuner
tuner.py->>IsaacLabTuneTrainable: default_resource_request()
IsaacLabTuneTrainable->>util.get_gpu_node_resources: get_gpu_node_resources(one_node_only=True)
util.get_gpu_node_resources->>Ray: ray.is_initialized()?
Ray-->>util.get_gpu_node_resources: True
util.get_gpu_node_resources-->>IsaacLabTuneTrainable: single node resources
2 files reviewed, no comments
|
Hi @PierrePeng , I was unable to recreate your issue when running within the docker environment with the three updates PRs. I used the commands in the documentation, and I had Claude code help me cherry pick the commits from the 3 open PRs to get it running. It seems that there is an issue stemming from how the dependencies are installed in your specific conda environment. I recommend running from within the docker comtainer, especially on a Linux machine because when I developed the code that was the only thing I could test it on. Maybe you can try installing the same dependency versions in conda, but your YMMV. Unfortunately, it's unrealistic for me and @ozhanozen to really maintain/troubleshoot the Ray functionality (which is partly some Linux bash things hacked together) in other environments beyond what's displayed in the documentation; hence the docs disclaimer about the functionality being experimental and Linux only. I really appreciate your help in helping identify some of the other issues! Let's try to keep one issue per thread though; if this persists,even within Docker on a Linux machine, then I will be happy to help troubleshoot if you open a new issue. Hopefully we will have these PRs merged soon to simplify this sorry, thank you for your patience! |
|
Hi @garylvov. After testing, I was able to get it running locally without using Docker. It seems that my previous method of applying the three open PRs was incorrect. After using cherry-pick like you did, everything works fine now. I’m sorry for the confusion, and thank you for your support. |
|
Glad to hear that it works! 😊 |
Description
When running Ray directly from tuner.py, Ray is not correctly initialized within
invoke_tuning_run(). The two problems associated with this are discussed in #3532. To solve them, this PR:ray_init()fromutil.get_gpu_node_resources(). Now, ray needs to be initialized before callingutil.get_gpu_node_resources(). This change actually reverses Fixes the missing Ray initialization #3350, which was merged to add the missing initialization when usingtuner.py, but it is safer to explicitly initialize Ray with the correct arguments outside of theutil.get_gpu_node_resources().invoke_tuning_run()to be beforeutil.get_gpu_node_resources()so we explicitly initialize it before and do not raise an exception later.ray_init()if Ray was already initialized.Fixes #3532
Type of change
Screenshots
Change 1:

Change 2:

Change 3:

Checklist
pre-commitchecks with./isaaclab.sh --formatconfig/extension.tomlfileCONTRIBUTORS.mdor my name already exists there