Skip to content

Conversation

ozhanozen
Copy link
Contributor

Description

When running Ray directly from tuner.py, Ray is not correctly initialized within invoke_tuning_run(). The two problems associated with this are discussed in #3532. To solve them, this PR:

  1. Removes ray_init() from util.get_gpu_node_resources(). Now, ray needs to be initialized before calling util.get_gpu_node_resources(). This change actually reverses Fixes the missing Ray initialization #3350, which was merged to add the missing initialization when using tuner.py, but it is safer to explicitly initialize Ray with the correct arguments outside of the util.get_gpu_node_resources().
  2. Moves Ray initialization within invoke_tuning_run() to be before util.get_gpu_node_resources() so we explicitly initialize it before and do not raise an exception later.
  3. Adds a warning when calling ray_init() if Ray was already initialized.

Fixes #3532

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Screenshots

Change 1:
Screenshot 2025-09-23 at 16 52 55

Change 2:
Screenshot 2025-09-23 at 16 52 33

Change 3:
Screenshot 2025-09-23 at 16 55 21

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

@ozhanozen ozhanozen requested a review from ooctipus as a code owner September 23, 2025 14:59
@github-actions github-actions bot added bug Something isn't working isaac-lab Related to Isaac Lab team labels Sep 23, 2025
@garylvov
Copy link
Collaborator

LGTM thanks for catching this!

@PierrePeng
Copy link

@ozhanozen Hi, I have tried this PR. Just run the command with
./isaaclab.sh -p scripts/reinforcement_learning/ray/tuner.py --cfg_file scripts/reinforcement_learning/ray/hyperparameter_tuning/vision_cartpole_cfg.py --cfg_class CartpoleTheiaJobCfg --run_mode local --workflow scripts/reinforcement_learning/rl_games/train.py --num_workers_per_node 1.

Then the ouput log show below.

(IsaacLabTuneTrainable pid=1414) [ERROR]: Could not find experiment logs within 200.0 seconds.
(IsaacLabTuneTrainable pid=1414) [ERROR]: Could not extract experiment_name/logdir from trainer output (experiment_name=None, logdir=None).
(IsaacLabTuneTrainable pid=1414) Make sure your training script prints the following correctly:
(IsaacLabTuneTrainable pid=1414) Exact experiment name requested from command line:
(IsaacLabTuneTrainable pid=1414) [INFO] Logging experiment in directory:
(IsaacLabTuneTrainable pid=1414)
(IsaacLabTuneTrainable pid=1414)

@garylvov
Copy link
Collaborator

Hi @PierrePeng I believe you also need the fix from #3531

@garylvov
Copy link
Collaborator

I'd also check out #3276 which depends on this PR and the other one I linked above

@PierrePeng
Copy link

PierrePeng commented Oct 12, 2025

Hi @PierrePeng I believe you also need the fix from #3531

Thanks @garylvov! I have applied the patch fix in 3531 and 3533.

It still didn't work and get the same error as before. It seems that the process haven't invoke the scripts/reinforcement_learning/rl_games/train.py script.

@ozhanozen
Copy link
Contributor Author

Hi @PierrePeng, couuld you add log_all_output = True as an argument to the

experiment = util.execute_job(

and track what is wrong? Assuming you already have the 3531, you should be able to see some output that might give clues regarding what is wrong. If train.py script is not executed at all, this problem is not directly linked to this PR and it is better to create a new issue for this.

@PierrePeng
Copy link

Hi @ozhanozen . Here is the log which is based on the #3531 and #3533, and adding log_all_output = True

log.txt

@garylvov
Copy link
Collaborator

Thank you for the log @PierrePeng (and for suggesting this functionality @ozhanozen )

It looks like that each training run started as needed, but the exact experiment name couldn't be extracted. I think we need to do add the

"Exact experiment name requested from command line: "

To RL Games. Previously, this was handled implicitly for RL Games (and explicitly for the rest). However, I don't see that this experiment name is being printed, so we need to add it explicitly.

@garylvov
Copy link
Collaborator

Hi @PierrePeng I believe commit fe6d188 in #3531 should resolve this issue, please let me know if it persists

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug Report] Wrong ray initialization within invoke_tuning_run()

3 participants