Skip to content

Conversation

Jackory
Copy link

@Jackory Jackory commented Nov 30, 2023

Description

TRPO is a representative algorithm of policy gradient in reinforcement learning. Although it is no longer practical, its ideas and mathematical principles are still worth considering. Currently, I haven't seen a single-file implementation of TRPO. I'm here to implement a single-file version of TRPO to help beginners understand it.

Types of changes

  • Bug fix
  • New feature
  • New algorithm
  • Documentation

Checklist:

  • I've read the CONTRIBUTION guide (required).
  • I have ensured pre-commit run --all-files passes (required).
  • I have updated the tests accordingly (if applicable).
  • I have updated the documentation and previewed the changes via mkdocs serve.
    • I have explained note-worthy implementation details.
    • I have explained the logged metrics.
    • I have added links to the original paper and related papers.

If you need to run benchmark experiments for a performance-impacting changes:

  • I have contacted @vwxyzjn to obtain access to the openrlbenchmark W&B team.
  • I have used the benchmark utility to submit the tracked experiments to the openrlbenchmark/cleanrl W&B project, optionally with --capture_video.
  • I have performed RLops with python -m openrlbenchmark.rlops.
    • For new feature or bug fix:
      • I have used the RLops utility to understand the performance impact of the changes and confirmed there is no regression.
    • For new algorithm:
      • I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
    • I have added the learning curves generated by the python -m openrlbenchmark.rlops utility to the documentation.
    • I have added links to the tracked experiments in W&B, generated by python -m openrlbenchmark.rlops ....your_args... --report, to the documentation.

Copy link

vercel bot commented Nov 30, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
cleanrl ✅ Ready (Inspect) Visit Preview 💬 Add feedback Dec 6, 2023 1:06pm

@vwxyzjn
Copy link
Owner

vwxyzjn commented Dec 18, 2023

Hi this is some cool stuff! Feel free to run some benchmarks with mujoco to see how it performs.

Copy link

@sontungkieu sontungkieu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: When running with num_envs > 1, this line

new_pg_loss = (advantages[mb_inds] * ratio).mean()

fails because advantages[mb_inds] has shape [batch, action_dim] while ratio is [batch], causing a dimension mismatch.

Proposed fix: Use the flattened b_advantages (shape [batch]) instead of advantages so both tensors align:

- new_pg_loss = (advantages[mb_inds] * ratio).mean()
+ mb_advantages = b_advantages[mb_inds]          # shape [batch]
+ new_pg_loss   = (mb_advantages * ratio).mean()

This ensures that mb_advantages and ratio are both 1-D tensors of length batch, resolving the error when num_envs > 1.

_, newlogprob, entropy = actor.get_action(b_obs[mb_inds], b_actions[mb_inds])
logratio = newlogprob - b_logprobs[mb_inds]
ratio = logratio.exp()
new_pg_loss = (advantages[mb_inds] * ratio).mean()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, I tried your code and it worked with mujoco environments listed on gymnasium when the number of environments is one. When I increased it had an error.

Traceback (most recent call last):
  File "/home/tung/practice-gymnasium/TRPO.py", line 405, in <module>
    new_pg_loss = (advantages[mb_inds] * ratio).mean()

Changed it to
new_pg_loss = (mb_advantages * ratio).mean()
solve the issue. (switch to flattened advantages so dims line up 😊)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants