- 
                Notifications
    You must be signed in to change notification settings 
- Fork 879
Adding TRPO #435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Adding TRPO #435
Conversation
| The latest updates on your projects. Learn more about Vercel for Git ↗︎ 
 | 
| Hi this is some cool stuff! Feel free to run some benchmarks with mujoco to see how it performs. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: When running with
num_envs > 1, this linenew_pg_loss = (advantages[mb_inds] * ratio).mean()fails because
advantages[mb_inds]has shape[batch, action_dim]whileratiois[batch], causing a dimension mismatch.Proposed fix: Use the flattened
b_advantages(shape[batch]) instead ofadvantagesso both tensors align:- new_pg_loss = (advantages[mb_inds] * ratio).mean() + mb_advantages = b_advantages[mb_inds] # shape [batch] + new_pg_loss = (mb_advantages * ratio).mean()This ensures that
mb_advantagesandratioare both 1-D tensors of lengthbatch, resolving the error whennum_envs > 1.
| _, newlogprob, entropy = actor.get_action(b_obs[mb_inds], b_actions[mb_inds]) | ||
| logratio = newlogprob - b_logprobs[mb_inds] | ||
| ratio = logratio.exp() | ||
| new_pg_loss = (advantages[mb_inds] * ratio).mean() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, I tried your code and it worked with mujoco environments listed on gymnasium when the number of environments is one. When I increased it had an error.
Traceback (most recent call last):
  File "/home/tung/practice-gymnasium/TRPO.py", line 405, in <module>
    new_pg_loss = (advantages[mb_inds] * ratio).mean()
Changed it to
new_pg_loss   = (mb_advantages * ratio).mean()
solve the issue.  (switch to flattened advantages so dims line up 😊)
Description
TRPO is a representative algorithm of policy gradient in reinforcement learning. Although it is no longer practical, its ideas and mathematical principles are still worth considering. Currently, I haven't seen a single-file implementation of TRPO. I'm here to implement a single-file version of TRPO to help beginners understand it.
Types of changes
Checklist:
pre-commit run --all-filespasses (required).mkdocs serve.If you need to run benchmark experiments for a performance-impacting changes:
--capture_video.python -m openrlbenchmark.rlops.python -m openrlbenchmark.rlopsutility to the documentation.python -m openrlbenchmark.rlops ....your_args... --report, to the documentation.