his repository presents a comprehensive study on reinforcement learning (RL) algorithms applied to a custom MuJoCo Hopper environment. The project aims to build robust locomotion policies under uncertain dynamics using:
- Classic Policy Gradient Methods: REINFORCE, Actor-Critic
- Advanced On-Policy Algorithms: Proximal Policy Optimization (PPO)
- Robustness Techniques: Domain Randomization (UDR), Curriculum Learning (CDR), Entropy Scheduling (ES)
.
├── src/ # Core code (Python package)
│ ├── agents/ # RL algorithm implementations
│ ├── env/ # Custom MuJoCo-Hopper wrappers
│ ├── evaluation/ # Metrics, plotting, helper scripts
│ └── training/ # Training entry-points & configs
│
├── Logs/ # Raw tensorboard/CSV logs
│ ├── Learning_Curve/ # ⇢ learning-curve CSVs
│ ├── PPO_episode_rewards/ # ⇢ per-episode returns
│ ├── PPO_robustness/ # ⇢ domain-randomisation runs
│ ├── PPO_runtime_tmp/ # ⇢ scratch & tmp logs
│ ├── actor_critic/ # ⇢ AC experiments
│ └── baseline/ # ⇢ REINFORCE baseline runs
│
├── models/
│ ├── PPO/
│ ├── actor_critic/
│ └── reinforce_baseline/
│
├── render/ # Visual outputs (GIF/MP4/PNG)
│ └── plots/
│
├── requirements.txt # Python dependencies
├── README.md # You are here 👋
├── __init__.py # Makes repo import-able (`import rl_master`)
├── .idea/ # IDE settings (⇢ add to .gitignore)
└── __pycache__/ # Byte-code cache (auto-generated)
The environment is based on a custom subclass of the MuJoCo Hopper (custom_hopper.py), extended with:
- Parameter Randomization: friction, damping, body mass, initial state
- Domain Randomization:
- Uniform DR (UDR): randomized every episode
- Curriculum DR (ES-CDR): difficulty scaled with agent performance and return entropy
| Algorithm | Description |
|---|---|
| REINFORCE | Monte Carlo policy gradient with optional baseline |
| Actor-Critic | TD-based policy/value method |
| PPO | Clipped surrogate objective with GAE (Stable-Baselines3) |
| UDR | Domain variation with uniform sampling |
| ES-CDR | Return entropy-driven difficulty adjustment |
Install the required packages:
pip install -r requirements.txtYou’ll need MuJoCo 2.1+ properly installed and licensed. Refer to: 👉 https://github.com/openai/mujoco-py#install-mujoco
From the root directory, run:
# REINFORCE
python src/training/Train_Reinforce_vanila.py
# REINFORCE with baseline
python src/training/Train_Baseline.py
# Actor-Critic
python src/training/Train_Actor_Critic.py
# PPO + UDR + ES-CDR
python src/training/PPO_UDR_ES_CDR.py --Domain cdr --Entropy_Scheduling True --seed 0python src/training/PPO_Hyperparameter_Calculation.pyYou can adjust sweep parameters via JSON or inline config.
Training metrics (returns, entropy, etc.) are saved as CSV in the Logs/ directory.
To plot results:
python evaluation/plot_csv_scripts/plot_metrics.pyOr use the built-in metric utilities in src/evaluation.
Implemented in src/env/custom_hopper.py, our environment introduces:
- Dynamic randomization of:
- Mass, friction, damping, init pose
- UDR: Resampled every episode
- CDR + ES: Difficulty increases based on policy performance and entropy
This project extends PPO with adaptive training difficulty using:
CDR gradually increases the range of domain parameters (e.g., torso mass, friction) during training, helping the agent:
- First master simple dynamics.
- Then adapt to complex, realistic scenarios.
Use it with:
--Domain cdrES monitors the policy’s return entropy. When the agent is confident (low entropy), it:
- Advances the curriculum level.
- Makes the environment harder.
Enable it with:
--Entropy_Scheduling Truepython src/training/PPO_UDR_ES_CDR.py --Domain cdr --Entropy_Scheduling True --seed 0| Level | Mean Return | Std Dev | Return Entropy |
|---|---|---|---|
| 1 | 820 | ±50 | 1.02 |
| 2 | 710 | ±70 | 1.30 |
| 3 | 665 | ±85 | 1.48 |
- OpenAI Baselines
- Stable-Baselines3 Docs
- MuJoCo Documentation
- Add evaluation over unseen dynamics
- Experiment with off-policy algorithms (e.g., SAC, DDPG)
- Integrate video rendering and performance visualizations
Please reach out via GitHub issues or Linkedin profiles.