This repository contains an implementation of Proximal Policy Optimization (PPO) based on the original PPO paper for continuous control tasks in MuJoCo environments. The implementation has been tested on HalfCheetah, Swimmer, Hopper, and Walker2d and compared with A2C and Vanilla Policy Gradient (VPG).
- Install UV (if you don't have it already)
curl -LsSf https://astral.sh/uv/install.sh | sh- Install dependencies
uv sync- If you want to use W&B to track training progress: Generate W&B API key, create .env file and add
WANDB_API_KEY:
WANDB_API_KEY=<YOUR_API_KEY>
uv run python train.py --config config/<MUJOCO_ENV_NAME>/<a2c|ppo|vpg>.yamluv run python train.py --config config/<MUJOCO_ENV_NAME>/<a2c|ppo|vpg>.yaml --disable-wandbwandb sweep config/<MUJOCO_ENV_NAME>/<a2c|ppo|vpg>.yaml
wandb agent <AGENT_NAME>uv run python simulate.py --config config/<MUJOCO_ENV_NAME>/<a2c|ppo|vpg>.yaml --video-dir videos --episodes <NUM_EPISODES>I log smoothed returns over the last 100 episodes during training. Below are the learning curves of PPO, A2C, and Vanilla PG, averaged over 3 random seeds.
Legend: 🟩 PPO | 🟦 A2C | 🟧 Vanilla PG
Comparison of PPO, A2C, and Vanilla PG algorithms on different MuJoCo environments: average return over 100 episodes, trained for 1 million timesteps.







