Skip to content

mradovic38/ppo-mujoco

Repository files navigation

Proximal Policy Optimization (PPO) for MuJoCo

This repository contains an implementation of Proximal Policy Optimization (PPO) based on the original PPO paper for continuous control tasks in MuJoCo environments. The implementation has been tested on HalfCheetah, Swimmer, Hopper, and Walker2d and compared with A2C and Vanilla Policy Gradient (VPG).

Prerequisites

  1. Install UV (if you don't have it already)
curl -LsSf https://astral.sh/uv/install.sh | sh
  1. Install dependencies
uv sync
  1. If you want to use W&B to track training progress: Generate W&B API key, create .env file and add WANDB_API_KEY:
WANDB_API_KEY=<YOUR_API_KEY>

Training

Regular training with W&B

uv run python train.py --config config/<MUJOCO_ENV_NAME>/<a2c|ppo|vpg>.yaml

Regular training without W&B

uv run python train.py --config config/<MUJOCO_ENV_NAME>/<a2c|ppo|vpg>.yaml --disable-wandb

W&B sweep (runs all 3 random seeds)

wandb sweep config/<MUJOCO_ENV_NAME>/<a2c|ppo|vpg>.yaml      
wandb agent <AGENT_NAME>

Simulating

uv run python simulate.py --config config/<MUJOCO_ENV_NAME>/<a2c|ppo|vpg>.yaml --video-dir videos --episodes <NUM_EPISODES>

Supported Environments

Training Performance

I log smoothed returns over the last 100 episodes during training. Below are the learning curves of PPO, A2C, and Vanilla PG, averaged over 3 random seeds.

Legend: 🟩 PPO | 🟦 A2C | 🟧 Vanilla PG

Comparison of PPO, A2C, and Vanilla PG algorithms on different MuJoCo environments: average return over 100 episodes, trained for 1 million timesteps.

Resources

About

Implementation of Proximal Policy Optimization (PPO) based on the original PPO paper for continuous control tasks in MuJoCo environments.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages