Guide to setting up PufferDrive on NYU HPC

Information

We will walk through detailed steps to setup PufferDrive over NYU HPC. This documentation assumes prior setup of users' HPC account. If you need to setup HPC account, refer to the Lab's onboarding checklist.

it is also recommended to have access to your account code for priority access to clusters (Ask Eugene for the same)

Installation (one-time)

1. Clone the PufferDrive repository in your \scratch\$USER directory. (You can set this up in your home branch as well, this guide follows an installation in the /scratch/ branch.

 https://github.com/Emerge-Lab/PufferDrive.git

2. We will now setup an overlay image in our directory. It is a writable layer which sits on top of a read-only image (which we will setup later). With PufferDrive, it is recommended to use the following overlay file owing to the preinstalled libraries useful for successful setup.

 cd PufferDrive

mkdir -p /scratch/$USER/images/PufferDrive
cd /scratch/$USER/images/PufferDrive

cp /scratch/work/public/overlay-fs-ext3/overlay-50G-10M.ext3.gz .
gunzip overlay-50G-10M.ext3.gz

This might take a few minutes, as the file size is pretty big. Once done, verify that the image exists -

ls /scratch/$USER/images/PufferDrive

3. We will now request for GPU nodes to setup our Singularity container.

srun --nodes=1 --tasks-per-node=1 --cpus-per-task=1 --mem=10GB --gres=gpu:1 \
--time=1:00:00 --account= --pty /bin/bash

Though we can start working with the Singularity container even without requesting nodes, we will need to request GPU nodes for complete setup. This is because setting up torch, and other libraries require GPU access.

Once successful, you might see something like this (depends on current HPC traffic) -

>>> srun: job XXXXXXX queued and waiting for resources
>>> srun: job XXXXXXX has been allocated resources

4. Launch singularity container by running the following command -

cd /scratch/$USER/PufferDrive

singularity exec --nv --overlay /scratch/$USER/images/PufferDrive/overlay-50G-10M.ext3:rw \
/scratch/work/public/singularity/cuda12.2.2-cudnn8.9.4-devel-ubuntu22.04.3.sif /bin/bash

You should now see -

Singularity>

We will use Conda to setup Python on our container. Now, inside the container, download and install miniforge to /ext3/miniforge3

wget --no-check-certificate https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh -b -p /ext3/miniforge3
rm Miniforge3-Linux-x86_64.sh # if you don't need this file any longer

Next, create a wrapper script /ext3/env.sh using a text editor, like nano.

touch /ext3/env.sh
nano /ext3/env.sh

The wrapper script will activate your conda environment, to which you will be installing your packages and dependencies. The script should contain the following:

#!/bin/bash

unset -f which

source /ext3/miniforge3/etc/profile.d/conda.sh
export PATH=/ext3/miniforge3/bin:$PATH
export PYTHONPATH=/ext3/miniforge3/bin:$PATH

Activate your conda environment with the following:

source /ext3/env.sh

If you have the "defaults" channel enabled, please disable it with

conda config --remove channels defaults

Now that your environment is activated, you can update and install packages:

conda update -n base conda -y
conda clean --all --yes
conda install pip -y
conda install ipykernel -y # Note: ipykernel is required to run as a kernel in the Open OnDemand Jupyter Notebooks

To confirm that your environment is appropriately referencing your Miniforge installation, try out the following:

unset -f which
which conda
# output: /ext3/miniforge3/bin/conda

which python
# output: /ext3/miniforge3/bin/python

python --version
# output: Python 3.8.5

which pip
# output: /ext3/miniforge3/bin/pip

For further instructions, you can refer to the NYU HPC Singularity link

We are now ready to install packages for PufferDrive

While still in the singularity container, we will now do the following -

(This is a tiny, fast C library for reading and parsing .ini configuration files.)

wget https://github.com/benhoyt/inih/archive/r62.tar.gz

We can now install the dependencies

pip install -e .

Now, we can compile the C code

python setup.py build_ext --inplace --force

We are now done with the installation 💯 . The next part of the setup is to prepare the folder with required data files and headless rendering.

Preparing Data

Downloading Waymo Data

You can download the WOMD data from Hugging Face in two versions:

Mini Dataset: GPUDrive_mini contains 1,000 training files and 300 test/validation files
Full Dataset: GPUDrive contains 100,000 unique scenes

Note: Replace 'GPUDrive_mini' with 'GPUDrive' in your download commands if you want to use the full dataset.

Here's the link to a Python file you can add at /scratch/$USER/PufferDrive to download WOMD data.

Lastly, let's run the following command to convert JSON files to map binaries. (It should be run from /scratch/$USER/PufferDrive)

python pufferlib/ocean/drive/drive.py

We're almost there now 🚀 . The last step of setting up PufferDrive is to add support for headless rendering. This is essential to observe renders and videos of the training runs, which is super useful for inference and debugging.

This is also a good step to test our setup by running (from /scratch/$USER/PufferDrive) -

puffer train puffer_drive

Setup Headless Rendering

For this, we will need to install a couple of libraries in our Conda environment -

conda install -c conda-forge xorg-x11-server-xvfb-cos6-x86_64
conda install -c conda-forge ffmpeg

where -

ffmpeg: Video processing and conversion
xvfb: Virtual display for headless environments

With this, we can visualize our training runs and also successfully export out renders automatically over to Wandb runs.

Build and Run

Build the application

bash scripts/build_ocean.sh drive local

(If this fails, replace drive with visualize)

Run with virtual display

xvfb-run -s "-screen 0 1280x720x24" ./drive

(If this fails, replace ./drive with ./visualize)

We are now done with the setup 😃 . We've provided a step by step example to showcase one way of running Puffer.

Launching Jobs with submit_cluster.py

For running training experiments, we recommend using submit_cluster.py instead of manual sbatch scripts. It handles SLURM submission, Singularity container wrapping, code isolation (so rebuilds don't break running jobs), and wandb integration.

Prerequisites

You need a lightweight login venv (outside the container) with submitit and pyyaml:

  python -m venv /scratch/$USER/login_venv
  source /scratch/$USER/login_venv/bin/activate
  pip install submitit pyyaml

Basic Usage

  source /scratch/$USER/login_venv/bin/activate
  cd /scratch/$USER/PufferDrive

  python scripts/submit_cluster.py \
      --save_dir /scratch/$USER/experiments \
      --compute_config scripts/cluster_configs/nyu_greene.yaml \
      --program_config scripts/cluster_configs/train_base.yaml \
      --prefix my_experiment \
      --container \
      --container_overlay /scratch/$USER/images/PufferDrive/overlay-50G-10M.ext3

This submits a job named my_experiment_train_base_<hash> with wandb run name my_experiment.

Wandb Integration

The submit script supports wandb run naming, grouping, and project selection:

Argument	Description	Default
`--prefix`	Job name prefix, also used as wandb run name	None
`--wandb-name`	Explicit wandb run name (overrides prefix)	Same as `--prefix`
`--wandb-group`	Wandb group for organizing related runs	From program config
`--wandb-project`	Wandb project name	From program config

Example with all wandb options:

  python scripts/submit_cluster.py \
      --save_dir /scratch/$USER/experiments \
      --compute_config scripts/cluster_configs/nyu_greene.yaml \
      --program_config scripts/cluster_configs/train_base.yaml \
      --prefix single_agent_fargoals \
      --wandb-group single_agent_experiments \
      --wandb-project pufferdrive \
      --container \
      --container_overlay /scratch/$USER/images/PufferDrive/overlay-50G-10M.ext3 \
      --args 'env.num_agents=1024' 'train.batch_size=2097152'

Overriding Training Config

Use --args to override any setting from the program config or drive.ini. Keys use dot notation for sections:

  --args 'env.num_agents=1024' 'train.batch_size=2097152' 'train.gamma=0.999'

To run a hyperparameter sweep, use colon-separated values (one job per combination):

  --args 'train.seed=42:55:1' 'train.gamma=0.98:0.999'

Before Launching: Rebuild

Always pull the latest code and rebuild the C extension on a compute node before launching. Do not build on the login node — nvcc uses too much memory and gets OOM-killed.

  cd /scratch/$USER/PufferDrive
  git pull origin <branch>

  srun --account=<account> --partition=l40s_public,h200_public \
      --cpus-per-task=4 --mem=16gb --time=10 --gres=gpu:1 \
      singularity exec --nv \
      --overlay /scratch/$USER/images/PufferDrive/overlay-50G-10M.ext3:ro \
      /share/apps/images/cuda12.8.1-cudnn9.8.0-ubuntu24.04.2.sif \
      bash -c 'source /ext3/env.sh && export TORCH_CUDA_ARCH_LIST="8.0;9.0" && python setup.py build_ext --inplace --force'

Set TORCH_CUDA_ARCH_LIST to include 8.0 (A100/L40S) and 9.0 (H200) since SLURM may schedule on either partition.

Monitoring Jobs

  # Check job status
  squeue -u $USER

  # View training output
  tail -20 /scratch/$USER/experiments/<job_name>/submitit/<job_id>_0_log.out

  # View errors
  tail -20 /scratch/$USER/experiments/<job_name>/submitit/<job_id>_0_log.err

  # Find wandb run URL
  grep 'View run' /scratch/$USER/experiments/<job_name>/submitit/<job_id>_0_log.err

Dry Run

To preview the commands without submitting:

  python scripts/submit_cluster.py \
      --save_dir /scratch/$USER/experiments \
      --compute_config scripts/cluster_configs/nyu_greene.yaml \
      --program_config scripts/cluster_configs/train_base.yaml \
      --prefix my_test \
      --container \
      --container_overlay /scratch/$USER/images/PufferDrive/overlay-50G-10M.ext3 \
      --dry

Srun Example

Login to your HPC account
Navigate to your base repository location (eg - \scratch\$USER\PufferDrive)
We will show an example to execute the run using sbatch. Create a .s file (eg - script.s)

singularity exec --nv --overlay \
/scratch/$USER/images/PufferDrive/overlay-50G-10M.ext3:rw \
/scratch/work/public/singularity/cuda12.2.2-cudnn8.9.4-devel-ubuntu22.04.3.sif \
/bin/bash -c "
    source /ext3/env.sh
    python setup.py build_ext --inplace --force
    bash scripts/build_ocean.sh drive local
    puffer train puffer_drive --wandb --wandb-project "puffer" --wandb-group "test"

Run the command

sbatch script.s

You should see your task up in the queue (can view this running squeue -u $USER.

It is important to run python setup.py build_ext --inplace --force and bash scripts/build_ocean.sh drive local for any changes to the build files, or build.c respectively.

Additional Notes

You can setup the PufferDrive clone in your home repository as well.
Setting up Headless rendering might require installing additional libraries (eg clang). The appropriate libraries can be downloaded by following the terminal output.
In case there is a version mismatch while setting up Headless rendering in base Conda environment, one can instantiate a virtual environment which is activated especially for rendering purposes. The following commands can then be followed (just an example, assuming the name of virtual environment to be myenv)-

singularity exec --nv --overlay \
/scratch/$USER/images/PufferDrive/overlay-50G-10M.ext3:rw \
/scratch/work/public/singularity/cuda12.2.2-cudnn8.9.4-devel-ubuntu22.04.3.sif \
/bin/bash -c "
    source /ext3/env.sh
    python setup.py build_ext --inplace --force
    conda activate myenv
    bash scripts/build_ocean.sh drive local
    puffer train puffer_drive --wandb --wandb-project "puffer" --wandb-group "test"

Prefer using srun during code-development and testing, use sbatch for hyper parameter sweeps. To execute using srun, of the following

srun --nodes=1 --tasks-per-node=1 --cpus-per-task=1 --mem=10GB --gres=gpu:1 \
--time=1:00:00 --account= --pty /bin/bash

Then, once you are logged in your compute node -

singularity exec --nv --overlay \
/scratch/$USER/images/PufferDrive/overlay-50G-10M.ext3:rw \
/scratch/work/public/singularity/cuda12.2.2-cudnn8.9.4-devel-ubuntu22.04.3.sif \
/bin/bash
source /ext3/env.sh
python setup.py build_ext --inplace --force
bash scripts/build_ocean.sh drive local
puffer train puffer_drive --wandb --wandb-project "puffer" --wandb-group "test"

Help

Do you encounter issues with one of the steps outlined above? Please reach out in the Emerge lab #code-help channel!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly