-
Notifications
You must be signed in to change notification settings - Fork 23
Guide to setting up PufferDrive on NYU HPC
We will walk through detailed steps to setup PufferDrive over NYU HPC. This documentation assumes prior setup of users' HPC account. If you need to setup HPC account, refer to the Lab's onboarding checklist.
it is also recommended to have access to your account code for priority access to clusters (Ask Eugene for the same)
1. Clone the PufferDrive repository in your \scratch\$USER directory. (You can set this up in your home branch as well, this guide follows an installation in the /scratch/ branch.
https://github.com/Emerge-Lab/PufferDrive.git
2. We will now setup an overlay image in our directory. It is a writable layer which sits on top of a read-only image (which we will setup later). With PufferDrive, it is recommended to use the following overlay file owing to the preinstalled libraries useful for successful setup.
cd PufferDrive
mkdir -p /scratch/$USER/images/PufferDrive cd /scratch/$USER/images/PufferDrive
cp /scratch/work/public/overlay-fs-ext3/overlay-50G-10M.ext3.gz . gunzip overlay-50G-10M.ext3.gz
This might take a few minutes, as the file size is pretty big. Once done, verify that the image exists -
ls /scratch/$USER/images/PufferDrive
3. We will now request for GPU nodes to setup our Singularity container.
srun --nodes=1 --tasks-per-node=1 --cpus-per-task=1 --mem=10GB --gres=gpu:1 \ --time=1:00:00 --account= --pty /bin/bash
Though we can start working with the Singularity container even without requesting nodes, we will need to request GPU nodes for complete setup. This is because setting up torch, and other libraries require GPU access.
Once successful, you might see something like this (depends on current HPC traffic) -
>>> srun: job XXXXXXX queued and waiting for resources >>> srun: job XXXXXXX has been allocated resources
4. Launch singularity container by running the following command -
cd /scratch/$USER/PufferDrive
singularity exec --nv --overlay /scratch/$USER/images/PufferDrive/overlay-50G-10M.ext3:rw \ /scratch/work/public/singularity/cuda12.2.2-cudnn8.9.4-devel-ubuntu22.04.3.sif /bin/bash
You should now see -
Singularity>
- We will use Conda to setup Python on our container. Now, inside the container, download and install miniforge to /ext3/miniforge3
wget --no-check-certificate https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh bash Miniforge3-Linux-x86_64.sh -b -p /ext3/miniforge3 rm Miniforge3-Linux-x86_64.sh # if you don't need this file any longer
Next, create a wrapper script /ext3/env.sh using a text editor, like nano.
touch /ext3/env.sh nano /ext3/env.sh
The wrapper script will activate your conda environment, to which you will be installing your packages and dependencies. The script should contain the following:
#!/bin/bash unset -f which source /ext3/miniforge3/etc/profile.d/conda.sh export PATH=/ext3/miniforge3/bin:$PATH export PYTHONPATH=/ext3/miniforge3/bin:$PATH
Activate your conda environment with the following:
source /ext3/env.sh
If you have the "defaults" channel enabled, please disable it with
conda config --remove channels defaults
Now that your environment is activated, you can update and install packages:
conda update -n base conda -y conda clean --all --yes conda install pip -y conda install ipykernel -y # Note: ipykernel is required to run as a kernel in the Open OnDemand Jupyter Notebooks
To confirm that your environment is appropriately referencing your Miniforge installation, try out the following:
unset -f which which conda # output: /ext3/miniforge3/bin/conda which python # output: /ext3/miniforge3/bin/python python --version # output: Python 3.8.5 which pip # output: /ext3/miniforge3/bin/pip
For further instructions, you can refer to the NYU HPC Singularity link
- We are now ready to install packages for PufferDrive
While still in the singularity container, we will now do the following -
(This is a tiny, fast C library for reading and parsing .ini configuration files.)
wget https://github.com/benhoyt/inih/archive/r62.tar.gz
We can now install the dependencies
pip install -e .
Now, we can compile the C code
python setup.py build_ext --inplace --force
We are now done with the installation 💯 . The next part of the setup is to prepare the folder with required data files and headless rendering.
You can download the WOMD data from Hugging Face in two versions:
- Mini Dataset: GPUDrive_mini contains 1,000 training files and 300 test/validation files
- Full Dataset: GPUDrive contains 100,000 unique scenes
Note: Replace 'GPUDrive_mini' with 'GPUDrive' in your download commands if you want to use the full dataset.
Here's the link to a Python file you can add at /scratch/$USER/PufferDrive to download WOMD data.
Lastly, let's run the following command to convert JSON files to map binaries. (It should be run from /scratch/$USER/PufferDrive)
python pufferlib/ocean/drive/drive.py
We're almost there now 🚀 . The last step of setting up PufferDrive is to add support for headless rendering. This is essential to observe renders and videos of the training runs, which is super useful for inference and debugging.
This is also a good step to test our setup by running (from /scratch/$USER/PufferDrive) -
puffer train puffer_drive
For this, we will need to install a couple of libraries in our Conda environment -
conda install -c conda-forge xorg-x11-server-xvfb-cos6-x86_64 conda install -c conda-forge ffmpeg
where -
-
ffmpeg: Video processing and conversion -
xvfb: Virtual display for headless environments
With this, we can visualize our training runs and also successfully export out renders automatically over to Wandb runs.
- Build the application
bash scripts/build_ocean.sh drive local
(If this fails, replace drive with visualize)
- Run with virtual display
xvfb-run -s "-screen 0 1280x720x24" ./drive
(If this fails, replace ./drive with ./visualize)
We are now done with the setup 😃 . We've provided a step by step example to showcase one way of running Puffer.
For running training experiments, we recommend using submit_cluster.py instead of manual sbatch scripts. It handles SLURM submission, Singularity
container wrapping, code isolation (so rebuilds don't break running jobs), and wandb integration.
You need a lightweight login venv (outside the container) with submitit and pyyaml:
python -m venv /scratch/$USER/login_venv source /scratch/$USER/login_venv/bin/activate pip install submitit pyyaml
source /scratch/$USER/login_venv/bin/activate
cd /scratch/$USER/PufferDrive
python scripts/submit_cluster.py \
--save_dir /scratch/$USER/experiments \
--compute_config scripts/cluster_configs/nyu_greene.yaml \
--program_config scripts/cluster_configs/train_base.yaml \
--prefix my_experiment \
--container \
--container_overlay /scratch/$USER/images/PufferDrive/overlay-50G-10M.ext3
This submits a job named my_experiment_train_base_<hash> with wandb run name my_experiment.
The submit script supports wandb run naming, grouping, and project selection:
| Argument | Description | Default |
|---|---|---|
--prefix |
Job name prefix, also used as wandb run name | None |
--wandb-name |
Explicit wandb run name (overrides prefix) | Same as --prefix
|
--wandb-group |
Wandb group for organizing related runs | From program config |
--wandb-project |
Wandb project name | From program config |
Example with all wandb options:
python scripts/submit_cluster.py \
--save_dir /scratch/$USER/experiments \
--compute_config scripts/cluster_configs/nyu_greene.yaml \
--program_config scripts/cluster_configs/train_base.yaml \
--prefix single_agent_fargoals \
--wandb-group single_agent_experiments \
--wandb-project pufferdrive \
--container \
--container_overlay /scratch/$USER/images/PufferDrive/overlay-50G-10M.ext3 \
--args 'env.num_agents=1024' 'train.batch_size=2097152'
Use --args to override any setting from the program config or drive.ini. Keys use dot notation for sections:
--args 'env.num_agents=1024' 'train.batch_size=2097152' 'train.gamma=0.999'
To run a hyperparameter sweep, use colon-separated values (one job per combination):
--args 'train.seed=42:55:1' 'train.gamma=0.98:0.999'
Always pull the latest code and rebuild the C extension on a compute node before launching. Do not build on the login node — nvcc uses too much memory and gets OOM-killed.
cd /scratch/$USER/PufferDrive
git pull origin <branch>
srun --account=<account> --partition=l40s_public,h200_public \
--cpus-per-task=4 --mem=16gb --time=10 --gres=gpu:1 \
singularity exec --nv \
--overlay /scratch/$USER/images/PufferDrive/overlay-50G-10M.ext3:ro \
/share/apps/images/cuda12.8.1-cudnn9.8.0-ubuntu24.04.2.sif \
bash -c 'source /ext3/env.sh && export TORCH_CUDA_ARCH_LIST="8.0;9.0" && python setup.py build_ext --inplace --force'
Set TORCH_CUDA_ARCH_LIST to include 8.0 (A100/L40S) and 9.0 (H200) since SLURM may schedule on either partition.
# Check job status squeue -u $USER # View training output tail -20 /scratch/$USER/experiments/<job_name>/submitit/<job_id>_0_log.out # View errors tail -20 /scratch/$USER/experiments/<job_name>/submitit/<job_id>_0_log.err # Find wandb run URL grep 'View run' /scratch/$USER/experiments/<job_name>/submitit/<job_id>_0_log.err
To preview the commands without submitting:
python scripts/submit_cluster.py \
--save_dir /scratch/$USER/experiments \
--compute_config scripts/cluster_configs/nyu_greene.yaml \
--program_config scripts/cluster_configs/train_base.yaml \
--prefix my_test \
--container \
--container_overlay /scratch/$USER/images/PufferDrive/overlay-50G-10M.ext3 \
--dry
- Login to your HPC account
- Navigate to your base repository location (eg -
\scratch\$USER\PufferDrive) - We will show an example to execute the run using
sbatch. Create a.sfile (eg -script.s)
singularity exec --nv --overlay \
/scratch/$USER/images/PufferDrive/overlay-50G-10M.ext3:rw \
/scratch/work/public/singularity/cuda12.2.2-cudnn8.9.4-devel-ubuntu22.04.3.sif \
/bin/bash -c "
source /ext3/env.sh
python setup.py build_ext --inplace --force
bash scripts/build_ocean.sh drive local
puffer train puffer_drive --wandb --wandb-project "puffer" --wandb-group "test"
- Run the command
sbatch script.s
You should see your task up in the queue (can view this running squeue -u $USER.
- It is important to run
python setup.py build_ext --inplace --forceandbash scripts/build_ocean.sh drive localfor any changes to the build files, orbuild.crespectively.
- You can setup the PufferDrive clone in your home repository as well.
- Setting up Headless rendering might require installing additional libraries (eg
clang). The appropriate libraries can be downloaded by following the terminal output. - In case there is a version mismatch while setting up Headless rendering in base Conda environment, one can instantiate a virtual environment which is activated especially for rendering purposes. The following commands can then be followed (just an example, assuming the name of virtual environment to be
myenv)-
singularity exec --nv --overlay \
/scratch/$USER/images/PufferDrive/overlay-50G-10M.ext3:rw \
/scratch/work/public/singularity/cuda12.2.2-cudnn8.9.4-devel-ubuntu22.04.3.sif \
/bin/bash -c "
source /ext3/env.sh
python setup.py build_ext --inplace --force
conda activate myenv
bash scripts/build_ocean.sh drive local
puffer train puffer_drive --wandb --wandb-project "puffer" --wandb-group "test"
- Prefer using
srunduring code-development and testing, usesbatchfor hyper parameter sweeps. To execute usingsrun, of the following
srun --nodes=1 --tasks-per-node=1 --cpus-per-task=1 --mem=10GB --gres=gpu:1 \ --time=1:00:00 --account= --pty /bin/bash
Then, once you are logged in your compute node -
singularity exec --nv --overlay \ /scratch/$USER/images/PufferDrive/overlay-50G-10M.ext3:rw \ /scratch/work/public/singularity/cuda12.2.2-cudnn8.9.4-devel-ubuntu22.04.3.sif \ /bin/bash source /ext3/env.sh python setup.py build_ext --inplace --force bash scripts/build_ocean.sh drive local puffer train puffer_drive --wandb --wandb-project "puffer" --wandb-group "test"
Do you encounter issues with one of the steps outlined above? Please reach out in the Emerge lab #code-help channel!