Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions open_vocabulary_segmentation/langsplatv2/.gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
*.nsys-rep
langsplatv2_logs/
frgs_logs/
*_results*/
26 changes: 14 additions & 12 deletions open_vocabulary_segmentation/langsplatv2/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# LangSplatV2 (fVDB)

LangSplatV2-style open-vocabulary 3D segmentation using [fVDB](https://github.com/openvdb/fvdb-core) and pre-trained Gaussian splat reconstructions. This implementation trains per-Gaussian sparse coefficient fields and shared CLIP-aligned codebooks on an existing reconstruction; it does not train the underlying Gaussians or colors.
LangSplatV2-style open-vocabulary 3D segmentation using [fVDB](https://github.com/openvdb/fvdb-core) and pre-trained Gaussian splat reconstructions. This implementation trains per-Gaussian sparse coefficient fields and shared CLIP-aligned codebooks on an existing reconstruction.

## What this implements

- **Preprocessing**: Multi-scale SAM2 masks and OpenCLIP feature encoding for each image (cached on disk).
- **Preprocessing**: Multi-scale SAM masks (SAM1 or SAM2, configurable) and OpenCLIP feature encoding for each image (cached on disk).
- **Training**: Residual VQ codebooks and per-splat sparse logits so that rendered language features match the CLIP embeddings from SAM masks. One feature level (scale) per run; train multiple levels separately and combine at inference.
- **Compatibility**: Same feature pipeline and training setup (loss, LR, layer schedule) as the original LangSplatV2; uses fVDB for the 3D representation and rendering.

Expand All @@ -22,7 +22,7 @@ conda activate fvdb
pip install -e .
```

Dependencies (see `pyproject.toml`) include `torch`, `open-clip-torch`, `fvdb-reality-capture`, `tyro`, and optional TensorBoard for logging.
Dependencies (see `pyproject.toml`) include `fvdb-core`, `fvdb-reality-capture`, `torch`, `open-clip-torch`, `sam2`, `tyro`, and `matplotlib`.

## How to run

Expand All @@ -31,15 +31,15 @@ Training loads the SfM scene, applies preprocessing (SAM2 + CLIP) with caching,
**Minimal (COLMAP scene + PLY reconstruction):**

```bash
python train_langsplatv2.py \
python scripts/train_langsplatv2.py \
--sfm-dataset-path /path/to/colmap/scene \
--reconstruction-path /path/to/point_cloud.ply
```

**With explicit feature level and log directory:**

```bash
python train_langsplatv2.py \
python scripts/train_langsplatv2.py \
--sfm-dataset-path /path/to/colmap/scene \
--reconstruction-path /path/to/point_cloud.ply \
--config.feature-level 1 \
Expand All @@ -50,7 +50,7 @@ python train_langsplatv2.py \

```bash
for level in 1 2 3; do
python train_langsplatv2.py \
python scripts/train_langsplatv2.py \
--sfm-dataset-path /path/to/scene \
--reconstruction-path /path/to/gaussians.ply \
--config.feature-level $level \
Expand All @@ -63,8 +63,10 @@ done

- `--config.feature-level` — 0=default, 1=small, 2=medium, 3=large (default: 1).
- `--config.max-steps` — Training steps (default from max_epochs if not set).
- `--preprocess.image-downsample-factor` — Downsample images before SAM2/CLIP (e.g. 2 for speed).
- `--preprocess.image-downsample-factor` — Downsample images before SAM/CLIP (e.g. 2 for speed).
- `--preprocess.sam-model` — `sam1` or `sam2` (default: `sam2`).
- `--preprocess.sam2.checkpoint` — SAM2 size: `large`, `small`, `tiny`, `base_plus`.
- `--preprocess.sam1.checkpoint` — SAM1 variant: `vit_h`, `vit_l`, `vit_b`.
- `--log-path` — Directory for run subdirs (checkpoints, metrics). Use `None` to disable saving.
- `--io.use-tensorboard` — Log scalars (and optionally images) to TensorBoard.
- `--use-every-n-as-val` — Hold out every N-th image for validation (e.g. 5); -1 = no validation.
Expand All @@ -74,7 +76,8 @@ done
With `--log-path` set (e.g. `langsplatv2_logs`), each run writes:

- `log_path/run_<timestamp>/` (or `log_path/<run_name>/` if `--run-name` is set)
- `checkpoints/<step>/langsplatv2_ckpt.pt` — Model state and config (when `io.save_checkpoints` is True).
- `final_checkpoint.pt` — Final model checkpoint saved at the run's top level for easy access.
- `checkpoints/<step>/langsplatv2_ckpt.pt` — Per-step model state and config (when `io.save_checkpoints` is True).
- `metrics_log.csv` — Step, loss, and optional validation metrics.
- `tensorboard/` — If `io.use_tensorboard` is True.
- `images/` — If `io.save_images` is True (e.g. feature visualizations at save steps).
Expand All @@ -83,7 +86,7 @@ Preprocessing caches (SAM2 masks, CLIP features) are stored under the scene’s

## Preprocessing pipeline and cache format

The pipeline (see `LangSplatV2PreprocessConfig` in `config.py`) runs in order: optional scene normalization, point filtering, image downsampling, filter images by visible points, **ComputeMultiScaleSAM2Masks**, **ComputeCLIPFeatures**, and optional cropping.
The pipeline (see `LangSplatV2PreprocessConfig` in `config.py`) runs in order: optional scene normalization, point filtering, image downsampling, filter images by visible points, **ComputeMultiScaleSAM1Masks** or **ComputeMultiScaleSAM2Masks** (controlled by `--preprocess.sam-model`), **ComputeCLIPFeatures**, and optional cropping.

### SAM2 masks (per image)

Expand All @@ -104,11 +107,10 @@ Training uses a single `feature_level` (0–3) to choose which scale’s seg map
## Training details and comparison with original LangSplatV2

- **Feature generation**: Same as original — crop mask region → pad to square → resize to 224 → OpenCLIP encode → L2-normalize. Scale order and seg-map indexing (default → s → m → l, cumulative) match.
- **Optimization**: Same language-feature LR (0.0025), layer schedule (every 10k steps), and cosine loss over valid pixels with gradient scaling via mask fraction. The scalar `train/loss` is the (mask-fraction-scaled) total loss used for backprop. For a smoother, more interpretable curve when mask coverage varies across images, use `train/cosine_loss_valid`, which is the mean cosine loss over valid pixels only (no mask-fraction scaling), we use this for logging.
- **Data sampling**: One random permutation of all training views per “epoch” (InfiniteSampler with shuffle), one view per step when `batch_size=1`, matching the original’s viewpoint-stack behavior.

## References

- [LangSplatV2: Vision-Language Gaussian Splatting](https://arxiv.org/abs/2312.16084)
- [LangSplat: 3D Language Gaussian Splatting](https://arxiv.org/abs/2312.16084)
- [LangSplatV2: High-dimensional 3D Language Gaussian Splatting with 450+ FPS](https://arxiv.org/abs/2507.07136)
- [Segment Anything 2 (SAM2)](https://github.com/facebookresearch/segment-anything-2)
- [OpenCLIP](https://github.com/mlfoundations/open_clip)
19 changes: 19 additions & 0 deletions open_vocabulary_segmentation/langsplatv2/environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
name: fvdb_langsplatv2
channels:
- conda-forge
dependencies:
- python >=3.10
- numpy
- pytorch
- torchvision
- opencv
- tqdm
- scikit-learn
- matplotlib
- gdown
- open-clip-torch
- tyro
- sam2
- segment-anything
- fvdb-core
- fvdb-reality-capture
131 changes: 131 additions & 0 deletions open_vocabulary_segmentation/langsplatv2/evaluation/lerf_ovs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# LERF-OVS Evaluation

Open-vocabulary segmentation evaluation on the [LERF-OVS](https://www.lerf.io/) dataset, comparing against ground-truth labelme annotations. Computes segmentation mIoU and localization accuracy across four scenes: `ramen`, `figurines`, `teatime`, `waldo_kitchen`.

All commands below should be run from this directory (`evaluation/lerf_ovs/`).

## Prerequisites

- The `fvdb` conda environment with the `fvdb-langsplatv2` package installed (see the [parent README](../../README.md)).
- `fvdb-reality-capture` installed.

## Step 1: Download the LERF-OVS dataset

```bash
python download_data.py --dataset-root data
```

This downloads and extracts the LERF-OVS data from Google Drive into `data/lerf_ovs/`. The resulting layout is:

```
data/lerf_ovs/
label/<scene>/frame_XXXXX.json, frame_XXXXX.jpg
<scene>/images/, sparse/
```

## Step 2: Reconstruct Gaussian splats

```bash
bash batch_reconstruct_eval_scenes.sh
```

This runs `frgs reconstruct` on each scene with default settings. Outputs are written to:

```
reconstructions/<scene>.ply
```

To reconstruct a single scene manually:

```bash
frgs reconstruct \
--run-name teatime \
--tx.image-downsample-factor 1 \
data/lerf_ovs/teatime/ \
-uv 10 \
-o reconstructions/teatime.ply \
--cfg.batch-size 1 \
--cfg.pose_opt_start_epoch 20
```

## Step 3: Train LangSplatV2 features

```bash
bash batch_train_eval_langsplat.sh
```

For each scene, this trains three models (one per SAM scale level: 1=small, 2=medium, 3=large) for 10k steps using SAM1. The final checkpoints are collected into:

```
langsplatv2_results/<scene>_level_1.pt
langsplatv2_results/<scene>_level_2.pt
langsplatv2_results/<scene>_level_3.pt
```

To train a single scene and level manually:

```bash
python ../../scripts/train_langsplatv2.py \
--sfm-dataset-path data/lerf_ovs/teatime \
--reconstruction-path reconstructions/teatime.ply \
--config.feature-level 1 \
--run-name teatime_level_1 \
--log-path langsplatv2_logs \
--config.max-steps 10000 \
--preprocess.sam-model sam1
```

Then copy the final checkpoint:

```bash
cp langsplatv2_logs/teatime_level_1/final_checkpoint.pt \
langsplatv2_results/teatime_level_1.pt
```

## Step 4: Evaluate

**All scenes (auto-discovered from checkpoints):**

```bash
python eval_lerf.py \
--lerf-root data/lerf_ovs \
--results-root langsplatv2_results \
--reconstructions-root reconstructions
```

**Single scene:**

```bash
python eval_lerf.py \
--lerf-root data/lerf_ovs \
--results-root langsplatv2_results \
--reconstructions-root reconstructions \
--scenes teatime
```

The evaluation:
1. Loads all three level checkpoints per scene
2. Renders CLIP features from each level for each annotated frame
3. Computes OpenCLIP relevancy maps for each ground-truth text prompt
4. Selects the best level per prompt (highest max relevancy score)
5. Reports **segmentation mIoU** (thresholded relevancy vs GT masks) and **localization accuracy** (relevancy peak inside GT bounding box)

### Evaluation flags

- `--mask-thresh` — Relevancy threshold for binary segmentation mask (default: 0.4).
- `--eval-topk` — Number of codebook entries to combine at eval (default: 4).
- `--output-dir` — Where to write results and visualizations (default: `lerf_eval_results`).
- `--no-visualizations` — Skip saving per-frame visualization images.
- `--verbose` — Enable debug logging.

### Output

Results are saved to `lerf_eval_results/` (or the path given by `--output-dir`):

```
lerf_eval_results/
lerf_results.json # Summary across all scenes (mIoU, localization accuracy)
<scene>/
results.json # Per-frame breakdown for this scene
frame_XXXXX.jpg # Per-frame visualizations (if enabled)
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#! /bin/bash
export PYTHONUNBUFFERED=1

for scene in ramen figurines teatime waldo_kitchen; do
frgs reconstruct \
--run-name ${scene} \
--tx.image-downsample-factor 1 \
data/lerf_ovs/${scene}/ \
-uv 10 \
-o reconstructions/${scene}.ply \
--cfg.batch-size 1 \
--cfg.pose_opt_start_epoch 20
done
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#! /bin/bash
set -ex
for scene in ramen figurines teatime waldo_kitchen; do
for level in 1 2 3; do
python ../../scripts/train_langsplatv2.py \
--sfm-dataset-path data/lerf_ovs/${scene} \
--reconstruction-path reconstructions/${scene}.ply \
--config.feature-level $level \
--run-name ${scene}_level_${level} \
--log-path langsplatv2_logs \
--config.max-steps 10000 \
--preprocess.sam-model sam1
done

# Collect checkpoints (final_checkpoint.pt is saved at the run's top level)
mkdir -p langsplatv2_results
for level in 1 2 3; do
cp langsplatv2_logs/${scene}_level_${level}/final_checkpoint.pt \
langsplatv2_results/${scene}_level_${level}.pt
done
done
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Copyright Contributors to the OpenVDB Project
# SPDX-License-Identifier: Apache-2.0
#
"""CLI script to download the LERF-OVS evaluation dataset."""
from dataclasses import dataclass
from pathlib import Path

import tyro
from langsplatv2.evaluation.datasets import set_dataset_root
from langsplatv2.evaluation.datasets.lerf import download_lerf_data


@dataclass
class DownloadLERFData:
"""Download the LERF-OVS dataset for open-vocabulary segmentation evaluation."""

dataset_root: Path = Path("data")
"""Root directory to store downloaded datasets."""

def main(self):
set_dataset_root(self.dataset_root)
download_lerf_data()


if __name__ == "__main__":
tyro.cli(DownloadLERFData).main()
Loading