openvdb · swahtz · Mar 5, 2026 · Mar 4, 2026 · Mar 4, 2026 · Mar 4, 2026
diff --git a/open_vocabulary_segmentation/langsplatv2/.gitignore b/open_vocabulary_segmentation/langsplatv2/.gitignore
@@ -1,2 +1,4 @@
 *.nsys-rep
 langsplatv2_logs/
+frgs_logs/
+*_results*/
diff --git a/open_vocabulary_segmentation/langsplatv2/README.md b/open_vocabulary_segmentation/langsplatv2/README.md
@@ -1,10 +1,10 @@
 # LangSplatV2 (fVDB)
 
-LangSplatV2-style open-vocabulary 3D segmentation using [fVDB](https://github.com/openvdb/fvdb-core) and pre-trained Gaussian splat reconstructions. This implementation trains per-Gaussian sparse coefficient fields and shared CLIP-aligned codebooks on an existing reconstruction; it does not train the underlying Gaussians or colors.
+LangSplatV2-style open-vocabulary 3D segmentation using [fVDB](https://github.com/openvdb/fvdb-core) and pre-trained Gaussian splat reconstructions. This implementation trains per-Gaussian sparse coefficient fields and shared CLIP-aligned codebooks on an existing reconstruction.
 
 ## What this implements
 
-- **Preprocessing**: Multi-scale SAM2 masks and OpenCLIP feature encoding for each image (cached on disk).
+- **Preprocessing**: Multi-scale SAM masks (SAM1 or SAM2, configurable) and OpenCLIP feature encoding for each image (cached on disk).
 - **Training**: Residual VQ codebooks and per-splat sparse logits so that rendered language features match the CLIP embeddings from SAM masks. One feature level (scale) per run; train multiple levels separately and combine at inference.
 - **Compatibility**: Same feature pipeline and training setup (loss, LR, layer schedule) as the original LangSplatV2; uses fVDB for the 3D representation and rendering.
 
@@ -22,7 +22,7 @@ conda activate fvdb
 pip install -e .
 ```
 
-Dependencies (see `pyproject.toml`) include `torch`, `open-clip-torch`, `fvdb-reality-capture`, `tyro`, and optional TensorBoard for logging.
+Dependencies (see `pyproject.toml`) include `fvdb-core`, `fvdb-reality-capture`, `torch`, `open-clip-torch`, `sam2`, `tyro`, and `matplotlib`.
 
 ## How to run
 
@@ -31,15 +31,15 @@ Training loads the SfM scene, applies preprocessing (SAM2 + CLIP) with caching,
 **Minimal (COLMAP scene + PLY reconstruction):**
 
 ```bash
-python train_langsplatv2.py \
+python scripts/train_langsplatv2.py \
     --sfm-dataset-path /path/to/colmap/scene \
     --reconstruction-path /path/to/point_cloud.ply
 ```
 
 **With explicit feature level and log directory:**
 
 ```bash
-python train_langsplatv2.py \
+python scripts/train_langsplatv2.py \
     --sfm-dataset-path /path/to/colmap/scene \
     --reconstruction-path /path/to/point_cloud.ply \
     --config.feature-level 1 \
@@ -50,7 +50,7 @@ python train_langsplatv2.py \
 
 ```bash
 for level in 1 2 3; do
-  python train_langsplatv2.py \
+  python scripts/train_langsplatv2.py \
     --sfm-dataset-path /path/to/scene \
     --reconstruction-path /path/to/gaussians.ply \
     --config.feature-level $level \
@@ -63,8 +63,10 @@ done
 
 - `--config.feature-level` — 0=default, 1=small, 2=medium, 3=large (default: 1).
 - `--config.max-steps` — Training steps (default from max_epochs if not set).
-- `--preprocess.image-downsample-factor` — Downsample images before SAM2/CLIP (e.g. 2 for speed).
+- `--preprocess.image-downsample-factor` — Downsample images before SAM/CLIP (e.g. 2 for speed).
+- `--preprocess.sam-model` — `sam1` or `sam2` (default: `sam2`).
 - `--preprocess.sam2.checkpoint` — SAM2 size: `large`, `small`, `tiny`, `base_plus`.
+- `--preprocess.sam1.checkpoint` — SAM1 variant: `vit_h`, `vit_l`, `vit_b`.
 - `--log-path` — Directory for run subdirs (checkpoints, metrics). Use `None` to disable saving.
 - `--io.use-tensorboard` — Log scalars (and optionally images) to TensorBoard.
 - `--use-every-n-as-val` — Hold out every N-th image for validation (e.g. 5); -1 = no validation.
@@ -74,7 +76,8 @@ done
 With `--log-path` set (e.g. `langsplatv2_logs`), each run writes:
 
 - `log_path/run_<timestamp>/` (or `log_path/<run_name>/` if `--run-name` is set)
-  - `checkpoints/<step>/langsplatv2_ckpt.pt` — Model state and config (when `io.save_checkpoints` is True).
+  - `final_checkpoint.pt` — Final model checkpoint saved at the run's top level for easy access.
+  - `checkpoints/<step>/langsplatv2_ckpt.pt` — Per-step model state and config (when `io.save_checkpoints` is True).
   - `metrics_log.csv` — Step, loss, and optional validation metrics.
   - `tensorboard/` — If `io.use_tensorboard` is True.
   - `images/` — If `io.save_images` is True (e.g. feature visualizations at save steps).
@@ -83,7 +86,7 @@ Preprocessing caches (SAM2 masks, CLIP features) are stored under the scene’s
 
 ## Preprocessing pipeline and cache format
 
-The pipeline (see `LangSplatV2PreprocessConfig` in `config.py`) runs in order: optional scene normalization, point filtering, image downsampling, filter images by visible points, **ComputeMultiScaleSAM2Masks**, **ComputeCLIPFeatures**, and optional cropping.
+The pipeline (see `LangSplatV2PreprocessConfig` in `config.py`) runs in order: optional scene normalization, point filtering, image downsampling, filter images by visible points, **ComputeMultiScaleSAM1Masks** or **ComputeMultiScaleSAM2Masks** (controlled by `--preprocess.sam-model`), **ComputeCLIPFeatures**, and optional cropping.
 
 ### SAM2 masks (per image)
 
@@ -104,11 +107,10 @@ Training uses a single `feature_level` (0–3) to choose which scale’s seg map
 ## Training details and comparison with original LangSplatV2
 
 - **Feature generation**: Same as original — crop mask region → pad to square → resize to 224 → OpenCLIP encode → L2-normalize. Scale order and seg-map indexing (default → s → m → l, cumulative) match.
-- **Optimization**: Same language-feature LR (0.0025), layer schedule (every 10k steps), and cosine loss over valid pixels with gradient scaling via mask fraction. The scalar `train/loss` is the (mask-fraction-scaled) total loss used for backprop. For a smoother, more interpretable curve when mask coverage varies across images, use `train/cosine_loss_valid`, which is the mean cosine loss over valid pixels only (no mask-fraction scaling), we use this for logging.
-- **Data sampling**: One random permutation of all training views per “epoch” (InfiniteSampler with shuffle), one view per step when `batch_size=1`, matching the original’s viewpoint-stack behavior.
 
 ## References
 
-- [LangSplatV2: Vision-Language Gaussian Splatting](https://arxiv.org/abs/2312.16084)
+- [LangSplat: 3D Language Gaussian Splatting](https://arxiv.org/abs/2312.16084)
+- [LangSplatV2: High-dimensional 3D Language Gaussian Splatting with 450+ FPS](https://arxiv.org/abs/2507.07136)
 - [Segment Anything 2 (SAM2)](https://github.com/facebookresearch/segment-anything-2)
 - [OpenCLIP](https://github.com/mlfoundations/open_clip)
diff --git a/open_vocabulary_segmentation/langsplatv2/environment.yml b/open_vocabulary_segmentation/langsplatv2/environment.yml
@@ -0,0 +1,19 @@
+name: fvdb_langsplatv2
+channels:
+  - conda-forge
+dependencies:
+  - python >=3.10
+  - numpy
+  - pytorch
+  - torchvision
+  - opencv
+  - tqdm
+  - scikit-learn
+  - matplotlib
+  - gdown
+  - open-clip-torch
+  - tyro
+  - sam2
+  - segment-anything
+  - fvdb-core
+  - fvdb-reality-capture
diff --git a/open_vocabulary_segmentation/langsplatv2/evaluation/lerf_ovs/README.md b/open_vocabulary_segmentation/langsplatv2/evaluation/lerf_ovs/README.md
@@ -0,0 +1,131 @@
+# LERF-OVS Evaluation
+
+Open-vocabulary segmentation evaluation on the [LERF-OVS](https://www.lerf.io/) dataset, comparing against ground-truth labelme annotations. Computes segmentation mIoU and localization accuracy across four scenes: `ramen`, `figurines`, `teatime`, `waldo_kitchen`.
+
+All commands below should be run from this directory (`evaluation/lerf_ovs/`).
+
+## Prerequisites
+
+- The `fvdb` conda environment with the `fvdb-langsplatv2` package installed (see the [parent README](../../README.md)).
+- `fvdb-reality-capture` installed.
+
+## Step 1: Download the LERF-OVS dataset
+
+```bash
+python download_data.py --dataset-root data
+```
+
+This downloads and extracts the LERF-OVS data from Google Drive into `data/lerf_ovs/`. The resulting layout is:
+
+```
+data/lerf_ovs/
+    label/<scene>/frame_XXXXX.json, frame_XXXXX.jpg
+    <scene>/images/, sparse/
+```
+
+## Step 2: Reconstruct Gaussian splats
+
+```bash
+bash batch_reconstruct_eval_scenes.sh
+```
+
+This runs `frgs reconstruct` on each scene with default settings. Outputs are written to:
+
+```
+reconstructions/<scene>.ply
+```
+
+To reconstruct a single scene manually:
+
+```bash
+frgs reconstruct \
+    --run-name teatime \
+    --tx.image-downsample-factor 1 \
+    data/lerf_ovs/teatime/ \
+    -uv 10 \
+    -o reconstructions/teatime.ply \
+    --cfg.batch-size 1 \
+    --cfg.pose_opt_start_epoch 20
+```
+
+## Step 3: Train LangSplatV2 features
+
+```bash
+bash batch_train_eval_langsplat.sh
+```
+
+For each scene, this trains three models (one per SAM scale level: 1=small, 2=medium, 3=large) for 10k steps using SAM1. The final checkpoints are collected into:
+
+```
+langsplatv2_results/<scene>_level_1.pt
+langsplatv2_results/<scene>_level_2.pt
+langsplatv2_results/<scene>_level_3.pt
+```
+
+To train a single scene and level manually:
+
+```bash
+python ../../scripts/train_langsplatv2.py \
+    --sfm-dataset-path data/lerf_ovs/teatime \
+    --reconstruction-path reconstructions/teatime.ply \
+    --config.feature-level 1 \
+    --run-name teatime_level_1 \
+    --log-path langsplatv2_logs \
+    --config.max-steps 10000 \
+    --preprocess.sam-model sam1
+```
+
+Then copy the final checkpoint:
+
+```bash
+cp langsplatv2_logs/teatime_level_1/final_checkpoint.pt \
+   langsplatv2_results/teatime_level_1.pt
+```
+
+## Step 4: Evaluate
+
+**All scenes (auto-discovered from checkpoints):**
+
+```bash
+python eval_lerf.py \
+    --lerf-root data/lerf_ovs \
+    --results-root langsplatv2_results \
+    --reconstructions-root reconstructions
+```
+
+**Single scene:**
+
+```bash
+python eval_lerf.py \
+    --lerf-root data/lerf_ovs \
+    --results-root langsplatv2_results \
+    --reconstructions-root reconstructions \
+    --scenes teatime
+```
+
+The evaluation:
+1. Loads all three level checkpoints per scene
+2. Renders CLIP features from each level for each annotated frame
+3. Computes OpenCLIP relevancy maps for each ground-truth text prompt
+4. Selects the best level per prompt (highest max relevancy score)
+5. Reports **segmentation mIoU** (thresholded relevancy vs GT masks) and **localization accuracy** (relevancy peak inside GT bounding box)
+
+### Evaluation flags
+
+- `--mask-thresh` — Relevancy threshold for binary segmentation mask (default: 0.4).
+- `--eval-topk` — Number of codebook entries to combine at eval (default: 4).
+- `--output-dir` — Where to write results and visualizations (default: `lerf_eval_results`).
+- `--no-visualizations` — Skip saving per-frame visualization images.
+- `--verbose` — Enable debug logging.
+
+### Output
+
+Results are saved to `lerf_eval_results/` (or the path given by `--output-dir`):
+
+```
+lerf_eval_results/
+    lerf_results.json              # Summary across all scenes (mIoU, localization accuracy)
+    <scene>/
+        results.json               # Per-frame breakdown for this scene
+        frame_XXXXX.jpg            # Per-frame visualizations (if enabled)
+```
diff --git a/..._vocabulary_segmentation/langsplatv2/evaluation/lerf_ovs/batch_reconstruct_eval_scenes.sh b/..._vocabulary_segmentation/langsplatv2/evaluation/lerf_ovs/batch_reconstruct_eval_scenes.sh
@@ -0,0 +1,13 @@
+#! /bin/bash
+export PYTHONUNBUFFERED=1
+
+for scene in ramen figurines teatime waldo_kitchen; do
+  frgs reconstruct \
+    --run-name ${scene} \
+    --tx.image-downsample-factor 1 \
+    data/lerf_ovs/${scene}/ \
+    -uv 10 \
+    -o reconstructions/${scene}.ply \
+    --cfg.batch-size 1 \
+    --cfg.pose_opt_start_epoch 20
+done
diff --git a/open_vocabulary_segmentation/langsplatv2/evaluation/lerf_ovs/batch_train_eval_langsplat.sh b/open_vocabulary_segmentation/langsplatv2/evaluation/lerf_ovs/batch_train_eval_langsplat.sh
@@ -0,0 +1,21 @@
+#! /bin/bash
+set -ex
+for scene in ramen figurines teatime waldo_kitchen; do
+  for level in 1 2 3; do
+    python ../../scripts/train_langsplatv2.py \
+      --sfm-dataset-path data/lerf_ovs/${scene} \
+      --reconstruction-path reconstructions/${scene}.ply \
+      --config.feature-level $level \
+      --run-name ${scene}_level_${level} \
+      --log-path langsplatv2_logs \
+      --config.max-steps 10000 \
+      --preprocess.sam-model sam1
+  done
+
+  # Collect checkpoints (final_checkpoint.pt is saved at the run's top level)
+  mkdir -p langsplatv2_results
+  for level in 1 2 3; do
+    cp langsplatv2_logs/${scene}_level_${level}/final_checkpoint.pt \
+       langsplatv2_results/${scene}_level_${level}.pt
+  done
+done
diff --git a/open_vocabulary_segmentation/langsplatv2/evaluation/lerf_ovs/download_data.py b/open_vocabulary_segmentation/langsplatv2/evaluation/lerf_ovs/download_data.py
@@ -0,0 +1,26 @@
+# Copyright Contributors to the OpenVDB Project
+# SPDX-License-Identifier: Apache-2.0
+#
+"""CLI script to download the LERF-OVS evaluation dataset."""
+from dataclasses import dataclass
+from pathlib import Path
+
+import tyro
+from langsplatv2.evaluation.datasets import set_dataset_root
+from langsplatv2.evaluation.datasets.lerf import download_lerf_data
+
+
+@dataclass
+class DownloadLERFData:
+    """Download the LERF-OVS dataset for open-vocabulary segmentation evaluation."""
+
+    dataset_root: Path = Path("data")
+    """Root directory to store downloaded datasets."""
+
+    def main(self):
+        set_dataset_root(self.dataset_root)
+        download_lerf_data()
+
+
+if __name__ == "__main__":
+    tyro.cli(DownloadLERFData).main()