Hypergraph Cover Optimization for
Multi-view 3D Human Pose Estimation
Tony Danjun Wang · Tolga Birdal · Nassir Navab · Lennart Bastian
Technical University of Munich · Munich Center for Machine Learning · Imperial College London
From pairwise to higher-order: COMPOSE filters hyperedges rather than pairwise edges (consistent relations in green), resolving cross-view association as a single hypergraph cover before triangulating 3D poses.
COMPOSE is training-free: no 3D supervision, only 2D detections and camera calibration. It estimates 3D human poses from sparse, calibrated multi-view camera systems by casting cross-view association as a weighted exact-cover optimization over a hypergraph of person hypotheses. Rather than stitching pairwise matches together with post-hoc consistency checks, it unifies them into a single combinatorial objective — solved exactly with Integer Linear Programming or approximately (and fast) with loopy Belief Propagation — then triangulates the matched correspondences into 3D skeletons.
Important
Tested with Python 3.11 + CUDA 12.8 + PyTorch 2.7. Git LFS is required for the bundled 2D detections, and the Panoptic data scripts need ffmpeg and wget (or curl) on your PATH.
# Package manager: uv (https://docs.astral.sh/uv/)
curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/wngTn/COMPOSE && cd COMPOSE
git lfs install && git lfs pull # pulls the bundled 2D detections
uv sync # creates the env and installs the `compose` packageThe Panoptic test-split detections are bundled (data/detections/panoptic_test.pkl), so the pipeline runs end-to-end once the Panoptic images are in place (Data):
# Estimate 3D poses on the Panoptic test split
# (prints the run directory + ready-to-paste next-step commands when it finishes)
python tools/generate.py --config experiments/panoptic.yaml --split test
# Evaluate a completed run against ground truth
python tools/evaluate.py --run-dir ./output/panoptic/<run_dir>
# Visualize a run (omit --run-dir to pick the most recent run)
python tools/visualize.py --run-dir ./output/panoptic/<run_dir> --dataset panoptic--config selects the dataset (panoptic, shelf, campus; mm_or ships as an optional, unpublished extra config that needs its own detections); --matching-solver is ilp (exact), bp (fast), or greedy (baseline). A CUDA GPU is recommended for triangulation and the BP solver.
Tip
Swap the solver, toggle video, or override any top-level config field ad hoc (cue parameters live in the YAML cues: block):
python tools/generate.py --config experiments/panoptic.yaml --split test --matching-solver bp --save-video --set lam=3.0 pixel_threshold=1024Evaluated on CMU Panoptic, Shelf, and Campus. Only the Panoptic test split runs out-of-the-box — its 2D detections are bundled; the other datasets need their own images and a detections pickle (see below). Download the Panoptic test images (several GB of HD video; the script defaults to the 4 test sequences and the 5 CMU0 cameras):
bash scripts/data/panoptic/0_get_data.sh # download CMU Panoptic (test sequences, 5 cameras)
bash scripts/data/panoptic/1_extract.sh # extract HD frames (needs ffmpeg)Note
COMPOSE consumes per-frame, per-camera 2D detections (keypoints + bounding boxes, stored as a pickle) alongside each dataset's images and camera calibration. To use another dataset or your own data, place a detections pickle under data/ and point the split's detections_path in experiments/<dataset>.yaml at it.
Expected directory layout
data/
├── detections/
│ └── panoptic_test.pkl # bundled via Git LFS
└── panoptic/
└── <sequence>/
├── hdImgs/ # 00_03/ 00_06/ 00_12/ 00_13/ 00_23/
├── hdPose3d_stage1_coco19/ # 3D ground truth
└── calibration_<sequence>.json
Using your own 2D detections
A detections pickle is db[sequence][frame]["2D"][camera] -> dict, where camera is a
"<node>_<panel>" string for Panoptic (e.g. "00_03") or an integer index otherwise.
Each per-camera dict holds:
keypoints_xys:(M, J, 3)float —Mdetections,Jjoints,[x, y, confidence]bbox_xywhs:(M, 5)float —[x, y, w, h, score]- (optional)
<name>_features:(M, D)float — per-detection embeddings for aphotometriccue (e.g.reid_features,dinov2_features)
J and the joint order follow the dataset's convention (compose/skeleton.py). Point the
split's detections_path at your pickle. The scripts/data/<dataset>/ stage scripts produce
these from raw images (detect → filter → optionally extract features).
Sanity-check a dataset's camera calibration with python tools/verify.py --dataset panoptic.
Each dataset has one YAML in experiments/ with shared base parameters and a splits: section.
cues:
- kind: geometric
name: reprojection
score_type: reprojection # reprojection | epipolar | sampson | trifocal
pixel_threshold: 1024.0 # px²; -1 disables geometric pre-filteringPrecedence is defaults < YAML base < YAML split < CLI flags, all merged by PipelineConfig.from_yaml. New cues are added by registering a class in compose.cues.CUE_REGISTRY.
Pipeline & key CLI flags
Each frame is processed by compose.pipeline.process_frame:
-
Hypergraph construction: grow a
$V$ -partite hypergraph of person hypotheses hierarchically over the valid 2D detections (level-by-level pruning, no$O(N^V)$ enumeration). - Cue scoring: rank every surviving hyperedge with a configurable cue chain (geometric, with optional photometric refinement).
- Correspondence matching: select a cover via ILP (exact), Belief Propagation (approximate), or a greedy baseline.
- Triangulation: algebraic SVD triangulation, Hungarian re-matching of leftover detections, and a final re-triangulation with per-joint worst-view dropout.
| flag | meaning |
|---|---|
--config |
dataset YAML (required) |
--split |
dataset split (only test is shipped) |
--matching-solver |
ilp · bp · greedy
|
--cameras |
camera preset (e.g. CMU0, Shelf5) |
--lam |
matching penalty per chosen hyperedge |
--pixel-threshold |
geometric pre-filter (px²) |
--save-video / --no-save-video
|
toggle MP4 output |
--set KEY=VALUE ... |
override any top-level config field |
Python API
The installable compose package exposes a small, stable API:
from compose import PipelineConfig, load_dataset, build_cues, process_frame
config = PipelineConfig.from_yaml("experiments/panoptic.yaml", split="test")
dataset = load_dataset(config)
cues = build_cues(config.cues, num_views=config.V, device="cuda")
# process_frame(...) runs the full per-frame pipeline and returns the 3D prediction.@article{wang2026compose,
title = {COMPOSE: Hypergraph Cover Optimization for Multi-view 3D Human Pose Estimation},
author = {Wang, Tony Danjun and Birdal, Tolga and Navab, Nassir and Bastian, Lennart},
journal = {arXiv preprint arXiv:2601.09698},
year = {2026},
}Released under the MIT License.
