End-to-end walkthrough of how inference and metrics are computed.
sequenceDiagram
participant C as CDBConfig
participant D as AlpamayoDataLoader
participant I as AlpamayoInference
participant P as AlpamayoPostprocessor
participant R as report_generator
C->>D: resolve paths, qa_types
D->>D: get_sample_ids(mode, scene)
loop For each (scene, sample)
D->>D: load_record_by_name(scene, sample)
D->>D: build_image_paths(record) → [{path, time_key, camera_key}]
D->>D: format_sample(record) → images, ego_history_xyz/rot
loop For each question (ladder / dormant / distractor)
D->>D: get_prompt(formatted_sample, question)
D->>D: _format_inference_text(question) — no ground truth
end
D-->>I: prompts.jsonl (one line per question)
I->>I: load_model(weights_path)
loop For each prompt in prompts.jsonl
I->>I: run_single(prompt_payload) → {"text": cot_string}
end
I-->>P: outputs.jsonl (one line per question)
P->>P: _load_sample_qa_index(bench/scene/sample) — scoped per sample
P->>P: _parse_sample_outputs(outputs.jsonl, qa_index)
P->>P: parse_answer(raw_output, question) × N
P->>R: generate_sample_report() → Tier 1 report.json
end
R->>R: generate_dataset_report() → Tier 2 report.json
R->>R: generate_run_report() → Tier 3 report.json
R->>R: generate_root_aggregate() → Tier 4 index.html
flowchart TD
A[inference.py main] --> B[BenchmarkDataset\nbench_dir]
A --> C[AlpamayoDataLoader\ndataset, model_cfg, cfg]
B --> D[get_sample_ids\nmode / scene / subset_size]
D --> E[iter_questions\nsample_ids]
E --> F[load_record_by_name\nscene_id, sample_id]
F --> G[_load_frames\nframes.json]
F --> H[_load_qa\nqa/active_qa.json\nqa/dormant_qa.json\nqa/distractor_qa.json]
E --> I[build_image_paths\ncamera-major ordering]
E --> J[format_sample\nimages + ego history]
J --> K[_compute_rotations\nheading from positions]
E --> L[get_prompt\nego_history_xyz/rot lists]
L --> M[_format_inference_text\nquestion + format hint\nNO ground truth]
E --> N[save_prompts\nprompts.jsonl]
BenchmarkDataset discovers sample directories and provides load_record_by_name():
dataset = BenchmarkDataset(bench_dir="/causal_drive_bench/causal_nuscenes")
sample_ids = dataset.get_sample_ids(mode="subset", subset_size=10)
# [{"scene_id": "nuscenes-scene-0159", "sample_id": "SAMPLED_0", "sample_idx": 0}, ...]AlpamayoDataLoader converts raw records to model-specific prompt payloads via three abstract methods:
loader = AlpamayoDataLoader(dataset=dataset, model_cfg={...}, cfg=cfg)
# Per sample — loads PIL images and ego history
record = dataset.load_record_by_name(scene_id, sample_id)
formatted = loader.format_sample(record)
# Returns: {scene_id, sample_id, images: 16 PIL.Images (camera-major), ego_history_xyz (4,3), ego_history_rot (4,3,3)}
# Per sample — resolves image paths in camera-major order
image_paths = loader.build_image_paths(record)
# Returns: [{"path": "...", "time_key": "Tm1p5", "camera_key": "cam_front"}, ...]
# Per question — builds JSON-safe model-specific fields only
prompt_extras = loader.get_prompt(formatted, question)
# Returns: {"ego_history_xyz": [[...], ...], "ego_history_rot": [[[...], ...], ...]}iter_questions() merges framework fields with model-specific fields into a complete prompt record and writes prompts.jsonl. Framework fields injected automatically:
| Field | Source |
|---|---|
scene_id |
record["scene_id"] |
sample_id |
record["sample_id"] |
question_id |
question["id"] |
prompt_id |
auto-generated padded int |
is_evaluated |
False |
question_json_file |
e.g. "dormant_qa.json" |
answer_format |
question["answer_format"] |
question_text |
question + options, no answer |
qa_text |
question + format hint only — no ground truth answer |
image_paths |
list of {path, time_key, camera_key} dicts from build_image_paths() |
Note:
qa_textdeliberately omits the correct answer and reasoning to prevent ground-truth leakage into the model prompt. It is generated by_format_inference_text(), notformat_qa_text(). The format hint ("Answer: Yes or No"/"Answer: A, B, C, or D") is included; the system prompt (system_prompt.txt) provides full output format instructions.
Image ordering — camera-major (matches Alpamayo training format):
Index time_key camera_key
0 Tm1p5 cam_front
1 Tm1p0 cam_front
2 Tm0p5 cam_front
3 Tp0p0 cam_front
4 Tm1p5 cam_front_left
...
15 Tp0p0 cam_back
Ego history is stored sparse (4 NuScenes timesteps at −1.5, −1.0, −0.5, 0.0 s) in
prompts.jsonl. Inference interpolates these to 16 dense steps using
interpolate_ego_history().
Record schema (BenchmarkDataset.load_record_by_name()):
{
"scene_id": "nuscenes-scene-0159",
"sample_id": "SAMPLED_0",
"state": {...}, # state.json (ego + agents at T=0)
"frames": {
"Tm1p5": {"cam_front": "raw_data/.../file.jpg", "cam_front_left": "...", ...},
"Tm1p0": {...},
"Tm0p5": {...},
"Tp0p0": {...},
},
"qa": {
"ladder": [...], # from active_qa.json
"dormant": [...], # from dormant_qa.json
"distractor": [...], # from distractor_qa.json
},
}QA file → internal type mapping:
| File | Internal qa_type key |
eval_config.yaml name |
|---|---|---|
active_qa.json |
ladder |
ladder |
dormant_qa.json |
dormant |
dormant |
distractor_qa.json |
distractor |
distractor |
Important:
eval_config.yamlmust useladder(notactive) in theqa_typeslist for active causal-ladder questions to be included.
flowchart TD
A[inference.py main] --> B[AlpamayoInference\nload_model]
B --> C[AlpamayoR1.from_pretrained\nbfloat16]
B --> D[helper.get_processor\nQwen3VL processor]
A --> E[run_from_jsonl\nprompts.jsonl → outputs.jsonl]
E --> F[run_single\nfor each prompt]
F --> G[Load 16 images\nPIL → uint8 tensor\ncamera-major order]
G --> G1{file exists?}
G1 -->|No| G2[gray placeholder\n1600×900]
G1 -->|Yes| G3[PILImage.open → numpy\n.permute 2,0,1]
F --> H[interpolate_ego_history\nxyz_sparse 4×3\nrot_sparse 4×3×3\n→ dense 16 steps]
H --> H1[unsqueeze ×2\n→ 1,1,16,3\n→ 1,1,16,3,3]
F --> I[helper.create_message\nimages → Qwen3VL messages]
I --> J[inject_qa_into_messages\noverride_system_prompt=True\nreplace traj instruction\nwith QA question]
F --> K[processor.apply_chat_template\ntokenize=True\ncontinue_final_message=True\nfrom cot_start token]
F --> L[model.sample_trajectories_\nfrom_data_with_vlm_rollout\ntop_p=0.95 temp=0.6\nmax_gen=512 tokens]
L --> M[VLM rollout\nchain-of-thought text]
L --> N[Diffusion head\ntrajectory waypoints]
M --> O[extract extra cot\nunwrap batch/traj/sample dims\n→ cot_string]
O --> P[return dict\ntext: cot_string]
After inject_qa_into_messages(override_system_prompt=True), the Qwen3VL message list is:
[System] → system_prompt.txt (replaces alpamayo default)
[User] → [16 image tokens]
<|traj_history_start|><|traj_history|>×48<|traj_history_end|>
Question: <question text>
[A) ... B) ... C) ... D) ...] ← MCQ only
Format: Answer: Yes or No ← or "A, B, C, or D"
[Assistant] → <|cot_start|> ← model continues from here
The driving instruction "output the chain-of-thought reasoning of the driving process, then output the future trajectory." is replaced (not appended) by the
QA question so the model has a single, unambiguous instruction.
inference = AlpamayoInference(model_cfg=model_cfg, device="cuda", cfg=cfg)
inference.load_model(weights_path)
n_written = inference.run_from_jsonl(
input_path=prompts_file,
output_path=outputs_file,
batch_size=1,
resume=True, # skips question_ids already in outputs.jsonl
)run_single() contract — must return a dict with at least a "text" key:
def run_single(self, prompt_payload: dict) -> dict:
...
return {"text": cot_string} # cot = chain-of-causation reasoningimage_paths lookup pattern:
path_lookup = {
(item["time_key"], item["camera_key"]): item["path"]
for item in prompt_payload["image_paths"]
}
# e.g. ("Tm1p5", "cam_front") → "raw_data/.../file.jpg"Timestamped output structure:
<dir_cdb_outputs>/<MODEL_ID>_<YYYYMMDD_HHMMSS>/ ← run_dir
├── inference.log
└── causal_nuscenes/
└── <scene_id>/
└── <sample_id>/
├── prompts.jsonl
└── outputs.jsonl
GPU-lazy loading: _has_pending_work() checks all sample output files before
loading the model — avoids GPU allocation when everything is already processed.
Resumption: question_id entries already present in outputs.jsonl are skipped.
Safe to Ctrl-C and resume with --run-dir <existing_run>.
flowchart TD
A[postprocess.py main] --> B[_process_dataset\nrun_dir, bench_dir]
B --> C[BenchmarkDataset\nget_sample_ids]
B --> D{for each sample}
D --> E[_load_sample_qa_index\nbench_dir/scene_id/sample_id\nscoped per-sample!]
D --> F[_parse_sample_outputs\noutputs.jsonl + qa_index]
F --> G[parse_answer\nraw_output, question]
G --> G1[_parse_binary\ncascading regex\nAnswer: Yes/No]
G --> G2[_parse_mcq\ncascading regex\nAnswer: A-D]
F --> H[generate_sample_report\nTier 1 report.json]
B --> I[generate_dataset_report\nTier 2 report.json]
A --> J[generate_run_report\nTier 3 report.json]
A --> K{html_enabled?}
K -->|Yes| L[generate_root_aggregate\nTier 4]
L --> M[generate_all_html\ncdb_reports_dir]
Question IDs (CI1, NI1, WC1, …) are reused across every scene in the
benchmark. The global _load_qa_index(bench_dir) approach — which rglobs all
49+ scenes at once — causes the last alphabetical scene's question to overwrite
all others, producing wrong question texts in reports.
flowchart LR
subgraph old["❌ Old: global index"]
A1[_load_qa_index\nall scenes] --> A2[CI1 → last scene wins\nwrong question text]
end
subgraph new["✅ New: per-sample index"]
B1["_load_sample_qa_index\nbench_dir/scene/sample"] --> B2["CI1 → this sample only\ncorrect question text"]
end
_load_sample_qa_index(sample_bench_dir) reads only the qa/ subfolder of the
specific sample being evaluated, eliminating cross-scene ID collisions.
AlpamayoPostprocessor.parse_answer() searches the model's chain-of-causation text
with cascading regex patterns:
Binary (Yes/No):
Answer:\s*(Yes|No)— highest priorityanswer\s+is\s+(Yes|No)^(Yes|No)[.,\s]— line-start\b(Yes|No)\b— fallback
MCQ (A–D):
Answer:\s*([A-D])answer\s+is\s+([A-D])Option\s+([A-D])([A-D])[).]\s^\s*([A-D])\s*$\b([A-D])\b— fallback
The model's output may optionally be wrapped in <think>...</think> tags (reasoning
model style). Both the inner content and the full text are searched in order.
flowchart TD
T1["Tier 1 — sample\ncdb_outputs/‹run›/‹dataset›/‹scene›/‹sample›/report.json\nper-sample QA results + accuracy"]
T2["Tier 2 — dataset\ncdb_outputs/‹run›/‹dataset›/report.json\naggregated across all samples"]
T3["Tier 3 — run\ncdb_outputs/‹run›/report.json\naggregated across all datasets in this run"]
T4["Tier 4 — root\ncdb_reports/‹run›/index.html\naggregated across all runs in cdb_outputs"]
T1 -->|read by generate_dataset_report| T2
T2 -->|read by generate_run_report| T3
T3 -->|read by generate_root_aggregate| T4
Tier 1 reports are incremental: skipped if their stored schema_version matches
eval_config.yaml. Bump reporting.schema_version to force full regeneration.
Tiers 2–4 always regenerate (fast: only read pre-computed JSON).
postprocessor = AlpamayoPostprocessor(model_cfg={})
# Tier 1 + 2 generated by _process_dataset():
_process_dataset(run_dir, "causal_nuscenes", bench_dir, postprocessor, ...)
# Tier 3:
generate_run_report(run_name, run_dir, run_dir / "report.json", ...)
# Tier 4 + HTML:
root = generate_root_aggregate(cfg.dir_cdb_outputs, schema_version)
generate_all_html(run_name, cfg.dir_cdb_outputs, cfg.dir_cdb_reports, root)raw_output format — _extract_text() handles both schemas:
raw_output = {"text": "Answer: No\nReasoning: ..."} # new dict format
raw_output = "Answer: No\nReasoning: ..." # legacy plain string| Mode | --mode |
Description |
|---|---|---|
| Full | full |
All scenes in the benchmark |
| Subset | subset --subset-size N |
N randomly sampled scenes |
| Single | single --scene <name> |
Exactly one scene (fastest, for debugging) |
Pass identical mode flags to both inference.py and postprocess.py so they
operate on the same set of samples.
For reasoning evaluation beyond exact-match accuracy. Scores each model response on four dimensions:
| Dimension | Description | Score |
|---|---|---|
spatial_grounding |
Correctly identifies locations and positions of agents | 0–2 |
mechanism_identification |
Explains the causal mechanism driving the event | 0–2 |
conditionality_awareness |
Understands conditional/hypothetical nature | 0–2 |
answer_consistency |
Final answer matches the reasoning given | 0–2 |
Composite score = mean of all four dimensions (0–2 scale).
metrics = postprocessor.compute_metrics(
outputs_path="outputs.jsonl",
bench_dir="...",
use_llm_judge=True,
judge_cfg={
"provider": "anthropic", # "anthropic" | "openai" | "azure_foundry"
"model": "claude-opus-4-6",
"api_key_env": "ANTHROPIC_API_KEY",
},
)Three supported judge backends:
| Provider | provider value |
Notes |
|---|---|---|
| Anthropic | "anthropic" |
Recommended; uses claude-opus-4-6 |
| OpenAI | "openai" |
Uses gpt-4o or similar |
| Azure AI Foundry | "azure_foundry" |
Azure-hosted Claude or OpenAI models |
Configure in eval_config.yaml under evaluation.judge:
evaluation:
use_llm_judge: false # set true to enable
judge:
provider: anthropic
model: claude-opus-4-6
api_key_env: ANTHROPIC_API_KEYThe top-level orchestrate.py runs the full pipeline for one or more models:
# Single model, single-scene debug
python evaluation/orchestrate.py \
--model alpamayo_1_0 \
--mode single --scene nuscenes-scene-0159
# Subset run
python evaluation/orchestrate.py \
--model alpamayo_1_0 \
--dataset causal_nuscenes \
--mode subset --subset-size 50
# Multiple models, full benchmark
python evaluation/orchestrate.py \
--models alpamayo_1_0,drivelm \
--mode full
# Dry run (print commands only)
python evaluation/orchestrate.py --model alpamayo_1_0 --dry-runThe orchestrator reads eval_config.yaml for model registry, dataset paths, and
qa_types. Only models with enabled: true are run.
| Metric | Description |
|---|---|
overall.accuracy |
Exact-match accuracy across all questions |
overall.n |
Total number of questions evaluated |
per_qa_type.ladder |
Accuracy for active causal-ladder questions (MCQ) |
per_qa_type.dormant |
Accuracy for dormant link questions (binary) |
per_qa_type.distractor |
Accuracy for distractor node questions (binary) |
confusion |
Confusion matrix and most-confused pairs |
reasoning_score (optional) |
LLM judge composite and per-dimension scores (0–2) |
All metrics are computed by pure-Python functions in common/metrics.py with no
GPU dependencies.