CausalDriveBench Model Evaluation Pipeline

End-to-end walkthrough of how inference and metrics are computed.

Pipeline Overview

sequenceDiagram
    participant C as CDBConfig
    participant D as AlpamayoDataLoader
    participant I as AlpamayoInference
    participant P as AlpamayoPostprocessor
    participant R as report_generator

    C->>D: resolve paths, qa_types
    D->>D: get_sample_ids(mode, scene)
    loop For each (scene, sample)
        D->>D: load_record_by_name(scene, sample)
        D->>D: build_image_paths(record) → [{path, time_key, camera_key}]
        D->>D: format_sample(record) → images, ego_history_xyz/rot
        loop For each question (ladder / dormant / distractor)
            D->>D: get_prompt(formatted_sample, question)
            D->>D: _format_inference_text(question) — no ground truth
        end
        D-->>I: prompts.jsonl (one line per question)
        I->>I: load_model(weights_path)
        loop For each prompt in prompts.jsonl
            I->>I: run_single(prompt_payload) → {"text": cot_string}
        end
        I-->>P: outputs.jsonl (one line per question)
        P->>P: _load_sample_qa_index(bench/scene/sample) — scoped per sample
        P->>P: _parse_sample_outputs(outputs.jsonl, qa_index)
        P->>P: parse_answer(raw_output, question) × N
        P->>R: generate_sample_report() → Tier 1 report.json
    end
    R->>R: generate_dataset_report() → Tier 2 report.json
    R->>R: generate_run_report()     → Tier 3 report.json
    R->>R: generate_root_aggregate() → Tier 4 index.html

Stage 1: Data Loading

Call Graph

flowchart TD
    A[inference.py main] --> B[BenchmarkDataset\nbench_dir]
    A --> C[AlpamayoDataLoader\ndataset, model_cfg, cfg]
    B --> D[get_sample_ids\nmode / scene / subset_size]
    D --> E[iter_questions\nsample_ids]
    E --> F[load_record_by_name\nscene_id, sample_id]
    F --> G[_load_frames\nframes.json]
    F --> H[_load_qa\nqa/active_qa.json\nqa/dormant_qa.json\nqa/distractor_qa.json]
    E --> I[build_image_paths\ncamera-major ordering]
    E --> J[format_sample\nimages + ego history]
    J --> K[_compute_rotations\nheading from positions]
    E --> L[get_prompt\nego_history_xyz/rot lists]
    L --> M[_format_inference_text\nquestion + format hint\nNO ground truth]
    E --> N[save_prompts\nprompts.jsonl]

BenchmarkDataset discovers sample directories and provides load_record_by_name():

dataset = BenchmarkDataset(bench_dir="/causal_drive_bench/causal_nuscenes")
sample_ids = dataset.get_sample_ids(mode="subset", subset_size=10)
# [{"scene_id": "nuscenes-scene-0159", "sample_id": "SAMPLED_0", "sample_idx": 0}, ...]

AlpamayoDataLoader converts raw records to model-specific prompt payloads via three abstract methods:

loader = AlpamayoDataLoader(dataset=dataset, model_cfg={...}, cfg=cfg)

# Per sample — loads PIL images and ego history
record = dataset.load_record_by_name(scene_id, sample_id)
formatted = loader.format_sample(record)
# Returns: {scene_id, sample_id, images: 16 PIL.Images (camera-major), ego_history_xyz (4,3), ego_history_rot (4,3,3)}

# Per sample — resolves image paths in camera-major order
image_paths = loader.build_image_paths(record)
# Returns: [{"path": "...", "time_key": "Tm1p5", "camera_key": "cam_front"}, ...]

# Per question — builds JSON-safe model-specific fields only
prompt_extras = loader.get_prompt(formatted, question)
# Returns: {"ego_history_xyz": [[...], ...], "ego_history_rot": [[[...], ...], ...]}

iter_questions() merges framework fields with model-specific fields into a complete prompt record and writes prompts.jsonl. Framework fields injected automatically:

Field	Source
`scene_id`	`record["scene_id"]`
`sample_id`	`record["sample_id"]`
`question_id`	`question["id"]`
`prompt_id`	auto-generated padded int
`is_evaluated`	`False`
`question_json_file`	e.g. `"dormant_qa.json"`
`answer_format`	`question["answer_format"]`
`question_text`	question + options, no answer
`qa_text`	question + format hint only — no ground truth answer
`image_paths`	list of `{path, time_key, camera_key}` dicts from `build_image_paths()`

Note: qa_text deliberately omits the correct answer and reasoning to prevent ground-truth leakage into the model prompt. It is generated by _format_inference_text(), not format_qa_text(). The format hint ("Answer: Yes or No" / "Answer: A, B, C, or D") is included; the system prompt (system_prompt.txt) provides full output format instructions.

Image ordering — camera-major (matches Alpamayo training format):

Index  time_key  camera_key
  0    Tm1p5     cam_front
  1    Tm1p0     cam_front
  2    Tm0p5     cam_front
  3    Tp0p0     cam_front
  4    Tm1p5     cam_front_left
  ...
 15    Tp0p0     cam_back

Ego history is stored sparse (4 NuScenes timesteps at −1.5, −1.0, −0.5, 0.0 s) in prompts.jsonl. Inference interpolates these to 16 dense steps using interpolate_ego_history().

Record schema (BenchmarkDataset.load_record_by_name()):

{
  "scene_id":  "nuscenes-scene-0159",
  "sample_id": "SAMPLED_0",
  "state":     {...},   # state.json (ego + agents at T=0)
  "frames": {
    "Tm1p5": {"cam_front": "raw_data/.../file.jpg", "cam_front_left": "...", ...},
    "Tm1p0": {...},
    "Tm0p5": {...},
    "Tp0p0": {...},
  },
  "qa": {
    "ladder":     [...],   # from active_qa.json
    "dormant":    [...],   # from dormant_qa.json
    "distractor": [...],   # from distractor_qa.json
  },
}

QA file → internal type mapping:

File	Internal `qa_type` key	`eval_config.yaml` name
`active_qa.json`	`ladder`	`ladder`
`dormant_qa.json`	`dormant`	`dormant`
`distractor_qa.json`	`distractor`	`distractor`

Important: eval_config.yaml must use ladder (not active) in the qa_types list for active causal-ladder questions to be included.

Stage 2: Inference (Alpamayo 1.0)

Call Graph

flowchart TD
    A[inference.py main] --> B[AlpamayoInference\nload_model]
    B --> C[AlpamayoR1.from_pretrained\nbfloat16]
    B --> D[helper.get_processor\nQwen3VL processor]
    A --> E[run_from_jsonl\nprompts.jsonl → outputs.jsonl]
    E --> F[run_single\nfor each prompt]

    F --> G[Load 16 images\nPIL → uint8 tensor\ncamera-major order]
    G --> G1{file exists?}
    G1 -->|No| G2[gray placeholder\n1600×900]
    G1 -->|Yes| G3[PILImage.open → numpy\n.permute 2,0,1]

    F --> H[interpolate_ego_history\nxyz_sparse 4×3\nrot_sparse 4×3×3\n→ dense 16 steps]
    H --> H1[unsqueeze ×2\n→ 1,1,16,3\n→ 1,1,16,3,3]

    F --> I[helper.create_message\nimages → Qwen3VL messages]
    I --> J[inject_qa_into_messages\noverride_system_prompt=True\nreplace traj instruction\nwith QA question]

    F --> K[processor.apply_chat_template\ntokenize=True\ncontinue_final_message=True\nfrom cot_start token]

    F --> L[model.sample_trajectories_\nfrom_data_with_vlm_rollout\ntop_p=0.95 temp=0.6\nmax_gen=512 tokens]
    L --> M[VLM rollout\nchain-of-thought text]
    L --> N[Diffusion head\ntrajectory waypoints]

    M --> O[extract extra cot\nunwrap batch/traj/sample dims\n→ cot_string]
    O --> P[return dict\ntext: cot_string]

Message Structure

After inject_qa_into_messages(override_system_prompt=True), the Qwen3VL message list is:

[System]    → system_prompt.txt (replaces alpamayo default)
[User]      → [16 image tokens]
              <|traj_history_start|><|traj_history|>×48<|traj_history_end|>

              Question: <question text>
              [A) ... B) ... C) ... D) ...]   ← MCQ only
              Format: Answer: Yes or No        ← or "A, B, C, or D"
[Assistant] → <|cot_start|>                  ← model continues from here

The driving instruction "output the chain-of-thought reasoning of the driving process, then output the future trajectory." is replaced (not appended) by the QA question so the model has a single, unambiguous instruction.

`run_from_jsonl()` loop

inference = AlpamayoInference(model_cfg=model_cfg, device="cuda", cfg=cfg)
inference.load_model(weights_path)
n_written = inference.run_from_jsonl(
    input_path=prompts_file,
    output_path=outputs_file,
    batch_size=1,
    resume=True,       # skips question_ids already in outputs.jsonl
)

run_single() contract — must return a dict with at least a "text" key:

def run_single(self, prompt_payload: dict) -> dict:
    ...
    return {"text": cot_string}   # cot = chain-of-causation reasoning

image_paths lookup pattern:

path_lookup = {
    (item["time_key"], item["camera_key"]): item["path"]
    for item in prompt_payload["image_paths"]
}
# e.g. ("Tm1p5", "cam_front") → "raw_data/.../file.jpg"

Timestamped output structure:

<dir_cdb_outputs>/<MODEL_ID>_<YYYYMMDD_HHMMSS>/   ← run_dir
├── inference.log
└── causal_nuscenes/
    └── <scene_id>/
        └── <sample_id>/
            ├── prompts.jsonl
            └── outputs.jsonl

GPU-lazy loading: _has_pending_work() checks all sample output files before loading the model — avoids GPU allocation when everything is already processed.

Resumption: question_id entries already present in outputs.jsonl are skipped. Safe to Ctrl-C and resume with --run-dir <existing_run>.

Stage 3: Post-processing

Call Graph

flowchart TD
    A[postprocess.py main] --> B[_process_dataset\nrun_dir, bench_dir]

    B --> C[BenchmarkDataset\nget_sample_ids]

    B --> D{for each sample}
    D --> E[_load_sample_qa_index\nbench_dir/scene_id/sample_id\nscoped per-sample!]
    D --> F[_parse_sample_outputs\noutputs.jsonl + qa_index]
    F --> G[parse_answer\nraw_output, question]
    G --> G1[_parse_binary\ncascading regex\nAnswer: Yes/No]
    G --> G2[_parse_mcq\ncascading regex\nAnswer: A-D]
    F --> H[generate_sample_report\nTier 1 report.json]

    B --> I[generate_dataset_report\nTier 2 report.json]

    A --> J[generate_run_report\nTier 3 report.json]

    A --> K{html_enabled?}
    K -->|Yes| L[generate_root_aggregate\nTier 4]
    L --> M[generate_all_html\ncdb_reports_dir]

Per-sample QA Index Scoping

Question IDs (CI1, NI1, WC1, …) are reused across every scene in the benchmark. The global _load_qa_index(bench_dir) approach — which rglobs all 49+ scenes at once — causes the last alphabetical scene's question to overwrite all others, producing wrong question texts in reports.

flowchart LR
    subgraph old["❌ Old: global index"]
        A1[_load_qa_index\nall scenes] --> A2[CI1 → last scene wins\nwrong question text]
    end
    subgraph new["✅ New: per-sample index"]
        B1["_load_sample_qa_index\nbench_dir/scene/sample"] --> B2["CI1 → this sample only\ncorrect question text"]
    end

_load_sample_qa_index(sample_bench_dir) reads only the qa/ subfolder of the specific sample being evaluated, eliminating cross-scene ID collisions.

Answer Extraction

AlpamayoPostprocessor.parse_answer() searches the model's chain-of-causation text with cascading regex patterns:

Binary (Yes/No):

Answer:\s*(Yes|No) — highest priority
answer\s+is\s+(Yes|No)
^(Yes|No)[.,\s] — line-start
\b(Yes|No)\b — fallback

MCQ (A–D):

Answer:\s*([A-D])
answer\s+is\s+([A-D])
Option\s+([A-D])
([A-D])[).]\s
^\s*([A-D])\s*$
\b([A-D])\b — fallback

The model's output may optionally be wrapped in <think>...</think> tags (reasoning model style). Both the inner content and the full text are searched in order.

Report Tiers

flowchart TD
    T1["Tier 1 — sample\ncdb_outputs/‹run›/‹dataset›/‹scene›/‹sample›/report.json\nper-sample QA results + accuracy"]
    T2["Tier 2 — dataset\ncdb_outputs/‹run›/‹dataset›/report.json\naggregated across all samples"]
    T3["Tier 3 — run\ncdb_outputs/‹run›/report.json\naggregated across all datasets in this run"]
    T4["Tier 4 — root\ncdb_reports/‹run›/index.html\naggregated across all runs in cdb_outputs"]

    T1 -->|read by generate_dataset_report| T2
    T2 -->|read by generate_run_report| T3
    T3 -->|read by generate_root_aggregate| T4

Tier 1 reports are incremental: skipped if their stored schema_version matches eval_config.yaml. Bump reporting.schema_version to force full regeneration.

Tiers 2–4 always regenerate (fast: only read pre-computed JSON).

postprocessor = AlpamayoPostprocessor(model_cfg={})
# Tier 1 + 2 generated by _process_dataset():
_process_dataset(run_dir, "causal_nuscenes", bench_dir, postprocessor, ...)

# Tier 3:
generate_run_report(run_name, run_dir, run_dir / "report.json", ...)

# Tier 4 + HTML:
root = generate_root_aggregate(cfg.dir_cdb_outputs, schema_version)
generate_all_html(run_name, cfg.dir_cdb_outputs, cfg.dir_cdb_reports, root)

raw_output format — _extract_text() handles both schemas:

raw_output = {"text": "Answer: No\nReasoning: ..."}  # new dict format
raw_output = "Answer: No\nReasoning: ..."             # legacy plain string

Evaluation Modes

Mode	`--mode`	Description
Full	`full`	All scenes in the benchmark
Subset	`subset --subset-size N`	N randomly sampled scenes
Single	`single --scene <name>`	Exactly one scene (fastest, for debugging)

Pass identical mode flags to both inference.py and postprocess.py so they operate on the same set of samples.

LLM Judge (Optional)

For reasoning evaluation beyond exact-match accuracy. Scores each model response on four dimensions:

Dimension	Description	Score
`spatial_grounding`	Correctly identifies locations and positions of agents	0–2
`mechanism_identification`	Explains the causal mechanism driving the event	0–2
`conditionality_awareness`	Understands conditional/hypothetical nature	0–2
`answer_consistency`	Final answer matches the reasoning given	0–2

Composite score = mean of all four dimensions (0–2 scale).

metrics = postprocessor.compute_metrics(
    outputs_path="outputs.jsonl",
    bench_dir="...",
    use_llm_judge=True,
    judge_cfg={
        "provider": "anthropic",          # "anthropic" | "openai" | "azure_foundry"
        "model": "claude-opus-4-6",
        "api_key_env": "ANTHROPIC_API_KEY",
    },
)

Three supported judge backends:

Provider	`provider` value	Notes
Anthropic	`"anthropic"`	Recommended; uses `claude-opus-4-6`
OpenAI	`"openai"`	Uses `gpt-4o` or similar
Azure AI Foundry	`"azure_foundry"`	Azure-hosted Claude or OpenAI models

Configure in eval_config.yaml under evaluation.judge:

evaluation:
  use_llm_judge: false   # set true to enable
  judge:
    provider: anthropic
    model: claude-opus-4-6
    api_key_env: ANTHROPIC_API_KEY

Orchestrator (Multi-model)

The top-level orchestrate.py runs the full pipeline for one or more models:

# Single model, single-scene debug
python evaluation/orchestrate.py \
    --model alpamayo_1_0 \
    --mode single --scene nuscenes-scene-0159

# Subset run
python evaluation/orchestrate.py \
    --model alpamayo_1_0 \
    --dataset causal_nuscenes \
    --mode subset --subset-size 50

# Multiple models, full benchmark
python evaluation/orchestrate.py \
    --models alpamayo_1_0,drivelm \
    --mode full

# Dry run (print commands only)
python evaluation/orchestrate.py --model alpamayo_1_0 --dry-run

The orchestrator reads eval_config.yaml for model registry, dataset paths, and qa_types. Only models with enabled: true are run.

Metrics

Metric	Description
`overall.accuracy`	Exact-match accuracy across all questions
`overall.n`	Total number of questions evaluated
`per_qa_type.ladder`	Accuracy for active causal-ladder questions (MCQ)
`per_qa_type.dormant`	Accuracy for dormant link questions (binary)
`per_qa_type.distractor`	Accuracy for distractor node questions (binary)
`confusion`	Confusion matrix and most-confused pairs
`reasoning_score` (optional)	LLM judge composite and per-dimension scores (0–2)

All metrics are computed by pure-Python functions in common/metrics.py with no GPU dependencies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CausalDriveBench Model Evaluation Pipeline

Pipeline Overview

Stage 1: Data Loading

Call Graph

Stage 2: Inference (Alpamayo 1.0)

Call Graph

Message Structure

`run_from_jsonl()` loop

Stage 3: Post-processing

Call Graph

Per-sample QA Index Scoping

Answer Extraction

Report Tiers

Evaluation Modes

LLM Judge (Optional)

Orchestrator (Multi-model)

Metrics

FilesExpand file tree

evaluation_pipeline.md

Latest commit

History

evaluation_pipeline.md

File metadata and controls

CausalDriveBench Model Evaluation Pipeline

Pipeline Overview

Stage 1: Data Loading

Call Graph

Stage 2: Inference (Alpamayo 1.0)

Call Graph

Message Structure

run_from_jsonl() loop

Stage 3: Post-processing

Call Graph

Per-sample QA Index Scoping

Answer Extraction

Report Tiers

Evaluation Modes

LLM Judge (Optional)

Orchestrator (Multi-model)

Metrics

`run_from_jsonl()` loop