Evaluation Pipeline

This directory contains the end-to-end evaluation script for LLaVA-LE.

Overview

Evaluation questions and images are loaded automatically from the pcvlab/lucid HuggingFace dataset (evaluation config, test split — 50 images, 190 questions). No local question files or image folders are required.

The pipeline has two main stages:

1. Answer Generation — `llava/eval/model_vqa_lunar.py`

Runs a LLaVA-style model on the lunar evaluation set and writes each model's answers to a .jsonl file. This is done for three models:

Model	Description
LLaVA-LE-S1	Stage 1 fine-tuned LoRA checkpoint
LLaVA-LE-S2	Stage 2 fine-tuned LoRA checkpoint
LLaVA	Base LLaVA (no fine-tuning), vision-language baseline

GPT and Gemini text-only answer files are generated separately and must exist before running the judge step.

2. Judge Scoring — `llava/eval/caption_judge.py`

Sends all five model answers (LLaVA-LE-S1, LLaVA-LE-S2, GPT-TEXT-ONLY, GEMINI-TEXT-ONLY, LLaVA) together with the reference caption to a GPT judge model. The judge scores each answer on a 1–10 scale and produces:

judge_scores.jsonl — per-question scores for all models
judge_summary.json — aggregated statistics (mean, std, per question-type breakdown)

Usage

export OPENAI_API_KEY="sk-..."
bash scripts/eval/run_eval.sh [OPTIONS]

Common examples

Run the full pipeline with default paths:

bash scripts/eval/run_eval.sh

Override checkpoint paths:

bash scripts/eval/run_eval.sh \
  --stage1-model-path LLaVA-LE/checkpoints/stage_1/my_stage1_checkpoint \
  --stage2-model-path LLaVA-LE/checkpoints/stage_2/my_stage2_checkpoint

Skip answer generation and only run the judge (answers already exist):

bash scripts/eval/run_eval.sh --skip-generate

Use a different judge model and output directory:

bash scripts/eval/run_eval.sh \
  --judge-model gpt-4-turbo \
  --output-dir ./my_eval_outputs

All options

Flag	Default	Description
`--stage1-model-path`	`LLaVA-LE/checkpoints/stage_1/llava-v1.5-13b-task-lora-stage1_20260221_001701`	Stage 1 LoRA checkpoint
`--stage2-model-path`	`LLaVA-LE/checkpoints/stage_2/llava-v1.5-13b-task-lora-stage2_20260225_221920`	Stage 2 LoRA checkpoint
`--model-base`	`liuhaotian/llava-v1.5-13b`	Base model ID (HuggingFace or local)
`--hf-dataset`	`pcvlab/lucid`	HuggingFace dataset repo ID; evaluation questions and images are loaded from the `evaluation` / `test` split — no local files needed
`--temperature`	`0.2`	Sampling temperature for generation
`--conv-mode`	`vicuna_v1`	Conversation template
`--output-dir`	`./eval_outputs`	Root output directory
`--judge-model`	`gpt-4o`	OpenAI model used as judge
`--gpt-text-only-answers`	`<output-dir>/model_answers/gpt_new_answers.json`	GPT text-only answers file
`--gemini-text-only-answers`	`<output-dir>/model_answers/text_gemini_answers.jsonl`	Gemini text-only answers file
`--skip-generate`	—	Skip steps 1–3; only run the judge

Output structure

eval_outputs/
├── model_answers/
│   ├── stage1_answers.jsonl          # LLaVA-LE-S1
│   ├── stage2_answers.jsonl          # LLaVA-LE-S2
│   ├── basellava_answers.jsonl       # LLaVA (base)
│   ├── gpt_new_answers.json          # GPT-TEXT-ONLY  (pre-existing)
│   └── text_gemini_answers.jsonl     # GEMINI-TEXT-ONLY (pre-existing)
└── judge/
    ├── judge_scores.jsonl
    └── judge_summary.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Pipeline

Overview

1. Answer Generation — `llava/eval/model_vqa_lunar.py`

2. Judge Scoring — `llava/eval/caption_judge.py`

Usage

Common examples

All options

Output structure

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Evaluation Pipeline

Overview

1. Answer Generation — llava/eval/model_vqa_lunar.py

2. Judge Scoring — llava/eval/caption_judge.py

Usage

Common examples

All options

Output structure

1. Answer Generation — `llava/eval/model_vqa_lunar.py`

2. Judge Scoring — `llava/eval/caption_judge.py`