This directory contains the end-to-end evaluation script for LLaVA-LE.
Evaluation questions and images are loaded automatically from the
pcvlab/lucid HuggingFace
dataset (evaluation config, test split — 50 images, 190 questions).
No local question files or image folders are required.
The pipeline has two main stages:
Runs a LLaVA-style model on the lunar evaluation set and writes each model's
answers to a .jsonl file. This is done for three models:
| Model | Description |
|---|---|
| LLaVA-LE-S1 | Stage 1 fine-tuned LoRA checkpoint |
| LLaVA-LE-S2 | Stage 2 fine-tuned LoRA checkpoint |
| LLaVA | Base LLaVA (no fine-tuning), vision-language baseline |
GPT and Gemini text-only answer files are generated separately and must exist before running the judge step.
Sends all five model answers (LLaVA-LE-S1, LLaVA-LE-S2, GPT-TEXT-ONLY, GEMINI-TEXT-ONLY, LLaVA) together with the reference caption to a GPT judge model. The judge scores each answer on a 1–10 scale and produces:
judge_scores.jsonl— per-question scores for all modelsjudge_summary.json— aggregated statistics (mean, std, per question-type breakdown)
export OPENAI_API_KEY="sk-..."
bash scripts/eval/run_eval.sh [OPTIONS]Run the full pipeline with default paths:
bash scripts/eval/run_eval.shOverride checkpoint paths:
bash scripts/eval/run_eval.sh \
--stage1-model-path LLaVA-LE/checkpoints/stage_1/my_stage1_checkpoint \
--stage2-model-path LLaVA-LE/checkpoints/stage_2/my_stage2_checkpointSkip answer generation and only run the judge (answers already exist):
bash scripts/eval/run_eval.sh --skip-generateUse a different judge model and output directory:
bash scripts/eval/run_eval.sh \
--judge-model gpt-4-turbo \
--output-dir ./my_eval_outputs| Flag | Default | Description |
|---|---|---|
--stage1-model-path |
LLaVA-LE/checkpoints/stage_1/llava-v1.5-13b-task-lora-stage1_20260221_001701 |
Stage 1 LoRA checkpoint |
--stage2-model-path |
LLaVA-LE/checkpoints/stage_2/llava-v1.5-13b-task-lora-stage2_20260225_221920 |
Stage 2 LoRA checkpoint |
--model-base |
liuhaotian/llava-v1.5-13b |
Base model ID (HuggingFace or local) |
--hf-dataset |
pcvlab/lucid |
HuggingFace dataset repo ID; evaluation questions and images are loaded from the evaluation / test split — no local files needed |
--temperature |
0.2 |
Sampling temperature for generation |
--conv-mode |
vicuna_v1 |
Conversation template |
--output-dir |
./eval_outputs |
Root output directory |
--judge-model |
gpt-4o |
OpenAI model used as judge |
--gpt-text-only-answers |
<output-dir>/model_answers/gpt_new_answers.json |
GPT text-only answers file |
--gemini-text-only-answers |
<output-dir>/model_answers/text_gemini_answers.jsonl |
Gemini text-only answers file |
--skip-generate |
— | Skip steps 1–3; only run the judge |
eval_outputs/
├── model_answers/
│ ├── stage1_answers.jsonl # LLaVA-LE-S1
│ ├── stage2_answers.jsonl # LLaVA-LE-S2
│ ├── basellava_answers.jsonl # LLaVA (base)
│ ├── gpt_new_answers.json # GPT-TEXT-ONLY (pre-existing)
│ └── text_gemini_answers.jsonl # GEMINI-TEXT-ONLY (pre-existing)
└── judge/
├── judge_scores.jsonl
└── judge_summary.json