Skip to content

Add vLLM OpenQA generation and LLM-as-judge scoring scripts#12

Draft
naufalso wants to merge 2 commits into
mainfrom
codex/create-generate_openqa_ans-and-score_openqa_ans-scripts
Draft

Add vLLM OpenQA generation and LLM-as-judge scoring scripts#12
naufalso wants to merge 2 commits into
mainfrom
codex/create-generate_openqa_ans-and-score_openqa_ans-scripts

Conversation

@naufalso
Copy link
Copy Markdown
Collaborator

@naufalso naufalso commented Feb 3, 2026

Motivation

  • Provide efficient offline batch generation of OpenQA model answers using vLLM to produce a JSONL of model outputs for downstream evaluation.
  • Provide an automated LLM-as-judge scoring tool to assess factual correctness and quality (0-10) for generated answers using the same vLLM backend.

Description

  • Add eval/generate_openqa_ans.py which loads a HuggingFace dataset or local JSON/JSONL, formats prompts (optional chat/system template), runs batched vLLM inference, and writes records with a generated_text field to an output JSONL file.
  • Add eval/score_openqa_ans.py which reads the generated JSONL, constructs a judge prompt using the provided system template, runs batched vLLM inference, parses <correctness> and <score> via regex, retries failed parses up to --max-trials, and appends judge_raw, judge_correctness, judge_score, and judge_parse_error when applicable.
  • Both scripts expose CLI arguments for model, batching, sampling (--max-tokens, --temperature, --top-p, --top-k, --seed), optional chat-template formatting via --use-chat-template, and an optional --max-model-len override for vLLM usage.
  • Scripts rely on datasets, transformers, and vllm for data loading, tokenizer chat templates, and inference respectively.

Testing

  • No automated tests were executed as part of this change (no CI/test run was requested or performed).

Codex Task

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a6effc709e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread eval/score_openqa_ans.py Outdated
Comment on lines +221 to +225
while pending_indices and trial < args.max_trials:
next_pending = []
for batch_indices in iter_batches(pending_indices, args.batch_size):
prompts = []
for idx in batch_indices:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make retries non-deterministic for parse failures

This retry loop reuses the same sampling_params each pass, and those params include a fixed seed set once before the loop. With the default --temperature 0.0 (or any run where a fixed seed is honored), llm.generate will deterministically return the same output for every retry, so --max-trials cannot recover from malformed judge outputs and will just repeat the same parse failure. Consider varying the seed or sampling params per trial (or skipping retries when deterministic) so retries can actually produce different outputs.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds two standalone evaluation utilities to (1) generate OpenQA-style answers with vLLM in offline batches and (2) score those answers with an LLM-as-judge, including regex-based parsing and retry logic.

Changes:

  • Add eval/generate_openqa_ans.py for batched OpenQA answer generation to JSONL using vLLM (optionally with tokenizer chat templates).
  • Add eval/score_openqa_ans.py for batched LLM-as-judge scoring with retries and structured fields (judge_*) appended to each JSONL record.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
eval/generate_openqa_ans.py New CLI script to load HF/local JSON(L), build prompts, run vLLM batch inference, and emit JSONL with generated_text.
eval/score_openqa_ans.py New CLI script to read generated JSONL, run a judge model via vLLM, parse <correctness>/<score>, retry failures, and write augmented JSONL.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread eval/score_openqa_ans.py
Return exactly these three blocks in order. Do not add text outside the tags.

<analysis>
Free-form justification. You may write anything here such as step-by-step reasoning, comparisons, errors spotted, strengths, weaknesses, etc. between the model_answer and reference_answe.
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the SYSTEM_PROMPT: the text ends with "reference_answe." which looks truncated/misspelled and should be "reference_answer" (to match the earlier variable name and improve judge instruction clarity).

Suggested change
Free-form justification. You may write anything here such as step-by-step reasoning, comparisons, errors spotted, strengths, weaknesses, etc. between the model_answer and reference_answe.
Free-form justification. You may write anything here such as step-by-step reasoning, comparisons, errors spotted, strengths, weaknesses, etc. between the model_answer and reference_answer.

Copilot uses AI. Check for mistakes.
Comment thread eval/score_openqa_ans.py
Comment on lines +57 to +59
- Use chain-of-thought privately, but present only a final analysis in <analysis>.
- Be strict on correctness: any factual error → correctness=False. If correctness=False, cap score at 3 or lower.
- If correct but shallow, keep correctness=True and assign a lower score.
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The judge instructions are internally inconsistent: it says "Use chain-of-thought privately" but then asks for detailed/step-by-step reasoning inside the block. This tends to elicit hidden-reasoning style outputs and makes outputs less predictable for parsing. Consider removing chain-of-thought wording and instead request a concise justification (still inside ) without step-by-step reasoning.

Copilot uses AI. Check for mistakes.
Comment on lines +124 to +127
args = parse_args()
dataset = load_data(args)
records = list(dataset)

Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

records = list(dataset) materializes the full dataset in memory before generation. For large HF datasets this can be prohibitively expensive; consider iterating the dataset directly (or using streaming) and batching over an iterator so memory usage stays bounded.

Copilot uses AI. Check for mistakes.
Comment thread eval/score_openqa_ans.py
Comment on lines +194 to +198
def load_jsonl(path: Path) -> List[dict]:
with path.open("r", encoding="utf-8") as file:
return [json.loads(line) for line in file if line.strip()]


Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_jsonl loads the entire input file into memory at once. For large JSONL evaluation runs this can cause high memory usage; consider a streaming/iterator approach (and writing outputs incrementally) so scoring can run in bounded memory.

Suggested change
def load_jsonl(path: Path) -> List[dict]:
with path.open("r", encoding="utf-8") as file:
return [json.loads(line) for line in file if line.strip()]
class JsonlRows:
"""Lightweight, list-like wrapper for a JSONL file.
This avoids loading all JSON objects into memory at once. Instead, it
stores byte offsets for each non-empty line and lazily deserializes
items on demand.
"""
def __init__(self, path: Path) -> None:
self._path = path
self._offsets: List[int] = []
# Record the byte offsets of all non-empty lines.
with path.open("rb") as f:
while True:
pos = f.tell()
line = f.readline()
if not line:
break
if line.strip():
self._offsets.append(pos)
def __len__(self) -> int:
return len(self._offsets)
def __getitem__(self, index: int) -> dict:
# Support negative indices like a normal sequence.
if index < 0:
index += len(self._offsets)
if index < 0 or index >= len(self._offsets):
raise IndexError("JsonlRows index out of range")
offset = self._offsets[index]
# Open in binary mode to match recorded byte offsets.
with self._path.open("rb") as f:
f.seek(offset)
line = f.readline()
# Decode and parse JSON for this single line.
return json.loads(line.decode("utf-8"))
def __iter__(self):
for i in range(len(self._offsets)):
yield self[i]
def load_jsonl(path: Path) -> JsonlRows:
"""Return a lazy, list-like view over a JSONL file."""
return JsonlRows(path)

Copilot uses AI. Check for mistakes.
Comment thread eval/score_openqa_ans.py
Comment on lines +163 to +172
if use_chat_template:
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_prompt},
]
return tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_prompt accepts tokenizer: Optional[AutoTokenizer], but unconditionally calls tokenizer.apply_chat_template(...) when use_chat_template is true. Since this is only safe when tokenizer is non-None, consider tightening the type (non-Optional) for that codepath or adding an explicit assertion to prevent accidental misuse/refactors from introducing a None dereference.

Copilot uses AI. Check for mistakes.
@naufalso naufalso marked this pull request as draft February 4, 2026 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants