Add vLLM OpenQA generation and LLM-as-judge scoring scripts#12
Add vLLM OpenQA generation and LLM-as-judge scoring scripts#12naufalso wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a6effc709e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| while pending_indices and trial < args.max_trials: | ||
| next_pending = [] | ||
| for batch_indices in iter_batches(pending_indices, args.batch_size): | ||
| prompts = [] | ||
| for idx in batch_indices: |
There was a problem hiding this comment.
Make retries non-deterministic for parse failures
This retry loop reuses the same sampling_params each pass, and those params include a fixed seed set once before the loop. With the default --temperature 0.0 (or any run where a fixed seed is honored), llm.generate will deterministically return the same output for every retry, so --max-trials cannot recover from malformed judge outputs and will just repeat the same parse failure. Consider varying the seed or sampling params per trial (or skipping retries when deterministic) so retries can actually produce different outputs.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Adds two standalone evaluation utilities to (1) generate OpenQA-style answers with vLLM in offline batches and (2) score those answers with an LLM-as-judge, including regex-based parsing and retry logic.
Changes:
- Add
eval/generate_openqa_ans.pyfor batched OpenQA answer generation to JSONL using vLLM (optionally with tokenizer chat templates). - Add
eval/score_openqa_ans.pyfor batched LLM-as-judge scoring with retries and structured fields (judge_*) appended to each JSONL record.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| eval/generate_openqa_ans.py | New CLI script to load HF/local JSON(L), build prompts, run vLLM batch inference, and emit JSONL with generated_text. |
| eval/score_openqa_ans.py | New CLI script to read generated JSONL, run a judge model via vLLM, parse <correctness>/<score>, retry failures, and write augmented JSONL. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Return exactly these three blocks in order. Do not add text outside the tags. | ||
|
|
||
| <analysis> | ||
| Free-form justification. You may write anything here such as step-by-step reasoning, comparisons, errors spotted, strengths, weaknesses, etc. between the model_answer and reference_answe. |
There was a problem hiding this comment.
Typo in the SYSTEM_PROMPT: the text ends with "reference_answe." which looks truncated/misspelled and should be "reference_answer" (to match the earlier variable name and improve judge instruction clarity).
| Free-form justification. You may write anything here such as step-by-step reasoning, comparisons, errors spotted, strengths, weaknesses, etc. between the model_answer and reference_answe. | |
| Free-form justification. You may write anything here such as step-by-step reasoning, comparisons, errors spotted, strengths, weaknesses, etc. between the model_answer and reference_answer. |
| - Use chain-of-thought privately, but present only a final analysis in <analysis>. | ||
| - Be strict on correctness: any factual error → correctness=False. If correctness=False, cap score at 3 or lower. | ||
| - If correct but shallow, keep correctness=True and assign a lower score. |
There was a problem hiding this comment.
The judge instructions are internally inconsistent: it says "Use chain-of-thought privately" but then asks for detailed/step-by-step reasoning inside the block. This tends to elicit hidden-reasoning style outputs and makes outputs less predictable for parsing. Consider removing chain-of-thought wording and instead request a concise justification (still inside ) without step-by-step reasoning.
| args = parse_args() | ||
| dataset = load_data(args) | ||
| records = list(dataset) | ||
|
|
There was a problem hiding this comment.
records = list(dataset) materializes the full dataset in memory before generation. For large HF datasets this can be prohibitively expensive; consider iterating the dataset directly (or using streaming) and batching over an iterator so memory usage stays bounded.
| def load_jsonl(path: Path) -> List[dict]: | ||
| with path.open("r", encoding="utf-8") as file: | ||
| return [json.loads(line) for line in file if line.strip()] | ||
|
|
||
|
|
There was a problem hiding this comment.
load_jsonl loads the entire input file into memory at once. For large JSONL evaluation runs this can cause high memory usage; consider a streaming/iterator approach (and writing outputs incrementally) so scoring can run in bounded memory.
| def load_jsonl(path: Path) -> List[dict]: | |
| with path.open("r", encoding="utf-8") as file: | |
| return [json.loads(line) for line in file if line.strip()] | |
| class JsonlRows: | |
| """Lightweight, list-like wrapper for a JSONL file. | |
| This avoids loading all JSON objects into memory at once. Instead, it | |
| stores byte offsets for each non-empty line and lazily deserializes | |
| items on demand. | |
| """ | |
| def __init__(self, path: Path) -> None: | |
| self._path = path | |
| self._offsets: List[int] = [] | |
| # Record the byte offsets of all non-empty lines. | |
| with path.open("rb") as f: | |
| while True: | |
| pos = f.tell() | |
| line = f.readline() | |
| if not line: | |
| break | |
| if line.strip(): | |
| self._offsets.append(pos) | |
| def __len__(self) -> int: | |
| return len(self._offsets) | |
| def __getitem__(self, index: int) -> dict: | |
| # Support negative indices like a normal sequence. | |
| if index < 0: | |
| index += len(self._offsets) | |
| if index < 0 or index >= len(self._offsets): | |
| raise IndexError("JsonlRows index out of range") | |
| offset = self._offsets[index] | |
| # Open in binary mode to match recorded byte offsets. | |
| with self._path.open("rb") as f: | |
| f.seek(offset) | |
| line = f.readline() | |
| # Decode and parse JSON for this single line. | |
| return json.loads(line.decode("utf-8")) | |
| def __iter__(self): | |
| for i in range(len(self._offsets)): | |
| yield self[i] | |
| def load_jsonl(path: Path) -> JsonlRows: | |
| """Return a lazy, list-like view over a JSONL file.""" | |
| return JsonlRows(path) |
| if use_chat_template: | ||
| messages = [ | ||
| {"role": "system", "content": SYSTEM_PROMPT}, | ||
| {"role": "user", "content": user_prompt}, | ||
| ] | ||
| return tokenizer.apply_chat_template( | ||
| messages, | ||
| tokenize=False, | ||
| add_generation_prompt=True, | ||
| ) |
There was a problem hiding this comment.
build_prompt accepts tokenizer: Optional[AutoTokenizer], but unconditionally calls tokenizer.apply_chat_template(...) when use_chat_template is true. Since this is only safe when tokenizer is non-None, consider tightening the type (non-Optional) for that codepath or adding an explicit assertion to prevent accidental misuse/refactors from introducing a None dereference.
… configuration options
Motivation
Description
eval/generate_openqa_ans.pywhich loads a HuggingFace dataset or local JSON/JSONL, formats prompts (optional chat/system template), runs batched vLLM inference, and writes records with agenerated_textfield to an output JSONL file.eval/score_openqa_ans.pywhich reads the generated JSONL, constructs a judge prompt using the provided system template, runs batched vLLM inference, parses<correctness>and<score>via regex, retries failed parses up to--max-trials, and appendsjudge_raw,judge_correctness,judge_score, andjudge_parse_errorwhen applicable.--max-tokens,--temperature,--top-p,--top-k,--seed), optional chat-template formatting via--use-chat-template, and an optional--max-model-lenoverride for vLLM usage.datasets,transformers, andvllmfor data loading, tokenizer chat templates, and inference respectively.Testing
Codex Task