Add vLLM OpenQA generation and LLM-as-judge scoring scripts by naufalso · Pull Request #12 · RISys-Lab/RedSage

naufalso · 2026-02-03T13:34:59Z

Motivation

Provide efficient offline batch generation of OpenQA model answers using vLLM to produce a JSONL of model outputs for downstream evaluation.
Provide an automated LLM-as-judge scoring tool to assess factual correctness and quality (0-10) for generated answers using the same vLLM backend.

Description

Add eval/generate_openqa_ans.py which loads a HuggingFace dataset or local JSON/JSONL, formats prompts (optional chat/system template), runs batched vLLM inference, and writes records with a generated_text field to an output JSONL file.
Add eval/score_openqa_ans.py which reads the generated JSONL, constructs a judge prompt using the provided system template, runs batched vLLM inference, parses <correctness> and <score> via regex, retries failed parses up to --max-trials, and appends judge_raw, judge_correctness, judge_score, and judge_parse_error when applicable.
Both scripts expose CLI arguments for model, batching, sampling (--max-tokens, --temperature, --top-p, --top-k, --seed), optional chat-template formatting via --use-chat-template, and an optional --max-model-len override for vLLM usage.
Scripts rely on datasets, transformers, and vllm for data loading, tokenizer chat templates, and inference respectively.

Testing

No automated tests were executed as part of this change (no CI/test run was requested or performed).

Codex Task

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a6effc709e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-03T13:36:01Z

+    while pending_indices and trial < args.max_trials:
+        next_pending = []
+        for batch_indices in iter_batches(pending_indices, args.batch_size):
+            prompts = []
+            for idx in batch_indices:


Make retries non-deterministic for parse failures

This retry loop reuses the same sampling_params each pass, and those params include a fixed seed set once before the loop. With the default --temperature 0.0 (or any run where a fixed seed is honored), llm.generate will deterministically return the same output for every retry, so --max-trials cannot recover from malformed judge outputs and will just repeat the same parse failure. Consider varying the seed or sampling params per trial (or skipping retries when deterministic) so retries can actually produce different outputs.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

Adds two standalone evaluation utilities to (1) generate OpenQA-style answers with vLLM in offline batches and (2) score those answers with an LLM-as-judge, including regex-based parsing and retry logic.

Changes:

Add eval/generate_openqa_ans.py for batched OpenQA answer generation to JSONL using vLLM (optionally with tokenizer chat templates).
Add eval/score_openqa_ans.py for batched LLM-as-judge scoring with retries and structured fields (judge_*) appended to each JSONL record.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
eval/generate_openqa_ans.py	New CLI script to load HF/local JSON(L), build prompts, run vLLM batch inference, and emit JSONL with `generated_text`.
eval/score_openqa_ans.py	New CLI script to read generated JSONL, run a judge model via vLLM, parse `<correctness>`/`<score>`, retry failures, and write augmented JSONL.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-03T13:38:55Z

+Return exactly these three blocks in order. Do not add text outside the tags.
+
+<analysis>
+Free-form justification. You may write anything here such as step-by-step reasoning, comparisons, errors spotted, strengths, weaknesses, etc. between the model_answer and reference_answe.


Typo in the SYSTEM_PROMPT: the text ends with "reference_answe." which looks truncated/misspelled and should be "reference_answer" (to match the earlier variable name and improve judge instruction clarity).

Suggested change

Free-form justification. You may write anything here such as step-by-step reasoning, comparisons, errors spotted, strengths, weaknesses, etc. between the model_answer and reference_answe.

Free-form justification. You may write anything here such as step-by-step reasoning, comparisons, errors spotted, strengths, weaknesses, etc. between the model_answer and reference_answer.

Copilot · 2026-02-03T13:38:56Z

+- Use chain-of-thought privately, but present only a final analysis in <analysis>.  
+- Be strict on correctness: any factual error → correctness=False. If correctness=False, cap score at 3 or lower.  
+- If correct but shallow, keep correctness=True and assign a lower score.


The judge instructions are internally inconsistent: it says "Use chain-of-thought privately" but then asks for detailed/step-by-step reasoning inside the block. This tends to elicit hidden-reasoning style outputs and makes outputs less predictable for parsing. Consider removing chain-of-thought wording and instead request a concise justification (still inside ) without step-by-step reasoning.

Copilot · 2026-02-03T13:38:56Z

+    args = parse_args()
+    dataset = load_data(args)
+    records = list(dataset)
+


records = list(dataset) materializes the full dataset in memory before generation. For large HF datasets this can be prohibitively expensive; consider iterating the dataset directly (or using streaming) and batching over an iterator so memory usage stays bounded.

Copilot · 2026-02-03T13:38:56Z

+def load_jsonl(path: Path) -> List[dict]:
+    with path.open("r", encoding="utf-8") as file:
+        return [json.loads(line) for line in file if line.strip()]
+
+


load_jsonl loads the entire input file into memory at once. For large JSONL evaluation runs this can cause high memory usage; consider a streaming/iterator approach (and writing outputs incrementally) so scoring can run in bounded memory.

Suggested change

def load_jsonl(path: Path) -> List[dict]:

with path.open("r", encoding="utf-8") as file:

return [json.loads(line) for line in file if line.strip()]

class JsonlRows:

"""Lightweight, list-like wrapper for a JSONL file.

This avoids loading all JSON objects into memory at once. Instead, it

stores byte offsets for each non-empty line and lazily deserializes

items on demand.

"""

def __init__(self, path: Path) -> None:

self._path = path

self._offsets: List[int] = []

# Record the byte offsets of all non-empty lines.

with path.open("rb") as f:

while True:

pos = f.tell()

line = f.readline()

if not line:

break

if line.strip():

self._offsets.append(pos)

def __len__(self) -> int:

return len(self._offsets)

def __getitem__(self, index: int) -> dict:

# Support negative indices like a normal sequence.

if index < 0:

index += len(self._offsets)

if index < 0 or index >= len(self._offsets):

raise IndexError("JsonlRows index out of range")

offset = self._offsets[index]

# Open in binary mode to match recorded byte offsets.

with self._path.open("rb") as f:

f.seek(offset)

line = f.readline()

# Decode and parse JSON for this single line.

return json.loads(line.decode("utf-8"))

def __iter__(self):

for i in range(len(self._offsets)):

yield self[i]

def load_jsonl(path: Path) -> JsonlRows:

"""Return a lazy, list-like view over a JSONL file."""

return JsonlRows(path)

Copilot · 2026-02-03T13:38:57Z

+    if use_chat_template:
+        messages = [
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {"role": "user", "content": user_prompt},
+        ]
+        return tokenizer.apply_chat_template(
+            messages,
+            tokenize=False,
+            add_generation_prompt=True,
+        )


build_prompt accepts tokenizer: Optional[AutoTokenizer], but unconditionally calls tokenizer.apply_chat_template(...) when use_chat_template is true. Since this is only safe when tokenizer is non-None, consider tightening the type (non-Optional) for that codepath or adding an explicit assertion to prevent accidental misuse/refactors from introducing a None dereference.

… configuration options

Add OpenQA generation and scoring scripts

a6effc7

naufalso added the codex label Feb 3, 2026 — with ChatGPT Codex Connector

naufalso requested a review from Copilot February 3, 2026 13:35

Copilot started reviewing on behalf of naufalso February 3, 2026 13:35 View session

chatgpt-codex-connector Bot reviewed Feb 3, 2026

View reviewed changes

Copilot AI reviewed Feb 3, 2026

View reviewed changes

naufalso marked this pull request as draft February 4, 2026 08:05

Fixed OpenQA generation and scoring scripts with improved logging and…

f25eba4

… configuration options

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vLLM OpenQA generation and LLM-as-judge scoring scripts#12

Add vLLM OpenQA generation and LLM-as-judge scoring scripts#12
naufalso wants to merge 2 commits into
mainfrom
codex/create-generate_openqa_ans-and-score_openqa_ans-scripts

naufalso commented Feb 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Feb 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	Free-form justification. You may write anything here such as step-by-step reasoning, comparisons, errors spotted, strengths, weaknesses, etc. between the model_answer and reference_answe.
	Free-form justification. You may write anything here such as step-by-step reasoning, comparisons, errors spotted, strengths, weaknesses, etc. between the model_answer and reference_answer.

-def load_jsonl(path: Path) -> List[dict]:
-    with path.open("r", encoding="utf-8") as file:
-        return [json.loads(line) for line in file if line.strip()]
+class JsonlRows:
+    """Lightweight, list-like wrapper for a JSONL file.
+    This avoids loading all JSON objects into memory at once. Instead, it
+    stores byte offsets for each non-empty line and lazily deserializes
+    items on demand.
+    """
+    def __init__(self, path: Path) -> None:
+        self._path = path
+        self._offsets: List[int] = []
+        # Record the byte offsets of all non-empty lines.
+        with path.open("rb") as f:
+            while True:
+                pos = f.tell()
+                line = f.readline()
+                if not line:
+                    break
+                if line.strip():
+                    self._offsets.append(pos)
+    def __len__(self) -> int:
+        return len(self._offsets)
+    def __getitem__(self, index: int) -> dict:
+        # Support negative indices like a normal sequence.
+        if index < 0:
+            index += len(self._offsets)
+        if index < 0 or index >= len(self._offsets):
+            raise IndexError("JsonlRows index out of range")
+        offset = self._offsets[index]
+        # Open in binary mode to match recorded byte offsets.
+        with self._path.open("rb") as f:
+            f.seek(offset)
+            line = f.readline()
+        # Decode and parse JSON for this single line.
+        return json.loads(line.decode("utf-8"))
+    def __iter__(self):
+        for i in range(len(self._offsets)):
+            yield self[i]
+def load_jsonl(path: Path) -> JsonlRows:
+    """Return a lazy, list-like view over a JSONL file."""
+    return JsonlRows(path)

Conversation

naufalso commented Feb 3, 2026

Motivation

Description

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants