Skip to content

Commit fd7fd7f

Browse files
Add arena hard v2.0 (#29)
* Update license placeholder * Update pyproject.toml with metadata * add smoke tests * add publishing workflow * remove testpypi from pyproject.toml * change authors to maintainers * Add ArenaHard-v2.0 * address PR comments * remove slurmpilot script for arenahard versions * fix precommit failure
1 parent e1726ad commit fd7fd7f

14 files changed

Lines changed: 370 additions & 73 deletions

File tree

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -186,7 +186,7 @@
186186
same "printed page" as the copyright notice for easier
187187
identification within third-party archives.
188188

189-
Copyright 2026 Erlis Lushtaku, David Salinas
189+
Copyright 2026 Erlis Lushtaku, David Salinas, and GitHub contributors
190190

191191
Licensed under the Apache License, Version 2.0 (the "License");
192192
you may not use this file except in compliance with the License.

README.md

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -196,15 +196,20 @@ This override applies to all vLLM models in the run. For remote providers (OpenA
196196
| Dataset | Description |
197197
|-----------------------|------------------------------------------------------------------------------------------------|
198198
| `alpaca-eval` | General instruction-following benchmark |
199-
| `arena-hard` | More challenging evaluation suite |
199+
| `arena-hard-v2.0` | Arena-Hard v2.0 from official `lmarena-ai/arena-hard-auto` source |
200+
| `arena-hard-v0.1` | Legacy Arena-Hard v0.1 from official `lmarena-ai/arena-hard-auto` source |
200201
| `m-arena-hard` | Translated version of Arena-Hard in 23 languages |
201202
| `m-arena-hard-{lang}` | Language-specific variants (e.g., `ar`, `cs`, `de`) |
202203
| `m-arena-hard-EU` | All EU languages combined |
203204
| `fluency-{lang}` | Fluency evaluation for pretrained models (`finnish`, `french`, `german`, `spanish`, `swedish`) |
204205

206+
For Arena-Hard, JudgeArena resolves baseline metadata by dataset version:
207+
- `arena-hard-v0.1`: `gpt-4-0314`
208+
- `arena-hard-v2.0`: `o3-mini-2025-01-31` (standard prompts)
209+
205210
## 📈 Estimating ELO Ratings
206211

207-
OpenJury can estimate the ELO rating of a model by running it against opponents sampled from a human preference arena (`LMArena-100k`, `LMArena-140k`, or `ComparIA`).
212+
JudgeArena can estimate the ELO rating of a model by running it against opponents sampled from a human preference arena (`LMArena-100k`, `LMArena-140k`, or `ComparIA`).
208213
The LLM judge scores each battle, and the resulting ratings are computed using the Bradley-Terry model anchored against the human-annotated arena leaderboard.
209214

210215
### Quick start
@@ -220,7 +225,7 @@ judgearena-elo \
220225
Alternatively, if running directly from the repository without installing:
221226

222227
```bash
223-
uv run python openjury/estimate_elo_ratings.py \
228+
uv run python judgearena/estimate_elo_ratings.py \
224229
--arena ComparIA \
225230
--model Together/meta-llama/Llama-3.3-70B-Instruct-Turbo \
226231
--judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
@@ -232,8 +237,8 @@ uv run python openjury/estimate_elo_ratings.py \
232237
| Flag | Default | Description |
233238
|---|---|---|
234239
| `--arena` | `ComparIA` | Arena to sample opponents from: `LMArena-100k`, `LMArena-140k`, or `ComparIA` |
235-
| `--model` | *(required)* | Model under evaluation (same format as `openjury`) |
236-
| `--judge_model` | *(required)* | LLM judge (same format as `openjury`) |
240+
| `--model` | *(required)* | Model under evaluation (same format as `judgearena`) |
241+
| `--judge_model` | *(required)* | LLM judge (same format as `judgearena`) |
237242
| `--n_instructions` | all | Number of arena battles to use for evaluation |
238243
| `--n_instructions_per_language` | all | Cap battles per language (useful for balanced multilingual eval) |
239244
| `--languages` | all | Restrict to specific language codes, e.g. `en fr de` |

TODOs.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
TODOs:
2-
* push on pypi
32
* document on the fly evaluations with custom prompt
4-
* support MT-bench
53
* handle errors
64
* CI [high/large]
75
* implement CI judge option
86
* implement domain filter in CI (maybe pass a regexp by column?)
97
* report cost?
108

119
Done:
10+
* push on pypi
11+
* support MT-bench
1212
* support alpaca-eval
1313
* support arena-hard
1414
* test together judge

judgearena/evaluate.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,10 @@
1010
from langchain_core.prompts import ChatPromptTemplate
1111

1212
from judgearena.instruction_dataset import load_instructions
13+
from judgearena.instruction_dataset.arena_hard import (
14+
download_arena_hard,
15+
is_arena_hard_dataset,
16+
)
1317
from judgearena.repro import _to_jsonable, write_run_metadata
1418
from judgearena.utils import (
1519
compute_pref_summary,
@@ -127,7 +131,10 @@ def evaluate_completions(
127131
"""
128132
run_started_at = datetime.now(UTC)
129133
local_path_tables = data_root / "tables"
130-
download_hf(name=dataset, local_path=local_path_tables)
134+
if is_arena_hard_dataset(dataset):
135+
download_arena_hard(dataset=dataset, local_tables_path=local_path_tables)
136+
else:
137+
download_hf(name=dataset, local_path=local_path_tables)
131138

132139
instructions = load_instructions(
133140
dataset=dataset,

judgearena/generate_and_evaluate.py

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,10 @@
1515
from judgearena.evaluate import judge_and_parse_prefs, resolve_judge_prompts
1616
from judgearena.generate import generate_base, generate_instructions
1717
from judgearena.instruction_dataset import load_instructions
18+
from judgearena.instruction_dataset.arena_hard import (
19+
download_arena_hard,
20+
is_arena_hard_dataset,
21+
)
1822
from judgearena.mt_bench.mt_bench_utils import run_mt_bench
1923
from judgearena.repro import _to_jsonable, write_run_metadata
2024
from judgearena.utils import (
@@ -41,7 +45,10 @@ def try_load_dataset_completions(
4145
or ``None`` when no pre-existing completions are found.
4246
"""
4347
local_path_tables = data_root / "tables"
44-
download_hf(name=dataset, local_path=local_path_tables)
48+
if is_arena_hard_dataset(dataset):
49+
download_arena_hard(dataset=dataset, local_tables_path=local_path_tables)
50+
else:
51+
download_hf(name=dataset, local_path=local_path_tables)
4552
output_path = local_path_tables / "model_outputs" / f"{dataset}.csv.zip"
4653
if not output_path.exists():
4754
return None
@@ -52,7 +59,7 @@ def try_load_dataset_completions(
5259
).sort_index()
5360
if model not in df_outputs.columns:
5461
return None
55-
print(f"Found pre-existing completions for '{model}' in {dataset} dataset.")
62+
print(f"Found pre-existing completions for '{model}' in dataset '{dataset}'.")
5663
completions = df_outputs.loc[:, model]
5764
if n_instructions is not None:
5865
completions = completions.head(n_instructions)
@@ -97,7 +104,8 @@ def parse_args(cls):
97104
)
98105
parser.add_argument(
99106
"--dataset",
100-
help="The dataset to use. For instance `alpaca-eval`, `arena-hard`, `m-arena-hard-EU` for instruction "
107+
help="The dataset to use. For instance `alpaca-eval`, `arena-hard-v2.0`, "
108+
"`arena-hard-v0.1`, `m-arena-hard-EU` for instruction "
101109
"tuning cases or `french-contexts`, `spanish-contexts` for base models.",
102110
)
103111
parser.add_argument(

judgearena/instruction_dataset/__init__.py

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
import pandas as pd
22

3+
from judgearena.instruction_dataset.arena_hard import (
4+
download_arena_hard,
5+
is_arena_hard_dataset,
6+
)
37
from judgearena.instruction_dataset.m_arenahard import load_m_arenahard
48
from judgearena.utils import data_root, download_hf, read_df
59

@@ -58,9 +62,16 @@ def load_instructions(dataset: str, n_instructions: int | None = None) -> pd.Dat
5862
)
5963

6064
else:
61-
assert dataset in ["alpaca-eval", "arena-hard"]
65+
assert dataset in [
66+
"alpaca-eval",
67+
"arena-hard-v0.1",
68+
"arena-hard-v2.0",
69+
]
6270
local_path_tables = data_root / "tables"
63-
download_hf(name=dataset, local_path=local_path_tables)
71+
if is_arena_hard_dataset(dataset):
72+
download_arena_hard(dataset=dataset, local_tables_path=local_path_tables)
73+
else:
74+
download_hf(name=dataset, local_path=local_path_tables)
6475
df_instructions = read_df(local_path_tables / "instructions" / f"{dataset}.csv")
6576

6677
df_instructions = df_instructions.set_index("instruction_index").sort_index()
Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
from dataclasses import dataclass
2+
from pathlib import Path
3+
4+
import pandas as pd
5+
from datasets import Dataset, DatasetDict, IterableDataset, load_dataset
6+
7+
ARENA_HARD_HF_REPO_ID = "lmarena-ai/arena-hard-auto"
8+
9+
10+
@dataclass(frozen=True)
11+
class ArenaHardSpec:
12+
hf_variant: str
13+
baseline_model: str
14+
15+
16+
ARENA_HARD_DATASETS: dict[str, ArenaHardSpec] = {
17+
"arena-hard-v0.1": ArenaHardSpec(
18+
hf_variant="arena-hard-v0.1",
19+
baseline_model="gpt-4-0314",
20+
),
21+
"arena-hard-v2.0": ArenaHardSpec(
22+
hf_variant="arena-hard-v2.0",
23+
baseline_model="o3-mini-2025-01-31",
24+
),
25+
}
26+
27+
28+
def resolve_arena_hard_spec(dataset: str) -> ArenaHardSpec | None:
29+
return ARENA_HARD_DATASETS.get(dataset)
30+
31+
32+
def is_arena_hard_dataset(dataset: str) -> bool:
33+
return resolve_arena_hard_spec(dataset) is not None
34+
35+
36+
def arena_hard_baseline_model(dataset: str) -> str | None:
37+
spec = resolve_arena_hard_spec(dataset)
38+
if spec is None:
39+
return None
40+
return spec.baseline_model
41+
42+
43+
def _load_official_arena_hard_dataset(spec: ArenaHardSpec) -> pd.DataFrame:
44+
data = load_dataset(
45+
path=ARENA_HARD_HF_REPO_ID,
46+
data_dir=f"data/{spec.hf_variant}",
47+
)
48+
return _dataset_like_to_dataframe(data)
49+
50+
51+
def _dataset_like_to_dataframe(
52+
data: Dataset | DatasetDict | IterableDataset,
53+
) -> pd.DataFrame:
54+
if isinstance(data, DatasetDict):
55+
if "train" in data:
56+
return data["train"].to_pandas()
57+
first_split = next(iter(data.keys()))
58+
return data[first_split].to_pandas()
59+
if isinstance(data, Dataset):
60+
return data.to_pandas()
61+
if isinstance(data, IterableDataset):
62+
return pd.DataFrame(list(data))
63+
raise TypeError(f"Unsupported dataset object type: {type(data)}")
64+
65+
66+
def normalize_official_arena_hard(
67+
raw_df: pd.DataFrame, dataset: str
68+
) -> tuple[pd.DataFrame, pd.DataFrame | None]:
69+
spec = resolve_arena_hard_spec(dataset)
70+
if spec is None:
71+
raise ValueError(f"Unsupported Arena-Hard dataset: {dataset}")
72+
73+
instruction_index = _pick_instruction_index(raw_df)
74+
instruction = _pick_instruction(raw_df)
75+
df_instructions = pd.DataFrame(
76+
{
77+
"instruction_index": instruction_index,
78+
"instruction": instruction,
79+
}
80+
)
81+
df_instructions = df_instructions.dropna(
82+
subset=["instruction_index", "instruction"]
83+
)
84+
df_instructions = df_instructions.drop_duplicates(subset=["instruction_index"])
85+
df_instructions = df_instructions.sort_values("instruction_index")
86+
87+
df_model_outputs = _build_model_outputs(raw_df)
88+
return df_instructions, df_model_outputs
89+
90+
91+
def download_arena_hard(dataset: str, local_tables_path: Path) -> None:
92+
"""Load Arena-Hard from the Hub if instruction and model-output files are missing."""
93+
spec = resolve_arena_hard_spec(dataset)
94+
if spec is None:
95+
return
96+
instructions_path = local_tables_path / "instructions" / f"{dataset}.csv"
97+
model_outputs_path = local_tables_path / "model_outputs" / f"{dataset}.csv.zip"
98+
if instructions_path.exists() and model_outputs_path.exists():
99+
return
100+
101+
raw_df = _load_official_arena_hard_dataset(spec)
102+
df_instructions, df_model_outputs = normalize_official_arena_hard(
103+
raw_df=raw_df, dataset=dataset
104+
)
105+
instructions_path.parent.mkdir(parents=True, exist_ok=True)
106+
model_outputs_path.parent.mkdir(parents=True, exist_ok=True)
107+
df_instructions.to_csv(instructions_path, index=False)
108+
if df_model_outputs is not None:
109+
df_model_outputs.to_csv(model_outputs_path, index=False)
110+
111+
112+
def _pick_instruction_index(raw_df: pd.DataFrame) -> pd.Series:
113+
for col in ["instruction_index", "question_id", "id"]:
114+
if col in raw_df.columns:
115+
return raw_df[col].astype(str)
116+
return pd.Series(range(len(raw_df)), dtype=str)
117+
118+
119+
def _pick_instruction(raw_df: pd.DataFrame) -> pd.Series:
120+
for col in ["instruction", "prompt", "question", "turns"]:
121+
if col in raw_df.columns:
122+
if col == "turns":
123+
return raw_df[col].apply(_turns_to_text)
124+
return raw_df[col].astype(str)
125+
raise ValueError(
126+
f"Unable to infer instruction text column from Arena-Hard data. Available columns: {raw_df.columns.tolist()}"
127+
)
128+
129+
130+
def _turns_to_text(turns_value) -> str:
131+
if isinstance(turns_value, list):
132+
if not turns_value:
133+
return ""
134+
first = turns_value[0]
135+
if isinstance(first, dict):
136+
for key in ["content", "text", "prompt"]:
137+
if key in first:
138+
return str(first[key])
139+
return str(first)
140+
if isinstance(turns_value, dict):
141+
for key in ["content", "text", "prompt"]:
142+
if key in turns_value:
143+
return str(turns_value[key])
144+
return str(turns_value)
145+
146+
147+
def _build_model_outputs(raw_df: pd.DataFrame) -> pd.DataFrame | None:
148+
if not {"model", "output"}.issubset(raw_df.columns):
149+
return None
150+
instruction_index = _pick_instruction_index(raw_df)
151+
df_outputs = pd.DataFrame(
152+
{
153+
"instruction_index": instruction_index,
154+
"model": raw_df["model"].astype(str),
155+
"output": raw_df["output"].fillna("").astype(str),
156+
}
157+
)
158+
return df_outputs

judgearena/utils.py

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,11 @@
1313
from langchain_openai import ChatOpenAI
1414
from tqdm.asyncio import tqdm
1515

16+
from judgearena.instruction_dataset.arena_hard import (
17+
download_arena_hard,
18+
is_arena_hard_dataset,
19+
)
20+
1621

1722
def _data_root_path() -> Path:
1823
raw = os.environ.get("JUDGEARENA_DATA") or os.environ.get("OPENJURY_DATA")
@@ -421,9 +426,17 @@ def make_model(model: str, max_tokens: int | None = 8192, **engine_kwargs):
421426

422427
def download_all():
423428
print(f"Downloading all dataset in {data_root}")
424-
for dataset in ["alpaca-eval", "arena-hard", "m-arena-hard"]:
425-
local_path_tables = data_root / "tables"
426-
download_hf(name=dataset, local_path=local_path_tables)
429+
local_path_tables = data_root / "tables"
430+
for dataset in [
431+
"alpaca-eval",
432+
"arena-hard-v0.1",
433+
"arena-hard-v2.0",
434+
"m-arena-hard",
435+
]:
436+
if is_arena_hard_dataset(dataset):
437+
download_arena_hard(dataset=dataset, local_tables_path=local_path_tables)
438+
else:
439+
download_hf(name=dataset, local_path=local_path_tables)
427440

428441
snapshot_download(
429442
repo_id="geoalgo/multilingual-contexts-to-be-completed",

scripts/multilingual_arena_hard/translate_arena_hard.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@
7777
# translator_model = "OpenRouter/deepseek/deepseek-chat-v3.1"
7878
n_instructions = 10
7979
df_instructions = load_instructions(
80-
"arena-hard",
80+
"arena-hard-v2.0",
8181
n_instructions=n_instructions,
8282
)
8383
# languages = [("fra", "French")]

0 commit comments

Comments
 (0)