Skip to content

Commit d2d67d6

Browse files
authored
Support for computing Elo ratings (#25)
* current script * update * update * fix swap and add test * add scikit-learn as dependency * fix tests with renaming * PR feedback * fix tests * add missing cache argument, add hash in case of long names
1 parent 648dbeb commit d2d67d6

8 files changed

Lines changed: 1182 additions & 49 deletions

File tree

README.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -202,6 +202,60 @@ This override applies to all vLLM models in the run. For remote providers (OpenA
202202
| `m-arena-hard-EU` | All EU languages combined |
203203
| `fluency-{lang}` | Fluency evaluation for pretrained models (`finnish`, `french`, `german`, `spanish`, `swedish`) |
204204

205+
## 📈 Estimating ELO Ratings
206+
207+
OpenJury can estimate the ELO rating of a model by running it against opponents sampled from a human preference arena (`LMArena-100k`, `LMArena-140k`, or `ComparIA`).
208+
The LLM judge scores each battle, and the resulting ratings are computed using the Bradley-Terry model anchored against the human-annotated arena leaderboard.
209+
210+
### Quick start
211+
212+
```bash
213+
judgearena-elo \
214+
--arena ComparIA \
215+
--model Together/meta-llama/Llama-3.3-70B-Instruct-Turbo \
216+
--judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
217+
--n_instructions 200
218+
```
219+
220+
Alternatively, if running directly from the repository without installing:
221+
222+
```bash
223+
uv run python openjury/estimate_elo_ratings.py \
224+
--arena ComparIA \
225+
--model Together/meta-llama/Llama-3.3-70B-Instruct-Turbo \
226+
--judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
227+
--n_instructions 200
228+
```
229+
230+
### Key options
231+
232+
| Flag | Default | Description |
233+
|---|---|---|
234+
| `--arena` | `ComparIA` | Arena to sample opponents from: `LMArena-100k`, `LMArena-140k`, or `ComparIA` |
235+
| `--model` | *(required)* | Model under evaluation (same format as `openjury`) |
236+
| `--judge_model` | *(required)* | LLM judge (same format as `openjury`) |
237+
| `--n_instructions` | all | Number of arena battles to use for evaluation |
238+
| `--n_instructions_per_language` | all | Cap battles per language (useful for balanced multilingual eval) |
239+
| `--languages` | all | Restrict to specific language codes, e.g. `en fr de` |
240+
| `--n_bootstraps` | `20` | Bootstrap samples for ELO confidence intervals |
241+
| `--swap_mode` | `fixed` | `fixed`: single judge pass; `both`: correct for position bias |
242+
| `--result_folder` | `results` | Directory where annotations and results are saved |
243+
244+
### Output
245+
246+
The script prints win/loss/tie counts, win rate, and a ranked ELO leaderboard with confidence intervals:
247+
248+
```
249+
=== Results for meta-llama/Llama-3.3-70B-Instruct-Turbo ===
250+
Battles: 200 | Wins: 112 | Losses: 71 | Ties: 17
251+
Win rate: 60.25%
252+
253+
=== ELO Ratings (Bradley-Terry, 20 bootstraps) ===
254+
gpt-4o (12453): 1132.4 ± 3.1
255+
meta-llama/Llama-3.3-70B-Instruct-Turbo (200) <-----: 1089.7 ± 8.2
256+
...
257+
```
258+
205259
### Offline Setup (Slurm/Air-Gapped Environments)
206260

207261
Pre-download all datasets before running jobs:

judgearena/arenas_utils.py

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
import warnings
2+
from pathlib import Path
3+
4+
import pandas as pd
5+
from fast_langdetect import detect_language
6+
from huggingface_hub import snapshot_download
7+
8+
9+
def _extract_instruction_text(turn: dict) -> str:
10+
"""Extract plain instruction text from a conversation first turn.
11+
12+
Handles both the 100k schema (content is a plain string) and the 140k
13+
schema (content is an array of {type, text, ...} objects).
14+
"""
15+
content = turn["content"]
16+
if isinstance(content, str):
17+
return content
18+
return " ".join(block["text"] for block in content if block.get("type") == "text")
19+
20+
21+
KNOWN_ARENAS = ["LMArena-100k", "LMArena-140k", "ComparIA"]
22+
23+
24+
def _load_arena_dataframe(
25+
arena: str, comparia_revision: str | None = None
26+
) -> pd.DataFrame:
27+
assert arena in KNOWN_ARENAS
28+
if "LMArena" in arena:
29+
size = arena.split("-")[1] # "100k" or "140k"
30+
path = snapshot_download(
31+
repo_id=f"lmarena-ai/arena-human-preference-{size}",
32+
repo_type="dataset",
33+
allow_patterns="*parquet",
34+
force_download=False,
35+
)
36+
parquet_files = sorted((Path(path) / "data").glob("*.parquet"))
37+
df = pd.concat([pd.read_parquet(f) for f in parquet_files], ignore_index=True)
38+
39+
if "tstamp" in df.columns:
40+
# 100k: tstamp is a unix timestamp in seconds
41+
df["date"] = pd.to_datetime(df["tstamp"], unit="s")
42+
else:
43+
# 140k: timestamp is already a datetime
44+
df["tstamp"] = df["timestamp"].astype("int64") // 10**9
45+
df["date"] = df["timestamp"]
46+
47+
if "question_id" not in df.columns:
48+
df["question_id"] = df["id"]
49+
50+
# 140k uses "both_bad" instead of "tie (bothbad)"
51+
df["winner"] = df["winner"].replace("both_bad", "tie (bothbad)")
52+
53+
df["benchmark"] = arena
54+
55+
else:
56+
path = snapshot_download(
57+
repo_id="ministere-culture/comparia-votes",
58+
repo_type="dataset",
59+
allow_patterns="*",
60+
revision=comparia_revision,
61+
force_download=False,
62+
)
63+
64+
df = pd.read_parquet(Path(path) / "votes.parquet")
65+
66+
# unify schema
67+
df["tstamp"] = df["timestamp"]
68+
df["model_a"] = df["model_a_name"]
69+
df["model_b"] = df["model_b_name"]
70+
71+
def get_winner(
72+
chosen_model_name: str,
73+
model_a: str,
74+
model_b: str,
75+
both_equal: bool,
76+
**kwargs,
77+
):
78+
if both_equal:
79+
return "tie"
80+
else:
81+
if chosen_model_name is None or isinstance(chosen_model_name, float):
82+
return None
83+
if chosen_model_name not in [model_a, model_b]:
84+
warnings.warn(
85+
f"Chosen model {chosen_model_name!r} not in model_a={model_a!r} or model_b={model_b!r}; skipping."
86+
)
87+
return None
88+
return "model_a" if chosen_model_name == model_a else "model_b"
89+
90+
df["winner"] = df.apply(lambda row: get_winner(**row), axis=1)
91+
92+
# filter battles without winner annotated
93+
df = df[~df.winner.isna()]
94+
df["benchmark"] = "ComparIA"
95+
df["question_id"] = df["id"]
96+
97+
df["lang"] = df["conversation_a"].apply(
98+
lambda conv: detect_language(_extract_instruction_text(conv[0])).lower()
99+
)
100+
101+
cols = [
102+
"question_id",
103+
"tstamp",
104+
"model_a",
105+
"model_b",
106+
"winner",
107+
"conversation_a",
108+
"conversation_b",
109+
"benchmark",
110+
"lang",
111+
]
112+
df = df.loc[:, cols]
113+
114+
# keep only one turn conversation for now as they are easier to evaluate
115+
df["turns"] = df.apply(lambda row: len(row["conversation_a"]) - 1, axis=1)
116+
n_before = len(df)
117+
df = df.loc[df.turns == 1]
118+
n_dropped = n_before - len(df)
119+
if n_dropped > 0:
120+
print(
121+
f"[{arena}] Dropped {n_dropped}/{n_before} multi-turn battles (keeping single-turn only)."
122+
)
123+
124+
return df
125+
126+
127+
def load_arena_dataframe(
128+
arena: str | None,
129+
comparia_revision: str = "7a40bce496c1f2aa3be4001da85a49cb4743042b",
130+
) -> pd.DataFrame:
131+
"""Load battles from one or all arenas.
132+
133+
:param arena: one of "LMArena-100k", "LMArena-140k", "ComparIA", "LMArena"
134+
(concatenation of both LMArena variants), or None (all arenas).
135+
:param comparia_revision: pinned revision for the ComparIA dataset.
136+
:return: dataframe containing battles for the arena(s) selected.
137+
"""
138+
if arena is None:
139+
arenas = KNOWN_ARENAS
140+
elif arena == "LMArena":
141+
arenas = ["LMArena-100k", "LMArena-140k"]
142+
else:
143+
return _load_arena_dataframe(arena, comparia_revision)
144+
return pd.concat(
145+
[_load_arena_dataframe(a, comparia_revision) for a in arenas],
146+
ignore_index=True,
147+
)
148+
149+
150+
def main():
151+
for arena in KNOWN_ARENAS:
152+
print(f"Loading {arena}")
153+
df = _load_arena_dataframe(arena)
154+
n_battles = len(df)
155+
n_models = len(set(df["model_a"]) | set(df["model_b"]))
156+
n_languages = df["lang"].nunique()
157+
print(
158+
f"{arena}: {n_battles} battles, {n_models} models, {n_languages} languages"
159+
)
160+
161+
162+
if __name__ == "__main__":
163+
main()

0 commit comments

Comments
 (0)