Skip to content

Commit 223db1b

Browse files
Add ruff checks and format codebase (#28)
* Update license placeholder * Update pyproject.toml with metadata * add smoke tests * add publishing workflow * remove testpypi from pyproject.toml * change authors to maintainers * add ruff checks - replace black with ruff - add ruff to pre-commit - add pre-commit (with ruff) to CI * upgrade pre-commit version * format * ignore rules * format
1 parent d2d67d6 commit 223db1b

24 files changed

Lines changed: 146 additions & 107 deletions
Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
1-
# This workflow will install Python dependencies, run tests and lint with a single version of Python
1+
# Install Python dependencies, run pre-commit (lint/format), and pytest.
22
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
33

4-
name: Run pytest
4+
name: CI
55

66
on:
77
push:
@@ -14,7 +14,7 @@ permissions:
1414

1515
jobs:
1616
build:
17-
17+
1818
runs-on: ubuntu-latest
1919
steps:
2020
- uses: actions/checkout@v3
@@ -24,6 +24,8 @@ jobs:
2424
enable-cache: true
2525
python-version: "3.12"
2626
- name: Install dependencies
27-
run: uv sync --all-extras --group dev
27+
run: uv sync
28+
- name: Run pre-commit
29+
run: uv run pre-commit run --all-files
2830
- name: Test with pytest
29-
run: uv run pytest
31+
run: uv run pytest

.pre-commit-config.yaml

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,23 @@
11
repos:
22
- repo: https://github.com/pre-commit/pre-commit-hooks
3-
rev: v4.5.0
3+
rev: v6.0.0
44
hooks:
55
- id: trailing-whitespace
66
- id: end-of-file-fixer
77
- id: check-yaml
88
- id: check-added-large-files
99

10-
- repo: https://github.com/psf/black
11-
rev: 24.1.1
10+
- repo: local
1211
hooks:
13-
- id: black
14-
language_version: python3
12+
- id: ruff
13+
name: ruff
14+
entry: uv run ruff check --fix --force-exclude
15+
language: system
16+
types_or: [python, pyi]
17+
require_serial: true
18+
- id: ruff-format
19+
name: ruff-format
20+
entry: uv run ruff format --force-exclude
21+
language: system
22+
types_or: [python, pyi]
23+
require_serial: true

README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ Compared to other libraries, here is a breakdown of features:
2424
| **Evalchemy** |||||||
2525
| **JudgeArena** | 🔜 ||||||
2626

27-
The table has been done on Oct 2025, in case some libraries implemented missing features, please open an issue
27+
The table has been done on Oct 2025, in case some libraries implemented missing features, please open an issue
2828
or send a PR, we will be happy to update the information.
2929

3030
## 🚀 Quick Start
@@ -34,7 +34,7 @@ or send a PR, we will be happy to update the information.
3434
```bash
3535
git clone https://github.com/OpenEuroLLM/JudgeArena
3636
cd JudgeArena
37-
uv sync
37+
uv sync
3838
uv sync --extra vllm # Optional: install vLLM support
3939
uv sync --extra llamacpp # Optional: install LlamaCpp support
4040
```
@@ -49,19 +49,19 @@ python judgearena/generate_and_evaluate.py \
4949
--model_A gpt4_1106_preview \
5050
--model_B VLLM/utter-project/EuroLLM-9B \
5151
--judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
52-
--n_instructions 10
52+
--n_instructions 10
5353
```
5454

5555
**What happens here?**
5656
- Use completions available for `gpt4_1106_preview` in Alpaca-Eval dataset
5757
- Generates completions for `model_B` if not already cached on `vLLM`
58-
- Compares two models using `deepseek-chat-v3.1` which the cheapest option available on `OpenRouter`
58+
- Compares two models using `deepseek-chat-v3.1` which the cheapest option available on `OpenRouter`
5959

6060
It will then display the results of the battles:
6161

6262
```bash
6363
============================================================
64-
🏆 MODEL BATTLE RESULTS 🏆
64+
🏆 MODEL BATTLE RESULTS 🏆
6565
📊 Dataset: alpaca-eval
6666
🤖 Competitors: Model A: gpt4_1106_preview vs Model B: VLLM/utter-project/EuroLLM-9B
6767
⚖️ Judge: OpenRouter/deepseek/deepseek-chat-v3.1
@@ -84,7 +84,7 @@ The evaluation scripts expose four different length controls with different role
8484

8585
### Engine-Specific Configuration (`--engine_kwargs`)
8686

87-
Some providers expose additional engine-level knobs (for example, vLLM allows configuring tensor parallelism or GPU memory utilization).
87+
Some providers expose additional engine-level knobs (for example, vLLM allows configuring tensor parallelism or GPU memory utilization).
8888
JudgeArena lets you forward these options directly to the underlying engine via `--engine_kwargs`, which expects a JSON object.
8989

9090
For instance, to run vLLM with tensor parallelism across multiple GPUs:
@@ -123,7 +123,7 @@ python judgearena/generate_and_evaluate.py \
123123
--model_A VLLM/Qwen/Qwen2.5-0.5B-Instruct \
124124
--model_B VLLM/Qwen/Qwen2.5-1.5B-Instruct \
125125
--judge_model VLLM/Qwen/Qwen2.5-32B-Instruct-GPTQ-Int8 \
126-
--n_instructions 10
126+
--n_instructions 10
127127
```
128128

129129
### Running locally with LlamaCpp

TODOs.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ TODOs:
66
* CI [high/large]
77
* implement CI judge option
88
* implement domain filter in CI (maybe pass a regexp by column?)
9-
* report cost?
9+
* report cost?
1010

1111
Done:
1212
* support alpaca-eval
@@ -22,7 +22,7 @@ Done:
2222
* CLI launcher [medium/large]
2323
* put contexts in HF dataset [high/small]
2424
* mAH: instruction loader [DONE]
25-
* mAH: generate instructions for two models [DONE]
25+
* mAH: generate instructions for two models [DONE]
2626
* mAH: make comparison [DONE]
2727
* mAH: support using all languages at once [high/medium]
2828
* unit-test
@@ -37,4 +37,4 @@ Done:
3737
* small refactor `annotate` needs to return just the judge completion, not the parsed one
3838
* perhaps change to `annotate_pair` and `annotate_single`
3939
* then provide example
40-
* support evaluation with input swap
40+
* support evaluation with input swap

judgearena/arenas_utils.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,8 @@ def get_winner(
8282
return None
8383
if chosen_model_name not in [model_a, model_b]:
8484
warnings.warn(
85-
f"Chosen model {chosen_model_name!r} not in model_a={model_a!r} or model_b={model_b!r}; skipping."
85+
f"Chosen model {chosen_model_name!r} not in model_a={model_a!r} or model_b={model_b!r}; skipping.",
86+
stacklevel=2,
8687
)
8788
return None
8889
return "model_a" if chosen_model_name == model_a else "model_b"

judgearena/criteria/io.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,7 @@ def _load_criteria_data(path: str | Path) -> dict:
1515
suffix = path.suffix.lower()
1616
if suffix not in {".yaml", ".yml"}:
1717
raise ValueError(
18-
f"Unsupported criteria file format '{path.suffix}'. "
19-
"Use .yaml or .yml."
18+
f"Unsupported criteria file format '{path.suffix}'. Use .yaml or .yml."
2019
)
2120

2221
data = yaml.safe_load(path.read_text())

judgearena/criteria/schema.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@
55
from dataclasses import dataclass, field
66
from typing import Any
77

8-
98
SCALE_MIN = 1
109
SCALE_MAX = 10
1110

judgearena/estimate_elo_ratings.py

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,10 @@
88
import pandas as pd
99
from sklearn.linear_model import LogisticRegression
1010

11-
from judgearena.arenas_utils import load_arena_dataframe, _extract_instruction_text
11+
from judgearena.arenas_utils import _extract_instruction_text, load_arena_dataframe
1212
from judgearena.evaluate import judge_and_parse_prefs
1313
from judgearena.generate import generate_instructions
14-
from judgearena.utils import make_model, cache_function_dataframe, compute_pref_summary
14+
from judgearena.utils import cache_function_dataframe, compute_pref_summary, make_model
1515

1616

1717
@dataclass
@@ -38,9 +38,9 @@ class CliEloArgs:
3838

3939
def __post_init__(self):
4040
supported_modes = ["fixed", "both"]
41-
assert (
42-
self.swap_mode in supported_modes
43-
), f"Only {supported_modes} modes are supported but got {self.swap_mode}."
41+
assert self.swap_mode in supported_modes, (
42+
f"Only {supported_modes} modes are supported but got {self.swap_mode}."
43+
)
4444

4545
@classmethod
4646
def parse_args(cls):
@@ -201,7 +201,7 @@ def parse_args(cls):
201201
if not isinstance(engine_kwargs, dict):
202202
raise ValueError("engine_kwargs must be a JSON object")
203203
except Exception as e:
204-
raise SystemExit(f"Failed to parse --engine_kwargs: {e}")
204+
raise SystemExit(f"Failed to parse --engine_kwargs: {e}") from e
205205

206206
return cls(
207207
arena=args.arena,
@@ -397,7 +397,9 @@ def main(args: CliEloArgs | None = None) -> dict:
397397
**extra_kwargs,
398398
)
399399

400-
replace_slash = lambda s: s.replace("/", "_")
400+
def replace_slash(s: str) -> str:
401+
return s.replace("/", "_")
402+
401403
languages_str = "-".join(sorted(args.languages)) if args.languages else "all"
402404
extra_kwargs_str = (
403405
"_".join(f"{k}={v}" for k, v in sorted(extra_kwargs.items()))
@@ -511,7 +513,7 @@ def run_judge() -> pd.DataFrame:
511513
model_name = args.model
512514
battle_results = []
513515
for pref, is_pos_a, opp_model in zip(
514-
prefs, our_model_is_position_a, opponent_models
516+
prefs, our_model_is_position_a, opponent_models, strict=True
515517
):
516518
if pref is None or pref == 0.5:
517519
winner = "tie"
@@ -536,7 +538,7 @@ def run_judge() -> pd.DataFrame:
536538
prefs_normalized = pd.Series(
537539
[
538540
p if (p is None or is_pos_a) else (1 - p)
539-
for p, is_pos_a in zip(prefs, our_model_is_position_a)
541+
for p, is_pos_a in zip(prefs, our_model_is_position_a, strict=True)
540542
]
541543
)
542544
summary = compute_pref_summary(prefs_normalized)

judgearena/evaluate.py

Lines changed: 18 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,22 @@
11
import json
22
import re
33
from dataclasses import dataclass
4-
from datetime import datetime, timezone
4+
from datetime import UTC, datetime
55
from pathlib import Path
66

77
import numpy as np
88
import pandas as pd
9-
from langchain_core.prompts import ChatPromptTemplate
109
from langchain_core.language_models.llms import LLM
10+
from langchain_core.prompts import ChatPromptTemplate
1111

1212
from judgearena.instruction_dataset import load_instructions
13-
from judgearena.repro import write_run_metadata, _to_jsonable
13+
from judgearena.repro import _to_jsonable, write_run_metadata
1414
from judgearena.utils import (
1515
compute_pref_summary,
16-
read_df,
1716
data_root,
18-
download_hf,
1917
do_inference,
18+
download_hf,
19+
read_df,
2020
)
2121

2222

@@ -55,13 +55,13 @@ def load_judge_system_and_user_prompt(
5555
provide_explanation: bool = True,
5656
) -> tuple[str, str]:
5757
# Prepare judge
58-
with open(Path(__file__).parent / "prompts" / "system-prompt.txt", "r") as f:
58+
with open(Path(__file__).parent / "prompts" / "system-prompt.txt") as f:
5959
system_prompt = str(f.read())
6060

6161
prompt_filename = (
6262
"prompt-with-explanation.txt" if provide_explanation else "prompt.txt"
6363
)
64-
with open(Path(__file__).parent / "prompts" / prompt_filename, "r") as f:
64+
with open(Path(__file__).parent / "prompts" / prompt_filename) as f:
6565
user_prompt_template = str(f.read())
6666

6767
return system_prompt, user_prompt_template
@@ -109,7 +109,7 @@ def evaluate_completions(
109109
exceeding context limit
110110
:return:
111111
"""
112-
run_started_at = datetime.now(timezone.utc)
112+
run_started_at = datetime.now(UTC)
113113
local_path_tables = data_root / "tables"
114114
download_hf(name=dataset, local_path=local_path_tables)
115115

@@ -140,9 +140,9 @@ def get_output(df_outputs: pd.DataFrame, dataset: str, method: str):
140140
return df.loc[:, "output"]
141141
else:
142142
print(f"Loading {method} from {dataset} dataset.")
143-
assert (
144-
method in df_outputs.columns
145-
), f"Method {method} not present, pick among {df_outputs.columns.tolist()}"
143+
assert method in df_outputs.columns, (
144+
f"Method {method} not present, pick among {df_outputs.columns.tolist()}"
145+
)
146146
return df_outputs.loc[:, method].sort_index()
147147

148148
completions_A = get_output(df_outputs=df_outputs, dataset=dataset, method=method_A)
@@ -151,9 +151,9 @@ def get_output(df_outputs: pd.DataFrame, dataset: str, method: str):
151151
instructions = instructions.head(num_annotations)
152152
completions_A = completions_A.head(num_annotations)
153153
completions_B = completions_B.head(num_annotations)
154-
assert (
155-
completions_A.index.tolist() == completions_B.index.tolist()
156-
), f"Index mismatch between methods {method_A} and {method_B}."
154+
assert completions_A.index.tolist() == completions_B.index.tolist(), (
155+
f"Index mismatch between methods {method_A} and {method_B}."
156+
)
157157

158158
if judge_chat_model is None:
159159
from langchain_together.llms import Together
@@ -303,7 +303,7 @@ def truncate(s: str, max_len: int | None = None):
303303
"completion_B": truncate(completion_B, max_len=truncate_input_chars),
304304
}
305305
for user_prompt, completion_A, completion_B in zip(
306-
instructions, completions_A, completions_B
306+
instructions, completions_A, completions_B, strict=True
307307
)
308308
]
309309
)
@@ -316,7 +316,7 @@ def truncate(s: str, max_len: int | None = None):
316316

317317
annotations = []
318318
for judge_completion, instruction, completion_A, completion_B in zip(
319-
judge_completions, instructions, completions_A, completions_B
319+
judge_completions, instructions, completions_A, completions_B, strict=True
320320
):
321321
annotations.append(
322322
JudgeAnnotation(
@@ -381,7 +381,8 @@ def judge_and_parse_prefs(
381381
use_tqdm=use_tqdm,
382382
)
383383

384-
_none_to_nan = lambda x: float("nan") if x is None else x
384+
def _none_to_nan(x):
385+
return float("nan") if x is None else x
385386

386387
score_parser = PairScore()
387388
prefs = pd.Series(

0 commit comments

Comments
 (0)