small features: add option to save cache in parquet, save judge input… by geoalgo · Pull Request #35 · OpenEuroLLM/JudgeArena

geoalgo · 2026-04-08T09:50:54Z

…, improve error handling of openrouter, remove compute_cohen_kappa

Reasons:

saving in parquet is better in term of storage
saving judge input allow to have all the context used to make the judge call and to be able to estimate the cost
the handling of errors in openrouter becomes a bit less whacky
cohen_kappa is available in scikit-learn, we should use this code instead (which produces the same values)

…, improve error handling of openrouter

ErlisLushtaku · 2026-04-09T21:09:27Z

judgearena/evaluate.py

+    completion_A: str  # completion of the first model
+    completion_B: str  # completion of the second model
+    judge_completion: str  # output of the judge
+    judge_input: str | None = None  # input that was passed to the judge


Should this be added to the estimate_elo_ratings.py workflow as well?

It uses judge_and_parse_prefs from this file so it is updated as well if I am not mistaken.

Yes, but we are dropping it here because we are constructing the Dataframe manually

good catch, I will change the elo-rating to also store input.

ErlisLushtaku · 2026-04-09T21:34:00Z

judgearena/utils.py

+                    return x
+
+                for col in df.select_dtypes(include="object").columns:
+                    df[col] = df[col].apply(_to_python).astype(str)


I think we should be careful here if the dataframe can contain missing values. Calling .astype(str) on missing values (None or np.nan) converts them into strings "None" and "nan". When the parquet file is read back, they would be processed as strings instead of missing values.

yes I agree but I dont see another way to serialize to parquet. I agree that this conversion is loosing the missingness information but I think all downstream code should probably exclude empty strings too when computing annotations.

I searched a bit and I think we have two safe solutions:

Option 1: Parquet + JSON Sidecar
We selectively stringify only the complex objects (like dicts/lists), leave None/NaN untouched by filtering something like this:

df_cache[col] = df_cache[col].apply( lambda x: x if x is None or (isinstance(x, float) and np.isnan(x)) else repr(_to_python(x)) )

and save a small .meta.json tracking which columns were altered. Then when reading we again filter to call ast.literal_eval only on cells that don't contain these missing values and were serialized.

Option 2: Compressed Pickle (.pkl.gz)
We could switch to df.to_pickle(cache_file, compression="gzip"). It would be simpler, but probably less storage-efficient.

Otherwise we could leave it like this and note it for later, if you think this is not a big issue at the moment.

Option 1 seems like an overkill given that the option is for util which is deactivated by default.
Option 2 is a strong no, as we would be replacing the issue with a security issue (also we wont be getting the benefit of parquet of column storage compression).

I would be keen to merge as is, possibly adding a comment.
Alternatively, I am also fine to drop the feature and keep it in my branch.

Then we could merge it as is so we can use it too 👍

judgearena/utils.py

small features: add option to save cache in parquet, save judge input…

ee04924

…, improve error handling of openrouter

geoalgo requested a review from ErlisLushtaku April 8, 2026 09:50

ErlisLushtaku reviewed Apr 9, 2026

View reviewed changes

ErlisLushtaku approved these changes Apr 13, 2026

View reviewed changes

geoalgo merged commit 31dc7a3 into main Apr 13, 2026
1 check passed

geoalgo deleted the small_features branch April 13, 2026 12:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

small features: add option to save cache in parquet, save judge input…#35

small features: add option to save cache in parquet, save judge input…#35
geoalgo merged 1 commit intomainfrom
small_features

geoalgo commented Apr 8, 2026

Uh oh!

ErlisLushtaku Apr 9, 2026

Uh oh!

geoalgo Apr 10, 2026

Uh oh!

ErlisLushtaku Apr 11, 2026 •

edited

Loading

Uh oh!

geoalgo Apr 13, 2026

Uh oh!

ErlisLushtaku Apr 9, 2026

Uh oh!

geoalgo Apr 10, 2026

Uh oh!

ErlisLushtaku Apr 11, 2026

Uh oh!

geoalgo Apr 13, 2026

Uh oh!

ErlisLushtaku Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

geoalgo commented Apr 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ErlisLushtaku Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ErlisLushtaku Apr 11, 2026 •

edited

Loading