Skip to content

Commit 84b9235

Browse files
committed
Update notes on SBV
1 parent d7615e4 commit 84b9235

2 files changed

Lines changed: 9 additions & 5 deletions

File tree

js/leaderboardFilters.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -409,7 +409,7 @@ function updateLeaderboardDescription(leaderboardName) {
409409
if (!textContainer) return;
410410

411411
const descriptions = {
412-
'bash-only': '<em>Bash Only</em> evaluates all LMs with a <a href="https://github.com/SWE-agent/mini-swe-agent">minimal agent</a> on SWE-bench Verified (<a href="bash-only.html">details</a>)',
412+
'bash-only': '<em>Bash Only</em> evaluates all LMs with a <a href="https://github.com/SWE-agent/mini-swe-agent">minimal agent</a> on SWE-bench Verified.<br/>Grayed out results were obtained with a different agent version (<a href="bash-only.html">details</a>).',
413413
'multilingual': '<em>Multilingual</em> features 300 tasks across 9 programming languages (<a href="multilingual-leaderboard.html">details</a>)',
414414
'lite': '<em>Lite</em> is a subset of 300 instances for less costly evaluation (<a href="lite.html">details</a>)',
415415
'verified': '<em>Verified</em> is a human-filtered subset of 500 instances (<a href="https://openai.com/index/introducing-swe-bench-verified/">details</a>)',

templates/pages/bash-only.html

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,14 +34,18 @@ <h2>Overview</h2>
3434

3535
<ul>
3636
<li>We use <a href="https://github.com/SWE-agent/mini-swe-agent/blob/main/src/minisweagent/config/extra/swebench.yaml">this configuration</a> for all models.</li>
37-
<li>The LM temperature is set to 0.0 if the temperature parameter is supported.</li>
38-
<li><a href="https://mini-swe-agent.com/latest/usage/swebench/">This guide</a> shows how to run the evaluation yourself.</li>
37+
<li>The release number in the leaderboard corresponds to the version of the mini-SWE-agent used to run the evaluation.</li>
38+
<li>Results of release 1.x and 2.x are not necessarily comparable to each other, as 2.x uses tool calling to invoke actions, whereas 1.x parses action from the output strings. Read more about the changes in the <a href="https://mini-swe-agent.com/latest/advanced/v2_migration/">mini-SWE-agent v2 migration guide</a>.</li>
39+
<li>For all results of release 1.x and earlier, the LM temperature is set to 0.0 if the temperature parameter is supported. For all results of release 2.x and later, the temperature parameter is not set. </li>
3940
<li>
40-
Small changes in the setup and configuration are captured by the version number in the leaderboard.
41+
Other than the aforementioned notes, small changes in the setup and configuration are captured by the version number in the leaderboard.
4142
Version numbers correspond to tags in the mini-SWE-agent repository.
4243
Since the mini-SWE-agent repository contains other components as well, a new version number does not necessarily mean that anything of relevance has changed for the bash-only leaderboard setting.
4344
We do <em>not</em> aim to tune the configuration and setup to reach higher and higher scores.
44-
Instead, we only make general fixes to the framework, as well as clarifications in the prompt to provide a maximally fair evaluation setup for the LMs. </li>
45+
Instead, we only make general fixes to the framework, as well as clarifications in the prompt to provide a maximally fair evaluation setup for the LMs.
46+
Generally, everything in the minor or patch release version number should be a minor change for the purpose of the bash-only leaderboard.
47+
</li>
48+
<li><a href="https://mini-swe-agent.com/latest/usage/swebench/">This guide</a> shows how to run the evaluation yourself.</li>
4549
</ul>
4650
</details>
4751

0 commit comments

Comments
 (0)