You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: js/leaderboardFilters.js
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -409,7 +409,7 @@ function updateLeaderboardDescription(leaderboardName) {
409
409
if(!textContainer)return;
410
410
411
411
constdescriptions={
412
-
'bash-only': '<em>Bash Only</em> evaluates all LMs with a <a href="https://github.com/SWE-agent/mini-swe-agent">minimal agent</a> on SWE-bench Verified(<a href="bash-only.html">details</a>)',
412
+
'bash-only': '<em>Bash Only</em> evaluates all LMs with a <a href="https://github.com/SWE-agent/mini-swe-agent">minimal agent</a> on SWE-bench Verified.<br/>Grayed out results were obtained with a different agent version (<a href="bash-only.html">details</a>).',
413
413
'multilingual': '<em>Multilingual</em> features 300 tasks across 9 programming languages (<a href="multilingual-leaderboard.html">details</a>)',
414
414
'lite': '<em>Lite</em> is a subset of 300 instances for less costly evaluation (<a href="lite.html">details</a>)',
415
415
'verified': '<em>Verified</em> is a human-filtered subset of 500 instances (<a href="https://openai.com/index/introducing-swe-bench-verified/">details</a>)',
Copy file name to clipboardExpand all lines: templates/pages/bash-only.html
+8-4Lines changed: 8 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -34,14 +34,18 @@ <h2>Overview</h2>
34
34
35
35
<ul>
36
36
<li>We use <ahref="https://github.com/SWE-agent/mini-swe-agent/blob/main/src/minisweagent/config/extra/swebench.yaml">this configuration</a> for all models.</li>
37
-
<li>The LM temperature is set to 0.0 if the temperature parameter is supported.</li>
38
-
<li><ahref="https://mini-swe-agent.com/latest/usage/swebench/">This guide</a> shows how to run the evaluation yourself.</li>
37
+
<li>The release number in the leaderboard corresponds to the version of the mini-SWE-agent used to run the evaluation.</li>
38
+
<li>Results of release 1.x and 2.x are not necessarily comparable to each other, as 2.x uses tool calling to invoke actions, whereas 1.x parses action from the output strings. Read more about the changes in the <ahref="https://mini-swe-agent.com/latest/advanced/v2_migration/">mini-SWE-agent v2 migration guide</a>.</li>
39
+
<li>For all results of release 1.x and earlier, the LM temperature is set to 0.0 if the temperature parameter is supported. For all results of release 2.x and later, the temperature parameter is not set. </li>
39
40
<li>
40
-
Small changes in the setup and configuration are captured by the version number in the leaderboard.
41
+
Other than the aforementioned notes, small changes in the setup and configuration are captured by the version number in the leaderboard.
41
42
Version numbers correspond to tags in the mini-SWE-agent repository.
42
43
Since the mini-SWE-agent repository contains other components as well, a new version number does not necessarily mean that anything of relevance has changed for the bash-only leaderboard setting.
43
44
We do <em>not</em> aim to tune the configuration and setup to reach higher and higher scores.
44
-
Instead, we only make general fixes to the framework, as well as clarifications in the prompt to provide a maximally fair evaluation setup for the LMs. </li>
45
+
Instead, we only make general fixes to the framework, as well as clarifications in the prompt to provide a maximally fair evaluation setup for the LMs.
46
+
Generally, everything in the minor or patch release version number should be a minor change for the purpose of the bash-only leaderboard.
47
+
</li>
48
+
<li><ahref="https://mini-swe-agent.com/latest/usage/swebench/">This guide</a> shows how to run the evaluation yourself.</li>
0 commit comments