|
1 | | -# Chronos MRR Benchmark Leaderboard |
| 1 | +# Chronos Benchmark Leaderboard |
2 | 2 |
|
3 | | -## Overall Performance (5,000 scenarios) |
| 3 | +## 🏆 SWE-bench Lite (Industry Standard Benchmark) |
| 4 | + |
| 5 | +### Overall Performance |
| 6 | + |
| 7 | +| Rank | System | Success Rate | Instances Resolved | Year | |
| 8 | +|------|--------|--------------|-------------------|------| |
| 9 | +| 🥇 **1** | **Kodezi Chronos** | **80.33%** | **241/300** | **2025** | |
| 10 | +| 🥈 2 | ExpeRepair-v1.0 + Claude 4.5 Sonnet | 60.33% | 181/300 | 2025 | |
| 11 | +| 3 | Claude 4.5 Sonnet (Bash Only) | ~14% | ~42/300 | 2025 | |
| 12 | +| 4 | Claude 4.1 Opus (Bash Only) | 14.2% | 43/300 | 2025 | |
| 13 | +| 5 | GPT-4.1 | 13.8% | 41/300 | 2025 | |
| 14 | +| 6 | Gemini 2.0 Pro | 13.4% | 40/300 | 2025 | |
| 15 | + |
| 16 | +**Key Achievement**: 20 percentage point absolute lead over second place |
| 17 | + |
| 18 | +### Repository-Specific Performance (SWE-bench Lite) |
| 19 | + |
| 20 | +| Repository | Domain | Chronos Success | Instances | |
| 21 | +|------------|--------|----------------|-----------| |
| 22 | +| **sympy** | Symbolic mathematics | **96.1%** | 51/53 | |
| 23 | +| **sphinx** | Documentation systems | **93.8%** | 60/64 | |
| 24 | +| **django** | Web frameworks | **90.4%** | 104/115 | |
| 25 | +| **Overall** | Mixed | **80.33%** | **241/300** | |
| 26 | + |
| 27 | +### The Code Generation vs Debugging Gap |
| 28 | + |
| 29 | +| Model | SWE-bench Full (Code Gen) | SWE-bench Lite (Debug) | Gap | |
| 30 | +|-------|---------------------------|------------------------|-----| |
| 31 | +| Claude 4.5 Sonnet | 72.7% | ~14% | -58.7pp | |
| 32 | +| Claude 4.1 Opus | 72.5% | 14.2% | -58.3pp | |
| 33 | +| Claude 4.1 Opus (Bash Only) | 67.60% | 14.2% | -53.4pp | |
| 34 | +| GPT-4.1 | 54.6% | 13.8% | -40.8pp | |
| 35 | +| **Chronos** (Debug-Specialized) | **N/A** | **80.33%** | **Purpose-Built** | |
| 36 | + |
| 37 | +**Key Insight**: General-purpose models achieving 70%+ on code generation drop to <15% on debugging tasks, revealing debugging requires specialized architectures rather than just larger context windows. |
| 38 | + |
| 39 | +--- |
| 40 | + |
| 41 | +## 🏆 MRR Benchmark (Multi-Random Retrieval) |
| 42 | + |
| 43 | +### Overall Performance (5,000 scenarios) |
4 | 44 |
|
5 | 45 | | Model | Success Rate | Precision | Recall | Avg Iterations | Cost/Fix | |
6 | 46 | |-------|--------------|-----------|--------|----------------|----------| |
7 | 47 | | **Chronos*** | 67.3%±2.1% | 92% | 85% | 7.8 | $1.36 | |
8 | 48 | | Gemini-2.0 Pro | 15.0%±1.5% | 74% | 38% | 19.2 | $4.25 | |
9 | | -| Claude-4 Opus | 14.2%±1.3% | 67% | 34% | 21.8 | $4.89 | |
| 49 | +| Claude-4.1 Opus | 14.2%±1.3% | 67% | 34% | 21.8 | $4.89 | |
10 | 50 | | GPT-4.1 | 13.8%±1.2% | 68% | 32% | 23.5 | $5.53 | |
11 | 51 | | DeepSeek-V2 | 8.7%±0.9% | 52% | 21% | 28.1 | $7.82 | |
12 | 52 | | Mistral-Large | 9.2%±0.8% | 48% | 19% | 31.7 | $8.95 | |
|
20 | 60 | |-------|--------------|---------------------| |
21 | 61 | | Chronos | 94.2% | 1.1x | |
22 | 62 | | GPT-4.1 | 82.3% | - | |
23 | | -| Claude-4 | 79.8% | 0.97x | |
| 63 | +| Claude-4.1 Opus | 79.8% | 0.97x | |
24 | 64 | | Gemini-2.0 | 85.1% | 1.03x | |
25 | 65 |
|
26 | 66 | ### Logic Errors (1,200 scenarios) |
27 | 67 | | Model | Success Rate | Improvement vs GPT-4 | |
28 | 68 | |-------|--------------|---------------------| |
29 | 69 | | Chronos | 72.8% | 6.0x | |
30 | 70 | | GPT-4.1 | 12.1% | - | |
31 | | -| Claude-4 | 10.7% | 0.88x | |
| 71 | +| Claude-4.1 Opus | 10.7% | 0.88x | |
32 | 72 | | Gemini-2.0 | 15.3% | 1.26x | |
33 | 73 |
|
34 | 74 | ### Concurrency Issues (800 scenarios) |
35 | 75 | | Model | Success Rate | Improvement vs GPT-4 | |
36 | 76 | |-------|--------------|---------------------| |
37 | 77 | | Chronos | 58.3% | 18.2x | |
38 | 78 | | GPT-4.1 | 3.2% | - | |
39 | | -| Claude-4 | 2.8% | 0.88x | |
| 79 | +| Claude-4.1 Opus | 2.8% | 0.88x | |
40 | 80 | | Gemini-2.0 | 4.1% | 1.28x | |
41 | 81 |
|
42 | 82 | ### Memory Issues (600 scenarios) |
43 | 83 | | Model | Success Rate | Improvement vs GPT-4 | |
44 | 84 | |-------|--------------|---------------------| |
45 | 85 | | Chronos | 61.7% | 10.8x | |
46 | 86 | | GPT-4.1 | 5.7% | - | |
47 | | -| Claude-4 | 4.3% | 0.75x | |
| 87 | +| Claude-4.1 Opus | 4.3% | 0.75x | |
48 | 88 | | Gemini-2.0 | 6.9% | 1.21x | |
49 | 89 |
|
50 | 90 | ### API Misuse (900 scenarios) |
51 | 91 | | Model | Success Rate | Improvement vs GPT-4 | |
52 | 92 | |-------|--------------|---------------------| |
53 | 93 | | Chronos | 79.1% | 4.2x | |
54 | 94 | | GPT-4.1 | 18.9% | - | |
55 | | -| Claude-4 | 16.2% | 0.86x | |
| 95 | +| Claude-4.1 Opus | 16.2% | 0.86x | |
56 | 96 | | Gemini-2.0 | 22.4% | 1.19x | |
57 | 97 |
|
58 | 98 | ### Performance Bugs (400 scenarios) |
59 | 99 | | Model | Success Rate | Improvement vs GPT-4 | |
60 | 100 | |-------|--------------|---------------------| |
61 | 101 | | Chronos | 65.4% | 8.8x | |
62 | 102 | | GPT-4.1 | 7.4% | - | |
63 | | -| Claude-4 | 6.1% | 0.82x | |
| 103 | +| Claude-4.1 Opus | 6.1% | 0.82x | |
64 | 104 | | Gemini-2.0 | 9.8% | 1.32x | |
65 | 105 |
|
66 | 106 | ### Cross-Category (600 scenarios) |
67 | 107 | | Model | Success Rate | Improvement vs GPT-4 | |
68 | 108 | |-------|--------------|---------------------| |
69 | 109 | | Chronos | 51.2% | 12.5x | |
70 | 110 | | GPT-4.1 | 4.1% | - | |
71 | | -| Claude-4 | 3.7% | 0.90x | |
| 111 | +| Claude-4.1 Opus | 3.7% | 0.90x | |
72 | 112 | | Gemini-2.0 | 5.2% | 1.27x | |
73 | 113 |
|
74 | 114 | ## Repository Scale Performance |
|
0 commit comments