Skip to content

Commit 93612b3

Browse files
committed
Q4 2025 Updates
1 parent 47bf0ce commit 93612b3

316 files changed

Lines changed: 85179 additions & 216 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CITATION.cff

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -32,17 +32,20 @@ repository-code: 'https://github.com/kodezi/chronos-research'
3232
url: 'https://kodezi.com/chronos'
3333
repository: 'https://arxiv.org/abs/2507.12482'
3434
abstract: >-
35-
Large Language Models (LLMs) have advanced code generation and software automation,
36-
but are fundamentally constrained by limited inference-time context and lack of
37-
explicit code structure reasoning. We introduce Kodezi Chronos, a next-generation
38-
architecture for autonomous code understanding, debugging, and maintenance, designed
39-
to operate across ultra-long contexts comprising entire codebases, histories, and
40-
documentation—all without fixed window limits. Kodezi Chronos leverages a multi-level
41-
embedding memory engine, combining vector and graph-based indexing with continuous
42-
code-aware retrieval. This enables efficient and accurate reasoning over millions
43-
of lines of code, supporting repository-scale comprehension, multi-file refactoring,
44-
and real-time self-healing actions. Chronos achieves 67.3% debugging success rate,
45-
representing a 4-5x improvement over state-of-the-art models including Claude Opus 4 and GPT-4.1.
35+
Large Language Models (LLMs) have advanced code generation and software automation,
36+
but are fundamentally constrained by limited inference-time context and lack of
37+
explicit code structure reasoning. We introduce Kodezi Chronos, a next-generation
38+
architecture for autonomous code understanding, debugging, and maintenance, designed
39+
to operate across ultra-long contexts comprising entire codebases, histories, and
40+
documentation—all without fixed window limits. Kodezi Chronos leverages a multi-level
41+
embedding memory engine, combining vector and graph-based indexing with continuous
42+
code-aware retrieval. This enables efficient and accurate reasoning over millions
43+
of lines of code, supporting repository-scale comprehension, multi-file refactoring,
44+
and real-time self-healing actions. Chronos achieves state-of-the-art 80.33% success
45+
rate on SWE-bench Lite (241/300 instances), a 20 percentage point lead over the next
46+
best system, and 67.3% on comprehensive debugging benchmarks—representing a 4-5x
47+
improvement over state-of-the-art models including Claude 4.1 Opus, Claude 4.5 Sonnet,
48+
and GPT-4.1.
4649
keywords:
4750
- debugging
4851
- language models

LEADERBOARD.md

Lines changed: 50 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,52 @@
1-
# Chronos MRR Benchmark Leaderboard
1+
# Chronos Benchmark Leaderboard
22

3-
## Overall Performance (5,000 scenarios)
3+
## 🏆 SWE-bench Lite (Industry Standard Benchmark)
4+
5+
### Overall Performance
6+
7+
| Rank | System | Success Rate | Instances Resolved | Year |
8+
|------|--------|--------------|-------------------|------|
9+
| 🥇 **1** | **Kodezi Chronos** | **80.33%** | **241/300** | **2025** |
10+
| 🥈 2 | ExpeRepair-v1.0 + Claude 4.5 Sonnet | 60.33% | 181/300 | 2025 |
11+
| 3 | Claude 4.5 Sonnet (Bash Only) | ~14% | ~42/300 | 2025 |
12+
| 4 | Claude 4.1 Opus (Bash Only) | 14.2% | 43/300 | 2025 |
13+
| 5 | GPT-4.1 | 13.8% | 41/300 | 2025 |
14+
| 6 | Gemini 2.0 Pro | 13.4% | 40/300 | 2025 |
15+
16+
**Key Achievement**: 20 percentage point absolute lead over second place
17+
18+
### Repository-Specific Performance (SWE-bench Lite)
19+
20+
| Repository | Domain | Chronos Success | Instances |
21+
|------------|--------|----------------|-----------|
22+
| **sympy** | Symbolic mathematics | **96.1%** | 51/53 |
23+
| **sphinx** | Documentation systems | **93.8%** | 60/64 |
24+
| **django** | Web frameworks | **90.4%** | 104/115 |
25+
| **Overall** | Mixed | **80.33%** | **241/300** |
26+
27+
### The Code Generation vs Debugging Gap
28+
29+
| Model | SWE-bench Full (Code Gen) | SWE-bench Lite (Debug) | Gap |
30+
|-------|---------------------------|------------------------|-----|
31+
| Claude 4.5 Sonnet | 72.7% | ~14% | -58.7pp |
32+
| Claude 4.1 Opus | 72.5% | 14.2% | -58.3pp |
33+
| Claude 4.1 Opus (Bash Only) | 67.60% | 14.2% | -53.4pp |
34+
| GPT-4.1 | 54.6% | 13.8% | -40.8pp |
35+
| **Chronos** (Debug-Specialized) | **N/A** | **80.33%** | **Purpose-Built** |
36+
37+
**Key Insight**: General-purpose models achieving 70%+ on code generation drop to <15% on debugging tasks, revealing debugging requires specialized architectures rather than just larger context windows.
38+
39+
---
40+
41+
## 🏆 MRR Benchmark (Multi-Random Retrieval)
42+
43+
### Overall Performance (5,000 scenarios)
444

545
| Model | Success Rate | Precision | Recall | Avg Iterations | Cost/Fix |
646
|-------|--------------|-----------|--------|----------------|----------|
747
| **Chronos*** | 67.3%±2.1% | 92% | 85% | 7.8 | $1.36 |
848
| Gemini-2.0 Pro | 15.0%±1.5% | 74% | 38% | 19.2 | $4.25 |
9-
| Claude-4 Opus | 14.2%±1.3% | 67% | 34% | 21.8 | $4.89 |
49+
| Claude-4.1 Opus | 14.2%±1.3% | 67% | 34% | 21.8 | $4.89 |
1050
| GPT-4.1 | 13.8%±1.2% | 68% | 32% | 23.5 | $5.53 |
1151
| DeepSeek-V2 | 8.7%±0.9% | 52% | 21% | 28.1 | $7.82 |
1252
| Mistral-Large | 9.2%±0.8% | 48% | 19% | 31.7 | $8.95 |
@@ -20,55 +60,55 @@
2060
|-------|--------------|---------------------|
2161
| Chronos | 94.2% | 1.1x |
2262
| GPT-4.1 | 82.3% | - |
23-
| Claude-4 | 79.8% | 0.97x |
63+
| Claude-4.1 Opus | 79.8% | 0.97x |
2464
| Gemini-2.0 | 85.1% | 1.03x |
2565

2666
### Logic Errors (1,200 scenarios)
2767
| Model | Success Rate | Improvement vs GPT-4 |
2868
|-------|--------------|---------------------|
2969
| Chronos | 72.8% | 6.0x |
3070
| GPT-4.1 | 12.1% | - |
31-
| Claude-4 | 10.7% | 0.88x |
71+
| Claude-4.1 Opus | 10.7% | 0.88x |
3272
| Gemini-2.0 | 15.3% | 1.26x |
3373

3474
### Concurrency Issues (800 scenarios)
3575
| Model | Success Rate | Improvement vs GPT-4 |
3676
|-------|--------------|---------------------|
3777
| Chronos | 58.3% | 18.2x |
3878
| GPT-4.1 | 3.2% | - |
39-
| Claude-4 | 2.8% | 0.88x |
79+
| Claude-4.1 Opus | 2.8% | 0.88x |
4080
| Gemini-2.0 | 4.1% | 1.28x |
4181

4282
### Memory Issues (600 scenarios)
4383
| Model | Success Rate | Improvement vs GPT-4 |
4484
|-------|--------------|---------------------|
4585
| Chronos | 61.7% | 10.8x |
4686
| GPT-4.1 | 5.7% | - |
47-
| Claude-4 | 4.3% | 0.75x |
87+
| Claude-4.1 Opus | 4.3% | 0.75x |
4888
| Gemini-2.0 | 6.9% | 1.21x |
4989

5090
### API Misuse (900 scenarios)
5191
| Model | Success Rate | Improvement vs GPT-4 |
5292
|-------|--------------|---------------------|
5393
| Chronos | 79.1% | 4.2x |
5494
| GPT-4.1 | 18.9% | - |
55-
| Claude-4 | 16.2% | 0.86x |
95+
| Claude-4.1 Opus | 16.2% | 0.86x |
5696
| Gemini-2.0 | 22.4% | 1.19x |
5797

5898
### Performance Bugs (400 scenarios)
5999
| Model | Success Rate | Improvement vs GPT-4 |
60100
|-------|--------------|---------------------|
61101
| Chronos | 65.4% | 8.8x |
62102
| GPT-4.1 | 7.4% | - |
63-
| Claude-4 | 6.1% | 0.82x |
103+
| Claude-4.1 Opus | 6.1% | 0.82x |
64104
| Gemini-2.0 | 9.8% | 1.32x |
65105

66106
### Cross-Category (600 scenarios)
67107
| Model | Success Rate | Improvement vs GPT-4 |
68108
|-------|--------------|---------------------|
69109
| Chronos | 51.2% | 12.5x |
70110
| GPT-4.1 | 4.1% | - |
71-
| Claude-4 | 3.7% | 0.90x |
111+
| Claude-4.1 Opus | 3.7% | 0.90x |
72112
| Gemini-2.0 | 5.2% | 1.27x |
73113

74114
## Repository Scale Performance

LICENSE

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
1+
© Kodezi Inc. All rights reserved.
2+
Use is subject to Kodezi's Terms of Service.
3+
14
MIT License
25

36
Copyright (c) 2025 Kodezi Inc.

0 commit comments

Comments
 (0)