Skip to content

Commit 47bf0ce

Browse files
authored
Update README.md
1 parent ce9d6b3 commit 47bf0ce

1 file changed

Lines changed: 104 additions & 158 deletions

File tree

README.md

Lines changed: 104 additions & 158 deletions
Original file line numberDiff line numberDiff line change
@@ -27,16 +27,16 @@
2727

2828
<div align="center">
2929

30-
### ⚠️ IMPORTANT: Model Availability Notice ⚠️
30+
### ⚠️ Model Access Notice ⚠️
3131

32-
**Kodezi Chronos is proprietary technology with exclusive access**
32+
**Chronos is proprietary and available exclusively through Kodezi OS**
3333

3434
| Timeline | Access | Details |
3535
|:--------:|:------:|:-------:|
36-
| **Q4 2025** | Beta Access | Select enterprise partners via [chronos.so](https://chronos.so) |
37-
| **Q1 2026** | General Availability | Via [Kodezi OS](https://kodezi.com/os) platform |
36+
| **Q4 2025** | Beta | Limited enterprise access |
37+
| **Q1 2026** | GA | Via [Kodezi OS](https://kodezi.com/os) |
3838

39-
This repository contains research findings, benchmarks, and evaluation frameworks. The model itself is not publicly available.
39+
This repository contains the MRR benchmark suite and evaluation framework only.
4040

4141
</div>
4242

@@ -52,15 +52,15 @@ This repository contains research findings, benchmarks, and evaluation framework
5252

5353
---
5454

55-
## 🏆 Breakthrough Performance Metrics
55+
## 🏆 MRR Benchmark Results
5656

5757
<div align="center">
5858

59-
### Overall Benchmark Results (5,000+ Real-World Debugging Scenarios)
59+
### Overall Performance (5,000 MRR Scenarios)
6060

61-
| Metric | **Kodezi Chronos** | **GPT-4.1** | **Claude-4-Opus** | **Gemini-2.0-Pro** | **Improvement** |
61+
| Metric | **Chronos** | **GPT-4.1** | **Claude-4** | **Gemini-2.0** | **Improvement** |
6262
|:------:|:------------------:|:---------:|:-----------------:|:------------------:|:---------------:|
63-
| **Debug Success Rate** | **67.3%±2.1%*** | 13.8%±1.2% | 14.2%±1.3% | <15% | **4.7-6.0x** |
63+
| **Debug Success Rate** | **67.3%** | 13.8% | 14.2% | 15.0% | **4.5x** |
6464
| **Root Cause Accuracy** | **89%*** | 12.3%±1.8% | 11.7%±2.0% | 15.8%±1.5% | **5.6-7.6x** |
6565
| **Average Fix Iterations** | **7.8** | 1-2 | 1-2 | 1-2 | **More thorough** |
6666
| **Retrieval Precision** | **92%*** | 68%±2.3% | 67%±2.4% | 74%±1.8% | **1.2-1.4x** |
@@ -100,113 +100,77 @@ This repository contains research findings, benchmarks, and evaluation framework
100100

101101
---
102102

103-
## 🧠 What Makes Chronos Revolutionary?
103+
## 🧠 Key Innovations in Chronos
104104

105-
<div align="center">
106-
107-
### **1. Debugging-First Architecture**
108-
Unlike code completion models trained on next-token prediction, Chronos is purpose-built from 42.5 million real debugging examples
105+
### 1. **Debugging-First Architecture**
106+
- Trained on 42.5M real debugging examples (not code completion)
107+
- Specialized for root cause analysis and multi-file patches
108+
- 78.4% root cause accuracy vs 15.8% best baseline
109109

110-
### **2. Persistent Debug Memory**
111-
Learns from every debugging session across your codebase, improving continuously with cross-session pattern recognition
110+
### 2. **Persistent Debug Memory (PDM)**
111+
- Repository-specific learning from debugging sessions
112+
- Improves from 35% → 65% success rate over time
113+
- Cross-session pattern recognition
112114

113-
### **3. Adaptive Graph-Guided Retrieval (AGR)**
114-
Dynamic k-hop expansion enables unlimited context through intelligent graph traversal, not brute-force token expansion
115+
### 3. **Adaptive Graph-Guided Retrieval (AGR)**
116+
- O(k log d) complexity with dynamic k-hop expansion
117+
- 92% precision, 85% recall on multi-file context
118+
- Handles unlimited repository scale intelligently
115119

116-
### **4. Output-Optimized Design**
117-
Recognizes debugging as inherently output-heavy (~3K output vs ~3.6K input tokens), optimized for generating fixes, tests, and documentation
120+
### 4. **Output-Optimized Design**
121+
- Optimized for ~3K output tokens (fixes, tests, docs)
122+
- 47.2% output entropy density vs 12.8% for completion models
123+
- Designed for complex patch generation
118124

119-
### **5. Autonomous Debugging Loop**
120-
Iteratively refines fixes through propose → test → analyze → refine cycles until all tests pass
121-
122-
</div>
125+
### 5. **Autonomous Debugging Loop**
126+
- Average 7.8 iterations to successful fix
127+
- Propose → test → analyze → refine cycles
128+
- 67.3% fully autonomous success rate
123129

124130
---
125131

126-
## 🏗️ Seven-Layer Architecture
127-
128-
<div align="center">
129-
130-
```mermaid
131-
graph TD
132-
A[Multi-Source Input Layer] --> B[Adaptive Retrieval Engine]
133-
B --> C[Debug-Tuned LLM Core]
134-
C --> D[Orchestration Controller]
135-
D --> E[Execution Sandbox]
136-
E --> F[Validation Results]
137-
F --> G{Tests Pass?}
138-
G -->|No| H[Iterative Refinement]
139-
H --> B
140-
G -->|Yes| I[Persistent Memory Update]
141-
I --> J[Fix Deployed]
142-
143-
style A fill:#f9f,stroke:#333,stroke-width:4px
144-
style C fill:#bbf,stroke:#333,stroke-width:4px
145-
style I fill:#bfb,stroke:#333,stroke-width:4px
146-
```
147-
148-
</div>
149-
150-
### Architecture Layers Explained
151-
152-
1. **Multi-Source Input Layer**
153-
- Ingests heterogeneous debugging signals: source code, CI/CD logs, error traces, tests, documentation
154-
- Processes 10+ input modalities simultaneously
155-
156-
2. **Adaptive Retrieval Engine (AGR)**
157-
- Dynamic k-hop neighbor expansion (k=1-5 based on complexity)
158-
- 89.2% precision vs 42.3% for flat retrieval
159-
- Handles temporal code evolution and refactoring
132+
## 🏗️ Architecture Overview
160133

161-
3. **Debug-Tuned LLM Core**
162-
- Trained on debugging workflows, not code completion
163-
- Specialized tasks: root cause prediction, multi-file patches, test interpretation
164-
- 78.4% root cause accuracy vs 15.8% best baseline
134+
### Seven-Layer System Design
165135

166-
4. **Orchestration Controller**
167-
- Manages autonomous debugging loop
168-
- Hypothesis generation → fix refinement → rollback on failure
169-
- Average 2.2 cycles to success
136+
1. **Multi-Source Input Layer**: Processes code, logs, traces, tests, docs simultaneously
137+
2. **Adaptive Retrieval Engine (AGR)**: Dynamic k-hop graph traversal (92% precision)
138+
3. **Debug-Tuned LLM Core**: 42.5M debugging examples, not code completion
139+
4. **Orchestration Controller**: Autonomous debugging loop management
140+
5. **Persistent Debug Memory**: Repository-specific learning (35% → 65% improvement)
141+
6. **Execution Sandbox**: Isolated test validation environment
142+
7. **Explainability Layer**: Human-readable root cause analysis
170143

171-
5. **Persistent Debug Memory**
172-
- Repository-specific bug patterns and fixes
173-
- Cross-session learning and adaptation
174-
- 7.3x better token efficiency through memory
175-
176-
6. **Execution Sandbox**
177-
- Isolated test execution environment
178-
- CI/CD pipeline emulation
179-
- Real-time validation without production risk
180-
181-
7. **Explainability Layer**
182-
- Human-readable root cause explanations
183-
- Automated PR descriptions and commit messages
184-
- Risk assessment for proposed changes
144+
See [architecture documentation](architecture/README.md) for detailed specifications.
185145

186146
---
187147

188148
## 📊 Multi-Random Retrieval (MRR) Benchmark
189149

190-
<div align="center">
150+
### What is MRR?
191151

192-
### Revolutionary Evaluation Framework
152+
MRR simulates real-world debugging complexity by:
153+
- **Spatial Distribution**: Bug context scattered across 10-50 files
154+
- **Temporal Dispersion**: Relevant information from 3-12 months of history
155+
- **Obfuscation Levels**: Low/medium/high code complexity
156+
- **5,000 Scenarios**: Comprehensive evaluation across languages and bug types
193157

194-
| Metric | **Chronos** | **GPT-4+RAG** | **Claude-3+VectorDB** | **Gemini-1.5+Graph** |
195-
|:------:|:-----------:|:-------------:|:---------------------:|:--------------------:|
196-
| **Precision@10** | **89.2%** | 42.3% | 48.1% | 51.7% |
197-
| **Recall@10** | **84.7%** | 31.7% | 36.2% | 41.8% |
198-
| **Fix Accuracy** | **67.3%** | 8.9% | 11.2% | 14.6% |
199-
| **Context Efficiency** | **0.71** | 0.23 | 0.28 | 0.31 |
158+
### MRR Results
200159

201-
MRR tests real-world debugging by scattering context across 10-50 files over 3-12 months of history
160+
| Metric | Chronos | GPT-4+RAG | Claude-3+VectorDB | Gemini-1.5+Graph |
161+
|:-------|:-------:|:---------:|:-----------------:|:----------------:|
162+
| **Precision@10** | 89.2% | 42.3% | 48.1% | 51.7% |
163+
| **Recall@10** | 84.7% | 31.7% | 36.2% | 41.8% |
164+
| **Fix Accuracy** | 67.3% | 8.9% | 11.2% | 14.6% |
165+
| **Context Efficiency** | 0.71 | 0.23 | 0.28 | 0.31 |
202166

203-
</div>
167+
Full benchmark available in [benchmarks/multi-random-retrieval/](benchmarks/multi-random-retrieval/)
204168

205169
---
206170

207171
## 🚀 Getting Started
208172

209-
### Research Repository Setup
173+
### Running the MRR Benchmark
210174

211175
```bash
212176
# Clone the repository
@@ -216,90 +180,72 @@ cd chronos-research
216180
# Install dependencies
217181
pip install -r requirements.txt
218182

219-
# Run performance analysis notebooks
220-
jupyter notebook notebooks/performance_analysis.ipynb
183+
# Run MRR benchmark on your model
184+
python benchmarks/run_mrr_benchmark_2025.py \
185+
--model your_model \
186+
--scenarios 100 # Start with subset
221187

222-
# Generate benchmark visualizations
223-
python scripts/generate_visualizations.py
188+
# Analyze results
189+
python benchmarks/analyze_results.py
224190
```
225191

226-
### Access Chronos Model
192+
### Model Access
227193

228-
<div align="center">
229-
230-
| Step | Action | Timeline |
231-
|:----:|:-------|:---------|
232-
| 1 | [Join Waitlist](https://chronos.so) | Available Now |
233-
| 2 | Beta Access | Q4 2025 |
234-
| 3 | General Availability | Q1 2026 |
194+
**⚠️ The Chronos model is not included in this repository**
235195

236-
</div>
196+
Chronos will be available via [Kodezi OS](https://kodezi.com/os):
197+
- **Q4 2025**: Enterprise beta
198+
- **Q1 2026**: General availability
199+
- **Join waitlist**: [chronos.so](https://chronos.so)
237200

238201
---
239202

240-
## 📁 Repository Structure
203+
## 📁 Repository Contents
241204

242205
```
243206
chronos-research/
244-
├── paper/ # Research paper (arXiv:2507.12482)
245-
│ ├── chronos-research.md # Full paper content
246-
│ ├── figures/ # All paper figures
247-
│ └── tables/ # Performance data tables
248-
├── benchmarks/ # Evaluation frameworks
249-
│ ├── multi-random-retrieval/ # MRR benchmark suite
250-
│ ├── debug_categories/ # Bug taxonomy
251-
│ └── evaluation_metrics/ # Metrics implementation
252-
├── results/ # Performance analysis
253-
│ ├── case_studies/ # Real debugging examples
254-
│ ├── ablation_studies/ # Component analysis
255-
│ └── performance_tables/ # Detailed metrics
256-
├── architecture/ # System design docs
257-
│ ├── agr_retrieval.md # AGR algorithm details
258-
│ ├── memory_engine.md # Persistent memory design
259-
│ └── debugging_loop.md # Autonomous loop
260-
├── evaluation/ # Testing methodology
261-
├── examples/ # Code examples
262-
├── docs/ # User documentation
263-
├── notebooks/ # Analysis notebooks
264-
└── scripts/ # Utility scripts
207+
├── benchmarks/ # MRR Benchmark Suite
208+
│ ├── multi-random-retrieval/ # 5,000 scenario benchmark
209+
│ ├── evaluation_metrics/ # Metrics implementation
210+
│ └── run_mrr_benchmark_2025.py # Main benchmark runner
211+
├── reference_implementations/ # Algorithm references (NOT the model)
212+
│ ├── algorithms/ # AGR, PDM implementations
213+
│ └── NOTICE.md # Proprietary model notice
214+
├── paper/ # Research paper
215+
│ └── chronos-research-2025.md # Full paper (arXiv:2507.12482)
216+
├── results/ # Performance data
217+
│ ├── raw_data/ # 5,000 scenario results
218+
│ └── case_studies/ # Debugging examples
219+
├── figures/ # Paper visualizations
220+
│ └── paper_figures/ # 11 paper figures
221+
├── docs/ # Documentation
222+
├── MODEL_ACCESS.md # How to access Chronos
223+
└── LEADERBOARD.md # Performance rankings
265224
```
266225

267226
---
268227

269-
## 🌟 Key Innovations
228+
## 🔬 Research Highlights
270229

271-
### 1. Revolutionary Training Dataset
272-
- **42.5M** total debugging examples
273-
- **15M** GitHub issues with linked PRs and fixes
274-
- **8M** stack traces paired with resolutions
275-
- **3M** CI/CD logs from failed and fixed builds
276-
- **2.5M** production debugging sessions
277-
- **14M** examples from Defects4J, SWE-bench, BugsInPy
230+
### Training Dataset
231+
- 42.5M debugging examples (not code completion)
232+
- 15M GitHub issues with fixes
233+
- 8M stack traces with resolutions
234+
- 3M CI/CD debugging logs
235+
- 2.5M production sessions
236+
- 14M curated from Defects4J, SWE-bench, BugsInPy
278237

279-
### 2. Adaptive Graph-Guided Retrieval (AGR)
280-
```
281-
Performance by Retrieval Depth:
282-
k=1 (Direct): 58.2% success
283-
k=2 (Expanded): 72.4% success
284-
k=3 (Deep): 71.8% success
285-
k=adaptive: 87.1% success (dynamic depth selection)
286-
Flat retrieval: 23.4% success
287-
```
288-
289-
### 3. Output-Heavy Optimization
290-
```
291-
Token Distribution in Debugging:
292-
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
293-
Input Tokens: ~3,600 (sparse)
294-
Output Tokens: ~3,000 (dense)
295-
Output Entropy Density: 47.2% (vs 12.8% for code completion)
296-
```
238+
### AGR Performance
239+
- k=1 hop: 58.2% success
240+
- k=2 hops: 72.4% success
241+
- k=adaptive: 87.1% success
242+
- Flat retrieval: 23.4% success
297243

298-
### 4. Persistent Debug Memory
299-
- Cross-session learning improves success rate from 35% → 65% over time
300-
- 7.3x token efficiency through intelligent memory
301-
- Repository-specific pattern recognition
302-
- Temporal code evolution tracking
244+
### PDM Learning Curve
245+
- Initial: 35% success rate
246+
- After 100 sessions: 52% success
247+
- After 500 sessions: 65% success
248+
- 7.3x token efficiency gain
303249

304250
---
305251

@@ -468,4 +414,4 @@ This research repository is licensed under the MIT License - see [LICENSE](LICEN
468414

469415
<sub>Built with ❤️ by the Kodezi Team</sub>
470416

471-
</div>
417+
</div>

0 commit comments

Comments
 (0)