2727
2828<div align =" center " >
2929
30- ### ⚠️ IMPORTANT: Model Availability Notice ⚠️
30+ ### ⚠️ Model Access Notice ⚠️
3131
32- ** Kodezi Chronos is proprietary technology with exclusive access **
32+ ** Chronos is proprietary and available exclusively through Kodezi OS **
3333
3434| Timeline | Access | Details |
3535| :--------:| :------:| :-------:|
36- | ** Q4 2025** | Beta Access | Select enterprise partners via [ chronos.so ] ( https://chronos.so ) |
37- | ** Q1 2026** | General Availability | Via [ Kodezi OS] ( https://kodezi.com/os ) platform |
36+ | ** Q4 2025** | Beta | Limited enterprise access |
37+ | ** Q1 2026** | GA | Via [ Kodezi OS] ( https://kodezi.com/os ) |
3838
39- This repository contains research findings, benchmarks, and evaluation frameworks. The model itself is not publicly available .
39+ This repository contains the MRR benchmark suite and evaluation framework only .
4040
4141</div >
4242
@@ -52,15 +52,15 @@ This repository contains research findings, benchmarks, and evaluation framework
5252
5353---
5454
55- ## 🏆 Breakthrough Performance Metrics
55+ ## 🏆 MRR Benchmark Results
5656
5757<div align =" center " >
5858
59- ### Overall Benchmark Results (5,000+ Real-World Debugging Scenarios)
59+ ### Overall Performance (5,000 MRR Scenarios)
6060
61- | Metric | ** Kodezi Chronos** | ** GPT-4.1** | ** Claude-4-Opus ** | ** Gemini-2.0-Pro ** | ** Improvement** |
61+ | Metric | ** Chronos** | ** GPT-4.1** | ** Claude-4** | ** Gemini-2.0** | ** Improvement** |
6262| :------:| :------------------:| :---------:| :-----------------:| :------------------:| :---------------:|
63- | ** Debug Success Rate** | ** 67.3%±2.1% *** | 13.8%±1.2% | 14.2%±1.3% | <15 % | ** 4.7-6.0x ** |
63+ | ** Debug Success Rate** | ** 67.3%** | 13.8% | 14.2% | 15.0 % | ** 4.5x ** |
6464| ** Root Cause Accuracy** | ** 89%*** | 12.3%±1.8% | 11.7%±2.0% | 15.8%±1.5% | ** 5.6-7.6x** |
6565| ** Average Fix Iterations** | ** 7.8** | 1-2 | 1-2 | 1-2 | ** More thorough** |
6666| ** Retrieval Precision** | ** 92%*** | 68%±2.3% | 67%±2.4% | 74%±1.8% | ** 1.2-1.4x** |
@@ -100,113 +100,77 @@ This repository contains research findings, benchmarks, and evaluation framework
100100
101101---
102102
103- ## 🧠 What Makes Chronos Revolutionary?
103+ ## 🧠 Key Innovations in Chronos
104104
105- < div align = " center " >
106-
107- ### ** 1. Debugging-First Architecture **
108- Unlike code completion models trained on next-token prediction, Chronos is purpose-built from 42.5 million real debugging examples
105+ ### 1. ** Debugging-First Architecture **
106+ - Trained on 42.5M real debugging examples (not code completion)
107+ - Specialized for root cause analysis and multi-file patches
108+ - 78.4% root cause accuracy vs 15.8% best baseline
109109
110- ### ** 2. Persistent Debug Memory**
111- Learns from every debugging session across your codebase, improving continuously with cross-session pattern recognition
110+ ### 2. ** Persistent Debug Memory (PDM)**
111+ - Repository-specific learning from debugging sessions
112+ - Improves from 35% → 65% success rate over time
113+ - Cross-session pattern recognition
112114
113- ### ** 3. Adaptive Graph-Guided Retrieval (AGR)**
114- Dynamic k-hop expansion enables unlimited context through intelligent graph traversal, not brute-force token expansion
115+ ### 3. ** Adaptive Graph-Guided Retrieval (AGR)**
116+ - O(k log d) complexity with dynamic k-hop expansion
117+ - 92% precision, 85% recall on multi-file context
118+ - Handles unlimited repository scale intelligently
115119
116- ### ** 4. Output-Optimized Design**
117- Recognizes debugging as inherently output-heavy (~ 3K output vs ~ 3.6K input tokens), optimized for generating fixes, tests, and documentation
120+ ### 4. ** Output-Optimized Design**
121+ - Optimized for ~ 3K output tokens (fixes, tests, docs)
122+ - 47.2% output entropy density vs 12.8% for completion models
123+ - Designed for complex patch generation
118124
119- ### ** 5. Autonomous Debugging Loop**
120- Iteratively refines fixes through propose → test → analyze → refine cycles until all tests pass
121-
122- </ div >
125+ ### 5. ** Autonomous Debugging Loop**
126+ - Average 7.8 iterations to successful fix
127+ - Propose → test → analyze → refine cycles
128+ - 67.3% fully autonomous success rate
123129
124130---
125131
126- ## 🏗️ Seven-Layer Architecture
127-
128- <div align =" center " >
129-
130- ``` mermaid
131- graph TD
132- A[Multi-Source Input Layer] --> B[Adaptive Retrieval Engine]
133- B --> C[Debug-Tuned LLM Core]
134- C --> D[Orchestration Controller]
135- D --> E[Execution Sandbox]
136- E --> F[Validation Results]
137- F --> G{Tests Pass?}
138- G -->|No| H[Iterative Refinement]
139- H --> B
140- G -->|Yes| I[Persistent Memory Update]
141- I --> J[Fix Deployed]
142-
143- style A fill:#f9f,stroke:#333,stroke-width:4px
144- style C fill:#bbf,stroke:#333,stroke-width:4px
145- style I fill:#bfb,stroke:#333,stroke-width:4px
146- ```
147-
148- </div >
149-
150- ### Architecture Layers Explained
151-
152- 1 . ** Multi-Source Input Layer**
153- - Ingests heterogeneous debugging signals: source code, CI/CD logs, error traces, tests, documentation
154- - Processes 10+ input modalities simultaneously
155-
156- 2 . ** Adaptive Retrieval Engine (AGR)**
157- - Dynamic k-hop neighbor expansion (k=1-5 based on complexity)
158- - 89.2% precision vs 42.3% for flat retrieval
159- - Handles temporal code evolution and refactoring
132+ ## 🏗️ Architecture Overview
160133
161- 3 . ** Debug-Tuned LLM Core**
162- - Trained on debugging workflows, not code completion
163- - Specialized tasks: root cause prediction, multi-file patches, test interpretation
164- - 78.4% root cause accuracy vs 15.8% best baseline
134+ ### Seven-Layer System Design
165135
166- 4 . ** Orchestration Controller**
167- - Manages autonomous debugging loop
168- - Hypothesis generation → fix refinement → rollback on failure
169- - Average 2.2 cycles to success
136+ 1 . ** Multi-Source Input Layer** : Processes code, logs, traces, tests, docs simultaneously
137+ 2 . ** Adaptive Retrieval Engine (AGR)** : Dynamic k-hop graph traversal (92% precision)
138+ 3 . ** Debug-Tuned LLM Core** : 42.5M debugging examples, not code completion
139+ 4 . ** Orchestration Controller** : Autonomous debugging loop management
140+ 5 . ** Persistent Debug Memory** : Repository-specific learning (35% → 65% improvement)
141+ 6 . ** Execution Sandbox** : Isolated test validation environment
142+ 7 . ** Explainability Layer** : Human-readable root cause analysis
170143
171- 5 . ** Persistent Debug Memory**
172- - Repository-specific bug patterns and fixes
173- - Cross-session learning and adaptation
174- - 7.3x better token efficiency through memory
175-
176- 6 . ** Execution Sandbox**
177- - Isolated test execution environment
178- - CI/CD pipeline emulation
179- - Real-time validation without production risk
180-
181- 7 . ** Explainability Layer**
182- - Human-readable root cause explanations
183- - Automated PR descriptions and commit messages
184- - Risk assessment for proposed changes
144+ See [ architecture documentation] ( architecture/README.md ) for detailed specifications.
185145
186146---
187147
188148## 📊 Multi-Random Retrieval (MRR) Benchmark
189149
190- < div align = " center " >
150+ ### What is MRR?
191151
192- ### Revolutionary Evaluation Framework
152+ MRR simulates real-world debugging complexity by:
153+ - ** Spatial Distribution** : Bug context scattered across 10-50 files
154+ - ** Temporal Dispersion** : Relevant information from 3-12 months of history
155+ - ** Obfuscation Levels** : Low/medium/high code complexity
156+ - ** 5,000 Scenarios** : Comprehensive evaluation across languages and bug types
193157
194- | Metric | ** Chronos** | ** GPT-4+RAG** | ** Claude-3+VectorDB** | ** Gemini-1.5+Graph** |
195- | :------:| :-----------:| :-------------:| :---------------------:| :--------------------:|
196- | ** Precision@10** | ** 89.2%** | 42.3% | 48.1% | 51.7% |
197- | ** Recall@10** | ** 84.7%** | 31.7% | 36.2% | 41.8% |
198- | ** Fix Accuracy** | ** 67.3%** | 8.9% | 11.2% | 14.6% |
199- | ** Context Efficiency** | ** 0.71** | 0.23 | 0.28 | 0.31 |
158+ ### MRR Results
200159
201- MRR tests real-world debugging by scattering context across 10-50 files over 3-12 months of history
160+ | Metric | Chronos | GPT-4+RAG | Claude-3+VectorDB | Gemini-1.5+Graph |
161+ | :-------| :-------:| :---------:| :-----------------:| :----------------:|
162+ | ** Precision@10** | 89.2% | 42.3% | 48.1% | 51.7% |
163+ | ** Recall@10** | 84.7% | 31.7% | 36.2% | 41.8% |
164+ | ** Fix Accuracy** | 67.3% | 8.9% | 11.2% | 14.6% |
165+ | ** Context Efficiency** | 0.71 | 0.23 | 0.28 | 0.31 |
202166
203- </ div >
167+ Full benchmark available in [ benchmarks/multi-random-retrieval/ ] ( benchmarks/multi-random-retrieval/ )
204168
205169---
206170
207171## 🚀 Getting Started
208172
209- ### Research Repository Setup
173+ ### Running the MRR Benchmark
210174
211175``` bash
212176# Clone the repository
@@ -216,90 +180,72 @@ cd chronos-research
216180# Install dependencies
217181pip install -r requirements.txt
218182
219- # Run performance analysis notebooks
220- jupyter notebook notebooks/performance_analysis.ipynb
183+ # Run MRR benchmark on your model
184+ python benchmarks/run_mrr_benchmark_2025.py \
185+ --model your_model \
186+ --scenarios 100 # Start with subset
221187
222- # Generate benchmark visualizations
223- python scripts/generate_visualizations .py
188+ # Analyze results
189+ python benchmarks/analyze_results .py
224190```
225191
226- ### Access Chronos Model
192+ ### Model Access
227193
228- <div align =" center " >
229-
230- | Step | Action | Timeline |
231- | :----:| :-------| :---------|
232- | 1 | [ Join Waitlist] ( https://chronos.so ) | Available Now |
233- | 2 | Beta Access | Q4 2025 |
234- | 3 | General Availability | Q1 2026 |
194+ ** ⚠️ The Chronos model is not included in this repository**
235195
236- </div >
196+ Chronos will be available via [ Kodezi OS] ( https://kodezi.com/os ) :
197+ - ** Q4 2025** : Enterprise beta
198+ - ** Q1 2026** : General availability
199+ - ** Join waitlist** : [ chronos.so] ( https://chronos.so )
237200
238201---
239202
240- ## 📁 Repository Structure
203+ ## 📁 Repository Contents
241204
242205```
243206chronos-research/
244- ├── paper/ # Research paper (arXiv:2507.12482)
245- │ ├── chronos-research.md # Full paper content
246- │ ├── figures/ # All paper figures
247- │ └── tables/ # Performance data tables
248- ├── benchmarks/ # Evaluation frameworks
249- │ ├── multi-random-retrieval/ # MRR benchmark suite
250- │ ├── debug_categories/ # Bug taxonomy
251- │ └── evaluation_metrics/ # Metrics implementation
252- ├── results/ # Performance analysis
253- │ ├── case_studies/ # Real debugging examples
254- │ ├── ablation_studies/ # Component analysis
255- │ └── performance_tables/ # Detailed metrics
256- ├── architecture/ # System design docs
257- │ ├── agr_retrieval.md # AGR algorithm details
258- │ ├── memory_engine.md # Persistent memory design
259- │ └── debugging_loop.md # Autonomous loop
260- ├── evaluation/ # Testing methodology
261- ├── examples/ # Code examples
262- ├── docs/ # User documentation
263- ├── notebooks/ # Analysis notebooks
264- └── scripts/ # Utility scripts
207+ ├── benchmarks/ # MRR Benchmark Suite
208+ │ ├── multi-random-retrieval/ # 5,000 scenario benchmark
209+ │ ├── evaluation_metrics/ # Metrics implementation
210+ │ └── run_mrr_benchmark_2025.py # Main benchmark runner
211+ ├── reference_implementations/ # Algorithm references (NOT the model)
212+ │ ├── algorithms/ # AGR, PDM implementations
213+ │ └── NOTICE.md # Proprietary model notice
214+ ├── paper/ # Research paper
215+ │ └── chronos-research-2025.md # Full paper (arXiv:2507.12482)
216+ ├── results/ # Performance data
217+ │ ├── raw_data/ # 5,000 scenario results
218+ │ └── case_studies/ # Debugging examples
219+ ├── figures/ # Paper visualizations
220+ │ └── paper_figures/ # 11 paper figures
221+ ├── docs/ # Documentation
222+ ├── MODEL_ACCESS.md # How to access Chronos
223+ └── LEADERBOARD.md # Performance rankings
265224```
266225
267226---
268227
269- ## 🌟 Key Innovations
228+ ## 🔬 Research Highlights
270229
271- ### 1. Revolutionary Training Dataset
272- - ** 42.5M** total debugging examples
273- - ** 15M** GitHub issues with linked PRs and fixes
274- - ** 8M ** stack traces paired with resolutions
275- - ** 3M ** CI/CD logs from failed and fixed builds
276- - ** 2.5M** production debugging sessions
277- - ** 14M** examples from Defects4J, SWE-bench, BugsInPy
230+ ### Training Dataset
231+ - 42.5M debugging examples (not code completion)
232+ - 15M GitHub issues with fixes
233+ - 8M stack traces with resolutions
234+ - 3M CI/CD debugging logs
235+ - 2.5M production sessions
236+ - 14M curated from Defects4J, SWE-bench, BugsInPy
278237
279- ### 2. Adaptive Graph-Guided Retrieval (AGR)
280- ```
281- Performance by Retrieval Depth:
282- k=1 (Direct): 58.2% success
283- k=2 (Expanded): 72.4% success
284- k=3 (Deep): 71.8% success
285- k=adaptive: 87.1% success (dynamic depth selection)
286- Flat retrieval: 23.4% success
287- ```
288-
289- ### 3. Output-Heavy Optimization
290- ```
291- Token Distribution in Debugging:
292- ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
293- Input Tokens: ~3,600 (sparse)
294- Output Tokens: ~3,000 (dense)
295- Output Entropy Density: 47.2% (vs 12.8% for code completion)
296- ```
238+ ### AGR Performance
239+ - k=1 hop: 58.2% success
240+ - k=2 hops: 72.4% success
241+ - k=adaptive: 87.1% success
242+ - Flat retrieval: 23.4% success
297243
298- ### 4. Persistent Debug Memory
299- - Cross-session learning improves success rate from 35% → 65% over time
300- - 7.3x token efficiency through intelligent memory
301- - Repository-specific pattern recognition
302- - Temporal code evolution tracking
244+ ### PDM Learning Curve
245+ - Initial: 35% success rate
246+ - After 100 sessions: 52% success
247+ - After 500 sessions: 65% success
248+ - 7.3x token efficiency gain
303249
304250---
305251
@@ -468,4 +414,4 @@ This research repository is licensed under the MIT License - see [LICENSE](LICEN
468414
469415<sub >Built with ❤️ by the Kodezi Team</sub >
470416
471- </div >
417+ </div >
0 commit comments