You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: pages/insights/20251103_release.md
+47-2Lines changed: 47 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,52 @@ authors: John Yang, Kilian Lieret
5
5
6
6
Existing coding benchmarks evaluate Language Models (LMs) on *tasks*.
7
7
8
-
By "task", we're referring to implementation requests with well-defined input-output specifications.
9
-
For instance, here's
8
+
Implement a function, fix a bug, write a test.
9
+
10
+
We tell models what to do, they give it a shot, and we evaluate correctness with unit tests.
11
+
12
+
This approach has driven impressive progress in LMs' code generation capabilities over the past few years.
13
+
14
+
However, as LM scores have skyrocketed on evaluations like [HumanEval](https://github.com/openai/human-eval) and [SWE-bench](https://www.swebench.com/), such improvement also beckons the question: Is the future of code evals just making harder tasks?
15
+
16
+
Our answer is founded in a simple question: Why do we write code?
17
+
18
+
To achieve *goals*!
19
+
20
+
Software developers aren't just incessantly solving tickets with no aim.
21
+
We code to improve user retention, increase revenue, reduce costs, achieve higher customer satisfaction - the list is endless.
22
+
23
+
Towards these goals, we decompose objectives into steps, prioritize them, and must strategically decide which solutions to pursue.
24
+
25
+
And it's a continuous, often competitive loop. Propose changes, deploy them, analyze real-world feedback (e.g., metrics, user behavior, A/B test results), then do it all again.
26
+
From this perspective, tasks are but small, isolated pieces tied together by an overarching goal.
27
+
28
+
So we posit - perhaps the next frontier in code evaluation is not harder tasks, but **goal-oriented software engineering**.
29
+
30
+
To formalize this, we're excited to share **CodeClash**!
31
+
32
+
Multiple LM systems compete to build the best codebase for achieving a high-level objective over the course of a multi-round tournament.
33
+
These codebases implement solutions that compete in a code arena.
<span class="subtext">Picture Credit to <a href="https://abehou.github.io/">Abe Hou</a></span>
38
+
</div>
39
+
40
+
Crucially, LMs do not play directly.
41
+
Instead, they iteratively refine code that competes as their proxy.
42
+
43
+
CodeClash enables us to examine models as long-running, continually improving developers:
44
+
45
+
- Objectives are open-ended (win, survive, or maximize reward)
46
+
- Arenas are diverse so solutions and interfaces differ dramatically
47
+
- Competition rewards adaptive strategies rather than one-off correctness.
48
+
49
+
If you're curious about models using code as the modality to learn, adapt, and improve over time, CodeClash is the playground for you.
50
+
51
+
Thanks for reading! Check out our [paper](https://arxiv.org/abs/2511.00839) for the full story. And if you're ready to dive in, here's a quick video to show you how to set up the repository and run your first CodeClash tournament!
0 commit comments