evals/paper/benchmark-methodology-whitepaper.tex at main · callstackincubator/evals · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
\documentclass[11pt]{article}

\usepackage[margin=1in]{geometry}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{lmodern}
\usepackage{amsmath}
\usepackage{enumitem}
\usepackage{hyperref}
\usepackage{xcolor}
\usepackage{cleveref}
\usepackage{graphicx}
\usepackage{tikz}
\usepackage{minted}
\usepackage[most]{tcolorbox}
\usepackage{booktabs}
\usepackage{tabularx}

\usepackage{authblk}


\usetikzlibrary{arrows.meta}

\hypersetup{
  colorlinks=true,
  linkcolor=blue!50!black,
  urlcolor=blue!50!black,
  citecolor=blue!50!black
}

\setlist[itemize]{leftmargin=1.25em}
\setlist[enumerate]{leftmargin=1.5em}

\title{\textbf{React Native Evals: A Reproducible Benchmark for Evaluating Large Language Models on Mobile App Development Tasks}\\ Whitepaper}

\author{Artur Morys-Magiera}
\author{Mike Grabowski}
\author{Lech Kalinowski PhD}
\author{\\Piotr Miłkowski}
\author{Michał Pierzchała}
\affil{Callstack, Poland}
\affil{\{artur.morys-magiera,mike,lech.kalinowski,\\piotr.milkowski,michal.pierzchala\}@callstack.com}

\date{March 5, 2026}

\begin{document}
\maketitle

\begin{abstract}
Large Language Models (LLMs) have improved rapidly on coding tasks, but most widely used benchmarks emphasize general-purpose programming and underrepresent mobile engineering. However, scores from generic coding benchmarks are often insufficient for estimating real-world performance on domain-specific tasks. To address this gap with respect to React Native, an open-source framework for building cross-platform applications, we present React Native Evals: a benchmark designed specifically for React Native problem solving evaluation. The framework evaluates declared requirements with optional weights, enabling fine-grained analysis of where models succeed or fail. These elements provide a transparent and auditable basis for comparative evaluation of LLMs on mobile app development tasks. The benchmark pipeline executes successfully end-to-end across the evaluated model set, confirming that the current eval suite is operationally usable. The results show that the framework can separate model capability in an auditable and reproducible way, while preserving requirement-level interpretability through explicit weights and per-eval verdict traces and deterministic output artifacts. Finally, we draw conclusions from the benchmark runs, comparing models both generally and within specific sub-domains in React Native code generation, as well as general metrics such as token consumption.
\end{abstract}

\section{Introduction}
Large Language Models (LLMs) have improved rapidly on coding tasks, but most widely used benchmarks emphasize general-purpose programming and underrepresent mobile engineering. React Native development introduces a different difficulty profile: correctness depends not only on code syntax, but also on component behavior, navigation flow, asynchronous state transitions, platform API usage, and consistency across multiple files. As a result, scores from generic coding benchmarks are often insufficient for estimating real-world performance on React Native tasks.

To address this gap, we present \textit{React Native Evals}, a reproducible benchmark designed specifically for React Native code-generation evaluation. The benchmark uses repository-defined tasks, structured solver outputs, and requirement-level judging to assess whether generated implementations satisfy explicit engineering constraints. Rather than relying on a single pass/fail signal at the task level, the framework evaluates declared requirements with optional weights, enabling fine-grained analysis of where models succeed or fail.

This whitepaper documents the benchmark in an implementation-faithful manner relative to the current repository state. It specifies the dataset contract, end-to-end execution pipeline, solver and judge responsibilities, scoring and aggregation rules, artifact schema, and repeat-run protocol for variance-aware comparison. It also clarifies authoring conventions (including \texttt{implementation-} requirement IDs) and distinguishes those conventions from runner-enforced behavior. Together, these elements provide a transparent and auditable basis for comparative evaluation of LLMs on mobile app development tasks in React Native.

\section{React Native methodology}

The meritorious content of this benchmark is concentrated around validation of primitive Large Language Models without any external augmentations such as RAG, documentation access or agent skills. The purpose of this benchmark is to assess how those models perform in the task of React Native development including the most common third party libraries in the React Native ecosystem.

The benchmark we propose in this work includes evaluation cases (referred to as \textit{evals}), divided into categories; each eval consists of multiple requirements as to the outputs of the evaluated model (referred to as \textit{solver}), as described in \cref{sec:generation,sec:judgement}. Finally, the solver performs the task in each evaluation, and in the second stage of the benchmark, another LLM (referred to as \textit{judge}) assesses the result by assigning scores to each of the requirements, along with generating a justification in natural language (for interpretability). The breakdown of the categories of evals is presented in \cref{tab:category-evals-scope}. For the sake of this experiment, the chosen requirements were adjusted such that they mattered equally, thus implying weights of all requirements being set to $1.0$.

\begin{table}[htbp]
\centering
\small
\begin{tabularx}{\linewidth}{l c X}
\toprule
\textbf{Category} & \textbf{\# Evals} & \textbf{Scope} \\
\midrule
Animation & 14 & React Native animation behavior, including \texttt{Animated} and \texttt{react-native-reanimated}. \\
Asynchronous state & 14 & Async state/data-flow tasks with \texttt{TanStack Query}, \texttt{Zustand}, and \texttt{Jotai}. \\
Lists & 18 & List rendering, item interactions, virtualization patterns, and sectioned data presentation. \\
Navigation & 15 & Screen transitions, stack/modal flows, route params, and navigation state handling. \\
React Native APIs & 9 & Core platform API usage such as layout, keyboard, accessibility, and device integration primitives. \\
\bottomrule
\end{tabularx}
\caption{Benchmark categories, number of evals per category, and short scope descriptions.}
\label{tab:category-evals-scope}
\end{table}

\section{The benchmark pipeline}
\label{sec:pipeline}

This section specifies the exact models and runtime settings used in the benchmark to ensure reproducibility and fair comparison. We report model identifiers, version snapshots, assigned pipeline roles, and key execution parameters. All experiments use the same evaluation protocol, task set, and scoring logic; only the configured model changes.
The models used in this benchmark are presented in \cref{tab:model-versions}.

\subsection{Version and configuration of benchmarked models}

\begin{table}[h]
    \centering
    \begin{tabular}{c|c|c}
         \textbf{Model name} & \textbf{Role} \\
         \hline
         claude-opus-4.6 & Solver \\
         \hline
         claude-opus-4.7 & Solver \\
         \hline
         claude-sonnet-4.6 & Solver, Judge \\
         \hline
         composer-2 & Solver \\
         \hline
         composer-2-fast & Solver \\
         \hline
         deepseek-r1-distill-qwen-32b & Solver \\
         \hline
         deepseek-v3.2 & Solver \\
         \hline
         gemini-3.1-pro-preview & Solver \\
         \hline
         gemma-4-31B-it & Solver \\
         \hline
         glm-5 & Solver \\
         \hline
         gpt-5.3-codex & Solver \\
         \hline
         gpt-5.4 & Solver \\
         \hline
         gpt-oss-20b & Solver \\
         \hline
         gpt-oss-120b & Solver \\
         \hline
         grok-4 & Solver \\
         \hline
         kimi-k2.5 & Solver \\
         \hline
         minimax-m2.7 & Solver \\
         \hline
         qwen2.5-coder-32b-instruct & Solver \\
         \bottomrule
    \end{tabular}
    \caption{List of models used in evaluation: names, versions and table role in the sense of \cref{sec:pipeline} (solver - the evaluted LLM; judge - LLM used to grade requirement execution by the solver)}
    \label{tab:model-versions}
\end{table}

\subsection{Benchmark Unit and Dataset Contract}
The atomic benchmark unit is a single eval directory. Each eval is expected to contain files structured according to the following \cref{sec:generation,sec:judgement}. The general pipeline looks as follows:
\begin{enumerate}
  \item \textbf{Generation CLI}: discover evals, load \texttt{app/} and \texttt{prompt.md}, run solver model, write generated files and \texttt{manifest.json} to the configured output directory.
  \item \textbf{Judge CLI}: read eval entries from generation \texttt{manifest.json}, load generated files and \\\texttt{requirements.yaml}, run LLM judging, write per-eval results incrementally, and write summary at run completion.
\end{enumerate}

Eval discovery is file-driven: the runner scans for \texttt{requirements.yaml}. If that file is missing, the eval is not discovered and will not be included in the run.

\begin{figure}[htbp]
\centering
\begin{tikzpicture}[
  data/.style={rectangle, draw, rounded corners=2pt, minimum width=1.8cm, align=center, font=\footnotesize, fill=blue!8},
  process/.style={rectangle, draw, rounded corners=2pt, minimum width=2.4cm, align=center, font=\footnotesize, fill=green!8},
  output/.style={rectangle, draw, rounded corners=2pt, minimum width=1.8cm, align=center, font=\footnotesize, fill=orange!8},
  arrow/.style={-{Stealth[length=2pt]}}
]
  % Row 1: Dataset
  \node[data] (prompt) {\texttt{prompt.md}};
  \node[data, anchor=west] (app) at ([xshift=0.3cm]prompt.east) {\texttt{app/}};
  \node[data, anchor=west] (reqs) at ([xshift=0.3cm]app.east) {\texttt{requirements.yaml}};
  \node[data, anchor=west] (ref) at ([xshift=0.3cm]reqs.east) {\texttt{reference/}};

  % Row 2: Generation CLI
  \node[process] (discover) at ([yshift=-0.8cm]app.south) {Discovery};
  \node[process] (loadgen) at ([yshift=-0.4cm]discover.south) {Load app, prompt};
  \node[process] (solver) at ([yshift=-0.4cm]loadgen.south) {Solver LLM};
  \node[process] (materialize) at ([yshift=-0.4cm]solver.south) {Materialize};

  \draw[arrow] (prompt.south) -- (discover);
  \draw[arrow] (app.south) -- (discover);
  \draw[arrow] (discover) -- (loadgen);
  \draw[arrow] (loadgen) -- (solver);
  \draw[arrow] (solver) -- (materialize);

  % Generation outputs (row centered between materialize and read manifest)
  \node[output, anchor=east] (genfiles) at ([xshift=-1cm, yshift=-0.5cm]materialize.south) {Generated source files};
  \node[output, anchor=east] (manifest) at ([xshift=1.21cm, yshift=-1.5cm]materialize.south) {\texttt{manifest.json}};
  \draw[arrow] (materialize.south) |- (genfiles.east);
  \draw[arrow] (materialize.south) |- (manifest.north);

  % Row 3: Judge CLI
  \node[process] (readmanifest) at ([yshift=-2.5cm]materialize.south) {Read manifest};
  \node[process] (loadjudge) at ([yshift=-0.4cm]readmanifest.south) {Load files, reqs};
  \node[process] (judge) at ([yshift=-0.4cm]loadjudge.south) {Judge LLM};
  \node[process] (mapreq) at ([yshift=-0.4cm]judge.south) {Map by ID};

  \draw[arrow] (manifest.south) -- (readmanifest.north);
  \draw[arrow] (reqs.south) |- (loadjudge.east);
  \draw[arrow] (readmanifest) -- (loadjudge);
  \draw[arrow] (loadjudge) -- (judge);
  \draw[arrow] (judge) -- (mapreq);

  % Judge outputs
  \node[output, anchor=west] (summary) at ([xshift=1.8cm, yshift=0.5cm]mapreq.east) {\texttt{summary.json}};
  \node[output, anchor=west] (pereval) at ([xshift=1.8cm, yshift=-0.5cm]mapreq.east) {\texttt{<eval-id>.json}};
  \draw[arrow] (mapreq.east) -- (summary.west);
  \draw[arrow] (mapreq.east) -- (pereval.west);

  % Row 4: Scoring
  \node[process, fill=yellow!6] (scoring) at ([yshift=-0.8cm]mapreq.south) {Scoring};
  \draw[arrow] (mapreq) -- (scoring);

  % Dataset label
  \node[font=\scriptsize\itshape, anchor=south] at ([yshift=0.1cm]prompt.north) {evals/\textlangle category\textrangle/\textlangle eval-id\textrangle/};
\end{tikzpicture}
\caption{End-to-end pipeline of the React Native Evals benchmark. Dataset inputs (blue) feed the Generation CLI (described in \cref{sec:generation}), producing \texttt{manifest.json} and source files generated by the solver. The Judge CLI (described in \cref{sec:judgement}) reads the manifest and \texttt{requirements.yaml}, runs the judge LLM, maps verdicts by requirement ID, and writes \texttt{summary.json} along with telemetry data. Scoring computes metrics as described in \cref{sec:scoring}.}
\label{fig:pipeline-flow}
\end{figure}

\subsection{Generation stage inputs}
\label{sec:generation}

\begin{enumerate}
  \item \textbf{Discovery}: discover evals
  \item \textbf{Load \texttt{app/}}: read solution template (files on which the solver performs the task) and copy it to a temporary directory, in which the solver LLM will operate
  \item \textbf{Load \texttt{prompt.md}}: parse the task definition, run solver model, write generated files and \\\texttt{manifest.json} to the configured output directory.
\end{enumerate}

\subsubsection{\texttt{prompt.md}}

Task prompt used by the solver model. A representative logical template of this type of file is presented in \cref{lst:promptmd-template}.

\begin{listing}[h]
\begin{tcolorbox}[colback=gray!20,  % light gray background
                  colframe=gray!60, % border color
                  rounded corners,   % rounded edges
                  boxrule=0.5mm,    % border thickness
                  left=4mm, right=4mm, top=2mm, bottom=2mm, % padding
                  title=\texttt{prompt.md} example]    % optional title
Implement X where performing interaction Y causes Z and state A uses an animation B from package C.
\end{tcolorbox}
\caption{Example structure of a prompt in the benchmark.}
\label{lst:promptmd-template}
\end{listing}

\subsubsection{\texttt{app/}}

The \texttt{app/} directory must contain all files serving as baseline input provided to the solver. The files that shall be evaluated by the judge must be listed in the \texttt{requirements.yaml} file.

\subsection{Judgement stage inputs}
\label{sec:judgement}

Reads eval entries from the generation output and runs an LLM to assess the completion of requirements (specified in \cref{sec:judgement-reqs}) in the generated solution.

\subsubsection{\texttt{manifest.json}}

The manifest file contains a summary of the generation stage (\cref{sec:generation}), containing information according to \cref{lst:manifest-json}.

\begin{listing}[h]
\begin{tcolorbox}[colback=gray!20,  % light gray background
                  colframe=gray!60, % border color
                  rounded corners,   % rounded edges
                  boxrule=0.5mm,    % border thickness
                  left=4mm, right=4mm, top=2mm, bottom=2mm, % padding
                  title=\texttt{manifest.json} contract]    % optional title
\begin{minted}[breaklines]{json}
{
  "runId": "YYYY-MM-DDTHH-MM-SS-SSSZ",
  "startedAt": "YYYY-MM-DDTHH:MM:SS.SSSZ",
  "finishedAt": "YYYY-MM-DDTHH:MM:SS.SSSZ",
  "solverModel": "<MODEL_NAME>",
  "pattern": "evals/**/*",
  "evalCount": 1,
  "evalsProcessed": 1,
  "evalsErrored": 0,
  "evals": [
    {
      "evalId": "01-category-task",
      "evalPath": "evals/category/task",
      "outputFiles": [
        "App.tsx",
        "styles.ts"
      ],
      "generatedPath": "category/task"
    }
  ]
}
\end{minted}
\end{tcolorbox}
\caption{The contract of \texttt{manifest.json} file, an artifact of the generation stage.}
\label{lst:manifest-json}
\end{listing}

\subsubsection{\texttt{requirements.yaml}}
\label{sec:judgement-reqs}

Task prompt used by the solver model. A representative logical template of this type of file is presented in \cref{lst:requirements-yaml-template}.
The \mintinline{yaml}{version} is a constant element present for compatibility reasons for future versions of the project, that must be set to \mintinline{yaml}{1}.
The \mintinline{yaml}{inputs} list shall contain all files to be copied to the temporary workspace for the solver.
The \mintinline{yaml}{requirements} list shall contain all the requirements, described in natural language. The requirements should explicitly state specific elements or behaviors that the presence of is to be asserted in the judgement stage.
Optionally, each requirement may contain a \mintinline{yaml}{weight} value set to a positive value to emphasize its impact on the score for that requirement. If omitted, its value defaults to \mintinline{yaml}{1}.

\begin{listing}[h]
\begin{tcolorbox}[colback=gray!20,  % light gray background
                  colframe=gray!60, % border color
                  rounded corners,   % rounded edges
                  boxrule=0.5mm,    % border thickness
                  left=4mm, right=4mm, top=2mm, bottom=2mm, % padding
                  title=\texttt{requirements.yaml} contract]    % optional title
\begin{minted}[breaklines]{yaml}
version: 1
inputs:
  files:
    - app/App.tsx
    - app/styles.ts
requirements:
  - id: implementation-use-X-api
    description: Must use API `X` from package `A`.
    weight: 1
  - id: implementation-avoid-Y-api
    description: Must NOT use deprecated API `Y` from package `A`.
    weight: 1
  - id: toggle-state-on-click
    description: Upon click on button `B`, the visual active state of element `C` must toggle.
    weight: 1
\end{minted}
\end{tcolorbox}
\caption{The contract of \texttt{requirements.yaml} file.}
\label{lst:requirements-yaml-template}
\end{listing}

\subsection{OpenCode integration in the judgment stage.}
Execution layer used to call all solver and judge models through one interface (model IDs in the form \texttt{provider/model}, e.g., \texttt{openai/gpt-5.3-codex}). Before judging starts, the runner ensures that a local OpenCode server is available on the configured port; if a server is already running, it is reused. For each eval, the Judge CLI creates a new OpenCode session, sends the constructed judge prompt, and requests schema-constrained structured output. If strict structured output fails, a JSON fallback path is used and validated before scoring.

After each judge call, the pipeline can fetch OpenCode session metadata (conversation, token usage, and file-diff summary) and store it as \texttt{<eval-id>.opencode-session.judge.json}. Standard judge outputs are written incrementally as \texttt{runs/<input-folder>/evals/<eval-id>.json}, and the run-level aggregate is written as \texttt{summary.json}. Judging runs with bounded concurrency and retry logic; when \texttt{--fail-fast} is disabled, failed evals do not stop the run and are counted in \texttt{evalsErrored}.

The runner supports bounded concurrency. If an eval fails and \texttt{----fail-fast} is not enabled, the run continues and the eval is counted in \texttt{evalsErrored}.

The judge CLI also supports targeted re-judging over existing judge outputs using \\\texttt{----rerun-requirements-file} with \texttt{----output}; optionally pass \texttt{----rerun-requirement-id} to refresh only one requirement. Without \texttt{----rerun-requirement-id}, all requirements for the targeted eval are re-judged and replaced in the per-eval JSON. It also supports \texttt{----rerun-missing-judgements} with \texttt{----output} to scan for and judge all evals missing per-eval judge JSON outputs (the same missing rows counted by \texttt{evalsErrored} in the rebuilt summary). In rerun modes, the previous \texttt{summary.json} is backed up as \texttt{summary.backup.<timestamp>.json}, and a new aggregate summary is generated from the current per-eval result set.

\section{Solver Stage Methodology}

The solver methodology treats each benchmark item as a constrained code-generation task: the model receives a task prompt plus project context, produces a complete implementation artifact, and is evaluated only against task-defined requirements. The same solver protocol is applied across all tasks and runs, with outputs stored in a structured, reproducible format, ensuring fair comparison across models and direct traceability between generated code and downstream requirement-level scores.

\subsection{Input Construction}
The solver is given:
\begin{itemize}
  \item the eval prompt from \texttt{prompt.md}
  \item the baseline files from \texttt{app/}
  \item a fixed system instruction requiring the model to return all provided files with modifications applied
\end{itemize}

The runner formats baseline files into a structured prompt with explicit file tags.

\subsection{Output Contract}
The solver returns structured output with:
\begin{itemize}
  \item \texttt{summary}: short description of performed changes
  \item \texttt{files[]}: array of \texttt{\{path, content\}}
\end{itemize}

Returned paths are sanitized before writing to disk (e.g., stripping leading slashes and removing traversal segments such as \texttt{..}).

\subsection{Model Requirement}
The generation CLI requires \texttt{----model}. When set to \texttt{noop}, the generation stage copies files from each eval's \texttt{reference/} directory into the configured output and still writes a validated \texttt{manifest.json}. Any other model value runs normal solver generation.

\section{LLM Judge Methodology}

For React Native coding tasks, an LLM-based judge is an effective approach that balances benefits with preparation workload. The quality of a task depends on semantic correctness across multiple files and layers. Component logic, state flow, navigation behavior, asynchronous handling, and API usage are all important. Many valid implementations exist, and strict deterministic rules do not apply to the semantic complexity of mobile apps. Deterministic checks are excellent for narrow invariants, but as a standalone judge, they overly penalize acceptable variations, miss errors in higher-level reasoning, and require constant revisions as frameworks and programming patterns evolve. A requirements- or specification-based LLM judge can evaluate requirements at the intent level, detect nuanced errors, incorrect cycle assumptions, or unstable query invalidation logic, and assign consistent, evidence-based results at the requirements level, thus remaining independent of the solver model and style. In practice, this approach provides a better fit to real-world engineering tasks and faster benchmark evolution than pure determinism pipelines for React Native.

\subsection{Requirements Parsing}
\texttt{requirements.yaml} is parsed and validated at runtime. The enforced schema requires:
\begin{itemize}
  \item \texttt{version}: positive integer (defaults to 1 via schema default)
  \item \texttt{requirements[]}: non-empty list of objects with:
  \begin{itemize}
    \item \texttt{id} (non-empty string)
    \item \texttt{description} (non-empty string)
    \item optional positive \texttt{weight}
  \end{itemize}
\end{itemize}

Extra fields are ignored by the current runtime parser.

\subsection{Deterministic \texttt{implementation-} Requirement Convention}
The repository documents an authoring convention for deterministic requirements:
\begin{itemize}
  \item prefix deterministic, source-verifiable requirement IDs with \texttt{implementation-}
  \item use \texttt{weight: 1} by default to avoid overweighting implementation details
\end{itemize}

Examples include required imports, API usage patterns, and specific wiring that can be verified directly from files.

\textbf{Important distinction:} the runner does \emph{not} infer behavior from ID prefixes such as \texttt{implementation-}. Prefixes are an authoring convention; scoring changes only when a requirement explicitly sets \texttt{weight} in \texttt{requirements.yaml}.

\subsection{Technical Eval Authoring Controls}
For technical categories (for example navigation, animation, storage), requirement quality is governed by additional authoring controls documented under \texttt{docs/}.
\begin{itemize}
  \item \textbf{Recent API shift audit:} before requirement writing, review official docs plus recent releases/changelogs for each core library, and extract concrete ``new preferred API'' and ``deprecated/removed API'' signals with dates.
  \item \textbf{Evidence-gated \texttt{MUST NOT}:} use explicit \texttt{MUST NOT} clauses only when backed by primary-source deprecation/removal or correctness caveats.
  \item \textbf{Implementation-level specificity:} technical diversity is measured at \\requirement/API-constraint level, not prompt wording level.
  \item \textbf{Similarity budget:} within a library subgroup, keep shared baseline requirements small (target: at most two shared IDs) and maintain multiple eval-specific implementation constraints per eval.
  \item \textbf{Judge input scope:} set \texttt{inputs.files} to implementation files under evaluation (for example \texttt{app/App.tsx}) rather than prompt-only artifacts.
\end{itemize}

These controls are authoring policy for dataset quality; they are not independently enforced by runner schema validation.

\subsection{Judge Prompt Construction}
The judge prompt includes:
\begin{itemize}
  \item declared requirements (including weights when present)
  \item generated files under evaluation
  \item rules instructing the judge to use only provided files as evidence and return one result per declared requirement ID
\end{itemize}

\subsection{Structured Judge Output}
The judge returns structured output with:
\begin{itemize}
  \item optional \texttt{summary}
  \item \texttt{requirements[]}, where each row includes \texttt{id}, \texttt{passed}, \texttt{reason}, \texttt{evidence[]}, and optional \texttt{confidence}
\end{itemize}

\subsection{Requirement Mapping and Failure Policy}
Judge rows are mapped back to declared requirements by ID.
\begin{itemize}
  \item If a declared requirement is missing from judge output, the runner marks it as failed.
  \item The runner emits a fixed failure reason indicating that the judge did not return a result for that requirement.
  \item Extra judge rows for undeclared IDs are ignored for scoring.
  \item Requirement weights are normalized before scoring (\cref{sec:scoring}).
\end{itemize}

The judge CLI requires \texttt{----model}; there is no judge skip/noop path.

\section{Scoring Methodology}
\label{sec:scoring}

Each model is evaluated through the identical benchmark pipeline in 10 independent runs to reduce variance from stochastic generation and judging effects and to produce a statistically more reliable performance estimate. In every run, the model receives the same task set, prompt protocol, and scoring rules, while random factors may still lead to different outputs and per-task scores. Repeating the full pipeline therefore captures both central tendency and dispersion of performance, allowing analysis to rely on aggregated results (e.g., mean and spread) rather than a potentially noisy single-run outcome. This design improves fairness in cross-model comparison and supports more robust conclusions about relative capability on React Native tasks.

\subsection{Requirement Weight Normalization}
Each requirement weight is normalized as:
\[
w_i' =
\begin{cases}
w_i & \text{if } w_i \text{ is finite and } w_i > 0 \\
1 & \text{otherwise}
\end{cases}
\]

This normalization is defensive. In normal operation, the runtime schema already requires positive weights when provided.

\subsection{Per-Eval Weighted Requirement Score}
For an eval with declared requirements \(R\), normalized weights \(w_i'\), and binary pass indicators \(p_i\):
\[
\text{totalWeight} = \sum_{i \in R} w_i'
\]
\[
\text{passedWeight} = \sum_{i \in R} p_i \cdot w_i'
\]
\[
\text{scoreRatio} =
\begin{cases}
0 & \text{if } \text{totalWeight} = 0 \\
\frac{\text{passedWeight}}{\text{totalWeight}} & \text{otherwise}
\end{cases}
\]

The runner rounds \texttt{passedWeight}, \texttt{totalWeight}, and \texttt{scoreRatio} to 4 decimal places.

\textbf{Interpretation example.} If requirements have weights \([2, 2, 1]\) and pass outcomes \([1, 0, 1]\), then the per-eval score is:
\[
\frac{2 + 0 + 1}{2 + 2 + 1} = \frac{3}{5} = 0.6
\]
This pattern is relevant only when authors intentionally assign non-default weights to specific requirements.

\subsection{Run-Level Aggregate Score}
Let \(E^+\) be the set of evals that complete successfully (i.e., without an unrecovered pipeline error). The summary field \texttt{weightedAverageScore} is:
\[
\texttt{weightedAverageScore} =
\begin{cases}
0 & \text{if } |E^+| = 0 \\
\frac{1}{|E^+|}\sum_{e \in E^+}\texttt{scoreRatio}_e & \text{otherwise}
\end{cases}
\]
rounded to 4 decimal places.

\textbf{Interpretation note:} the ``weighted'' part refers to weights within each eval. The run-level average itself is an unweighted mean across successful evals.

\subsection{Additional Summary Metrics}
The run summary also reports:
\begin{itemize}
  \item \texttt{requirementsTotal}
  \item \texttt{requirementsPassed}
  \item \texttt{evalsErrored}
\end{itemize}

For successful evals, each row includes \texttt{requirementsTotal}, \texttt{requirementsPassed},
and \texttt{scoreRatio}. For pipeline errors, each row includes \texttt{status=error}
along with the eval identifier and path.

\section{Artifacts and Output Schema}
Each run writes artifacts to deterministic generation and judge directories:
\begin{itemize}
  \item \texttt{runs/<input-folder>/evals/<eval-id>.json}: per-eval judge results (written as each eval completes)
  \item \texttt{runs/<input-folder>/summary.json}: aggregate judge summary (written at end)
  \item generated artifacts and \texttt{manifest.json}: written by generation CLI to a user-selected output directory (default: \texttt{generated/<model>-<run-id>})
  \item per-eval OpenCode session snapshots:
  \begin{itemize}
    \item generation: \texttt{<generated-eval-dir>/opencode-session.solver.json}
    \item judging: \texttt{runs/<input-folder>/evals/<eval-id>.opencode-session.judge.json}
  \end{itemize}
  \item optional debug artifacts when \texttt{----debug} is enabled
\end{itemize}

Per-eval results include solver and judge model identifiers, mapped requirement verdicts, per-eval score, generated file list, and optional \texttt{judgeSessionArtifactPath} when session capture succeeds. The generation \texttt{manifest.json} includes \texttt{solverSessionArtifactPath} for each processed eval.

\section{Repeat Orchestration Script}
The repository provides \texttt{scripts/bench-series.sh} to run multiple full cycles (generation + judging) for variance analysis. Configure via environment variables:
\begin{itemize}
  \item \texttt{RUN\_COUNT}: number of run+judge iterations (default: 10)
  \item \texttt{BENCH\_RUN\_ARGS}: arguments for the generation CLI (e.g.\ \texttt{----model}, \texttt{----pattern})
  \item \texttt{BENCH\_JUDGE\_ARGS}: arguments for the judge CLI (e.g.\ \texttt{----model}); \texttt{----input} is injected per run
  \item \texttt{BENCH\_OUTPUT\_BASE}: base directory for outputs (default: \texttt{/tmp/bench-seriess})
\end{itemize}

For each of the \texttt{RUN\_COUNT} iterations, the script runs generation then judging. If either step fails, it retries up to 3 times before failing the script. Outputs are written to \\\texttt{<BENCH\_OUTPUT\_BASE>/<timestamp>/run-1}, \texttt{run-2}, etc.

\section{Results}

\Cref{fig:model-comparison} summarizes both overall and category-level performance across the evaluated solver models. In overall weighted average score, \texttt{composer-2} leads (96.1\%), followed by \texttt{composer-2-fast} (94.9\%) and \texttt{gpt-5.4} (85.3\%), while \texttt{deepseek-r1-distill-qwen-32b} scores lowest (44.4\%). The category breakdown remains non-uniform: navigation and React Native API tasks are generally the strongest categories across the model set, whereas animation remains the most discriminative category and lists provide additional separation among mid-tier models. For each model, detailed analysis is provided in appendix section A. The \cref{tab:tokens-by-model} summarizes token usage for each of the models.

\begin{figure}[htbp]
\centering
\includegraphics[width=\textwidth]{export/comparison_plot.png}
\caption{Overall score and category comparison across all models.}
\label{fig:model-comparison}
\end{figure}

\begin{table}[htbp]
\centering
\small
\begin{tabular}{r l c c}
\toprule
\textbf{Rank} & \textbf{Model} & \textbf{Weighted average score} & \textbf{Runs} \\
\midrule
1  & composer-2                     & 0.961 (96.1\%) & 10 \\
2  & composer-2-fast                & 0.949 (94.9\%) & 10 \\
3  & gpt-5.4                        & 0.853 (85.3\%) & 10 \\
4  & claude-opus-4.6                & 0.841 (84.1\%) & 10 \\
5  & gpt-5.3-codex                  & 0.831 (83.1\%) & 10 \\
6  & claude-opus-4.7                & 0.828 (82.8\%) & 10 \\
7  & claude-sonnet-4.6              & 0.806 (80.6\%) & 10 \\
8  & gemini-3.1-pro-preview         & 0.789 (78.9\%) & 10 \\
9  & kimi-k2.5                      & 0.772 (77.2\%) & 10 \\
10 & gemma-4-31B-it                 & 0.752 (75.2\%) & 10 \\
11 & glm-5                          & 0.748 (74.8\%) & 10 \\
12 & grok-4                         & 0.726 (72.6\%) & 10 \\
13 & gpt-oss-120b                   & 0.716 (71.6\%) & 10 \\
14 & deepseek-v3.2                  & 0.715 (71.5\%) & 10 \\
15 & minimax-m2.7                   & 0.714 (71.4\%) & 10 \\
16 & gpt-oss-20b                    & 0.710 (71.0\%) & 10 \\
17 & qwen2.5-coder-32b-instruct     & 0.512 (51.2\%) & 10 \\
18 & deepseek-r1-distill-qwen-32b   & 0.444 (44.4\%) & 10 \\
\bottomrule
\end{tabular}
\caption{Overall model comparison from packed result archives.}
\label{tab:model-comparison-overall}
\end{table}

\include{export/tokens_by_model}

The full benchmark pipeline executed successfully end-to-end across the evaluated model set, confirming that the current eval suite is operationally usable for comparative React Native model assessment.

\section{Recommended Reporting Protocol}
For comparative studies, report at minimum:
\begin{enumerate}
  \item repository commit hash (dataset and runner version)
  \item CLI options (\texttt{run: ----pattern, ----model, ----timeout, ----concurrency, ----output}; \\\texttt{judge: ----model, ----timeout, ----concurrency, ----input} \\and optional rerun flags \texttt{----rerun-requirements-file}, \texttt{----output}, \\scope optional \texttt{----rerun-requirement-id}, scope optional \texttt{----rerun-missing-judgements})
  \item execution date and time
  \item counts of discovered, processed, and errored evals
  \item \texttt{weightedAverageScore} and \texttt{requirementsPassed/Total}
  \item generated artifact source directory
\end{enumerate}

Repeat runs are recommended. The \texttt{bench:repeat} script (or \texttt{scripts/bench-series.sh}) automates multiple cycles. The current runner does not explicitly pin a random seed or decoding settings (e.g., temperature) in solver or judge calls, so provider defaults and model nondeterminism can affect results.

\section{Limitations and Threats to Validity}
\begin{itemize}
  \item \textbf{Judge dependence:} requirement verdicts depend on the configured LLM judge model and prompt behavior.
  \item \textbf{Requirement-interpretation ambiguity:} without seeded reference context, ambiguous requirement phrasing can increase variance in judge decisions.
  \item \textbf{Binary requirement outcomes:} there is no partial credit within a single requirement.
  \item \textbf{Run-level averaging choice:} the summary score is unweighted across evals, which can underweight larger evals.
  \item \textbf{Pipeline error exclusion from mean:} errored evals are counted separately and excluded from the run-level score average.
  \item \textbf{Authoring-convention variability:} conventions such as \texttt{implementation} improve consistency but are not enforced by the runtime parser.
\end{itemize}

\section{Acknowledgments}

Text drafting and editorial refinement used artificial intelligence-assisted tooling. All technical claims, implementation details, equations, and reported metrics were manually verified against repository code and captured experiment artifacts before inclusion.

\section{Conflict of Interest}

The authors are affiliated with Callstack. The experiments used commercially available infrastructure acquired and operated independently; no direct sponsorship, funding, or editorial influence was provided for this work.

\section{Conclusion}
We evaluated \textbf{18} solver models on the React Native eval suite (\textbf{66} evals in total), with \textbf{10 repeated runs per model} under the same pipeline and scoring rules. Across completed runs, the highest mean \texttt{weightedAverageScore} was \textbf{0.961} (\textbf{96.1\%}, model: \texttt{composer-2}); the median across models was \textbf{0.761} (\textbf{76.1\%}). Category-level analysis (\cref{fig:model-comparison}) shows strongest performance on \textbf{navigation} and \textbf{react native apis}, while \textbf{animation} remains the weakest and most discriminative category; \textbf{lists} and \textbf{async state} provide additional separation among otherwise closely clustered models. Recurring failures were concentrated in behavior-heavy requirements, especially animation dynamics and async-state consistency/invalidation logic.

Two practical implications follow. First, React Native development difficulty is not uniform: models that are competitive on navigation and structural code generation can still be unreliable on behavior-heavy work, where correctness depends on timing, gesture/animation dynamics, and state invalidation semantics. For practitioners, this suggests applying the most scrutiny (and the most automated protection via tests and runtime validation) to animation and async-state code paths, even when a model performs strongly overall. Second, requirement-level evaluation yields actionable signals: per-requirement verdict traces support targeted prompt iteration, scaffolding improvements, and dataset evolution without collapsing all behavior into a single task-level pass/fail.

From an operational standpoint, token usage varied substantially across models (\cref{tab:tokens-by-model}). Among models with available token aggregates, mean usage ranged from \textbf{254,437} tokens per run (\texttt{deepseek-r1-distill-qwen-32b}) to \textbf{5,590,132} (\texttt{deepseek-v3.2}), with additional high-consumption behavior visible for \texttt{kimi-k2.5} at \textbf{2,500,716}. The token totals for \texttt{composer-2} and \texttt{composer-2-fast} could not be determined precisely from the captured runs, so efficiency comparisons for those two models should be interpreted with caution. Overall, this highlights a cost/latency frontier that is not directly implied by score alone, and reinforces that model selection for production assistance should consider both category-specific performance and consumption characteristics.

Overall, these results indicate that React Native Evals separates model capability in a reproducible and auditable way while preserving requirement-level interpretability (via explicit per-requirement weights and per-eval verdict artifacts). They also highlight a need to extend the current evaluation set beyond the existing categories to more of the day-to-day React Native surface area, such as handling network requests (including error handling, retries, caching, and cancellation), building custom native integrations, repairing or extending faulty code in partially-correct codebases, keeping pace with the latest versions and breaking changes of common third-party libraries, and more. Future work should also explore judge-robust scoring (e.g., multi-judge consensus and calibrated partial credit) to reduce dependence on any single evaluator while maintaining transparency.


\include{export/appendix}

\end{document}