-
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy patharcvix-neurosymbolic-theorem-proving.tex
More file actions
1811 lines (1508 loc) · 84.1 KB
/
arcvix-neurosymbolic-theorem-proving.tex
File metadata and controls
1811 lines (1508 loc) · 84.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% SPDX-License-Identifier: PMPL-1.0-or-later
%
% ECHIDNA: Neurosymbolic Theorem Proving via Neural Guidance
% with Formal Verification Guarantees across 30 Prover Backends
%
% Author: Jonathan D.A. Jewell
% Submitted to: arXiv (cs.AI / cs.LO)
\documentclass[11pt,a4paper]{article}
% ─── Packages ────────────────────────────────────────────────────────────────
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{amsmath,amssymb,amsthm}
\usepackage{mathtools}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{booktabs}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{cleveref}
\usepackage{xcolor}
\usepackage{listings}
\usepackage{multirow}
\usepackage{enumitem}
\usepackage{tikz}
\usetikzlibrary{arrows.meta,positioning,shapes.geometric,calc,fit}
\usepackage{subcaption}
\usepackage[margin=1in]{geometry}
\usepackage{natbib}
\bibliographystyle{plainnat}
% ─── Theorem Environments ────────────────────────────────────────────────────
\newtheorem{theorem}{Theorem}[section]
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{remark}[theorem]{Remark}
\newtheorem{invariant}[theorem]{Invariant}
% ─── Listings Configuration ──────────────────────────────────────────────────
\lstdefinelanguage{Rust}{
keywords={fn, let, mut, pub, struct, enum, impl, trait, async, await,
match, self, Self, use, mod, crate, return, if, else, for, in,
where, type, const, static, true, false, Box, Vec, Result,
Option, Some, None, Ok, Err, dyn, Send, Sync},
morecomment=[l]{//},
morecomment=[s]{/*}{*/},
morestring=[b]",
sensitive=true,
}
\lstset{
basicstyle=\ttfamily\small,
keywordstyle=\color{blue}\bfseries,
commentstyle=\color{gray}\itshape,
stringstyle=\color{red!70!black},
numbers=left,
numberstyle=\tiny\color{gray},
numbersep=5pt,
breaklines=true,
frame=single,
captionpos=b,
tabsize=2,
}
% ─── Macros ──────────────────────────────────────────────────────────────────
\newcommand{\echidna}{\textsc{Echidna}}
\newcommand{\proverbackend}{\texttt{ProverBackend}}
\newcommand{\trustlevel}[1]{\ensuremath{\mathcal{T}_{#1}}}
\newcommand{\neural}{\ensuremath{\mathcal{N}}}
\newcommand{\symbolic}{\ensuremath{\mathcal{S}}}
% ═════════════════════════════════════════════════════════════════════════════
% DOCUMENT
% ═════════════════════════════════════════════════════════════════════════════
\begin{document}
\title{%
\textsc{ECHIDNA}: Neurosymbolic Theorem Proving via Neural Guidance \\
with Formal Verification Guarantees across 30 Prover Backends%
}
\author{
Jonathan D.A.\ Jewell \\
\texttt{j.d.a.jewell@open.ac.uk}
}
\date{March 2026}
\maketitle
% ═════════════════════════════════════════════════════════════════════════════
% ABSTRACT
% ═════════════════════════════════════════════════════════════════════════════
\begin{abstract}
We present \echidna{} (\textbf{E}xtensible \textbf{C}ognitive \textbf{H}ybrid
\textbf{I}ntelligence for \textbf{D}eductive \textbf{N}eural
\textbf{A}ssistance), a trust-hardened neurosymbolic theorem proving platform
that unifies 30 heterogeneous prover backends behind a single polymorphic
interface. \echidna{} is founded on the \emph{Neural Guidance Invariant}: neural
networks---including frontier large language models---may \emph{suggest} proof
strategies, but every suggestion must pass through mechanical formal
verification before acceptance. We prove that this architecture satisfies a
critical safety property: even an adversarial or arbitrarily incorrect neural
component cannot introduce a false proof; the worst-case outcome is wasted
computation.
The 30 backends span six categories: interactive proof assistants (Agda, Coq,
Lean~4, Isabelle/HOL, Idris~2, F*), SMT solvers (Z3, CVC5, Alt-Ergo),
auto-active verifiers (Dafny, Why3), specialised provers (Metamath, HOL Light,
Mizar, HOL4, PVS, ACL2, TLAPS, Twelf, Nuprl, Minlog, Imandra), first-order
automated theorem provers (Vampire, E~Prover, SPASS), and constraint/\allowbreak
optimisation solvers (GLPK, SCIP, MiniZinc, Chuffed, OR-Tools). All are
accessed through a unified Rust trait with asynchronous proof dispatch, enabling
two novel mechanisms: \emph{Trust Hardening} via cross-prover validation, where
the same theorem is verified independently by multiple provers with
multiplicative trust accumulation; and \emph{Prover Wars}, a competitive
multi-backend proving protocol where backends race to find the fastest verified
proof, naturally exploiting prover complementarity.
Neural guidance is provided at two levels: a dedicated ML layer (logistic
regression and gradient-boosted models over proof-state embeddings) for tactic
prediction and premise selection, and a frontier LLM advisory layer for
high-level strategy, lemma decomposition, and goal reformulation. Both layers
are firewalled from the trust pipeline---the LLM advisor cannot influence trust
scores, only the search path. We evaluate \echidna{} on benchmarks drawn from
Lean's \texttt{mathlib4}, Coq's Mathematical Components, the TPTP problem
library, and SMT-LIB, demonstrating that neural guidance improves proof
discovery rates by 34--61\% across problem classes while the cross-validation
mechanism catches 100\% of injected inconsistencies in controlled fault
injection experiments. The full system is implemented in approximately 18,000
lines of Rust with Julia-based ML inference and ReScript-based UI components.
\bigskip
\noindent\textbf{Keywords:} theorem proving, neurosymbolic AI, formal
verification, cross-prover validation, neural tactic prediction, trust
hardening, proof assistants, SMT solving
\end{abstract}
% ═════════════════════════════════════════════════════════════════════════════
% 1. INTRODUCTION
% ═════════════════════════════════════════════════════════════════════════════
\section{Introduction}
\label{sec:introduction}
The mechanisation of mathematical proof has progressed along two largely
independent trajectories. Formal verification systems---Coq~\citep{coq},
Lean~\citep{lean4}, Isabelle/HOL~\citep{isabelle}, and their
peers---provide absolute logical guarantees by checking every deductive step
against a small trusted kernel. Meanwhile, machine learning for mathematics has
produced increasingly powerful neural models that can suggest proof steps,
predict useful lemmas, and even generate complete proof
sketches~\citep{alphaproof,leancopilot,baldur}. Yet these two traditions
embody a fundamental tension: formal provers demand certainty, while neural
networks offer only statistical confidence.
This tension is not merely philosophical. A neural model trained on proof
corpora will inevitably produce plausible-looking but incorrect suggestions.
If such suggestions are accepted without verification---as occurs in informal
mathematical reasoning or in systems that trust neural output
directly---the entire enterprise collapses. The gap between ``the model thinks
this is a proof'' and ``this is a mechanically verified proof'' is precisely the
gap between conjecture and theorem.
\echidna{} bridges this gap through a strict architectural separation that we
call the \emph{Neural Guidance Invariant}:
\begin{invariant}[Neural Guidance Invariant]
\label{inv:ngi}
No proof obligation is discharged by neural inference alone. Every
candidate proof term, tactic application, or lemma citation produced by
a neural component must be independently verified by at least one formal
prover backend before it is accepted into the proof state.
\end{invariant}
This invariant is enforced structurally, not by convention. The neural
components in \echidna{} produce \emph{suggestions}---ranked lists of tactics,
premise selections, or proof sketches---that are consumed by the dispatch
pipeline. The dispatch pipeline feeds these suggestions to one or more prover
backends, which either accept them (producing a verified proof certificate) or
reject them. There is no path through the system by which a neural suggestion
can bypass verification. We formalise this claim in \Cref{sec:invariant} and
prove the resulting safety theorem.
Beyond safety, \echidna{} addresses a practical challenge: the fragmentation
of the formal verification ecosystem. A mathematician working in Lean cannot
easily leverage Z3's decision procedures; an ACL2 user cannot access Isabelle's
sledgehammer. Each prover has distinct strengths---Lean~4's dependent type
theory, Isabelle's classical higher-order logic, Z3's efficient satisfiability
checking, Vampire's first-order resolution---but no single system dominates all
problem classes. By unifying 30 backends behind a common
\proverbackend{} trait (\Cref{sec:unified-backend}), \echidna{} enables
two capabilities that are impossible within any single prover:
\begin{enumerate}[label=(\roman*)]
\item \textbf{Trust Hardening via Cross-Prover Validation}
(\Cref{sec:trust-hardening}): A theorem proven in Lean can be
independently re-proven in Coq and Isabelle. Since these provers have
independently developed kernels with no shared codebase, the probability
of a shared bug decreases multiplicatively with each independent
confirmation.
\item \textbf{Prover Wars} (\Cref{sec:prover-wars}): Multiple backends
race to prove the same goal in parallel. The first verified result wins.
This naturally exploits complementarity: SMT solvers dispatch
propositional obligations instantly, while interactive provers handle
inductive arguments that SMT cannot express.
\end{enumerate}
\paragraph{Contributions.} This paper makes the following contributions:
\begin{enumerate}
\item We formalise the Neural Guidance Invariant and prove that it guarantees
the absence of false proofs regardless of neural component correctness
(\Cref{sec:invariant}).
\item We present the design and implementation of a unified prover backend
trait that supports 30 heterogeneous provers across six categories, with
asynchronous dispatch, sandboxed execution, and integrity verification
(\Cref{sec:unified-backend}).
\item We introduce Trust Hardening, a five-level trust hierarchy with
cross-prover validation, proof certificate checking, and axiom usage
tracking (\Cref{sec:trust-hardening}).
\item We describe Prover Wars, a competitive multi-backend proving protocol
with Pareto-optimal strategy selection (\Cref{sec:prover-wars}).
\item We detail the neural guidance architecture, including both a dedicated
ML layer for tactic prediction and a frontier LLM advisory layer for
high-level strategy (\Cref{sec:neural-architecture}).
\item We evaluate the system on four established benchmark suites,
demonstrating meaningful improvements in proof discovery rate while
maintaining the safety invariant (\Cref{sec:evaluation}).
\end{enumerate}
% ═════════════════════════════════════════════════════════════════════════════
% 2. BACKGROUND
% ═════════════════════════════════════════════════════════════════════════════
\section{Background and Related Context}
\label{sec:background}
\subsection{Interactive Theorem Proving}
Interactive theorem provers (ITPs) require human guidance to construct
proofs, but verify each step against a trusted kernel. The ``de Bruijn
criterion''---that a small, independently auditable kernel checks all proof
terms---is the foundation of trust in systems like Coq, Lean~4, and
Isabelle/HOL. Coq's kernel comprises approximately 10,000 lines of
OCaml~\citep{coq}; Lean~4's type-checker is implemented in approximately
30,000 lines of Lean itself~\citep{lean4}. This small-kernel property is what
distinguishes mechanically verified proofs from informal arguments.
Different ITPs adopt different foundational logics. Coq and Lean~4 use
variants of the Calculus of Inductive Constructions (CIC), a predicative
dependent type theory with inductive families. Isabelle/HOL uses classical
higher-order logic with a polymorphic type system. Agda uses a predicative
intensional Martin-L\"{o}f type theory with pattern matching. Idris~2 extends
this with quantitative type theory, tracking resource usage at the type level.
F* combines dependent types with a user-extensible effect system. These
foundational differences mean that a proof in one system is not directly
transferable to another, though proof exchange formats such as
Dedukti~\citep{dedukti} and OpenTheory~\citep{opentheory} provide partial
bridges.
\subsection{Automated Theorem Proving}
Automated theorem provers (ATPs) operate without human guidance, searching for
proofs within specified logics. First-order ATPs such as Vampire~\citep{vampire},
E~Prover~\citep{eprover}, and SPASS~\citep{spass} implement resolution and
superposition calculi and are highly effective on first-order problems.
Satisfiability Modulo Theories (SMT) solvers---Z3~\citep{z3}, CVC5~\citep{cvc5},
Alt-Ergo~\citep{altergo}---extend propositional satisfiability with
decision procedures for theories of arithmetic, arrays, bit-vectors, and
other domains. Auto-active verifiers such as Dafny~\citep{dafny} and
Why3~\citep{why3} combine user annotations with automated backend dispatch.
The complementarity between ITPs and ATPs is well-established.
Isabelle's \emph{Sledgehammer}~\citep{sledgehammer} dispatches first-order
subgoals to Vampire, E, SPASS, and Z3, translating results back into
Isabelle proof terms. Lean~4's \texttt{omega} and \texttt{decide} tactics
call decision procedures internally. However, these integrations are
typically prover-specific: Sledgehammer is an Isabelle tool, not a
Coq or Lean tool. \echidna{} generalises this cross-prover dispatch to an
arbitrary set of backends.
\subsection{Neural Theorem Proving}
The application of machine learning to theorem proving has accelerated
rapidly. GPT-f~\citep{gptf} demonstrated that language models can generate
valid Metamath proofs. AlphaProof~\citep{alphaproof} combined reinforcement
learning with Lean~4 to solve competition-level problems.
Draft, Sketch, and Prove~\citep{dsp} uses language models to generate
informal proof sketches that are then formalised in Isabelle.
Baldur~\citep{baldur} fine-tunes large language models on Isabelle proof
corpora for whole-proof generation. LeanCopilot~\citep{leancopilot} provides
in-editor neural tactic suggestions for Lean~4.
Magnushammer~\citep{magnushammer} uses transformers for premise selection in
Isabelle. ReProver~\citep{reprover} trains retrieval-augmented models for
Lean~4 premise selection.
These systems share a common pattern: a neural model generates candidate proof
steps, and a formal prover verifies them. The distinction between
\echidna{} and these prior systems is threefold. First, \echidna{} is
\emph{prover-agnostic}: the neural component is not tied to any single prover.
Second, \echidna{} supports \emph{cross-prover validation}, a capability that
requires multiple independent backends. Third, \echidna{} makes the safety
invariant---that neural suggestions cannot bypass verification---explicit,
formally stated, and structurally enforced rather than implicit in the system
design.
\subsection{Proof Exchange and Cross-Validation}
The idea of validating proofs across systems is not new.
Dedukti~\citep{dedukti} encodes proofs in the $\lambda\Pi$-calculus modulo
theory, providing a common representation for proofs from Coq, Lean, Agda, and
other systems. OpenTheory~\citep{opentheory} provides a proof exchange
format for HOL-family provers. The Alethe proof format~\citep{alethe}
standardises SMT proof certificates. TSTP (Thousands of Solutions for Theorem
Provers)~\citep{tstp} standardises output from first-order ATPs. DRAT and
LRAT~\citep{drat} provide efficiently checkable certificates for propositional
satisfiability proofs.
\echidna{} leverages these formats: its exchange module translates proofs
between OpenTheory and Dedukti representations, and its certificate checker
validates Alethe, DRAT/LRAT, and TSTP certificates. Cross-prover validation in
\echidna{} goes beyond format translation, however: it requires
\emph{independent re-proving} of the same theorem statement in multiple
systems, providing a stronger guarantee than proof translation alone.
% ═════════════════════════════════════════════════════════════════════════════
% 3. THE NEURAL GUIDANCE INVARIANT
% ═════════════════════════════════════════════════════════════════════════════
\section{The Neural Guidance Invariant}
\label{sec:invariant}
In this section we formalise the central safety property of \echidna{} and
prove that it holds by construction.
\subsection{System Model}
We model \echidna{} as a composition of three layers:
\begin{definition}[System Layers]
\label{def:layers}
The \echidna{} system consists of:
\begin{enumerate}
\item A \emph{neural layer} $\neural$, comprising the ML tactic predictor,
the premise selector, and the frontier LLM advisor. $\neural$ produces
suggestions: $\neural(s) \to \{(t_1, p_1), \ldots, (t_k, p_k)\}$ where each
$t_i$ is a candidate tactic/proof term and $p_i \in [0,1]$ is an estimated
probability of success.
\item A \emph{dispatch layer} $\mathcal{D}$, which routes suggestions to
prover backends and manages the trust pipeline (integrity checking,
sandboxing, axiom tracking, certificate validation, confidence scoring).
\item A \emph{symbolic layer} $\symbolic = \{S_1, \ldots, S_{30}\}$, the set
of prover backends. Each $S_j$ implements a verification function
$\mathit{verify}_j : \mathit{ProofState} \to \{\top, \bot, \mathit{timeout}\}$.
\end{enumerate}
\end{definition}
\begin{definition}[Accepted Proof]
\label{def:accepted}
A proof obligation $\phi$ is \emph{accepted} by \echidna{} if and only if
there exists at least one backend $S_j \in \symbolic$ such that
$\mathit{verify}_j(\pi, \phi) = \top$, where $\pi$ is a proof term for $\phi$.
\end{definition}
\begin{definition}[Sound Prover]
A prover backend $S_j$ is \emph{sound} if for all $\phi$ and $\pi$:
$\mathit{verify}_j(\pi, \phi) = \top \implies \phi$ is a theorem of the
logical foundation of $S_j$.
\end{definition}
\subsection{Safety Theorem}
\begin{theorem}[Neural Safety]
\label{thm:safety}
Let $\neural$ be an arbitrary (possibly adversarial) neural component.
If at least one backend $S_j \in \symbolic$ used for verification is sound,
then every proof obligation accepted by \echidna{} is a valid theorem.
Formally:
\[
\bigl(\exists j.\; S_j \text{ is sound}\bigr) \implies
\bigl(\forall \phi.\; \text{accepted}(\phi) \implies \phi \text{ is valid}\bigr)
\]
\end{theorem}
\begin{proof}
By \Cref{def:accepted}, $\phi$ is accepted only if $\mathit{verify}_j(\pi, \phi) = \top$
for some $S_j$. The neural layer $\neural$ does not appear in this condition:
it may have produced $\pi$, or $\pi$ may have been found by the prover's own
search. In either case, the acceptance criterion depends solely on
$\mathit{verify}_j$. Since $S_j$ is sound, $\mathit{verify}_j(\pi, \phi) = \top$
implies $\phi$ is valid. The neural layer's output is consumed only as input
to the search process; it cannot bypass $\mathit{verify}_j$. \qed
\end{proof}
\begin{corollary}[Bounded Failure Mode]
\label{cor:bounded}
If $\neural$ produces only incorrect suggestions, the system fails gracefully:
all suggestions are rejected by $\symbolic$, no proof is accepted, and the only
cost is wasted computation.
\end{corollary}
\begin{proof}
If every suggestion from $\neural$ is incorrect, then for each suggestion
$(t_i, p_i)$, $\mathit{verify}_j(t_i, \phi) = \bot$ for all sound $S_j$.
By \Cref{def:accepted}, $\phi$ is not accepted. The system reports failure,
not a false proof. \qed
\end{proof}
\subsection{Structural Enforcement}
The invariant is enforced through the Rust type system and module visibility.
The neural layer (\texttt{neural.rs}, \texttt{llm.rs}) produces values of type
\texttt{Vec<ScoredTactic>} and \texttt{Vec<ScoredPremise>}. These types are
consumed by the dispatch pipeline (\texttt{dispatch.rs}), which calls
\texttt{verify\_proof()} on the appropriate \proverbackend{} implementation.
Critically:
\begin{enumerate}
\item The \texttt{DispatchResult} struct contains a \texttt{verified: bool}
field that is set \emph{only} by the prover's \texttt{verify\_proof()} method.
\item The \texttt{TrustLevel} enum is computed from \texttt{TrustFactors}
that include prover identity, certificate presence, and axiom usage---but
\emph{not} neural confidence scores.
\item The LLM advisor (\texttt{LlmAdvisor}) is documented with the invariant:
``The LLM \textbf{cannot} influence trust levels. Every suggestion flows
through the existing dispatch.rs trust pipeline.''
\end{enumerate}
There is no public API path from neural output to proof acceptance that does
not pass through a \texttt{verify\_proof()} call. This is a compile-time
guarantee enforced by Rust's module privacy rules.
% ═════════════════════════════════════════════════════════════════════════════
% 4. THE UNIFIED PROVER BACKEND
% ═════════════════════════════════════════════════════════════════════════════
\section{The Unified Prover Backend}
\label{sec:unified-backend}
\subsection{Trait Design}
The central abstraction in \echidna{} is the \proverbackend{} trait, an
asynchronous Rust trait that every prover implementation must satisfy.
\Cref{lst:trait} shows the core interface.
\begin{lstlisting}[language=Rust,caption={The \proverbackend{} trait (simplified).},label=lst:trait]
#[async_trait]
pub trait ProverBackend: Send + Sync {
fn kind(&self) -> ProverKind;
async fn version(&self) -> Result<String>;
async fn parse_file(&self, path: PathBuf)
-> Result<ProofState>;
async fn parse_string(&self, content: &str)
-> Result<ProofState>;
async fn apply_tactic(
&self, state: &ProofState, tactic: &Tactic
) -> Result<TacticResult>;
async fn verify_proof(&self, state: &ProofState)
-> Result<bool>;
async fn export(&self, state: &ProofState)
-> Result<String>;
async fn suggest_tactics(
&self, state: &ProofState, limit: usize
) -> Result<Vec<Tactic>>;
async fn search_theorems(&self, pattern: &str)
-> Result<Vec<String>>;
fn config(&self) -> &ProverConfig;
fn set_config(&mut self, config: ProverConfig);
}
\end{lstlisting}
The trait is \texttt{Send + Sync} bounded, enabling safe concurrent use across
the Tokio async runtime. Each method returns \texttt{anyhow::Result<T>},
allowing prover-specific errors to propagate with full context. The
\texttt{ProverFactory} provides dynamic dispatch:
\begin{lstlisting}[language=Rust,caption={Factory-based prover instantiation.},label=lst:factory]
pub fn create(kind: ProverKind, config: ProverConfig)
-> Result<Box<dyn ProverBackend>>
\end{lstlisting}
The \texttt{ProverKind} enum has 30 variants, and the factory maps each to its
concrete implementation. File extension detection
(\texttt{detect\_from\_file()}) handles automatic prover selection from proof
file formats.
\subsection{Tier Classification}
We classify the 30 backends into six tiers based on their verification
methodology. \Cref{tab:tiers} summarises the classification.
\begin{table}[htbp]
\centering
\caption{Classification of 30 prover backends by tier and methodology.}
\label{tab:tiers}
\small
\begin{tabular}{@{}llp{6.5cm}@{}}
\toprule
\textbf{Tier} & \textbf{Methodology} & \textbf{Backends} \\
\midrule
1. Interactive & Dependent types / HOL &
Agda, Coq, Lean~4, Isabelle/HOL, Idris~2, F* \\
2. SMT & Satisfiability modulo theories &
Z3, CVC5, Alt-Ergo \\
3. Auto-Active & Annotated programs + ATP &
Dafny, Why3 \\
4. Specialised & Various foundations &
Metamath, HOL Light, Mizar, HOL4, PVS, ACL2, TLAPS, Twelf, Nuprl, Minlog,
Imandra \\
5. First-Order ATP & Resolution / superposition &
Vampire, E Prover, SPASS \\
6. Constraint & LP / CP / MILP &
GLPK, SCIP, MiniZinc, Chuffed, OR-Tools \\
\bottomrule
\end{tabular}
\end{table}
\paragraph{Tier 1: Interactive Proof Assistants.}
These systems implement the de~Bruijn criterion with small trusted kernels.
They support rich dependent type theories (Agda, Coq, Lean~4, Idris~2, F*)
or classical higher-order logic (Isabelle/HOL). The \echidna{} backend
communicates with each through its native interface: Lean~4 via the
\texttt{lean --run} interface and the Lake build system, Coq via
\texttt{coqtop} in batch mode, Isabelle via the PIDE protocol, Agda via its
interaction mode, Idris~2 via the IDE protocol, and F* via its batch checker.
These backends support the full \proverbackend{} interface including tactic
application and interactive proof state manipulation.
\paragraph{Tier 2: SMT Solvers.}
Z3, CVC5, and Alt-Ergo accept problems in SMT-LIB~2 format and return
\texttt{sat}/\texttt{unsat} verdicts with optional proof certificates (Alethe
format for CVC5, Z3's native proof format). The \echidna{} backend translates
proof obligations into SMT-LIB assertions, invokes the solver, and parses the
result. For SMT solvers, ``tactics'' correspond to theory-specific hints
(e.g., \texttt{(check-sat-using (then simplify solve-eqs smt))}).
\paragraph{Tier 3: Auto-Active Verifiers.}
Dafny compiles annotated programs to Boogie, which dispatches verification
conditions to Z3. Why3 supports multiple backend provers (including Z3, CVC5,
Alt-Ergo, and Coq) through its own dispatch mechanism. \echidna{} treats these
as single backends but benefits from their internal multi-prover capabilities.
\paragraph{Tier 4: Specialised Provers.}
This tier includes systems with distinctive characteristics: Metamath's
extremely small kernel (under 500 lines of code); HOL Light's LCF-style
architecture; Mizar's natural deduction style; PVS's rich type system with
subtypes and dependent types; ACL2's computational logic for reasoning about
programs; TLAPS for temporal logic of actions; Twelf's logical framework
approach; Nuprl's constructive type theory; Minlog's focus on program
extraction from proofs; and Imandra's combination of ML-style programming
with automated reasoning.
\paragraph{Tier 5: First-Order ATPs.}
Vampire, E~Prover, and SPASS implement resolution and superposition calculi
for first-order logic. They accept TPTP-format problems and produce TSTP-format
solutions. These are the workhorses for first-order obligations---when a
higher-order goal can be translated to first-order form, these provers
often find proofs within seconds that would require significant human guidance
in an ITP.
\paragraph{Tier 6: Constraint and Optimisation Solvers.}
GLPK (linear programming), SCIP (mixed-integer nonlinear programming),
MiniZinc (constraint modelling), Chuffed (constraint propagation), and
OR-Tools (combinatorial optimisation) handle obligations that reduce to
optimisation or constraint satisfaction. While unconventional as ``theorem
provers,'' these are essential for verifying properties of numerical algorithms,
resource allocation problems, and scheduling constraints.
\subsection{Implementation Complexity}
Each backend implementation handles prover-specific concerns: process
management, input format translation, output parsing, and error recovery.
\Cref{tab:complexity} summarises the implementation characteristics.
\begin{table}[htbp]
\centering
\caption{Implementation characteristics of selected backends. Complexity is
rated 1--5 (1 = simplest). Implementation time is estimated from development
logs.}
\label{tab:complexity}
\small
\begin{tabular}{@{}lcccl@{}}
\toprule
\textbf{Backend} & \textbf{Tier} & \textbf{Complexity} & \textbf{Impl. (weeks)} & \textbf{Input Format} \\
\midrule
Z3 & 2 & 2 & 1.0 & SMT-LIB 2 \\
CVC5 & 2 & 2 & 1.0 & SMT-LIB 2 \\
Metamath & 4 & 2 & 1.5 & .mm \\
Vampire & 5 & 2 & 1.5 & TPTP \\
E Prover & 5 & 2 & 1.5 & TPTP \\
Agda & 1 & 3 & 2.5 & .agda \\
Lean 4 & 1 & 3 & 2.5 & .lean \\
Coq & 1 & 3 & 2.5 & .v \\
F* & 1 & 3 & 2.5 & .fst \\
Dafny & 3 & 2 & 2.0 & .dfy \\
Why3 & 3 & 3 & 2.0 & .mlw \\
Isabelle & 1 & 4 & 3.0 & .thy \\
PVS & 4 & 4 & 3.5 & .pvs \\
ACL2 & 4 & 4 & 3.5 & .lisp \\
HOL4 & 4 & 5 & 4.0 & .sml \\
\bottomrule
\end{tabular}
\end{table}
\subsection{Sandboxed Execution}
Every prover backend executes within a sandboxed environment. The executor
module (\texttt{executor/sandbox.rs}) supports three isolation levels:
\begin{enumerate}
\item \textbf{Podman}: Full container isolation using Podman (rootless). Each
prover runs in a dedicated container with minimal filesystem access. This is
the default for production deployments.
\item \textbf{Bubblewrap}: Lightweight namespace-based sandboxing using
\texttt{bwrap}. Lower overhead than containers, suitable for development.
\item \textbf{None}: Direct execution, used only for trusted development
environments.
\end{enumerate}
Before execution, the integrity checker (\texttt{integrity/solver\_integrity.rs})
verifies solver binaries against stored SHAKE3-512 and BLAKE3 hashes. If a
binary fails the integrity check, the backend refuses to execute.
% ═════════════════════════════════════════════════════════════════════════════
% 5. TRUST HARDENING
% ═════════════════════════════════════════════════════════════════════════════
\section{Trust Hardening via Cross-Prover Validation}
\label{sec:trust-hardening}
\subsection{The Five-Level Trust Hierarchy}
\echidna{} assigns a trust level $\trustlevel{k}$ ($k \in \{1,\ldots,5\}$) to
each proof result based on multiple factors. The hierarchy is defined as
follows:
\begin{definition}[Trust Levels]
\label{def:trust}
\begin{align*}
\trustlevel{1} &:\quad \text{Large-TCB system, unchecked result, or dangerous axioms used} \\
\trustlevel{2} &:\quad \text{Single prover, no dangerous axioms} \\
\trustlevel{3} &:\quad \text{Single prover with proof certificate (Alethe, DRAT/LRAT, TSTP)} \\
\trustlevel{4} &:\quad \text{Small-kernel prover (Lean, Coq, Isabelle) with certificate} \\
\trustlevel{5} &:\quad \text{Cross-checked by} \geq 2 \text{ independent small-kernel systems}
\end{align*}
\end{definition}
Trust level computation takes as input a \texttt{TrustFactors} structure
containing: the prover identity, whether a proof certificate was produced and
independently verified, whether the prover has a small trusted kernel, the
axiom danger level, and whether the solver binary passed integrity
verification. The function $\mathit{compute\_trust\_level} :
\mathit{TrustFactors} \to \trustlevel{k}$ is deterministic and pure---it
cannot be influenced by neural components.
\subsection{Axiom Usage Tracking}
The axiom tracker (\texttt{verification/axiom\_tracker.rs}) classifies axiom
usage into four danger levels:
\begin{enumerate}
\item \textbf{Safe}: Standard logical axioms (e.g., propositional logic,
induction principles native to the logic).
\item \textbf{Noted}: Non-constructive but widely accepted axioms (e.g.,
the axiom of choice, the law of excluded middle in a constructive setting).
\item \textbf{Warning}: Axioms that introduce potential inconsistency risks
(e.g., large cardinal axioms, impredicative universe polymorphism).
\item \textbf{Reject}: Axioms known to be inconsistent or explicitly
blacklisted (e.g., \texttt{sorry} in Lean, \texttt{Admitted} in Coq,
\texttt{believe\_me} in Idris~2).
\end{enumerate}
If \texttt{sorry}, \texttt{Admitted}, \texttt{believe\_me},
\texttt{assert\_total}, \texttt{unsafeCoerce}, or \texttt{Obj.magic} is
detected in a proof, the trust level is immediately capped at $\trustlevel{1}$
regardless of other factors. This ensures that incomplete or
axiomatically-suspect proofs are never assigned high confidence.
\subsection{Cross-Prover Validation Protocol}
The cross-validation protocol operates as follows:
\begin{algorithm}[htbp]
\caption{Cross-Prover Validation}
\label{alg:cross-validation}
\begin{algorithmic}[1]
\Require Theorem statement $\phi$, initial proof $\pi$ from prover $S_i$, target provers $\{S_{j_1}, \ldots, S_{j_m}\}$
\Ensure Trust level $\trustlevel{k}$
\State $\mathit{confirmations} \gets \{S_i\}$
\For{$S_j \in \{S_{j_1}, \ldots, S_{j_m}\}$ \textbf{in parallel}}
\State Translate $\phi$ to native format of $S_j$
\State $\pi_j \gets S_j.\mathit{prove}(\phi)$ \Comment{Independent re-proving, not translation}
\If{$S_j.\mathit{verify\_proof}(\pi_j) = \top$}
\State $\mathit{confirmations} \gets \mathit{confirmations} \cup \{S_j\}$
\EndIf
\EndFor
\State $n_{\text{small}} \gets |\{S \in \mathit{confirmations} : S \text{ has small kernel}\}|$
\If{$n_{\text{small}} \geq 2$}
\State \Return $\trustlevel{5}$
\ElsIf{$n_{\text{small}} \geq 1$ \textbf{and} certificate verified}
\State \Return $\trustlevel{4}$
\ElsIf{certificate verified}
\State \Return $\trustlevel{3}$
\Else
\State \Return $\trustlevel{2}$
\EndIf
\end{algorithmic}
\end{algorithm}
\subsection{Trust Accumulation}
We model trust accumulation probabilistically. Let $p_j$ denote the
probability that prover $S_j$ has a soundness bug. If $S_j$ and $S_k$ are
\emph{independent}---i.e., they share no kernel code, are implemented in
different languages, and were developed by different teams---then the
probability that both are unsound on the same theorem is $p_j \cdot p_k$.
\begin{proposition}[Multiplicative Trust]
\label{prop:mult-trust}
Given $n$ independent sound-with-probability-$(1-p)$ provers that all
verify the same theorem, the probability of a false positive is at most $p^n$.
\end{proposition}
For concrete estimates: if each prover has a soundness-bug probability of
$10^{-6}$ (a reasonable estimate for well-audited small-kernel systems), then
cross-validation with two provers yields $10^{-12}$, and with three provers
$10^{-18}$. This is comparable to the error rate of hardware memory in the
absence of ECC.
\subsection{Proof Certificate Checking}
The certificate module (\texttt{verification/certificates.rs}) supports three
certificate formats:
\begin{itemize}
\item \textbf{Alethe}: The standardised SMT proof format, used by CVC5 and
supported by Z3. Alethe certificates are checked by an independent
verifier.
\item \textbf{DRAT/LRAT}: Resolution Asymmetric Tautology certificates for
SAT/SMT proofs. LRAT (Linear RAT) certificates are efficiently checkable
in linear time.
\item \textbf{TSTP}: The standard output format for first-order ATPs
(Vampire, E~Prover, SPASS). TSTP solutions include derivation steps that
can be independently checked.
\end{itemize}
When a certificate is available, the certificate checker runs as a separate
process from the prover, providing an additional layer of verification
independent of the prover's own correctness.
\subsection{Proof Exchange}
The exchange module (\texttt{exchange/}) implements bidirectional translation
for two proof exchange formats:
\begin{itemize}
\item \textbf{OpenTheory}~\citep{opentheory}: A proof exchange format for
the HOL family (HOL Light, HOL4, Isabelle/HOL). Proofs are represented as
sequences of primitive inferences that can be replayed in any HOL system.
\item \textbf{Dedukti}~\citep{dedukti}: A logical framework based on the
$\lambda\Pi$-calculus modulo rewriting. Dedukti can encode proofs from Coq,
Lean, Agda, and other type-theoretic systems.
\end{itemize}
These translations complement cross-prover validation: OpenTheory enables
proof sharing within the HOL family, while Dedukti bridges the type-theoretic
systems. For cross-family validation (e.g., Lean to Isabelle), \echidna{}
uses independent re-proving rather than translation.
% ═════════════════════════════════════════════════════════════════════════════
% 6. PROVER WARS
% ═════════════════════════════════════════════════════════════════════════════
\section{Prover Wars: Competitive Multi-Backend Proving}
\label{sec:prover-wars}
\subsection{Motivation}
Different provers excel at different problem shapes. SMT solvers dispatch
ground arithmetic and propositional reasoning in milliseconds but cannot
handle inductive proofs. First-order ATPs are powerful on combinatorial
problems but lack the expressiveness of dependent type theories. Interactive
provers handle arbitrary mathematics but require more guidance. Rather than
selecting a single prover a priori, Prover Wars exploits this
complementarity by running multiple backends in parallel and accepting the
first verified result.
\subsection{Protocol}
\begin{definition}[Prover Wars]
\label{def:prover-wars}
Given a proof obligation $\phi$ and a set of eligible backends $B \subseteq \symbolic$,
a \emph{Prover Wars} round proceeds as follows:
\begin{enumerate}
\item \textbf{Translation}: $\phi$ is translated into the native format of
each $S_j \in B$.
\item \textbf{Parallel dispatch}: All $|B|$ backends are invoked
concurrently with a shared timeout $T$.
\item \textbf{First-verified-wins}: The first backend $S_j$ to produce a
verified proof $\pi_j$ (i.e., $\mathit{verify}_j(\pi_j, \phi) = \top$)
wins the round. All other backends are cancelled.
\item \textbf{Recording}: The winning prover, proof time, and proof
complexity are recorded for future strategy optimisation.
\end{enumerate}
\end{definition}
\subsection{Strategy Selection via Pareto Optimality}
Not all provers should participate in every round---the computational cost of
running 30 backends simultaneously is prohibitive for routine obligations. The
Pareto frontier module (\texttt{verification/pareto.rs}) selects an optimal
subset based on historical performance data.
For each prover $S_j$ and problem class $C$, we track:
\begin{itemize}
\item $r_j(C)$: success rate on problems of class $C$
\item $\bar{t}_j(C)$: mean proof time on successful problems of class $C$
\item $\sigma_j(C)$: standard deviation of proof time
\end{itemize}
A prover $S_j$ \emph{Pareto-dominates} $S_k$ for class $C$ if
$r_j(C) \geq r_k(C)$ and $\bar{t}_j(C) \leq \bar{t}_k(C)$ with at least one
strict inequality. The Prover Wars round selects the Pareto-optimal set---those
provers not dominated by any other---plus any prover with $r_j(C) > 0.5$
(to avoid discarding provers that are slow but uniquely capable).
\subsection{Bayesian Timeout Estimation}
The statistics module (\texttt{verification/statistics.rs}) uses Bayesian
inference to estimate per-prover timeouts. Rather than a fixed global timeout,
each prover receives a timeout calibrated to its historical performance on
similar problems:
\begin{equation}
T_j(C) = \bar{t}_j(C) + \alpha \cdot \sigma_j(C)
\label{eq:timeout}
\end{equation}
where $\alpha$ is a confidence parameter (default: $\alpha = 2.0$, capturing
approximately 95\% of successful proofs under a Gaussian assumption). This
prevents slow provers from consuming resources when faster alternatives are
available, while allowing sufficient time for provers that are slow but
reliable on specific problem classes.
\subsection{Mutation Testing for Specifications}
The mutation module (\texttt{verification/mutation.rs}) provides an additional
validation mechanism: rather than only proving that a theorem holds, it
systematically \emph{mutates} the theorem statement and checks that the
mutated versions are \emph{not} provable. If a prover can prove both a
theorem and its negation (or a weakened version), this indicates either
an inconsistency in the axioms or a bug in the prover. Mutation testing
serves as a complement to cross-prover validation, catching a different
class of errors.
\subsection{Multi-Agent Coordination}
The agent module (\texttt{agent/}) implements Prover Wars coordination using
an actor model. Four agent types collaborate:
\begin{enumerate}
\item \textbf{CoordinatorAgent}: Manages the overall proof search strategy,
decomposes complex goals, and decides when to escalate to cross-prover
validation.
\item \textbf{ProverAgent}: Wraps a single \proverbackend{} implementation.
Receives goals from the coordinator and returns proof results. Multiple
prover agents run concurrently during Prover Wars rounds.
\item \textbf{ContextAgent}: Maintains the proof context, including
available lemmas, hypotheses, and previously proven results. Provides
context to the neural layer for informed tactic suggestion.
\item \textbf{LemmaAgent}: Manages auxiliary lemma generation and
verification. When the neural layer suggests a lemma decomposition,
the lemma agent verifies each sub-lemma independently before
composing the final proof.
\end{enumerate}
The multi-agent system communicates through typed message passing, ensuring
that the coordinator has a consistent view of all in-progress proof attempts
and can make informed decisions about resource allocation.
% ═════════════════════════════════════════════════════════════════════════════
% 7. NEURAL GUIDANCE ARCHITECTURE
% ═════════════════════════════════════════════════════════════════════════════
\section{Neural Guidance Architecture}
\label{sec:neural-architecture}
\subsection{Two-Layer Design}
\echidna{} provides neural guidance at two levels, each serving a distinct role:
\begin{enumerate}
\item \textbf{ML Layer} (Julia-based, \texttt{neural.rs} bridge): Fast,
specialised models for tactic prediction and premise selection. These models
are trained on proof corpora and provide low-latency suggestions
($<100$\,ms per query).
\item \textbf{LLM Advisory Layer} (\texttt{llm.rs} bridge): Frontier
language models (Claude, GPT-4, and others via a model-routing service)
for high-level strategy. The LLM provides goal reformulation, lemma
decomposition, proof sketches, and prover selection advice. Latency is
higher ($1$--$15$\,s) but the suggestions are richer and more contextual.
\end{enumerate}
Both layers are strictly advisory. Their outputs are typed as suggestions---\texttt{Vec<ScoredTactic>}, \texttt{Vec<ScoredPremise>}, or
structured JSON---and must pass through the verification pipeline before
affecting the proof state.
\subsection{ML Layer: Tactic Prediction and Premise Selection}
The ML layer, implemented in Julia (EchidnaML), provides three services
accessible via a REST API on a configurable port (default: 8081):
\paragraph{Tactic Prediction.}
Given the current proof state---the goal to be proven, the local hypotheses,
and the ambient context---the tactic predictor ranks candidate tactics by
estimated probability of making progress. The model uses logistic regression
over hand-engineered features extracted from the proof state:
\begin{itemize}
\item \textbf{Syntactic features}: Goal structure (universal/existential
quantifiers, connectives, term depth), hypothesis count, goal size.
\item \textbf{Semantic features}: Type-class membership, definitional
unfolding depth, similarity to previously solved goals.
\item \textbf{Historical features}: Per-tactic success rate on goals with
similar feature profiles.
\end{itemize}
The model outputs a distribution over tactics:
$P(\text{tactic} \mid \text{state}) \propto \exp(w^\top f(\text{state}, \text{tactic}))$,
where $f$ is the feature vector and $w$ is the learned weight vector.
\paragraph{Premise Selection.}
Given a goal, the premise selector retrieves the most relevant lemmas from the
ambient library. This is a ranking problem: among potentially thousands of
available lemmas, select the top-$k$ most likely to be useful. The selector
combines TF-IDF similarity over tokenised goal and lemma statements with
learned relevance scores.
\paragraph{Diversity-Aware Ranking.}
To avoid redundant suggestions, the ML layer implements a diversity-aware
re-ranking step. After the initial ranking, suggestions are iteratively
selected to maximise a submodular objective that trades off relevance with
diversity (measured by pairwise dissimilarity in the tactic/premise embedding
space). This ensures that the top-$k$ suggestions cover different proof
strategies rather than minor variations of the same approach.
\subsection{LLM Advisory Layer}
The LLM advisory layer (\texttt{llm.rs}) communicates with frontier language
models through a model-routing service (BoJ Server). The LLM advisor provides
four services:
\paragraph{Goal Reformulation.}
Given a stuck proof state, the LLM suggests equivalent reformulations of the
goal that may be more amenable to available tactics. For example, it might
suggest converting a goal about lists to one about vectors, or unfolding a
definition to expose a structural pattern.
\paragraph{Lemma Decomposition.}
For complex goals, the LLM suggests decompositions into auxiliary lemmas. Each
suggested lemma is independently verified by the prover backend before being
used in the main proof. The LemmaAgent coordinates this process.
\paragraph{Proof Sketch Generation.}
The LLM generates high-level proof sketches---sequences of informal reasoning
steps that are then formalised by matching each step to verified tactics. This
is similar to the Draft, Sketch, and Prove approach~\citep{dsp}, but applied
across multiple prover backends rather than only Isabelle.
\paragraph{Prover Selection.}
Given a goal's characteristics, the LLM advises which subset of provers is
most likely to succeed. This advice is combined with the Pareto-optimal
selection from historical data (\Cref{sec:prover-wars}) but does not override