Context
Thuki bundles a pinned llama-server build (currently llama.cpp b9590), fetched, verified, pruned, and re-signed by scripts/ensure-llama-server.ts. The pin is bumped only by hand. As llama.cpp adds support for new model architectures, a stale pin means newly released models fail to load on the built-in engine, and the failure today is opaque.
Motivating case: a user installed a model whose architecture postdates the pinned build; llama-server exits with unknown model architecture and the user only sees a generic engine-start failure.
We want the bundled engine to stay current automatically, but safely. A newer engine build can also break a previously-working model, change output, regress performance, or change the dylib closure in a way that breaks our prune-and-sign step. So "stay current" must be paired with "prove no regression."
Goal
Two coupled deliverables:
- Automated, scheduled, review-gated bumping of the pinned llama.cpp engine.
- An engine regression CI gate that proves a given engine build loads real models and runs real inference correctly, runs on every PR, and acts as the hard gate on bump PRs.
Requirements (the asks)
Automation / bump PRs:
- A scheduled job detects new upstream llama.cpp releases and opens a bump PR.
- Bump PRs are NEVER auto-merged. They require human review and approval before landing.
- A bump PR pins the exact new build identifier and its sha256, and re-runs the existing fetch/verify/prune/re-sign flow. Provenance stays pinned: never float to "latest" at build time.
- The PR makes the change reviewable: which build, what changed upstream, and the regression-gate result.
Engine regression CI gate:
- A new CI job loads at least one real GGUF model through the actual bundled sidecar and runs real inference end to end (not just
cargo build).
- It asserts the engine starts, the model loads, inference produces correct/sane output, and there is no functional or meaningful performance regression versus the current pin.
- It exercises the real
ensure-llama-server.ts fetch/verify/prune/re-sign path so a changed dylib closure or signing breakage is caught.
- This gate runs on EVERY PR (so normal changes cannot silently break the engine path), and is the mandatory hard gate on engine-bump PRs specifically.
- A failing gate blocks the merge.
Explicitly open for the implementer (do your own research)
This ticket lists requirements, not a design. Whoever takes it must research, investigate, and evaluate the options before implementing, then bring a short proposal for review. Do not assume an approach from this ticket. Questions to evaluate and justify:
- Bump-automation mechanism and cadence.
- How to detect and represent "a new release," and how to avoid noisy rebuild-only bumps.
- Which model(s) and prompt(s) the regression gate uses, how it asserts "correct output" deterministically, and how it measures performance regression without flakiness.
- How the gate stays fast enough to run on every PR (caching, a small model, runner sizing).
- Where the macOS-only signing/runner constraints push the design.
Constraints to honor
- macOS-only sidecar; ad-hoc signed (
signingIdentity: "-", hardenedRuntime: false); the existing ensure-llama-server.ts is the source of truth for fetch/verify/prune/sign.
- Keep pinned-revision provenance (hash = integrity, pin = provenance).
- Repo conventions: 100% coverage gates, signoff commits, no auto-merge for behavior-changing binaries.
Out of scope
- The in-app user-facing handling of an unsupported model (clearer error + switch-model nudge) is a separate change.
- Decoupled user-downloadable engine runtimes (LM Studio style) are not requested here; the existing Ollama / OpenAI-compatible provider path is the escape hatch for models the bundled engine cannot run.
Acceptance criteria
Note
Out of Phase 3 scope; separate infrastructure workstream.
Context
Thuki bundles a pinned
llama-serverbuild (currently llama.cpp b9590), fetched, verified, pruned, and re-signed byscripts/ensure-llama-server.ts. The pin is bumped only by hand. As llama.cpp adds support for new model architectures, a stale pin means newly released models fail to load on the built-in engine, and the failure today is opaque.Motivating case: a user installed a model whose architecture postdates the pinned build;
llama-serverexits withunknown model architectureand the user only sees a generic engine-start failure.We want the bundled engine to stay current automatically, but safely. A newer engine build can also break a previously-working model, change output, regress performance, or change the dylib closure in a way that breaks our prune-and-sign step. So "stay current" must be paired with "prove no regression."
Goal
Two coupled deliverables:
Requirements (the asks)
Automation / bump PRs:
Engine regression CI gate:
cargo build).ensure-llama-server.tsfetch/verify/prune/re-sign path so a changed dylib closure or signing breakage is caught.Explicitly open for the implementer (do your own research)
This ticket lists requirements, not a design. Whoever takes it must research, investigate, and evaluate the options before implementing, then bring a short proposal for review. Do not assume an approach from this ticket. Questions to evaluate and justify:
Constraints to honor
signingIdentity: "-",hardenedRuntime: false); the existingensure-llama-server.tsis the source of truth for fetch/verify/prune/sign.Out of scope
Acceptance criteria
Note
Out of Phase 3 scope; separate infrastructure workstream.