Keep the bundled llama.cpp engine current: scheduled review-gated bump PRs + engine regression CI gate

## Context

Thuki bundles a pinned `llama-server` build (currently llama.cpp **b9590**), fetched, verified, pruned, and re-signed by `scripts/ensure-llama-server.ts`. The pin is bumped only by hand. As llama.cpp adds support for new model architectures, a stale pin means newly released models fail to load on the built-in engine, and the failure today is opaque.

Motivating case: a user installed a model whose architecture postdates the pinned build; `llama-server` exits with `unknown model architecture` and the user only sees a generic engine-start failure.

We want the bundled engine to stay current automatically, but safely. A newer engine build can also break a previously-working model, change output, regress performance, or change the dylib closure in a way that breaks our prune-and-sign step. So "stay current" must be paired with "prove no regression."

## Goal

Two coupled deliverables:

1. Automated, scheduled, review-gated bumping of the pinned llama.cpp engine.
2. An engine regression CI gate that proves a given engine build loads real models and runs real inference correctly, runs on every PR, and acts as the hard gate on bump PRs.

## Requirements (the asks)

Automation / bump PRs:

- A scheduled job detects new upstream llama.cpp releases and opens a bump PR.
- Bump PRs are NEVER auto-merged. They require human review and approval before landing.
- A bump PR pins the exact new build identifier and its sha256, and re-runs the existing fetch/verify/prune/re-sign flow. Provenance stays pinned: never float to "latest" at build time.
- The PR makes the change reviewable: which build, what changed upstream, and the regression-gate result.

Engine regression CI gate:

- A new CI job loads at least one real GGUF model through the actual bundled sidecar and runs real inference end to end (not just `cargo build`).
- It asserts the engine starts, the model loads, inference produces correct/sane output, and there is no functional or meaningful performance regression versus the current pin.
- It exercises the real `ensure-llama-server.ts` fetch/verify/prune/re-sign path so a changed dylib closure or signing breakage is caught.
- This gate runs on EVERY PR (so normal changes cannot silently break the engine path), and is the mandatory hard gate on engine-bump PRs specifically.
- A failing gate blocks the merge.

## Explicitly open for the implementer (do your own research)

This ticket lists requirements, not a design. Whoever takes it must research, investigate, and evaluate the options before implementing, then bring a short proposal for review. Do not assume an approach from this ticket. Questions to evaluate and justify:

- Bump-automation mechanism and cadence.
- How to detect and represent "a new release," and how to avoid noisy rebuild-only bumps.
- Which model(s) and prompt(s) the regression gate uses, how it asserts "correct output" deterministically, and how it measures performance regression without flakiness.
- How the gate stays fast enough to run on every PR (caching, a small model, runner sizing).
- Where the macOS-only signing/runner constraints push the design.

## Constraints to honor

- macOS-only sidecar; ad-hoc signed (`signingIdentity: "-"`, `hardenedRuntime: false`); the existing `ensure-llama-server.ts` is the source of truth for fetch/verify/prune/sign.
- Keep pinned-revision provenance (hash = integrity, pin = provenance).
- Repo conventions: 100% coverage gates, signoff commits, no auto-merge for behavior-changing binaries.

## Out of scope

- The in-app user-facing handling of an unsupported model (clearer error + switch-model nudge) is a separate change.
- Decoupled user-downloadable engine runtimes (LM Studio style) are not requested here; the existing Ollama / OpenAI-compatible provider path is the escape hatch for models the bundled engine cannot run.

## Acceptance criteria

- [ ] New upstream llama.cpp releases produce a review-ready bump PR automatically; nothing auto-merges.
- [ ] Bump PRs pin exact build + sha256 and pass the existing fetch/verify/prune/re-sign flow.
- [ ] An engine regression CI gate loads a real model and runs real inference, asserting correctness and no regression, and exercises the signing/prune path.
- [ ] The regression gate runs on every PR and is required (blocking) on engine-bump PRs.
- [ ] The chosen approach is documented with the trade-offs that were evaluated.

## Note

Out of Phase 3 scope; separate infrastructure workstream.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep the bundled llama.cpp engine current: scheduled review-gated bump PRs + engine regression CI gate #238

Context

Goal

Requirements (the asks)

Explicitly open for the implementer (do your own research)

Constraints to honor

Out of scope

Acceptance criteria

Note

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Keep the bundled llama.cpp engine current: scheduled review-gated bump PRs + engine regression CI gate #238

Description

Context

Goal

Requirements (the asks)

Explicitly open for the implementer (do your own research)

Constraints to honor

Out of scope

Acceptance criteria

Note

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions