Skip to content

Keep the bundled llama.cpp engine current: scheduled review-gated bump PRs + engine regression CI gate #238

Description

@quiet-node

Context

Thuki bundles a pinned llama-server build (currently llama.cpp b9590), fetched, verified, pruned, and re-signed by scripts/ensure-llama-server.ts. The pin is bumped only by hand. As llama.cpp adds support for new model architectures, a stale pin means newly released models fail to load on the built-in engine, and the failure today is opaque.

Motivating case: a user installed a model whose architecture postdates the pinned build; llama-server exits with unknown model architecture and the user only sees a generic engine-start failure.

We want the bundled engine to stay current automatically, but safely. A newer engine build can also break a previously-working model, change output, regress performance, or change the dylib closure in a way that breaks our prune-and-sign step. So "stay current" must be paired with "prove no regression."

Goal

Two coupled deliverables:

  1. Automated, scheduled, review-gated bumping of the pinned llama.cpp engine.
  2. An engine regression CI gate that proves a given engine build loads real models and runs real inference correctly, runs on every PR, and acts as the hard gate on bump PRs.

Requirements (the asks)

Automation / bump PRs:

  • A scheduled job detects new upstream llama.cpp releases and opens a bump PR.
  • Bump PRs are NEVER auto-merged. They require human review and approval before landing.
  • A bump PR pins the exact new build identifier and its sha256, and re-runs the existing fetch/verify/prune/re-sign flow. Provenance stays pinned: never float to "latest" at build time.
  • The PR makes the change reviewable: which build, what changed upstream, and the regression-gate result.

Engine regression CI gate:

  • A new CI job loads at least one real GGUF model through the actual bundled sidecar and runs real inference end to end (not just cargo build).
  • It asserts the engine starts, the model loads, inference produces correct/sane output, and there is no functional or meaningful performance regression versus the current pin.
  • It exercises the real ensure-llama-server.ts fetch/verify/prune/re-sign path so a changed dylib closure or signing breakage is caught.
  • This gate runs on EVERY PR (so normal changes cannot silently break the engine path), and is the mandatory hard gate on engine-bump PRs specifically.
  • A failing gate blocks the merge.

Explicitly open for the implementer (do your own research)

This ticket lists requirements, not a design. Whoever takes it must research, investigate, and evaluate the options before implementing, then bring a short proposal for review. Do not assume an approach from this ticket. Questions to evaluate and justify:

  • Bump-automation mechanism and cadence.
  • How to detect and represent "a new release," and how to avoid noisy rebuild-only bumps.
  • Which model(s) and prompt(s) the regression gate uses, how it asserts "correct output" deterministically, and how it measures performance regression without flakiness.
  • How the gate stays fast enough to run on every PR (caching, a small model, runner sizing).
  • Where the macOS-only signing/runner constraints push the design.

Constraints to honor

  • macOS-only sidecar; ad-hoc signed (signingIdentity: "-", hardenedRuntime: false); the existing ensure-llama-server.ts is the source of truth for fetch/verify/prune/sign.
  • Keep pinned-revision provenance (hash = integrity, pin = provenance).
  • Repo conventions: 100% coverage gates, signoff commits, no auto-merge for behavior-changing binaries.

Out of scope

  • The in-app user-facing handling of an unsupported model (clearer error + switch-model nudge) is a separate change.
  • Decoupled user-downloadable engine runtimes (LM Studio style) are not requested here; the existing Ollama / OpenAI-compatible provider path is the escape hatch for models the bundled engine cannot run.

Acceptance criteria

  • New upstream llama.cpp releases produce a review-ready bump PR automatically; nothing auto-merges.
  • Bump PRs pin exact build + sha256 and pass the existing fetch/verify/prune/re-sign flow.
  • An engine regression CI gate loads a real model and runs real inference, asserting correctness and no regression, and exercises the signing/prune path.
  • The regression gate runs on every PR and is required (blocking) on engine-bump PRs.
  • The chosen approach is documented with the trade-offs that were evaluated.

Note

Out of Phase 3 scope; separate infrastructure workstream.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestgithub_actionsPull requests that update GitHub Actions code

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions