fix(llamacpp): pin mainline build by digest, retire dead TurboQuant fork by AlienWalker1995 · Pull Request #54 · AlienWalker1995/Ordo-AI-Stack

AlienWalker1995 · 2026-06-23T13:37:15Z

Problem (config drift)

The llamacpp image was tagged as official server-cuda but actually ran a TurboQuant fork build (0a8062d, ~April) — three disagreeing sources of truth: the .env override, the rolling tag, and llamacpp/Dockerfile still building the fork. That old build predates upstream MTP support (PR #22673), so Qwen3.6 MTP GGUFs failed to load (missing tensor blk.N.ssm_conv1d.weight).

The previous workaround was to strip the MTP head off every model — a per-artifact bandaid. The real fix is updating the engine.

Change

Pin both llama.cpp services by DIGEST to the current mainline build: ghcr.io/ggml-org/llama.cpp@sha256:44cd0833… (build 9765 / 73618f27a) — the first build that loads Qwen3.6 MTP + qwen35moe natively. No rolling tags.
Remove the dead build: stanza + llamacpp/Dockerfile (the unused fork) → one source of truth.
Reconcile .env.example.

Validation (in production, before commit)

Recreated llamacpp on build 9765 via the blessed two-env-file mechanism. The A3B (qwen35moe) loads on the GPU at full 512K with the qwen35moe.context_length override + YaRN 2× + mmproj, and serves through model-gateway — system_fingerprint: b9765-73618f27a, RTX 5090 at 28.8/32.6 GB. The dense Qwen3.6-27B MTP GGUF also loads natively on this build (CPU-isolated test) — the per-model strip is retired. docker compose config parses clean with both services on the digest, no build stanza.

Follow-ups (separate, not bundled)

Swap local-chat → dense Qwen3.6-27B (native 256K, --spec-type draft-mtp).
Extend stack_monitor to track the pinned llama.cpp build number (currently ROLLING).

🤖 Generated with Claude Code

The llamacpp image was tagged as the official `server-cuda` but actually ran a TurboQuant fork build (`0a8062d`, ~April) — three disagreeing sources of truth: the `.env` override, the rolling tag, and `llamacpp/Dockerfile` still building the fork. That old build predates upstream MTP support (PR #22673), so Qwen3.6 MTP GGUFs failed to load (`missing tensor blk.N.ssm_conv1d.weight`). - Pin both llama.cpp services to the current mainline image BY DIGEST (`ghcr.io/ggml-org/llama.cpp@sha256:44cd0833…`, build 9765 / 73618f27a) — the first build that loads Qwen3.6 MTP + qwen35moe natively. - Remove the dead `build:` stanza and `llamacpp/Dockerfile` (the unused fork). - Reconcile `.env.example` to the single pinned source of truth. Validated end-to-end: recreated llamacpp on build 9765; the A3B (qwen35moe) loads on GPU at full 512K with the `qwen35moe.context_length` override + YaRN + mmproj and serves via model-gateway (system_fingerprint `b9765-73618f27a`). The dense Qwen3.6-27B MTP GGUF also loads natively on this build — no per-model tensor stripping. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The fork is gone (mainline pinned by digest), so .env.example must not advertise tbq*/tbqp* KV cache types — they don't exist on mainline and an operator setting one would crash llama-server. Replace with the mainline KV types (q8_0 default for the hybrid A3B) and drop the build-specific 0a8062d reference. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Hermes Bot and others added 2 commits June 23, 2026 09:36

AlienWalker1995 merged commit e025dd8 into main Jun 23, 2026
5 checks passed

AlienWalker1995 deleted the fix/llamacpp-pin-mainline-digest branch June 23, 2026 13:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(llamacpp): pin mainline build by digest, retire dead TurboQuant fork#54

fix(llamacpp): pin mainline build by digest, retire dead TurboQuant fork#54
AlienWalker1995 merged 2 commits into
mainfrom
fix/llamacpp-pin-mainline-digest

AlienWalker1995 commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AlienWalker1995 commented Jun 23, 2026

Problem (config drift)

Change

Validation (in production, before commit)

Follow-ups (separate, not bundled)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant