Skip to content

fix(moe-shards): guard empty q4 dense FFN + wire --metal (closes #151)#152

Merged
chrishayuk merged 1 commit into
chrishayuk:mainfrom
deem0n:fix/moe-shards-pure-moe-and-metal
May 27, 2026
Merged

fix(moe-shards): guard empty q4 dense FFN + wire --metal (closes #151)#152
chrishayuk merged 1 commit into
chrishayuk:mainfrom
deem0n:fix/moe-shards-pure-moe-and-metal

Conversation

@deem0n
Copy link
Copy Markdown
Contributor

@deem0n deem0n commented May 27, 2026

Summary

Two fixes for larql run --moe-shards on pure-MoE models (e.g. Gemma-4 26B A4B), reported in #151:

  1. Empty interleaved_q4k.bin panic guard. resolve_ffn_weights panicked with range end index 3345408 out of range for slice of length 0 when the dense FFN mmap was 0 bytes. Pure-MoE vindexes ship no dense FFN tensor, so the fallback branch sliced into an empty mmap. Now guarded with q4_ffn_mmap.is_empty(): returns empty QuantWeight stubs — patch_pipeline_layers_for_remote_moe overwrites them downstream and moe_fn supersedes the dense FFN path during decode, so the stubs are never read.

  2. --metal not wired into run_with_moe_shards. Always used default_backend() instead of honouring the CLI flag. Mirrors the pattern PR fix: wire --metal into remote FFN path, add post-FFN norms, flush stdout (cherry-pick of #115) #122 already applied to run_with_remote_ffn: explicit metal_backend() with CPU fallback on init failure, and a clear error when the gpu feature isn't compiled in.

Both fixes were needed to get Gemma-4 26B A4B running with --moe-shards.

Files changed

  • crates/larql-compute/src/pipeline_layer.rsis_empty() guard + regression test
  • crates/larql-cli/src/commands/primary/run_cmd.rs — new metal: bool param on run_with_moe_shards, mirroring run_with_remote_ffn's backend init, threaded from args.metal at the call site

Net diff: 2 files, +60 / -1.

Test plan

Local verification on macOS, branched off current chrishayuk/larql:main (post-#145):

  • cargo check -p larql-compute -p larql-cli --lib --tests — clean
  • cargo clippy -p larql-compute -p larql-cli --lib --tests --no-deps -- -D warnings — clean
  • cargo fmt -p larql-compute -p larql-cli -- --check — clean
  • cargo test -p larql-compute --lib657 passed (includes new resolve_ffn_weights_returns_empty_stubs_when_q4_ffn_mmap_is_empty regression)
  • cargo test -p larql-cli --tests — 3 passed (1 ignored, model-heavy)

CI verification (all green — 15/15)

  • test - ubuntu-latest (inference + cli) — pass (1m27s, 8m17s)
  • test - macos-14 (inference + cli) — pass (5m18s, 2m49s)
  • test - windows-latest (inference + cli) — pass (12m23s, 12m26s)
  • test · ubuntu-latest (compute) — pass (3m20s)
  • test · macos-14 (compute) — pass (6m28s, 1m1s)
  • test · windows-latest (compute) — pass (10m50s)
  • coverage - ubuntu × 2 — pass (6m46s, 6m29s)
  • coverage · ubuntu — pass (51s)
  • bench — pass (9m27s)
  • verify — pass (3m48s)

🤖 Generated with Claude Code

…shayuk#151)

Two fixes for `larql run --moe-shards` on pure-MoE models (e.g.
Gemma-4 26B A4B), reported in chrishayuk#151:

1. `resolve_ffn_weights` panicked with `range end index 3345408 out of
   range for slice of length 0` when `interleaved_q4k.bin` is 0 bytes.
   Pure-MoE vindexes ship no dense FFN tensor, so the fallback branch
   sliced into an empty mmap. Guard with `q4_ffn_mmap.is_empty()` and
   return empty `QuantWeight` stubs — `patch_pipeline_layers_for_remote_moe`
   overwrites them downstream and `moe_fn` supersedes the dense FFN
   path during decode, so the stubs are never read.

2. `--metal` was not wired into `run_with_moe_shards`; it always used
   `default_backend()` instead of letting the CLI flag select Metal.
   Mirror the same pattern PR chrishayuk#122 applied to `run_with_remote_ffn`:
   explicit `metal_backend()` with CPU fallback on init failure, and
   a clear error when the `gpu` feature isn't compiled in.

Both were needed to get Gemma-4 26B A4B running with `--moe-shards`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@chrishayuk chrishayuk merged commit 270269c into chrishayuk:main May 27, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants