fix(moe-shards): guard empty q4 dense FFN + wire --metal (closes #151)#152
Merged
chrishayuk merged 1 commit intoMay 27, 2026
Merged
Conversation
…shayuk#151) Two fixes for `larql run --moe-shards` on pure-MoE models (e.g. Gemma-4 26B A4B), reported in chrishayuk#151: 1. `resolve_ffn_weights` panicked with `range end index 3345408 out of range for slice of length 0` when `interleaved_q4k.bin` is 0 bytes. Pure-MoE vindexes ship no dense FFN tensor, so the fallback branch sliced into an empty mmap. Guard with `q4_ffn_mmap.is_empty()` and return empty `QuantWeight` stubs — `patch_pipeline_layers_for_remote_moe` overwrites them downstream and `moe_fn` supersedes the dense FFN path during decode, so the stubs are never read. 2. `--metal` was not wired into `run_with_moe_shards`; it always used `default_backend()` instead of letting the CLI flag select Metal. Mirror the same pattern PR chrishayuk#122 applied to `run_with_remote_ffn`: explicit `metal_backend()` with CPU fallback on init failure, and a clear error when the `gpu` feature isn't compiled in. Both were needed to get Gemma-4 26B A4B running with `--moe-shards`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two fixes for
larql run --moe-shardson pure-MoE models (e.g. Gemma-4 26B A4B), reported in #151:Empty
interleaved_q4k.binpanic guard.resolve_ffn_weightspanicked withrange end index 3345408 out of range for slice of length 0when the dense FFN mmap was 0 bytes. Pure-MoE vindexes ship no dense FFN tensor, so the fallback branch sliced into an empty mmap. Now guarded withq4_ffn_mmap.is_empty(): returns emptyQuantWeightstubs —patch_pipeline_layers_for_remote_moeoverwrites them downstream andmoe_fnsupersedes the dense FFN path during decode, so the stubs are never read.--metalnot wired intorun_with_moe_shards. Always useddefault_backend()instead of honouring the CLI flag. Mirrors the pattern PR fix: wire --metal into remote FFN path, add post-FFN norms, flush stdout (cherry-pick of #115) #122 already applied torun_with_remote_ffn: explicitmetal_backend()with CPU fallback on init failure, and a clear error when thegpufeature isn't compiled in.Both fixes were needed to get Gemma-4 26B A4B running with
--moe-shards.Files changed
crates/larql-compute/src/pipeline_layer.rs—is_empty()guard + regression testcrates/larql-cli/src/commands/primary/run_cmd.rs— newmetal: boolparam onrun_with_moe_shards, mirroringrun_with_remote_ffn's backend init, threaded fromargs.metalat the call siteNet diff: 2 files, +60 / -1.
Test plan
Local verification on macOS, branched off current
chrishayuk/larql:main(post-#145):cargo check -p larql-compute -p larql-cli --lib --tests— cleancargo clippy -p larql-compute -p larql-cli --lib --tests --no-deps -- -D warnings— cleancargo fmt -p larql-compute -p larql-cli -- --check— cleancargo test -p larql-compute --lib— 657 passed (includes newresolve_ffn_weights_returns_empty_stubs_when_q4_ffn_mmap_is_emptyregression)cargo test -p larql-cli --tests— 3 passed (1 ignored, model-heavy)CI verification (all green — 15/15)
test - ubuntu-latest(inference + cli) — pass (1m27s, 8m17s)test - macos-14(inference + cli) — pass (5m18s, 2m49s)test - windows-latest(inference + cli) — pass (12m23s, 12m26s)test · ubuntu-latest(compute) — pass (3m20s)test · macos-14(compute) — pass (6m28s, 1m1s)test · windows-latest(compute) — pass (10m50s)coverage - ubuntu× 2 — pass (6m46s, 6m29s)coverage · ubuntu— pass (51s)bench— pass (9m27s)verify— pass (3m48s)🤖 Generated with Claude Code