feat(gguf): consolidate PRs #135-139 with fixes + modular split by chrishayuk · Pull Request #145 · chrishayuk/larql

chrishayuk · 2026-05-25T13:09:35Z

Summary

Consolidates 5 GGUF PRs from @mvkorobkov into a single reviewed, tested, and modularised landing:

feat(gguf): surface MLA metadata for DeepSeek-V2/V3 + Kimi K2 — closes #67 #135 MLA metadata for DeepSeek-V2/V3 + Kimi K2 (already merged)
feat(gguf): multi-shard reader for *-NNNNN-of-NNNNN.gguf splits #136 Multi-shard GGUF reader
fix(capabilities): accept MLA architectures when full geometry is exposed #137 MLA capability gate for f32 absorption path
fix(gguf): fall back to expert_feed_forward_length for MoE-only configs #138 MoE expert_feed_forward_length fallback
feat(streaming): GGUF support in extract pipeline (browse-level) #139 Streaming GGUF extract at browse level

Review fixes applied

3-digit shard width detection bug in discover_shard_siblings
Q4K writer MLA guard (prevents silent corrupt vindex)
Deduplicated detect_gguf_entry (was copy-pasted across 2 files)
Redundant shard_idx assignment, stale doc comment, unused import

Module split

Split 3,221-line gguf.rs monolith into gguf/ directory:

File	Lines	Contents
`mod.rs`	23	Re-exports
`constants.rs`	82	GGUF magic, type IDs, key names
`types.rs`	151	GgufValue, GgufTensorInfo, ShardInfo, GgufFile
`reader.rs`	370	Binary read helpers
`parser.rs`	772	open/open_single, shard discovery
`orient.rs`	649	Tensor orientation/split
`loader.rs`	1,244	load_tensors, to_config_json, entry points

Test coverage

93.5% on gguf module (543/581 lines). 27 new tests added covering: shard parsing edge cases, multi-shard open via non-first shard, MoE fallback, tensor count mismatch, bad magic/version, skip_key filtering, 1D/3D tensor handling, config JSON branch coverage, orient/split edge cases.

Test plan

cargo fmt -- --check clean
cargo clippy --workspace clean
cargo test -p larql-models -p larql-vindex all pass
cargo tarpaulin -p larql-models 93.5% on gguf module
larql bench gemma3-4b-v2 no performance regression
larql show / list / run all functional

Closes #136, closes #137, closes #138, closes #139

llama.cpp's gguf-split produces multi-file GGUFs (canonical naming: `<prefix>-<NNNNN>-of-<NNNNN>.gguf`). Each shard carries the full metadata header but only owns its own slice of tensors. The current `GgufFile::open` reads one file, so multi-shard models — Kimi K2.6 (14 shards), DeepSeek-V4-Flash (3 shards), and increasingly any large modern LLM — could not be loaded for vindex extraction. This change: 1. Adds `ShardInfo` (path + data_offset) and a `shards: Vec<ShardInfo>` field on `GgufFile`. Single-file GGUFs get a `shards.len() == 1`. 2. `GgufFile::open` detects multi-shard via the explicit `split.count` metadata key, falling back to the filename pattern when the splitter omits the metadata. 3. Discovers all sibling shards in the same directory by reconstructing filenames at the prefix's chosen width (`00001-of-00014` vs `001-of-003` both supported). 4. Appends each sibling's `tensor_infos` to the combined list, tagging them with the right `shard_idx`. Cross-checks the total against `split.tensors.count` when present. 5. `load_tensors_filtered` mmaps each shard lazily on first use and reads each tensor from `shards[info.shard_idx].path` at the right per-shard `data_offset`. Shards whose tensors are all skipped by `skip_key` are never opened. Backward-compatible: existing `GgufFile::open` callers and the single-file test fixtures keep working with `shards = vec![…one…]`. Tests (8 new + all existing pass): - parse_shard_filename: canonical layout, plain `.gguf` rejection, mismatched widths rejection, 3-digit split width support - discover_shard_siblings: complete set discovery from any-position shard, error when sibling missing - open_multi_shard_combines_tensors_from_all_shards: builds two real 2-shard GGUFs with disjoint tensor sets, opens via either shard, verifies each tensor reads from its own shard's data section - open_rejects_multi_shard_when_a_shard_file_is_missing - existing 27 tests stay green; 286/286 larql-models tests pass Combined with #96 (MLA absorption), #103 (Q3_K/Q5_K dequant), #133 (GGUF extract input), and #135 (DeepSeek-V2/V3 MLA metadata reading), this completes the chain — `larql extract --level inference` works end-to-end on Kimi K2.6 UD-Q8_K_XL and DeepSeek-V4-Flash multi-shard GGUFs.

The streaming extract pipeline in `larql-vindex` needs per-tensor metadata access to look up the right shard / byte range / quant type for each tensor on demand (without bulk-loading a 500 GB+ MoE model into RAM). All the building blocks were already on `GgufFile.shards` and the free helpers `normalize_gguf_key` / `dequantize` / `tensor_data_size`; this commit only adds the read-only accessors on `GgufTensorInfo` so a consumer can: for info in &gguf.tensor_infos { let hf_key = normalize_gguf_key(info.name()); if !want(&hf_key) { continue; } let shard = &gguf.shards[info.shard_idx()]; // mmap shard, slice [shard.data_offset + info.offset() // .. + tensor_data_size(info.tensor_type(), n_elements)?], // dequantize to f32, reshape to (dims[1], dims[0]). } No behaviour change — purely additive accessors. Used in the follow-up streaming-GGUF work that lets `build_vindex_streaming` ingest GGUF inputs alongside safetensors.

Adds GGUF to the streaming-extract pipeline alongside safetensors. Until now, GGUF input was routed through the in-memory `load_model_dir_validated` path which dequantises every tensor to f32 in RAM — fine for small models but architecturally unworkable for ≥70B GGUFs (Kimi K2.6 at 554 GB, DS-V4-Flash at 127 GB). Design — TensorSource enum: enum TensorSource { Safetensors { shards, index }, // existing Gguf(GgufTensorSource), // new } `StreamingContext::new` detects the input format (`.gguf` file directly, or a directory whose first/largest `.gguf` shard is used as the entry point) and constructs the appropriate variant. Each stage now calls `self.tensor_source.get_tensor_f32(key)` for the canonical 2D dequant path. The MXFP4 raw-pair access (DeepSeek-V4 packed gate_up_proj_blocks / down_proj_blocks) stays safetensors-only — GGUF has no equivalent packed format. GGUF specifics: - Multi-shard splits are handled via `GgufFile::open` (added previously); each shard is mmap'd eagerly (virtual address space only — the OS pages in only what we touch). - Per-tensor read does `data_offset + offset` into the right shard, slices `tensor_data_size` bytes, and dequantises via `larql_models::quant::ggml::dequantize` (Q4_K / Q5_K / Q6_K / Q8_0 / BF16 / F32 etc — all already supported by the workspace). - The dim ordering convention matches `load_gguf`'s reshape to `(dims[1], dims[0])`. Canonical FFN orientation (the in-memory loader's `orient_in_place`) is applied here too, driven by `(hidden_size, intermediate_size)` from the detected architecture — without it `tensor.shape()[0]` would be `hidden` instead of `intermediate` for some quants and downstream matmul would produce NaN. CLI routing: - safetensors (any level) → streaming - GGUF + browse + quant=none → streaming (new) - GGUF + attention/inference/all → in-memory (unchanged) - GGUF + any level with --quant q4k → in-memory (unchanged) Inference / Q4K levels for GGUF still need the `StreamingWeights` writer subsystem (Q4_K + f32 attn/FFN writers) ported to read tensors via `ggml::dequantize` per tensor — that's the follow-on PR. The stage gate returns a clear `VindexError::Parse(...)` if the user requests an unsupported level/quant combo for GGUF input. Validation: - DS-R1-0528-Qwen3-8B-Q3_K_L (10 GB, dense, mixed Q3_K/Q4_K/Q6_K) → 3.4 GB gate_vectors.bin + 1.2 GB embeddings.bin written cleanly through the streaming path on ai-main. - 1074 vindex unit tests pass. Known gap (pre-existing, not introduced here): the streaming pipeline's MoE branch looks up per-expert 2D keys (`mlp.experts.K.gate_proj.weight`) which GGUF stores as 3D-packed tensors (`blk.L.ffn_gate_exps.weight`, `[hidden, intermediate, n_experts]`). Both the streaming pipeline and the in-memory `load_gguf` currently skip these 3D tensors. Unpacking them lives in the same follow-on PR as inference-level GGUF.

Previously the streaming `down_meta` stage accumulated every layer's feature meta in memory and called `write_binary` exactly once at the end of the projection loop. For a dense ≥30B model, that loop is single-threaded matmul that can run for an hour — kill mid-projection and every completed layer's work was lost. Fix: snapshot `all_down_meta` to `down_meta.bin` after each layer finishes. `write_binary` already uses a tempfile + atomic rename, so the on-disk file is never in a half-written state — readers always see either the previous snapshot or the new one. The loop is restructured from `iter_mut().enumerate()` to index-based iteration so the per-iteration mutable borrow on `all_down_meta[layer]` drops before the immutable borrow `write_binary` needs. Cost: ~1.5 MB extra write per layer (well under the per-layer matmul time). Benefit: a killed run preserves every completed layer of projection — a 40-min interruption no longer loses 40 min of work. The final write after the loop is kept for the resumed-from-checkpoint branch (where the loop runs zero iterations).

DeepSeek-V4 family emits only `{arch}.expert_feed_forward_length` — never the global `{arch}.feed_forward_length` — because no dense FFN layer exists above the per-expert size. The current loader reads only the global key, so `intermediate_size` came back as `0` and config validation rejected: Error: failed to load GGUF model: config validation failed: [ConfigValidationError { field: "intermediate_size", message: "must be greater than 0" }] This is the same fix as upstream PR #138, applied directly to this branch so DS-V4-Flash can flow through the streaming-GGUF path. (#138 will land independently; this commit is no-op once it merges.)

…/ directory Incorporates 5 PRs from mvkorobkov (MLA metadata, multi-shard reader, MoE fallback, MLA capability gate, streaming GGUF extract) with fixes: - 3-digit shard width detection, Q4K MLA guard, detect_gguf_entry dedup - Split 3,221-line monolith into 7 focused modules (93.5% test coverage)

… with orient.rs debt baseline

…x coverage baselines

Mykhailo Korobkov and others added 9 commits May 25, 2026 14:53

fix(clippy): replace useless vec! with array literal in parser test

be5d7dc

fix(coverage): update policy for gguf/ module split — replace gguf.rs…

ae8424f

… with orient.rs debt baseline

fix(coverage): lower vindex streaming baselines for new GGUF code paths

281084a

chrishayuk force-pushed the feat/streaming-gguf branch from 1d5add7 to 9598975 Compare May 25, 2026 14:04

fix: add has_vision_config to test configs after rebase, update vinde…

9598975

…x coverage baselines

chrishayuk merged commit d248a59 into main May 25, 2026
31 checks passed

deem0n mentioned this pull request May 27, 2026

fix(moe-shards): guard empty q4 dense FFN + wire --metal (closes #151) #152

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gguf): consolidate PRs #135-139 with fixes + modular split#145

feat(gguf): consolidate PRs #135-139 with fixes + modular split#145
chrishayuk merged 10 commits into
mainfrom
feat/streaming-gguf

chrishayuk commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chrishayuk commented May 25, 2026

Summary

Review fixes applied

Module split

Test coverage

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant