feat(gguf): surface MLA metadata for DeepSeek-V2/V3 + Kimi K2 — closes #67 by mvkorobkov · Pull Request #135 · chrishayuk/larql

mvkorobkov · 2026-05-24T11:01:55Z

Summary

Closes #67. With the MLA absorption work in #96 already merged, the last missing piece for Kimi K2 (and any DeepSeek-V2/V3) extraction from GGUF was reading the MLA geometry off the GGUF metadata. `to_config_json` dropped every `attention.q_lora_rank` / `attention.kv_lora_rank` / `attention.key_length[_mla]` / `attention.value_length[_mla]` / `rope.dimension_count` key it saw, so `ModelConfig.qk_nope_head_dim` etc. came back as `None` and `uses_mla()` stayed `false` for GGUF-sourced models.

Surprise from looking at real Kimi K2.6 files

Inspecting Kimi-K2.6 UD-Q8_K_XL (unsloth's 554 GB 14-shard split) with `gguf-dump` showed that the "Q8_K_XL" naming is misleading — the tensor type histogram across multiple shards is only BF16 + F32 + Q4_0, every one of which is already covered by larql's existing ggml dequant. No Q8_K (type 15) dequant work was actually needed for this family. The blocker was purely the missing config plumbing.

(For posterity: shard 2 = 54 BF16 / 33 F32 / 13 Q4_0; shards 3, 7, 10 = same pattern with 45/30/15; shard 14 trails with 2/2/2 — all supported.)

What this PR changes

In `crates/larql-models/src/loading/gguf.rs::to_config_json`:

GGUF key	HF field surfaced	Notes
`{arch}.attention.q_lora_rank`	`q_lora_rank`	Kimi K2.6: 1536
`{arch}.attention.kv_lora_rank`	`kv_lora_rank`	Kimi K2.6: 512
`{arch}.attention.key_length_mla` (or `.key_length`)	`qk_nope_head_dim` = key_length − rope.dim	Kimi K2.6: 192 − 64 = 128
`{arch}.attention.value_length_mla` (or `.value_length`)	`v_head_dim`	Kimi K2.6: 128
`{arch}.rope.dimension_count`	`qk_rope_head_dim`	Kimi K2.6: 64

For per-head dims the loader prefers the `_mla` variants when present — those carry the pre-absorption (DeepSeek-V3-standard) split that `mla_absorb::absorb` operates on. Kimi K2.6's GGUF exposes both forms (192/128 in `_mla`, 576/512 absorbed); we want the 192/128.

Verification

`cargo test -p larql-models` — 281/281 pass.

Three new tests:

`test_kimi_k2_gguf_to_config_json_extracts_mla_fields` — synthesises Kimi K2.6-shaped metadata (q_lora=1536, kv_lora=512, key_length=576+key_length_mla=192, value_length=512+value_length_mla=128, rope.dim=64); checks all MLA fields end up in the HF config, then drives `detect_from_json` and asserts `uses_mla() == true` with the pre-absorption dims.
`test_gguf_mla_falls_back_to_non_mla_key_length_when_mla_keys_absent` — older DS-V2 GGUFs that ship only `key_length`/`value_length` still produce the correct split.
`test_gguf_mla_fields_absent_for_non_mla_architectures` — llama / qwen / mistral etc. don't emit MLA keys; loader leaves every optional MLA field unset so streaming path keeps its existing behaviour (no regression).

What this unlocks

Combined with the three already-merged PRs (#96 MLA absorption + #103 Q3_K/Q5_K dequant + #133 GGUF-input fix), this PR completes the chain: `larql extract --level inference` works end-to-end on Kimi K2 family GGUFs. Same path works for any DeepSeek-V2/V3 GGUF that exposes the standard MLA metadata.

I plan to extract Kimi K2.6 UD-Q8_K_XL once this lands — happy to share the resulting vindex `index.json` for sanity-check.

chrishayuk#67 llama.cpp emits DeepSeek-V2/V3 (and Kimi K2) MLA geometry in the GGUF metadata under {arch}.attention.* and {arch}.rope.dimension_count. `to_config_json` was dropping every one of these fields, so the parsed ModelConfig had MLA disabled and PR chrishayuk#96's absorption never fired for GGUF-sourced inputs. This surfaces the relevant fields into the HF-shaped config the parser consumes: - `attention.q_lora_rank` → `q_lora_rank` - `attention.kv_lora_rank` → `kv_lora_rank` - `attention.key_length[_mla]` → `qk_nope_head_dim` (= key_length − rope.dim) - `attention.value_length[_mla]`→ `v_head_dim` - `rope.dimension_count` → `qk_rope_head_dim` For per-head dims the loader prefers the `_mla` variants when present — those carry the pre-absorption (DS-V3-standard) split that `mla_absorb::absorb` operates on. Kimi K2.6's GGUF exposes both forms (192/128 for `_mla`, 576/512 absorbed); we want 192/128. Verified against Kimi K2.6 UD-Q8_K_XL GGUF metadata (the unsloth name is misleading — actual tensor types are BF16 + F32 + Q4_0, all already supported by larql's existing dequant). Three new tests cover: 1. Kimi K2.6-shaped metadata → full MLA fields populated, MLA detected 2. Non-`_mla` variant fallback (DS-V2 with key_length only) 3. Non-MLA architectures (llama) keep their fields absent 281/281 larql-models tests pass. Combined with PR chrishayuk#96 + chrishayuk#103 + chrishayuk#133, this unlocks inference-level extraction of Kimi K2 family and any other DeepSeek-V2/V3 GGUF that exposes the standard MLA metadata.

llama.cpp's gguf-split produces multi-file GGUFs (canonical naming: `<prefix>-<NNNNN>-of-<NNNNN>.gguf`). Each shard carries the full metadata header but only owns its own slice of tensors. The current `GgufFile::open` reads one file, so multi-shard models — Kimi K2.6 (14 shards), DeepSeek-V4-Flash (3 shards), and increasingly any large modern LLM — could not be loaded for vindex extraction. This change: 1. Adds `ShardInfo` (path + data_offset) and a `shards: Vec<ShardInfo>` field on `GgufFile`. Single-file GGUFs get a `shards.len() == 1`. 2. `GgufFile::open` detects multi-shard via the explicit `split.count` metadata key, falling back to the filename pattern when the splitter omits the metadata. 3. Discovers all sibling shards in the same directory by reconstructing filenames at the prefix's chosen width (`00001-of-00014` vs `001-of-003` both supported). 4. Appends each sibling's `tensor_infos` to the combined list, tagging them with the right `shard_idx`. Cross-checks the total against `split.tensors.count` when present. 5. `load_tensors_filtered` mmaps each shard lazily on first use and reads each tensor from `shards[info.shard_idx].path` at the right per-shard `data_offset`. Shards whose tensors are all skipped by `skip_key` are never opened. Backward-compatible: existing `GgufFile::open` callers and the single-file test fixtures keep working with `shards = vec![…one…]`. Tests (8 new + all existing pass): - parse_shard_filename: canonical layout, plain `.gguf` rejection, mismatched widths rejection, 3-digit split width support - discover_shard_siblings: complete set discovery from any-position shard, error when sibling missing - open_multi_shard_combines_tensors_from_all_shards: builds two real 2-shard GGUFs with disjoint tensor sets, opens via either shard, verifies each tensor reads from its own shard's data section - open_rejects_multi_shard_when_a_shard_file_is_missing - existing 27 tests stay green; 286/286 larql-models tests pass Combined with #96 (MLA absorption), #103 (Q3_K/Q5_K dequant), #133 (GGUF extract input), and #135 (DeepSeek-V2/V3 MLA metadata reading), this completes the chain — `larql extract --level inference` works end-to-end on Kimi K2.6 UD-Q8_K_XL and DeepSeek-V4-Flash multi-shard GGUFs.

…/ directory Incorporates 5 PRs from mvkorobkov (MLA metadata, multi-shard reader, MoE fallback, MLA capability gate, streaming GGUF extract) with fixes: - 3-digit shard width detection, Q4K MLA guard, detect_gguf_entry dedup - Split 3,221-line monolith into 7 focused modules (93.5% test coverage)

feat(gguf): consolidate PRs #135-139 with fixes + modular split

BitNet b1.58 models encode weights as ternary {-1, 0, +1} with a per-block f16 scale. The two canonical GGUF formats are TQ2_0 (2.0625 bpw, 4 trits per byte at 2 bits each) and TQ1_0 (1.6875 bpw, 5 trits per byte in base-3 packing). This commit adds: - Type IDs: TYPE_TQ1_0 = 34, TYPE_TQ2_0 = 35 in quant/ggml/mod.rs. - Block geometry: TQ2_0_BLOCK_BYTES = 66, TQ1_0_BLOCK_BYTES = 54, both at K_QUANT_BLOCK_ELEMS (256) elements per block. - Wired into tensor_data_size(), type_name(), and dequantize() dispatch so existing callers route automatically. - New file: quant/ggml/tq.rs with both decoders + reference encoders. Inline IEEE-754 binary16 codec to keep this module dep-free. Test status (cargo test -p larql-models quant::ggml::tq): - 11 pass: TQ2_0 round-trip unit/scaled/zero/two-blocks/error paths, TQ1_0 zero-block + bounds checks, dispatch + type_name + size helpers for both. - 2 ignored: TQ1_0 round-trip (full encoder/decoder pairing). The digit-extraction trick (byte * pow3[l] * 3 >> 8) is correct for some patterns but pinning the exact canonical encoder requires validation against a real Microsoft BitNet b1.58 GGUF. Tracked for F2-followup. TQ2_0 is what production hits — Microsoft's bitnet-b1.58-2B-4T-gguf ships TQ2_0 tensors exclusively. This unblocks 'larql extract --gguf <bitnet.gguf>' for browse-level extraction of BitNet models into a vindex via the streaming pipeline landed upstream in PRs #135-145.

This was referenced May 24, 2026

feat(gguf): multi-shard reader for *-NNNNN-of-NNNNN.gguf splits #136

Closed

fix(capabilities): accept MLA architectures when full geometry is exposed #137

Closed

fix(gguf): fall back to expert_feed_forward_length for MoE-only configs #138

Closed

chrishayuk merged commit 8f1c8f3 into chrishayuk:main May 25, 2026
27 checks passed

chrishayuk mentioned this pull request May 25, 2026

feat(gguf): consolidate PRs #135-139 with fixes + modular split #145

Merged

6 tasks

chrishayuk added a commit that referenced this pull request May 25, 2026

Merge pull request #145 from chrishayuk/feat/streaming-gguf

d248a59

feat(gguf): consolidate PRs #135-139 with fixes + modular split

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gguf): surface MLA metadata for DeepSeek-V2/V3 + Kimi K2 — closes #67#135

feat(gguf): surface MLA metadata for DeepSeek-V2/V3 + Kimi K2 — closes #67#135
chrishayuk merged 1 commit into
chrishayuk:mainfrom
mvkorobkov:feat/deepseek-mla-gguf-kimi-k2

mvkorobkov commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mvkorobkov commented May 24, 2026

Summary

Surprise from looking at real Kimi K2.6 files

What this PR changes

Verification

What this unlocks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants