feat(ggml): add Q3_K and Q5_K dequantization (types 11 and 13)#103
Conversation
|
Hey @mvkorobkov — the Q3_K + Q5_K additions look great in isolation, but this branch is showing 920 files changed / +38.5K / −122K against current main. Looks like the base diverged significantly while the PR was open. Could you rebase against current If a rebase is painful, I can also cherry-pick just the Q3_K / Q5_K files onto a fresh branch from main and credit you on the commits (the pattern we used for #91→#121 and #115→#122). Let me know which you prefer. |
179a07b to
6b2cd1b
Compare
|
Rebased onto current main (810f163). Branch is now a single 349-line commit — |
6b2cd1b to
f2a4c34
Compare
Implements scalar dequantize for Q3_K (110 B/block) and Q5_K (176 B/block) so that DeepSeek-R1-0528-Qwen3-8B-Q3_K_L and similar models can be converted via larql gguf-to-vindex. - q3_k.rs: unpack_q3k_scales (kmask1/kmask2 per llama.cpp), two-half-block loop with m-bitmask for high bits, signed-scale centred at 32. - q5_k.rs: reuses pub(super) unpack_q4k_scales from q4_k; u1/u2 mask walk for high bits, 4 iterations of 64 elements each. - mod.rs: Q3_K_BLOCK_BYTES=110, Q5_K_BLOCK_BYTES=176, dispatch in tensor_data_size() and dequantize(). - q4_k.rs: unpack_q4k_scales promoted to pub(super) for Q5_K reuse.
#67 llama.cpp emits DeepSeek-V2/V3 (and Kimi K2) MLA geometry in the GGUF metadata under {arch}.attention.* and {arch}.rope.dimension_count. `to_config_json` was dropping every one of these fields, so the parsed ModelConfig had MLA disabled and PR #96's absorption never fired for GGUF-sourced inputs. This surfaces the relevant fields into the HF-shaped config the parser consumes: - `attention.q_lora_rank` → `q_lora_rank` - `attention.kv_lora_rank` → `kv_lora_rank` - `attention.key_length[_mla]` → `qk_nope_head_dim` (= key_length − rope.dim) - `attention.value_length[_mla]`→ `v_head_dim` - `rope.dimension_count` → `qk_rope_head_dim` For per-head dims the loader prefers the `_mla` variants when present — those carry the pre-absorption (DS-V3-standard) split that `mla_absorb::absorb` operates on. Kimi K2.6's GGUF exposes both forms (192/128 for `_mla`, 576/512 absorbed); we want 192/128. Verified against Kimi K2.6 UD-Q8_K_XL GGUF metadata (the unsloth name is misleading — actual tensor types are BF16 + F32 + Q4_0, all already supported by larql's existing dequant). Three new tests cover: 1. Kimi K2.6-shaped metadata → full MLA fields populated, MLA detected 2. Non-`_mla` variant fallback (DS-V2 with key_length only) 3. Non-MLA architectures (llama) keep their fields absent 281/281 larql-models tests pass. Combined with PR #96 + #103 + #133, this unlocks inference-level extraction of Kimi K2 family and any other DeepSeek-V2/V3 GGUF that exposes the standard MLA metadata.
llama.cpp's gguf-split produces multi-file GGUFs (canonical naming: `<prefix>-<NNNNN>-of-<NNNNN>.gguf`). Each shard carries the full metadata header but only owns its own slice of tensors. The current `GgufFile::open` reads one file, so multi-shard models — Kimi K2.6 (14 shards), DeepSeek-V4-Flash (3 shards), and increasingly any large modern LLM — could not be loaded for vindex extraction. This change: 1. Adds `ShardInfo` (path + data_offset) and a `shards: Vec<ShardInfo>` field on `GgufFile`. Single-file GGUFs get a `shards.len() == 1`. 2. `GgufFile::open` detects multi-shard via the explicit `split.count` metadata key, falling back to the filename pattern when the splitter omits the metadata. 3. Discovers all sibling shards in the same directory by reconstructing filenames at the prefix's chosen width (`00001-of-00014` vs `001-of-003` both supported). 4. Appends each sibling's `tensor_infos` to the combined list, tagging them with the right `shard_idx`. Cross-checks the total against `split.tensors.count` when present. 5. `load_tensors_filtered` mmaps each shard lazily on first use and reads each tensor from `shards[info.shard_idx].path` at the right per-shard `data_offset`. Shards whose tensors are all skipped by `skip_key` are never opened. Backward-compatible: existing `GgufFile::open` callers and the single-file test fixtures keep working with `shards = vec![…one…]`. Tests (8 new + all existing pass): - parse_shard_filename: canonical layout, plain `.gguf` rejection, mismatched widths rejection, 3-digit split width support - discover_shard_siblings: complete set discovery from any-position shard, error when sibling missing - open_multi_shard_combines_tensors_from_all_shards: builds two real 2-shard GGUFs with disjoint tensor sets, opens via either shard, verifies each tensor reads from its own shard's data section - open_rejects_multi_shard_when_a_shard_file_is_missing - existing 27 tests stay green; 286/286 larql-models tests pass Combined with #96 (MLA absorption), #103 (Q3_K/Q5_K dequant), #133 (GGUF extract input), and #135 (DeepSeek-V2/V3 MLA metadata reading), this completes the chain — `larql extract --level inference` works end-to-end on Kimi K2.6 UD-Q8_K_XL and DeepSeek-V4-Flash multi-shard GGUFs.
What
Implements scalar dequantization for two missing K-quant formats:
Both
tensor_data_size()anddequantize()inmod.rsare wired up.Why
Without these,
larql convert gguf-to-vindexfails withunsupported type id 11/13on any model that uses Q3_K or Q5_K tensors. This includes:Implementation
q3_k.rs—dequantize_q3_k()hmask[0..32]·qs[32..96]·scales[96..108]·d[108..110]unpack_q3k_scales(): 12 bytes → 16 six-bit signed values using thekmask1=0x03030303/kmask2=0x0F0F0F0Fshuffle fromdequantize_row_q3_Kin llama.cppmbitmask walks throughhmask; clear bit → subtract 4 from q2 valueq5_k.rs—dequantize_q5_k()d[0..2]·dmin[2..4]·scales[4..16]·qh[16..48]·qs[48..176]pub(super) unpack_q4k_scales()fromq4_k.rs(same 12-byte format as Q4_K)u1/u2bitmask pair walks throughqh; set bit → add 16 to 4-bit nibbledequantize_row_q5_Kin llama.cppq4_k.rs—unpack_q4k_scalesvisibility changedfn→pub(super)so Q5_K can share it without duplication.Testing
Unit tests in each module:
q3_k: zero-scale all-zero output, hmask-clear subtracts 4, wrong-size errorq5_k: zero-scale all-zero output, high-bit adds 16, wrong-size errorEnd-to-end:
larql convert gguf-to-vindexonDeepSeek-R1-0528-Qwen3-8B-Q3_K_L.ggufcompletes through dequantization without errors (145 Q3_K + 108 Q5_K tensors dequantized cleanly).All 82 existing larql-models tests continue to pass.