feat(ggml): add Q3_K and Q5_K dequantization (types 11 and 13) by mvkorobkov · Pull Request #103 · chrishayuk/larql

mvkorobkov · 2026-05-15T08:58:45Z

What

Implements scalar dequantization for two missing K-quant formats:

Type	ID	Block size	Elements/block
Q3_K	11	110 bytes	256
Q5_K	13	176 bytes	256

Both tensor_data_size() and dequantize() in mod.rs are wired up.

Why

Without these, larql convert gguf-to-vindex fails with unsupported type id 11/13 on any model that uses Q3_K or Q5_K tensors. This includes:

DeepSeek-R1-0528-Qwen3-8B-Q3_K_L — 145 Q3_K tensors + 108 Q5_K tensors + 1 Q6_K
DeepSeek-V4-Flash-Q3_K_M (multi-shard, separate PR) — same types
Any model quantised with llama.cpp Q3_K_S / Q3_K_M / Q3_K_L / Q5_K_S / Q5_K_M

Implementation

q3_k.rs — dequantize_q3_k()

Block layout: hmask[0..32] · qs[32..96] · scales[96..108] · d[108..110]
unpack_q3k_scales(): 12 bytes → 16 six-bit signed values using the kmask1=0x03030303 / kmask2=0x0F0F0F0F shuffle from dequantize_row_q3_K in llama.cpp
Two-half loop (128 + 128 elements), m bitmask walks through hmask; clear bit → subtract 4 from q2 value

q5_k.rs — dequantize_q5_k()

Block layout: d[0..2] · dmin[2..4] · scales[4..16] · qh[16..48] · qs[48..176]
Reuses pub(super) unpack_q4k_scales() from q4_k.rs (same 12-byte format as Q4_K)
u1/u2 bitmask pair walks through qh; set bit → add 16 to 4-bit nibble
4 iterations × 64 elements, matching dequantize_row_q5_K in llama.cpp

q4_k.rs — unpack_q4k_scales visibility changed fn → pub(super) so Q5_K can share it without duplication.

Testing

Unit tests in each module:

q3_k: zero-scale all-zero output, hmask-clear subtracts 4, wrong-size error
q5_k: zero-scale all-zero output, high-bit adds 16, wrong-size error

End-to-end: larql convert gguf-to-vindex on DeepSeek-R1-0528-Qwen3-8B-Q3_K_L.gguf completes through dequantization without errors (145 Q3_K + 108 Q5_K tensors dequantized cleanly).

All 82 existing larql-models tests continue to pass.

chrishayuk · 2026-05-22T15:24:47Z

Hey @mvkorobkov — the Q3_K + Q5_K additions look great in isolation, but this branch is showing 920 files changed / +38.5K / −122K against current main. Looks like the base diverged significantly while the PR was open.

Could you rebase against current main? The Q3_K/Q5_K work itself should be a clean ~330-line addition to crates/larql-models/src/quant/ggml/. Right now the diff also deletes crates/larql-models/src/quant/{fp8,half,mxfp4}.rs which live on main, so it can't merge as-is.

If a rebase is painful, I can also cherry-pick just the Q3_K / Q5_K files onto a fresh branch from main and credit you on the commits (the pattern we used for #91→#121 and #115→#122). Let me know which you prefer.

mvkorobkov · 2026-05-23T16:47:02Z

Rebased onto current main (810f163). Branch is now a single 349-line commit — feat(ggml): add Q3_K and Q5_K dequantization (types 11 and 13) — touching only crates/larql-models/src/quant/ggml/{mod.rs,q3_k.rs,q4_k.rs,q5_k.rs}. Dropped the workflow-deletion commit you flagged. cargo check -p larql-models passes. Ready for review.

Implements scalar dequantize for Q3_K (110 B/block) and Q5_K (176 B/block) so that DeepSeek-R1-0528-Qwen3-8B-Q3_K_L and similar models can be converted via larql gguf-to-vindex. - q3_k.rs: unpack_q3k_scales (kmask1/kmask2 per llama.cpp), two-half-block loop with m-bitmask for high bits, signed-scale centred at 32. - q5_k.rs: reuses pub(super) unpack_q4k_scales from q4_k; u1/u2 mask walk for high bits, 4 iterations of 64 elements each. - mod.rs: Q3_K_BLOCK_BYTES=110, Q5_K_BLOCK_BYTES=176, dispatch in tensor_data_size() and dequantize(). - q4_k.rs: unpack_q4k_scales promoted to pub(super) for Q5_K reuse.

#67 llama.cpp emits DeepSeek-V2/V3 (and Kimi K2) MLA geometry in the GGUF metadata under {arch}.attention.* and {arch}.rope.dimension_count. `to_config_json` was dropping every one of these fields, so the parsed ModelConfig had MLA disabled and PR #96's absorption never fired for GGUF-sourced inputs. This surfaces the relevant fields into the HF-shaped config the parser consumes: - `attention.q_lora_rank` → `q_lora_rank` - `attention.kv_lora_rank` → `kv_lora_rank` - `attention.key_length[_mla]` → `qk_nope_head_dim` (= key_length − rope.dim) - `attention.value_length[_mla]`→ `v_head_dim` - `rope.dimension_count` → `qk_rope_head_dim` For per-head dims the loader prefers the `_mla` variants when present — those carry the pre-absorption (DS-V3-standard) split that `mla_absorb::absorb` operates on. Kimi K2.6's GGUF exposes both forms (192/128 for `_mla`, 576/512 absorbed); we want 192/128. Verified against Kimi K2.6 UD-Q8_K_XL GGUF metadata (the unsloth name is misleading — actual tensor types are BF16 + F32 + Q4_0, all already supported by larql's existing dequant). Three new tests cover: 1. Kimi K2.6-shaped metadata → full MLA fields populated, MLA detected 2. Non-`_mla` variant fallback (DS-V2 with key_length only) 3. Non-MLA architectures (llama) keep their fields absent 281/281 larql-models tests pass. Combined with PR #96 + #103 + #133, this unlocks inference-level extraction of Kimi K2 family and any other DeepSeek-V2/V3 GGUF that exposes the standard MLA metadata.

llama.cpp's gguf-split produces multi-file GGUFs (canonical naming: `<prefix>-<NNNNN>-of-<NNNNN>.gguf`). Each shard carries the full metadata header but only owns its own slice of tensors. The current `GgufFile::open` reads one file, so multi-shard models — Kimi K2.6 (14 shards), DeepSeek-V4-Flash (3 shards), and increasingly any large modern LLM — could not be loaded for vindex extraction. This change: 1. Adds `ShardInfo` (path + data_offset) and a `shards: Vec<ShardInfo>` field on `GgufFile`. Single-file GGUFs get a `shards.len() == 1`. 2. `GgufFile::open` detects multi-shard via the explicit `split.count` metadata key, falling back to the filename pattern when the splitter omits the metadata. 3. Discovers all sibling shards in the same directory by reconstructing filenames at the prefix's chosen width (`00001-of-00014` vs `001-of-003` both supported). 4. Appends each sibling's `tensor_infos` to the combined list, tagging them with the right `shard_idx`. Cross-checks the total against `split.tensors.count` when present. 5. `load_tensors_filtered` mmaps each shard lazily on first use and reads each tensor from `shards[info.shard_idx].path` at the right per-shard `data_offset`. Shards whose tensors are all skipped by `skip_key` are never opened. Backward-compatible: existing `GgufFile::open` callers and the single-file test fixtures keep working with `shards = vec![…one…]`. Tests (8 new + all existing pass): - parse_shard_filename: canonical layout, plain `.gguf` rejection, mismatched widths rejection, 3-digit split width support - discover_shard_siblings: complete set discovery from any-position shard, error when sibling missing - open_multi_shard_combines_tensors_from_all_shards: builds two real 2-shard GGUFs with disjoint tensor sets, opens via either shard, verifies each tensor reads from its own shard's data section - open_rejects_multi_shard_when_a_shard_file_is_missing - existing 27 tests stay green; 286/286 larql-models tests pass Combined with #96 (MLA absorption), #103 (Q3_K/Q5_K dequant), #133 (GGUF extract input), and #135 (DeepSeek-V2/V3 MLA metadata reading), this completes the chain — `larql extract --level inference` works end-to-end on Kimi K2.6 UD-Q8_K_XL and DeepSeek-V4-Flash multi-shard GGUFs.

chrishayuk mentioned this pull request May 22, 2026

feat: MLA absorption for DeepSeek V2/V3 — fuse low-rank Q/K/V into standard dense tensors #96

Merged

4 tasks

mvkorobkov force-pushed the feat/q3k-q5k-dequant branch from 179a07b to 6b2cd1b Compare May 23, 2026 16:46

chrishayuk force-pushed the feat/q3k-q5k-dequant branch from 6b2cd1b to f2a4c34 Compare May 23, 2026 22:18

chrishayuk merged commit 08aae28 into chrishayuk:main May 23, 2026
27 checks passed

This was referenced May 24, 2026

feat(gguf): surface MLA metadata for DeepSeek-V2/V3 + Kimi K2 — closes #67 #135

Merged

feat(gguf): multi-shard reader for *-NNNNN-of-NNNNN.gguf splits #136

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ggml): add Q3_K and Q5_K dequantization (types 11 and 13)#103

feat(ggml): add Q3_K and Q5_K dequantization (types 11 and 13)#103
chrishayuk merged 1 commit into
chrishayuk:mainfrom
mvkorobkov:feat/q3k-q5k-dequant

mvkorobkov commented May 15, 2026

Uh oh!

chrishayuk commented May 22, 2026

Uh oh!

mvkorobkov commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mvkorobkov commented May 15, 2026

What

Why

Implementation

Testing

Uh oh!

chrishayuk commented May 22, 2026

Uh oh!

mvkorobkov commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants