Skip to content

feat(ggml): add Q3_K and Q5_K dequantization (types 11 and 13)#103

Merged
chrishayuk merged 1 commit into
chrishayuk:mainfrom
mvkorobkov:feat/q3k-q5k-dequant
May 23, 2026
Merged

feat(ggml): add Q3_K and Q5_K dequantization (types 11 and 13)#103
chrishayuk merged 1 commit into
chrishayuk:mainfrom
mvkorobkov:feat/q3k-q5k-dequant

Conversation

@mvkorobkov
Copy link
Copy Markdown

What

Implements scalar dequantization for two missing K-quant formats:

Type ID Block size Elements/block
Q3_K 11 110 bytes 256
Q5_K 13 176 bytes 256

Both tensor_data_size() and dequantize() in mod.rs are wired up.

Why

Without these, larql convert gguf-to-vindex fails with unsupported type id 11/13 on any model that uses Q3_K or Q5_K tensors. This includes:

  • DeepSeek-R1-0528-Qwen3-8B-Q3_K_L — 145 Q3_K tensors + 108 Q5_K tensors + 1 Q6_K
  • DeepSeek-V4-Flash-Q3_K_M (multi-shard, separate PR) — same types
  • Any model quantised with llama.cpp Q3_K_S / Q3_K_M / Q3_K_L / Q5_K_S / Q5_K_M

Implementation

q3_k.rsdequantize_q3_k()

  • Block layout: hmask[0..32] · qs[32..96] · scales[96..108] · d[108..110]
  • unpack_q3k_scales(): 12 bytes → 16 six-bit signed values using the kmask1=0x03030303 / kmask2=0x0F0F0F0F shuffle from dequantize_row_q3_K in llama.cpp
  • Two-half loop (128 + 128 elements), m bitmask walks through hmask; clear bit → subtract 4 from q2 value

q5_k.rsdequantize_q5_k()

  • Block layout: d[0..2] · dmin[2..4] · scales[4..16] · qh[16..48] · qs[48..176]
  • Reuses pub(super) unpack_q4k_scales() from q4_k.rs (same 12-byte format as Q4_K)
  • u1/u2 bitmask pair walks through qh; set bit → add 16 to 4-bit nibble
  • 4 iterations × 64 elements, matching dequantize_row_q5_K in llama.cpp

q4_k.rsunpack_q4k_scales visibility changed fnpub(super) so Q5_K can share it without duplication.

Testing

Unit tests in each module:

  • q3_k: zero-scale all-zero output, hmask-clear subtracts 4, wrong-size error
  • q5_k: zero-scale all-zero output, high-bit adds 16, wrong-size error

End-to-end: larql convert gguf-to-vindex on DeepSeek-R1-0528-Qwen3-8B-Q3_K_L.gguf completes through dequantization without errors (145 Q3_K + 108 Q5_K tensors dequantized cleanly).

All 82 existing larql-models tests continue to pass.

@chrishayuk
Copy link
Copy Markdown
Owner

Hey @mvkorobkov — the Q3_K + Q5_K additions look great in isolation, but this branch is showing 920 files changed / +38.5K / −122K against current main. Looks like the base diverged significantly while the PR was open.

Could you rebase against current main? The Q3_K/Q5_K work itself should be a clean ~330-line addition to crates/larql-models/src/quant/ggml/. Right now the diff also deletes crates/larql-models/src/quant/{fp8,half,mxfp4}.rs which live on main, so it can't merge as-is.

If a rebase is painful, I can also cherry-pick just the Q3_K / Q5_K files onto a fresh branch from main and credit you on the commits (the pattern we used for #91#121 and #115#122). Let me know which you prefer.

@mvkorobkov
Copy link
Copy Markdown
Author

Rebased onto current main (810f163). Branch is now a single 349-line commit — feat(ggml): add Q3_K and Q5_K dequantization (types 11 and 13) — touching only crates/larql-models/src/quant/ggml/{mod.rs,q3_k.rs,q4_k.rs,q5_k.rs}. Dropped the workflow-deletion commit you flagged. cargo check -p larql-models passes. Ready for review.

@chrishayuk chrishayuk force-pushed the feat/q3k-q5k-dequant branch from 6b2cd1b to f2a4c34 Compare May 23, 2026 22:18
Implements scalar dequantize for Q3_K (110 B/block) and Q5_K (176 B/block)
so that DeepSeek-R1-0528-Qwen3-8B-Q3_K_L and similar models can be converted
via larql gguf-to-vindex.

- q3_k.rs: unpack_q3k_scales (kmask1/kmask2 per llama.cpp), two-half-block
  loop with m-bitmask for high bits, signed-scale centred at 32.
- q5_k.rs: reuses pub(super) unpack_q4k_scales from q4_k; u1/u2 mask walk
  for high bits, 4 iterations of 64 elements each.
- mod.rs: Q3_K_BLOCK_BYTES=110, Q5_K_BLOCK_BYTES=176, dispatch in
  tensor_data_size() and dequantize().
- q4_k.rs: unpack_q4k_scales promoted to pub(super) for Q5_K reuse.
@chrishayuk chrishayuk merged commit 08aae28 into chrishayuk:main May 23, 2026
27 checks passed
chrishayuk pushed a commit that referenced this pull request May 25, 2026
 #67

llama.cpp emits DeepSeek-V2/V3 (and Kimi K2) MLA geometry in the GGUF
metadata under {arch}.attention.* and {arch}.rope.dimension_count.
`to_config_json` was dropping every one of these fields, so the parsed
ModelConfig had MLA disabled and PR #96's absorption never fired for
GGUF-sourced inputs.

This surfaces the relevant fields into the HF-shaped config the parser
consumes:

- `attention.q_lora_rank`       → `q_lora_rank`
- `attention.kv_lora_rank`      → `kv_lora_rank`
- `attention.key_length[_mla]`  → `qk_nope_head_dim` (= key_length − rope.dim)
- `attention.value_length[_mla]`→ `v_head_dim`
- `rope.dimension_count`        → `qk_rope_head_dim`

For per-head dims the loader prefers the `_mla` variants when present —
those carry the pre-absorption (DS-V3-standard) split that
`mla_absorb::absorb` operates on. Kimi K2.6's GGUF exposes both forms
(192/128 for `_mla`, 576/512 absorbed); we want 192/128.

Verified against Kimi K2.6 UD-Q8_K_XL GGUF metadata (the unsloth name
is misleading — actual tensor types are BF16 + F32 + Q4_0, all already
supported by larql's existing dequant). Three new tests cover:

1. Kimi K2.6-shaped metadata → full MLA fields populated, MLA detected
2. Non-`_mla` variant fallback (DS-V2 with key_length only)
3. Non-MLA architectures (llama) keep their fields absent

281/281 larql-models tests pass. Combined with PR #96 + #103 + #133,
this unlocks inference-level extraction of Kimi K2 family and any
other DeepSeek-V2/V3 GGUF that exposes the standard MLA metadata.
chrishayuk pushed a commit that referenced this pull request May 25, 2026
llama.cpp's gguf-split produces multi-file GGUFs (canonical naming:
`<prefix>-<NNNNN>-of-<NNNNN>.gguf`). Each shard carries the full
metadata header but only owns its own slice of tensors. The current
`GgufFile::open` reads one file, so multi-shard models — Kimi K2.6
(14 shards), DeepSeek-V4-Flash (3 shards), and increasingly any large
modern LLM — could not be loaded for vindex extraction.

This change:

1. Adds `ShardInfo` (path + data_offset) and a `shards: Vec<ShardInfo>`
   field on `GgufFile`. Single-file GGUFs get a `shards.len() == 1`.
2. `GgufFile::open` detects multi-shard via the explicit `split.count`
   metadata key, falling back to the filename pattern when the splitter
   omits the metadata.
3. Discovers all sibling shards in the same directory by reconstructing
   filenames at the prefix's chosen width (`00001-of-00014` vs `001-of-003`
   both supported).
4. Appends each sibling's `tensor_infos` to the combined list, tagging
   them with the right `shard_idx`. Cross-checks the total against
   `split.tensors.count` when present.
5. `load_tensors_filtered` mmaps each shard lazily on first use and
   reads each tensor from `shards[info.shard_idx].path` at the right
   per-shard `data_offset`. Shards whose tensors are all skipped by
   `skip_key` are never opened.

Backward-compatible: existing `GgufFile::open` callers and the
single-file test fixtures keep working with `shards = vec![…one…]`.

Tests (8 new + all existing pass):

- parse_shard_filename: canonical layout, plain `.gguf` rejection,
  mismatched widths rejection, 3-digit split width support
- discover_shard_siblings: complete set discovery from any-position
  shard, error when sibling missing
- open_multi_shard_combines_tensors_from_all_shards: builds two real
  2-shard GGUFs with disjoint tensor sets, opens via either shard,
  verifies each tensor reads from its own shard's data section
- open_rejects_multi_shard_when_a_shard_file_is_missing
- existing 27 tests stay green; 286/286 larql-models tests pass

Combined with #96 (MLA absorption), #103 (Q3_K/Q5_K dequant), #133
(GGUF extract input), and #135 (DeepSeek-V2/V3 MLA metadata reading),
this completes the chain — `larql extract --level inference` works
end-to-end on Kimi K2.6 UD-Q8_K_XL and DeepSeek-V4-Flash multi-shard
GGUFs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants