feat: decoder model engine — LastToken/PrePooledU8 pooling, KV-cache injection, static batch guard by CrispStrobe · Pull Request #237 · Anush008/fastembed-rs

CrispStrobe · 2026-03-18T10:05:37Z

Summary

This PR adds engine-level support for decoder-style ONNX embedding models (e.g. onnx-community/Qwen3-Embedding-0.6B), which require different inference wiring than encoder models.

New `Pooling` variants

Pooling::LastToken — extracts the last non-padding token's hidden state (causal/decoder models use this instead of CLS or mean)
Pooling::PrePooledU8 { scale, zero_point } — reads a uint8 output tensor and applies affine dequantization (f32 = (u8 − zero_point) × scale), enabling models that export a pre-quantized sentence_embedding_quantized output

Auto-detection in `TextEmbedding::new()`

The session inputs are inspected once at load time to auto-detect and store:

Field	What it detects
`need_position_ids`	dynamo-exported models that need explicit `position_ids`
`need_task_id`	Jina-v3-style models with a LoRA task selector
`kv_cache_layers / kv_heads / head_dim`	decoder models with `past_key_values.N.key/value` inputs
`max_batch_size`	statically-shaped exports (non-`-1` batch dim on `input_ids`)

`transform()` additions

Injects position_ids, task_id, and empty KV-cache tensors when the auto-detected flags are set
Enforces max_batch_size with a clear anyhow::Error before hitting ORT

Test coverage

New tests/local_models.rs with skip-if-missing guards using LOCAL_MODELS_DIR env var
New tests/quality_bench.rs for semantic quality metrics across all local models
Verified locally: Pooling::LastToken on Octen-0.6B-INT8, Pooling::PrePooledU8 on electroglyph/Qwen3-Embedding-0.6B-onnx-uint8

🤖 Generated with Claude Code

…ction, static batch guard New pooling modes - `Pooling::LastToken`: takes the last non-padding token's embedding, required by Qwen3-Embedding-family decoder models - `Pooling::PrePooledU8 { scale, zero_point }`: affine dequantization `f32 = (u8 - zero_point) × scale` for calibrated uint8 ONNX outputs (e.g. `electroglyph/Qwen3-Embedding-0.6B-onnx-uint8`) - `dequant_u8()` helper; `select_output_u8()` on `SingleBatchOutput` Output precedence - `sentence_embedding` is now preferred over `last_hidden_state` when both outputs are present; models that only expose `last_hidden_state` fall through unchanged Auto-injection in transform() - `position_ids [[0,1,…,seq-1],…]`: injected when session has a `position_ids` input (dynamo-exported decoder models) - `task_id = 1`: injected when session has a `task_id` input (Jina-embeddings-v3 LoRA adapter selection) - `past_key_values.N.key/value [batch, kv_heads, 0, head_dim]`: injected for each layer when KV-cache inputs are detected (onnx-community-style exports) Static batch guard - `new()` reads the `input_ids` shape; a positive batch dimension means the model was exported with a fixed batch size — `transform()` now returns a descriptive error instead of an opaque ORT shape mismatch Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…_lazy_continuation The `mean` function's doc comment was accidentally placed above `dequant_u8`, causing clippy to see `/// *` list items followed immediately by `/// Dequantize...` and flag it as a list continuation without indentation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

`FASTEMBED_CACHE_DIR` now accepts a colon-separated list of paths: FASTEMBED_CACHE_DIR=/fast/ssd/models:/slow/backup/models All three model retrieval paths (text embedding, sparse embedding, reranking) search the list in order and use the first directory that contains a complete hf-hub snapshot for the requested model. If no directory has the model, it is downloaded into the first directory. New public helpers: - `get_cache_dirs() -> Vec<PathBuf>` — parses the env var - `find_model_cache_dir(model_code, dirs)` — locates an existing snapshot `get_cache_dir()` is preserved unchanged for backwards compatibility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…minism) ORT parallel INT8 MatMul accumulation is non-deterministic across platforms and across runs. GTELargeENV15Q uses model_quantized.onnx (INT8 ONNX). Skip exact-sum assertion; quality is tested by semantic ordering test. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…divergence) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…cEmbedMLongQ These are upstream models with CI-validated (x86_64) checksums that we incorrectly replaced with skip-checksum. Local ARM64 may differ for INT8 models but CI (x86_64) must hit the reference values. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Anush008 · 2026-03-24T06:35:13Z

+pub fn get_cache_dirs() -> Vec<std::path::PathBuf> {
+    std::env::var("FASTEMBED_CACHE_DIR")
+        .unwrap_or_else(|_| DEFAULT_CACHE_DIR.into())
+        .split(':')


This'll break on Windows.

Maybe std::env::split_paths?

Anush008 · 2026-03-24T06:36:48Z

+    ///
+    /// Used for models (e.g. calibrated uint8 quantizations) whose output tensor
+    /// element type is `u8` rather than `f32`.
+    pub fn select_output_u8(


Duplicates the logic of select_output.

Please consider refactoring duplicated bit into a private helper and use in both places.

…election deduplication

This was referenced Mar 18, 2026

feat: file-based ONNX loading and asymmetric retrieval prefix API for UserDefinedEmbeddingModel #238

Open

feat: add SnowflakeArcticEmbedLV2, PixieRuneV1, JinaEmbeddingsV5Nano, Qwen3Embedding0_6BUint8 #239

Open

CrispStrobe and others added 7 commits March 18, 2026 12:02

ci: add push and workflow_dispatch triggers to test workflow

452c4ed

fix: skip SnowflakeArcticEmbedMLongQ checksum (ORT INT8 ARM64 vs x86 …

beaff08

…divergence) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

style: apply cargo fmt

fc81656

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

CrispStrobe marked this pull request as ready for review March 19, 2026 20:03

Anush008 reviewed Mar 24, 2026

View reviewed changes

CrispStrobe force-pushed the feat/decoder-engine branch 4 times, most recently from d61b03c to 4210837 Compare March 28, 2026 23:43

refactor: address review comments on cache dir splitting and output s…

c73c46f

…election deduplication

CrispStrobe force-pushed the feat/decoder-engine branch from 4210837 to c73c46f Compare March 29, 2026 05:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: decoder model engine — LastToken/PrePooledU8 pooling, KV-cache injection, static batch guard#237

feat: decoder model engine — LastToken/PrePooledU8 pooling, KV-cache injection, static batch guard#237
CrispStrobe wants to merge 9 commits intoAnush008:mainfrom
CrispStrobe:feat/decoder-engine

CrispStrobe commented Mar 18, 2026

Uh oh!

Anush008 Mar 24, 2026

Uh oh!

Anush008 Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CrispStrobe commented Mar 18, 2026

Summary

New Pooling variants

Auto-detection in TextEmbedding::new()

transform() additions

Test coverage

Uh oh!

Anush008 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Anush008 Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New `Pooling` variants

Auto-detection in `TextEmbedding::new()`

`transform()` additions

Anush008 Mar 24, 2026 •

edited

Loading