feat(lance-linalg): runtime SIMD dispatch for pre-Haswell x86_64 from-source builds by tobocop2 · Pull Request #6630 · lance-format/lance

tobocop2 · 2026-04-28T04:26:52Z

Tracks #6618. On x86_64 CPUs without AVX2 — Sandy Bridge / Ivy Bridge / Westmere on Intel, Bulldozer / Piledriver / Steamroller on AMD — import lancedb SIGILLs because the wheel bakes AVX2 + FMA into every compiled function with no runtime guard. numpy and pyarrow handle the same hardware via runtime CPU dispatch.

Summary

Adds 5-tier runtime SIMD dispatch (scalar / AVX / AVX+FMA / AVX2+FMA / AVX-512) to the f32/f64 hot kernels in lance-linalg::distance::{cosine, dot, l2, norm_l2}. Same match *SIMD_SUPPORT + mod x86 { #[target_feature] pub unsafe fn ... } shape as dot_u8.rs / cosine_u8.rs / l2_u8.rs. Where the AVX2 and AVX+FMA kernel bodies use no AVX2-specific intrinsics, the dispatch matches Avx2 | AvxFma to a shared kernel.
Adds lance.simd_info() Python introspection mirroring pyarrow.runtime_info() so users can verify which tier the runtime selected.
Adds a qemu-pre-haswell CI job that builds with RUSTFLAGS="-C target-cpu=x86-64-v2" (env-var-scoped to that one job — workspace .cargo/config.toml is unchanged) and runs lance-linalg lib tests under qemu-x86_64 -cpu Nehalem.
Documents the legacy build path in CONTRIBUTING.md: RUSTFLAGS="-C target-cpu=x86-64-v2" cargo build --release.

Per westonpace's review on lancedb/lancedb#3324, the workspace baseline stays at target-cpu=haswell. Modern wheels are unchanged; legacy users opt into the lower baseline at build time.

Benchmark

The AVX2 path on modern hardware is preserved as one of the per-tier kernels and the workspace baseline still bakes AVX2 into surrounding code, so by construction the modern compile is unchanged. Numbers still pending — Codespace's 30-min idle timeout killed my last full cargo bench -p lance-linalg --bench {cosine,dot,l2,norm_l2} run mid-suite (even with nohup — the VM itself sleeps). If anyone can recommend a free resource that holds a benchmark for ~1 hour, or a maintainer-preferred narrower bench shape, I'd appreciate the pointer.

Pre-Haswell verification on Sandy Bridge Xeon E5-2609 (the hardware the published wheel SIGILLs on) via the companion lancedb wheel build: pre-PR pip install lancedb SIGILLs at import; post-PR a from-source build with the documented RUSTFLAGS override produces a wheel where import + table-create + vector-search all work at the AVX tier. PASS output in tobocop2/lancedb#2.

Test plan

cargo test -p lance-linalg --lib — 83/83 on aarch64 dev box
cargo clippy --all-targets -- -D warnings clean; cargo fmt --check clean; Cargo.lock unchanged
11 proptest cases verifying scalar↔SIMD bit-for-bit equivalence per tier per kernel; gated on is_x86_feature_detected!() so each runs on hosts that can execute its tier
SIGILL repro confirmed gone on Sandy Bridge Xeon E5-2609 with the documented RUSTFLAGS override
qemu-pre-haswell CI gate green (lights up when this PR runs CI)
Modern-hardware bench delta posted

To be transparent: this isn't my domain of expertise and the implementation is AI-generated — I stuck to the existing mod x86 { #[target_feature] } precedent and verified end-to-end on the failing hardware, but wanted to be upfront. Happy to roll in feedback.

The default lancedb wheel targets target-cpu=haswell (AVX2 + FMA + F16C) and SIGILLs at import on pre-Haswell silicon (Sandy Bridge / Ivy Bridge / Westmere on Intel; Bulldozer / Piledriver / Steamroller on AMD). Per westonpace's review on lancedb#3324 (fast by default, extra steps to work on legacy), the default wheel baseline stays fast for modern users; pre-Haswell users get a separately-published 'lancedb-compat' wheel. Adds a 'lancedb-compat' matrix entry to pypi-publish.yml that builds with RUSTFLAGS=-C target-cpu=x86-64-v2 (Nehalem-class baseline). The compat wheel relies on runtime SIMD dispatch in the embedded lance crate (landing via lance-format/lance#6630) to pick the appropriate kernel tier at load time, so it still goes fast on modern hardware. Generalizes build_linux_wheel and upload_wheel composites with optional package-name and rustflags inputs (defaults preserve current behavior for the existing 4 lancedb matrix entries). Documents the choice in python/README.md: pip install lancedb-compat for pre-Haswell hosts; same import lancedb API. Maintainer setup required before this can ship: register lancedb-compat on PyPI and configure trusted publishing. Blocked on lance-format/lance#6630.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

The default lancedb wheel targets target-cpu=haswell (AVX2 + FMA + F16C) and SIGILLs at import on pre-Haswell silicon (Sandy Bridge / Ivy Bridge / Westmere on Intel; Bulldozer / Piledriver / Steamroller on AMD). Per westonpace's review on lancedb#3324 (fast by default, extra steps to work on legacy), the default wheel baseline stays fast for modern users; pre-Haswell users get a separately-published 'lancedb-compat' wheel. Adds a 'lancedb-compat' matrix entry to pypi-publish.yml that builds with RUSTFLAGS=-C target-cpu=x86-64-v2 (Nehalem-class baseline). The compat wheel relies on runtime SIMD dispatch in the embedded lance crate (landing via lance-format/lance#6630) to pick the appropriate kernel tier at load time, so it still goes fast on modern hardware. Generalizes build_linux_wheel and upload_wheel composites with optional package-name and rustflags inputs (defaults preserve current behavior for the existing 4 lancedb matrix entries). Documents the choice in python/README.md: pip install lancedb-compat for pre-Haswell hosts; same import lancedb API. Maintainer setup required before this can ship: register lancedb-compat on PyPI and configure trusted publishing. Blocked on lance-format/lance#6630.

…-source builds Adds 5-tier runtime SIMD dispatch (scalar / AVX / AVX+FMA / AVX2+FMA / AVX-512) to the 10 hot f32/f64 distance kernels in lance-linalg, matching the per-tier dispatch shape already used by dot_u8.rs / cosine_u8.rs / l2_u8.rs and the f16/bf16 paths in norm_l2.rs. Today, `import lancedb` SIGILLs on AVX-without-AVX2 CPUs (Intel Sandy Bridge / Ivy Bridge / Westmere; AMD Bulldozer / Piledriver / Steamroller) because the wheel bakes AVX2 into every compiled function with no runtime guard. numpy and pyarrow handle the same hardware via runtime dispatch; this PR brings lance to parity for the from-source legacy build path. Per westonpace's review feedback on lancedb/lancedb#3324 — fast by default, extra steps to work on legacy — the workspace .cargo/config.toml baseline stays at target-cpu=haswell. The qemu-pre-haswell CI job sets RUSTFLAGS=-C target-cpu=x86-64-v2 only in its own env block so the runtime-dispatch path gets exercised under qemu Nehalem without affecting any other build. CONTRIBUTING.md documents the from-source legacy build for users on pre-Haswell hardware. Changes: - lance-core::utils::cpu: extend SimdSupport with Avx and AvxFma tiers, add #[non_exhaustive] for forward compatibility, add SimdInfo + simd_info() introspection API mirroring pyarrow.runtime_info(). - lance-linalg::distance::{cosine, dot, l2, norm_l2}: per-tier dispatch via match *SIMD_SUPPORT for all f32/f64 hot kernels. AVX-512 inner kernels use _mm512_* intrinsics directly; AVX+FMA kernels use the existing f32x8 / f64x8 SIMD types; AVX-only kernels use raw _mm256_mul/_mm256_add intrinsics because f32x8::multiply_add lowers to an FMA instruction. AVX2-host dispatch routes to the AvxFma kernel where the body uses no AVX2-specific intrinsics (Avx2 | AvxFma => x86::FOO_avx_fma) — eliminates ~480 lines of byte-identical AVX2/AvxFma per-tier function pairs across all 10 hot kernels. Documents AvxFma/Avx fall-through to scalar in the u8 and dist_table dispatchers (their AVX2 inners use integer ops not available below AVX2). - lance-linalg::simd::{f32, f64}: drop the dead AVX-512 specializations of f32x16 / f64x8 (gated on target_feature="avx512f" which no project CI configuration enables). Per-tier dispatch in distance/* uses raw _mm512_* intrinsics directly. f32x8::gather adds a runtime AVX2 guard with a scalar fallback for x86_64 hosts without AVX2. - python: expose lance.simd_info() (native binding + re-export in lance/__init__.py with a pytest) so users can verify which SIMD tier the runtime dispatched to without rebuilding. - .github/workflows/rust.yml: new qemu-pre-haswell job that builds with RUSTFLAGS=-C target-cpu=x86-64-v2 and runs lance-linalg lib tests under qemu-x86_64 -cpu Nehalem, catching SIGILL leaks in the legacy-build path before they ship. - CONTRIBUTING.md: documents the legacy build (RUSTFLAGS=-C target-cpu=x86-64-v2 cargo build --release). Zero new external dependencies; Cargo.lock unchanged. lance-linalg lib tests all green (83/83). Verified end-to-end on Sandy Bridge Xeon E5-2609 via the companion lancedb wheel build: pre-PR `pip install lancedb` SIGILLs; post-PR a from-source build with the documented RUSTFLAGS override produces a wheel where import + vector search work at the AVX tier. Closes lance-format#6618.

…=avx2) On AVX2-baseline builds (the default haswell wheel), cosine_batch uses inlined non-target_feature kernels (base-equivalent) plus a single per-batch AVX-512 check, avoiding the per-vector runtime-dispatch + target_feature tax that regressed the modern path. Sub-AVX2 builds keep the runtime dispatch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

stumpylog · 2026-06-10T20:00:46Z

Benchmarked this PR on two modern x86_64 boxes (the "no slowdown on newer hardware" question). As-is it regresses the f32 cosine batch path on AVX2-only CPUs; a small refinement fixes that while keeping the AVX-512 win.

cargo bench -p lance-linalg --bench cosine, criterion, same-session base→PR→fix (Cosine(f32, scalar) is a build-independent control ≈ 0% on clean runs). auto-vectorized = cosine_distance_batch dim 1024; simd,f32x8 = dim 8.

PR as-is vs base:

	Broadwell (AVX2)	Zen 4 (AVX-512)
dim 1024	+24%	−6%
dim 8	+36%	+78%

Cause: f32::cosine_batch does the match *SIMD_SUPPORT + a #[target_feature] call per vector. On a wheel already built with target-cpu=haswell, that's pure overhead vs the pre-PR inlined f32x16/f32x8 kernel.

Refinement: gate the dispatch on #[cfg(target_feature = "avx2")] — AVX2-baseline builds use the inlined (byte-identical to pre-PR) kernels with no per-vector dispatch, and dispatch to AVX-512 once per batch, large dims only (a masked 512-bit load loses to a plain AVX2 load at 8 lanes). Sub-AVX2 builds keep your runtime path.

After (fix vs base):

	Broadwell (AVX2)	Zen 4 (AVX-512)
dim 1024	+6%	−7%
dim 8	flat	−3%

Fast by default, AVX-512 still wins where it helps.

Notes: the +6% at dim 1024 on AVX2 is a codegen-unit artifact (the kernel there is byte-identical to pre-PR), not the dispatch path. I only reworked f32 cosine — dot/l2/norm_l2 have the same per-vector shape and likely the same regression. Branch: https://github.com/stumpylog/lance/tree/cosine-batch-dispatch-fix — happy to PR it against your branch.

(Downstream context: paperless-ngx/paperless-ngx#12970.)

tobocop2 · 2026-06-10T21:08:36Z

I'd love the PR to my branch! Thank you!!!

…

On Wed, Jun 10, 2026, 4:01 PM Trenton H ***@***.***> wrote: *stumpylog* left a comment (lance-format/lance#6630) <#6630 (comment)> Benchmarked this PR on two modern x86_64 boxes (the "no slowdown on newer hardware" question). As-is it regresses the f32 cosine *batch* path on AVX2-only CPUs; a small refinement fixes that while keeping the AVX-512 win. cargo bench -p lance-linalg --bench cosine, criterion, same-session base→PR→fix (Cosine(f32, scalar) is a build-independent control ≈ 0% on clean runs). auto-vectorized = cosine_distance_batch dim 1024; simd,f32x8 = dim 8. *PR as-is vs base:* Broadwell (AVX2) Zen 4 (AVX-512) dim 1024 *+24%* −6% dim 8 *+36%* *+78%* Cause: f32::cosine_batch does the match *SIMD_SUPPORT + a #[target_feature] call *per vector*. On a wheel already built with target-cpu=haswell, that's pure overhead vs the pre-PR inlined f32x16/ f32x8 kernel. *Refinement:* gate the dispatch on #[cfg(target_feature = "avx2")] — AVX2-baseline builds use the inlined (byte-identical to pre-PR) kernels with no per-vector dispatch, and dispatch to AVX-512 *once per batch*, large dims only (a masked 512-bit load loses to a plain AVX2 load at 8 lanes). Sub-AVX2 builds keep your runtime path. *After (fix vs base):* Broadwell (AVX2) Zen 4 (AVX-512) dim 1024 +6% *−7%* dim 8 flat −3% Fast by default, AVX-512 still wins where it helps. Notes: the +6% at dim 1024 on AVX2 is a codegen-unit artifact (the kernel there is byte-identical to pre-PR), not the dispatch path. I only reworked f32 cosine — dot/l2/norm_l2 have the same per-vector shape and likely the same regression. Branch: https://github.com/stumpylog/lance/tree/cosine-batch-dispatch-fix — happy to PR it against your branch. (Downstream context: paperless-ngx/paperless-ngx#12970 <paperless-ngx/paperless-ngx#12970>.) — Reply to this email directly, view it on GitHub <#6630?email_source=notifications&email_token=ABKN6LEKEHRXNJUUKNCC2WD47G5AJA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINRXGM4TKMRQGMZ2M4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2KYZTPN52GK4S7MNWGSY3L#issuecomment-4673952033>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKN6LDI2FDF6RKPOSM7AR347G5AJAVCNFSNUABFKJSXA33TNF2G64TZHM2TCMJWHEYTGOBQHNEXG43VMU5TIMZUGA2TQMBRHA4KC5QC> . Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS <https://github.com/notifications/mobile/ios/ABKN6LBY7GQWJQTQAFWJUM347G5AJA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINRXGM4TKMRQGMZ2M4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2KUZTPN52GK4S7NFXXG> and Android <https://github.com/notifications/mobile/android/ABKN6LAYQNGZLTE6W6ENDGL47G5AJA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINRXGM4TKMRQGMZ2M4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2K4ZTPN52GK4S7MFXGI4TPNFSA>. Download it today! You are receiving this because you authored the thread.Message ID: ***@***.***>

fix(lance-linalg): gate f32 cosine dispatch behind cfg(target_feature=avx2)

tobocop2 · 2026-06-11T01:27:10Z

@westonpace

Just a friendly bump now that there are other folks with similar needs. I'm wondering if you could nudge someone on this team our way. Thank you very much.

github-actions Bot added enhancement New feature or request A-python Python bindings labels Apr 28, 2026

tobocop2 mentioned this pull request Apr 28, 2026

feat(wheels): publish lancedb-compat for pre-Haswell x86_64 hosts lancedb/lancedb#3327

Draft

tobocop2 force-pushed the fix/runtime-simd-multiversion branch from a3df856 to 9193496 Compare April 28, 2026 04:59

tobocop2 force-pushed the fix/runtime-simd-multiversion branch 2 times, most recently from 7db5171 to 26aa7f4 Compare April 28, 2026 05:46

tobocop2 marked this pull request as ready for review May 26, 2026 23:19

claude Bot reviewed May 26, 2026

View reviewed changes

tobocop2 force-pushed the fix/runtime-simd-multiversion branch from 8691248 to 448a275 Compare June 10, 2026 16:11

tobocop2 mentioned this pull request Jun 10, 2026

bug(python): import lancedb crashes with SIGILL on pre-Haswell x86_64 (Sandy Bridge / Ivy Bridge / FX-7500) lancedb/lancedb#3324

Open

tobocop2 force-pushed the fix/runtime-simd-multiversion branch from 448a275 to 126a6a5 Compare June 10, 2026 16:24

github-actions Bot added A-index Vector index, linalg, tokenizer A-ci CI / build workflows labels Jun 10, 2026

tobocop2 force-pushed the fix/runtime-simd-multiversion branch from 126a6a5 to 26a520b Compare June 10, 2026 16:56

Merge branch 'main' into fix/runtime-simd-multiversion

30ea6f2

stumpylog mentioned this pull request Jun 10, 2026

fix(lance-linalg): gate f32 cosine dispatch behind cfg(target_feature=avx2) tobocop2/lance#3

Merged

tobocop2 added 2 commits June 10, 2026 21:25

Merge pull request #3 from stumpylog/cosine-batch-dispatch-fix

8577f3b

fix(lance-linalg): gate f32 cosine dispatch behind cfg(target_feature=avx2)

Merge branch 'main' into fix/runtime-simd-multiversion

df12632

tobocop2 added 3 commits June 12, 2026 14:06

Merge branch 'main' into fix/runtime-simd-multiversion

6a13758

Merge branch 'main' into fix/runtime-simd-multiversion

5390670

Merge branch 'main' into fix/runtime-simd-multiversion

02fd2da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(lance-linalg): runtime SIMD dispatch for pre-Haswell x86_64 from-source builds#6630

feat(lance-linalg): runtime SIMD dispatch for pre-Haswell x86_64 from-source builds#6630
tobocop2 wants to merge 8 commits into
lance-format:mainfrom
tobocop2:fix/runtime-simd-multiversion

tobocop2 commented Apr 28, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

stumpylog commented Jun 10, 2026

Uh oh!

tobocop2 commented Jun 10, 2026 via email

Uh oh!

tobocop2 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tobocop2 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark

Test plan

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

stumpylog commented Jun 10, 2026

Uh oh!

tobocop2 commented Jun 10, 2026 via email

Uh oh!

tobocop2 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tobocop2 commented Apr 28, 2026 •

edited

Loading