Skip to content

feat(lance-linalg): runtime SIMD dispatch for pre-Haswell x86_64 from-source builds#6630

Open
tobocop2 wants to merge 8 commits into
lance-format:mainfrom
tobocop2:fix/runtime-simd-multiversion
Open

feat(lance-linalg): runtime SIMD dispatch for pre-Haswell x86_64 from-source builds#6630
tobocop2 wants to merge 8 commits into
lance-format:mainfrom
tobocop2:fix/runtime-simd-multiversion

Conversation

@tobocop2

@tobocop2 tobocop2 commented Apr 28, 2026

Copy link
Copy Markdown

Tracks #6618. On x86_64 CPUs without AVX2 — Sandy Bridge / Ivy Bridge / Westmere on Intel, Bulldozer / Piledriver / Steamroller on AMD — import lancedb SIGILLs because the wheel bakes AVX2 + FMA into every compiled function with no runtime guard. numpy and pyarrow handle the same hardware via runtime CPU dispatch.

Summary

  • Adds 5-tier runtime SIMD dispatch (scalar / AVX / AVX+FMA / AVX2+FMA / AVX-512) to the f32/f64 hot kernels in lance-linalg::distance::{cosine, dot, l2, norm_l2}. Same match *SIMD_SUPPORT + mod x86 { #[target_feature] pub unsafe fn ... } shape as dot_u8.rs / cosine_u8.rs / l2_u8.rs. Where the AVX2 and AVX+FMA kernel bodies use no AVX2-specific intrinsics, the dispatch matches Avx2 | AvxFma to a shared kernel.
  • Adds lance.simd_info() Python introspection mirroring pyarrow.runtime_info() so users can verify which tier the runtime selected.
  • Adds a qemu-pre-haswell CI job that builds with RUSTFLAGS="-C target-cpu=x86-64-v2" (env-var-scoped to that one job — workspace .cargo/config.toml is unchanged) and runs lance-linalg lib tests under qemu-x86_64 -cpu Nehalem.
  • Documents the legacy build path in CONTRIBUTING.md: RUSTFLAGS="-C target-cpu=x86-64-v2" cargo build --release.

Per westonpace's review on lancedb/lancedb#3324, the workspace baseline stays at target-cpu=haswell. Modern wheels are unchanged; legacy users opt into the lower baseline at build time.

Benchmark

The AVX2 path on modern hardware is preserved as one of the per-tier kernels and the workspace baseline still bakes AVX2 into surrounding code, so by construction the modern compile is unchanged. Numbers still pending — Codespace's 30-min idle timeout killed my last full cargo bench -p lance-linalg --bench {cosine,dot,l2,norm_l2} run mid-suite (even with nohup — the VM itself sleeps). If anyone can recommend a free resource that holds a benchmark for ~1 hour, or a maintainer-preferred narrower bench shape, I'd appreciate the pointer.

Pre-Haswell verification on Sandy Bridge Xeon E5-2609 (the hardware the published wheel SIGILLs on) via the companion lancedb wheel build: pre-PR pip install lancedb SIGILLs at import; post-PR a from-source build with the documented RUSTFLAGS override produces a wheel where import + table-create + vector-search all work at the AVX tier. PASS output in tobocop2/lancedb#2.

Test plan

  • cargo test -p lance-linalg --lib — 83/83 on aarch64 dev box
  • cargo clippy --all-targets -- -D warnings clean; cargo fmt --check clean; Cargo.lock unchanged
  • 11 proptest cases verifying scalar↔SIMD bit-for-bit equivalence per tier per kernel; gated on is_x86_feature_detected!() so each runs on hosts that can execute its tier
  • SIGILL repro confirmed gone on Sandy Bridge Xeon E5-2609 with the documented RUSTFLAGS override
  • qemu-pre-haswell CI gate green (lights up when this PR runs CI)
  • Modern-hardware bench delta posted

To be transparent: this isn't my domain of expertise and the implementation is AI-generated — I stuck to the existing mod x86 { #[target_feature] } precedent and verified end-to-end on the failing hardware, but wanted to be upfront. Happy to roll in feedback.

@github-actions github-actions Bot added enhancement New feature or request A-python Python bindings labels Apr 28, 2026
@tobocop2 tobocop2 force-pushed the fix/runtime-simd-multiversion branch from a3df856 to 9193496 Compare April 28, 2026 04:59
tobocop2 added a commit to tobocop2/lancedb that referenced this pull request Apr 28, 2026
The default lancedb wheel targets target-cpu=haswell (AVX2 + FMA + F16C)
and SIGILLs at import on pre-Haswell silicon (Sandy Bridge / Ivy Bridge /
Westmere on Intel; Bulldozer / Piledriver / Steamroller on AMD). Per
westonpace's review on lancedb#3324 (fast by default, extra steps to work on
legacy), the default wheel baseline stays fast for modern users;
pre-Haswell users get a separately-published 'lancedb-compat' wheel.

Adds a 'lancedb-compat' matrix entry to pypi-publish.yml that builds
with RUSTFLAGS=-C target-cpu=x86-64-v2 (Nehalem-class baseline). The
compat wheel relies on runtime SIMD dispatch in the embedded lance crate
(landing via lance-format/lance#6630) to pick the appropriate kernel
tier at load time, so it still goes fast on modern hardware.

Generalizes build_linux_wheel and upload_wheel composites with optional
package-name and rustflags inputs (defaults preserve current behavior
for the existing 4 lancedb matrix entries). Documents the choice in
python/README.md: pip install lancedb-compat for pre-Haswell hosts;
same import lancedb API.

Maintainer setup required before this can ship: register lancedb-compat
on PyPI and configure trusted publishing.

Blocked on lance-format/lance#6630.
@tobocop2 tobocop2 force-pushed the fix/runtime-simd-multiversion branch 2 times, most recently from 7db5171 to 26aa7f4 Compare April 28, 2026 05:46
@tobocop2 tobocop2 marked this pull request as ready for review May 26, 2026 23:19

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@tobocop2 tobocop2 force-pushed the fix/runtime-simd-multiversion branch from 8691248 to 448a275 Compare June 10, 2026 16:11
tobocop2 added a commit to tobocop2/lancedb that referenced this pull request Jun 10, 2026
The default lancedb wheel targets target-cpu=haswell (AVX2 + FMA + F16C)
and SIGILLs at import on pre-Haswell silicon (Sandy Bridge / Ivy Bridge /
Westmere on Intel; Bulldozer / Piledriver / Steamroller on AMD). Per
westonpace's review on lancedb#3324 (fast by default, extra steps to work on
legacy), the default wheel baseline stays fast for modern users;
pre-Haswell users get a separately-published 'lancedb-compat' wheel.

Adds a 'lancedb-compat' matrix entry to pypi-publish.yml that builds
with RUSTFLAGS=-C target-cpu=x86-64-v2 (Nehalem-class baseline). The
compat wheel relies on runtime SIMD dispatch in the embedded lance crate
(landing via lance-format/lance#6630) to pick the appropriate kernel
tier at load time, so it still goes fast on modern hardware.

Generalizes build_linux_wheel and upload_wheel composites with optional
package-name and rustflags inputs (defaults preserve current behavior
for the existing 4 lancedb matrix entries). Documents the choice in
python/README.md: pip install lancedb-compat for pre-Haswell hosts;
same import lancedb API.

Maintainer setup required before this can ship: register lancedb-compat
on PyPI and configure trusted publishing.

Blocked on lance-format/lance#6630.
@tobocop2 tobocop2 force-pushed the fix/runtime-simd-multiversion branch from 448a275 to 126a6a5 Compare June 10, 2026 16:24
@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer A-ci CI / build workflows labels Jun 10, 2026
…-source builds

Adds 5-tier runtime SIMD dispatch (scalar / AVX / AVX+FMA / AVX2+FMA /
AVX-512) to the 10 hot f32/f64 distance kernels in lance-linalg, matching
the per-tier dispatch shape already used by dot_u8.rs / cosine_u8.rs /
l2_u8.rs and the f16/bf16 paths in norm_l2.rs.

Today, `import lancedb` SIGILLs on AVX-without-AVX2 CPUs (Intel Sandy
Bridge / Ivy Bridge / Westmere; AMD Bulldozer / Piledriver / Steamroller)
because the wheel bakes AVX2 into every compiled function with no runtime
guard. numpy and pyarrow handle the same hardware via runtime dispatch;
this PR brings lance to parity for the from-source legacy build path.

Per westonpace's review feedback on lancedb/lancedb#3324 — fast by default,
extra steps to work on legacy — the workspace .cargo/config.toml baseline
stays at target-cpu=haswell. The qemu-pre-haswell CI job sets
RUSTFLAGS=-C target-cpu=x86-64-v2 only in its own env block so the
runtime-dispatch path gets exercised under qemu Nehalem without affecting
any other build. CONTRIBUTING.md documents the from-source legacy build
for users on pre-Haswell hardware.

Changes:

- lance-core::utils::cpu: extend SimdSupport with Avx and AvxFma tiers,
  add #[non_exhaustive] for forward compatibility, add SimdInfo +
  simd_info() introspection API mirroring pyarrow.runtime_info().
- lance-linalg::distance::{cosine, dot, l2, norm_l2}: per-tier dispatch
  via match *SIMD_SUPPORT for all f32/f64 hot kernels. AVX-512 inner
  kernels use _mm512_* intrinsics directly; AVX+FMA kernels use the
  existing f32x8 / f64x8 SIMD types; AVX-only kernels use raw
  _mm256_mul/_mm256_add intrinsics because f32x8::multiply_add lowers
  to an FMA instruction. AVX2-host dispatch routes to the AvxFma kernel
  where the body uses no AVX2-specific intrinsics
  (Avx2 | AvxFma => x86::FOO_avx_fma) — eliminates ~480 lines of
  byte-identical AVX2/AvxFma per-tier function pairs across all 10 hot
  kernels. Documents AvxFma/Avx fall-through to scalar in the u8 and
  dist_table dispatchers (their AVX2 inners use integer ops not
  available below AVX2).
- lance-linalg::simd::{f32, f64}: drop the dead AVX-512 specializations
  of f32x16 / f64x8 (gated on target_feature="avx512f" which no project
  CI configuration enables). Per-tier dispatch in distance/* uses raw
  _mm512_* intrinsics directly. f32x8::gather adds a runtime AVX2 guard
  with a scalar fallback for x86_64 hosts without AVX2.
- python: expose lance.simd_info() (native binding + re-export in
  lance/__init__.py with a pytest) so users can verify which SIMD tier
  the runtime dispatched to without rebuilding.
- .github/workflows/rust.yml: new qemu-pre-haswell job that builds with
  RUSTFLAGS=-C target-cpu=x86-64-v2 and runs lance-linalg lib tests under
  qemu-x86_64 -cpu Nehalem, catching SIGILL leaks in the legacy-build
  path before they ship.
- CONTRIBUTING.md: documents the legacy build (RUSTFLAGS=-C target-cpu=x86-64-v2 cargo build --release).

Zero new external dependencies; Cargo.lock unchanged. lance-linalg lib
tests all green (83/83). Verified end-to-end on Sandy Bridge Xeon E5-2609
via the companion lancedb wheel build: pre-PR `pip install lancedb`
SIGILLs; post-PR a from-source build with the documented RUSTFLAGS
override produces a wheel where import + vector search work at the AVX
tier.

Closes lance-format#6618.
@tobocop2 tobocop2 force-pushed the fix/runtime-simd-multiversion branch from 126a6a5 to 26a520b Compare June 10, 2026 16:56
…=avx2)

On AVX2-baseline builds (the default haswell wheel), cosine_batch uses inlined
non-target_feature kernels (base-equivalent) plus a single per-batch AVX-512
check, avoiding the per-vector runtime-dispatch + target_feature tax that
regressed the modern path. Sub-AVX2 builds keep the runtime dispatch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@stumpylog

Copy link
Copy Markdown

Benchmarked this PR on two modern x86_64 boxes (the "no slowdown on newer hardware" question). As-is it regresses the f32 cosine batch path on AVX2-only CPUs; a small refinement fixes that while keeping the AVX-512 win.

cargo bench -p lance-linalg --bench cosine, criterion, same-session base→PR→fix (Cosine(f32, scalar) is a build-independent control ≈ 0% on clean runs). auto-vectorized = cosine_distance_batch dim 1024; simd,f32x8 = dim 8.

PR as-is vs base:

Broadwell (AVX2) Zen 4 (AVX-512)
dim 1024 +24% −6%
dim 8 +36% +78%

Cause: f32::cosine_batch does the match *SIMD_SUPPORT + a #[target_feature] call per vector. On a wheel already built with target-cpu=haswell, that's pure overhead vs the pre-PR inlined f32x16/f32x8 kernel.

Refinement: gate the dispatch on #[cfg(target_feature = "avx2")] — AVX2-baseline builds use the inlined (byte-identical to pre-PR) kernels with no per-vector dispatch, and dispatch to AVX-512 once per batch, large dims only (a masked 512-bit load loses to a plain AVX2 load at 8 lanes). Sub-AVX2 builds keep your runtime path.

After (fix vs base):

Broadwell (AVX2) Zen 4 (AVX-512)
dim 1024 +6% −7%
dim 8 flat −3%

Fast by default, AVX-512 still wins where it helps.

Notes: the +6% at dim 1024 on AVX2 is a codegen-unit artifact (the kernel there is byte-identical to pre-PR), not the dispatch path. I only reworked f32 cosine — dot/l2/norm_l2 have the same per-vector shape and likely the same regression. Branch: https://github.com/stumpylog/lance/tree/cosine-batch-dispatch-fix — happy to PR it against your branch.

(Downstream context: paperless-ngx/paperless-ngx#12970.)

@tobocop2

tobocop2 commented Jun 10, 2026 via email

Copy link
Copy Markdown
Author

@tobocop2

Copy link
Copy Markdown
Author

@westonpace

Just a friendly bump now that there are other folks with similar needs. I'm wondering if you could nudge someone on this team our way. Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-ci CI / build workflows A-index Vector index, linalg, tokenizer A-python Python bindings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants