feat(lance-linalg): runtime SIMD dispatch for pre-Haswell x86_64 from-source builds#6630
feat(lance-linalg): runtime SIMD dispatch for pre-Haswell x86_64 from-source builds#6630tobocop2 wants to merge 8 commits into
Conversation
a3df856 to
9193496
Compare
The default lancedb wheel targets target-cpu=haswell (AVX2 + FMA + F16C) and SIGILLs at import on pre-Haswell silicon (Sandy Bridge / Ivy Bridge / Westmere on Intel; Bulldozer / Piledriver / Steamroller on AMD). Per westonpace's review on lancedb#3324 (fast by default, extra steps to work on legacy), the default wheel baseline stays fast for modern users; pre-Haswell users get a separately-published 'lancedb-compat' wheel. Adds a 'lancedb-compat' matrix entry to pypi-publish.yml that builds with RUSTFLAGS=-C target-cpu=x86-64-v2 (Nehalem-class baseline). The compat wheel relies on runtime SIMD dispatch in the embedded lance crate (landing via lance-format/lance#6630) to pick the appropriate kernel tier at load time, so it still goes fast on modern hardware. Generalizes build_linux_wheel and upload_wheel composites with optional package-name and rustflags inputs (defaults preserve current behavior for the existing 4 lancedb matrix entries). Documents the choice in python/README.md: pip install lancedb-compat for pre-Haswell hosts; same import lancedb API. Maintainer setup required before this can ship: register lancedb-compat on PyPI and configure trusted publishing. Blocked on lance-format/lance#6630.
7db5171 to
26aa7f4
Compare
8691248 to
448a275
Compare
The default lancedb wheel targets target-cpu=haswell (AVX2 + FMA + F16C) and SIGILLs at import on pre-Haswell silicon (Sandy Bridge / Ivy Bridge / Westmere on Intel; Bulldozer / Piledriver / Steamroller on AMD). Per westonpace's review on lancedb#3324 (fast by default, extra steps to work on legacy), the default wheel baseline stays fast for modern users; pre-Haswell users get a separately-published 'lancedb-compat' wheel. Adds a 'lancedb-compat' matrix entry to pypi-publish.yml that builds with RUSTFLAGS=-C target-cpu=x86-64-v2 (Nehalem-class baseline). The compat wheel relies on runtime SIMD dispatch in the embedded lance crate (landing via lance-format/lance#6630) to pick the appropriate kernel tier at load time, so it still goes fast on modern hardware. Generalizes build_linux_wheel and upload_wheel composites with optional package-name and rustflags inputs (defaults preserve current behavior for the existing 4 lancedb matrix entries). Documents the choice in python/README.md: pip install lancedb-compat for pre-Haswell hosts; same import lancedb API. Maintainer setup required before this can ship: register lancedb-compat on PyPI and configure trusted publishing. Blocked on lance-format/lance#6630.
448a275 to
126a6a5
Compare
…-source builds Adds 5-tier runtime SIMD dispatch (scalar / AVX / AVX+FMA / AVX2+FMA / AVX-512) to the 10 hot f32/f64 distance kernels in lance-linalg, matching the per-tier dispatch shape already used by dot_u8.rs / cosine_u8.rs / l2_u8.rs and the f16/bf16 paths in norm_l2.rs. Today, `import lancedb` SIGILLs on AVX-without-AVX2 CPUs (Intel Sandy Bridge / Ivy Bridge / Westmere; AMD Bulldozer / Piledriver / Steamroller) because the wheel bakes AVX2 into every compiled function with no runtime guard. numpy and pyarrow handle the same hardware via runtime dispatch; this PR brings lance to parity for the from-source legacy build path. Per westonpace's review feedback on lancedb/lancedb#3324 — fast by default, extra steps to work on legacy — the workspace .cargo/config.toml baseline stays at target-cpu=haswell. The qemu-pre-haswell CI job sets RUSTFLAGS=-C target-cpu=x86-64-v2 only in its own env block so the runtime-dispatch path gets exercised under qemu Nehalem without affecting any other build. CONTRIBUTING.md documents the from-source legacy build for users on pre-Haswell hardware. Changes: - lance-core::utils::cpu: extend SimdSupport with Avx and AvxFma tiers, add #[non_exhaustive] for forward compatibility, add SimdInfo + simd_info() introspection API mirroring pyarrow.runtime_info(). - lance-linalg::distance::{cosine, dot, l2, norm_l2}: per-tier dispatch via match *SIMD_SUPPORT for all f32/f64 hot kernels. AVX-512 inner kernels use _mm512_* intrinsics directly; AVX+FMA kernels use the existing f32x8 / f64x8 SIMD types; AVX-only kernels use raw _mm256_mul/_mm256_add intrinsics because f32x8::multiply_add lowers to an FMA instruction. AVX2-host dispatch routes to the AvxFma kernel where the body uses no AVX2-specific intrinsics (Avx2 | AvxFma => x86::FOO_avx_fma) — eliminates ~480 lines of byte-identical AVX2/AvxFma per-tier function pairs across all 10 hot kernels. Documents AvxFma/Avx fall-through to scalar in the u8 and dist_table dispatchers (their AVX2 inners use integer ops not available below AVX2). - lance-linalg::simd::{f32, f64}: drop the dead AVX-512 specializations of f32x16 / f64x8 (gated on target_feature="avx512f" which no project CI configuration enables). Per-tier dispatch in distance/* uses raw _mm512_* intrinsics directly. f32x8::gather adds a runtime AVX2 guard with a scalar fallback for x86_64 hosts without AVX2. - python: expose lance.simd_info() (native binding + re-export in lance/__init__.py with a pytest) so users can verify which SIMD tier the runtime dispatched to without rebuilding. - .github/workflows/rust.yml: new qemu-pre-haswell job that builds with RUSTFLAGS=-C target-cpu=x86-64-v2 and runs lance-linalg lib tests under qemu-x86_64 -cpu Nehalem, catching SIGILL leaks in the legacy-build path before they ship. - CONTRIBUTING.md: documents the legacy build (RUSTFLAGS=-C target-cpu=x86-64-v2 cargo build --release). Zero new external dependencies; Cargo.lock unchanged. lance-linalg lib tests all green (83/83). Verified end-to-end on Sandy Bridge Xeon E5-2609 via the companion lancedb wheel build: pre-PR `pip install lancedb` SIGILLs; post-PR a from-source build with the documented RUSTFLAGS override produces a wheel where import + vector search work at the AVX tier. Closes lance-format#6618.
126a6a5 to
26a520b
Compare
…=avx2) On AVX2-baseline builds (the default haswell wheel), cosine_batch uses inlined non-target_feature kernels (base-equivalent) plus a single per-batch AVX-512 check, avoiding the per-vector runtime-dispatch + target_feature tax that regressed the modern path. Sub-AVX2 builds keep the runtime dispatch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Benchmarked this PR on two modern x86_64 boxes (the "no slowdown on newer hardware" question). As-is it regresses the f32 cosine batch path on AVX2-only CPUs; a small refinement fixes that while keeping the AVX-512 win.
PR as-is vs base:
Cause: Refinement: gate the dispatch on After (fix vs base):
Fast by default, AVX-512 still wins where it helps. Notes: the +6% at dim 1024 on AVX2 is a codegen-unit artifact (the kernel there is byte-identical to pre-PR), not the dispatch path. I only reworked f32 cosine — (Downstream context: paperless-ngx/paperless-ngx#12970.) |
|
I'd love the PR to my branch! Thank you!!!
…On Wed, Jun 10, 2026, 4:01 PM Trenton H ***@***.***> wrote:
*stumpylog* left a comment (lance-format/lance#6630)
<#6630 (comment)>
Benchmarked this PR on two modern x86_64 boxes (the "no slowdown on newer
hardware" question). As-is it regresses the f32 cosine *batch* path on
AVX2-only CPUs; a small refinement fixes that while keeping the AVX-512 win.
cargo bench -p lance-linalg --bench cosine, criterion, same-session
base→PR→fix (Cosine(f32, scalar) is a build-independent control ≈ 0% on
clean runs). auto-vectorized = cosine_distance_batch dim 1024; simd,f32x8
= dim 8.
*PR as-is vs base:*
Broadwell (AVX2) Zen 4 (AVX-512)
dim 1024 *+24%* −6%
dim 8 *+36%* *+78%*
Cause: f32::cosine_batch does the match *SIMD_SUPPORT + a
#[target_feature] call *per vector*. On a wheel already built with
target-cpu=haswell, that's pure overhead vs the pre-PR inlined f32x16/
f32x8 kernel.
*Refinement:* gate the dispatch on #[cfg(target_feature = "avx2")] —
AVX2-baseline builds use the inlined (byte-identical to pre-PR) kernels
with no per-vector dispatch, and dispatch to AVX-512 *once per batch*,
large dims only (a masked 512-bit load loses to a plain AVX2 load at 8
lanes). Sub-AVX2 builds keep your runtime path.
*After (fix vs base):*
Broadwell (AVX2) Zen 4 (AVX-512)
dim 1024 +6% *−7%*
dim 8 flat −3%
Fast by default, AVX-512 still wins where it helps.
Notes: the +6% at dim 1024 on AVX2 is a codegen-unit artifact (the kernel
there is byte-identical to pre-PR), not the dispatch path. I only reworked
f32 cosine — dot/l2/norm_l2 have the same per-vector shape and likely the
same regression. Branch:
https://github.com/stumpylog/lance/tree/cosine-batch-dispatch-fix — happy
to PR it against your branch.
(Downstream context: paperless-ngx/paperless-ngx#12970
<paperless-ngx/paperless-ngx#12970>.)
—
Reply to this email directly, view it on GitHub
<#6630?email_source=notifications&email_token=ABKN6LEKEHRXNJUUKNCC2WD47G5AJA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINRXGM4TKMRQGMZ2M4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2KYZTPN52GK4S7MNWGSY3L#issuecomment-4673952033>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKN6LDI2FDF6RKPOSM7AR347G5AJAVCNFSNUABFKJSXA33TNF2G64TZHM2TCMJWHEYTGOBQHNEXG43VMU5TIMZUGA2TQMBRHA4KC5QC>
.
Triage notifications, keep track of coding agent tasks and review pull
requests on the go with GitHub Mobile for iOS
<https://github.com/notifications/mobile/ios/ABKN6LBY7GQWJQTQAFWJUM347G5AJA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINRXGM4TKMRQGMZ2M4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2KUZTPN52GK4S7NFXXG>
and Android
<https://github.com/notifications/mobile/android/ABKN6LAYQNGZLTE6W6ENDGL47G5AJA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINRXGM4TKMRQGMZ2M4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2K4ZTPN52GK4S7MFXGI4TPNFSA>.
Download it today!
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
fix(lance-linalg): gate f32 cosine dispatch behind cfg(target_feature=avx2)
|
Just a friendly bump now that there are other folks with similar needs. I'm wondering if you could nudge someone on this team our way. Thank you very much. |
Tracks #6618. On x86_64 CPUs without AVX2 — Sandy Bridge / Ivy Bridge / Westmere on Intel, Bulldozer / Piledriver / Steamroller on AMD —
import lancedbSIGILLs because the wheel bakes AVX2 + FMA into every compiled function with no runtime guard. numpy and pyarrow handle the same hardware via runtime CPU dispatch.Summary
lance-linalg::distance::{cosine, dot, l2, norm_l2}. Samematch *SIMD_SUPPORT+mod x86 { #[target_feature] pub unsafe fn ... }shape asdot_u8.rs/cosine_u8.rs/l2_u8.rs. Where the AVX2 and AVX+FMA kernel bodies use no AVX2-specific intrinsics, the dispatch matchesAvx2 | AvxFmato a shared kernel.lance.simd_info()Python introspection mirroringpyarrow.runtime_info()so users can verify which tier the runtime selected.qemu-pre-haswellCI job that builds withRUSTFLAGS="-C target-cpu=x86-64-v2"(env-var-scoped to that one job — workspace.cargo/config.tomlis unchanged) and runslance-linalglib tests underqemu-x86_64 -cpu Nehalem.CONTRIBUTING.md:RUSTFLAGS="-C target-cpu=x86-64-v2" cargo build --release.Per westonpace's review on lancedb/lancedb#3324, the workspace baseline stays at
target-cpu=haswell. Modern wheels are unchanged; legacy users opt into the lower baseline at build time.Benchmark
The AVX2 path on modern hardware is preserved as one of the per-tier kernels and the workspace baseline still bakes AVX2 into surrounding code, so by construction the modern compile is unchanged. Numbers still pending — Codespace's 30-min idle timeout killed my last full
cargo bench -p lance-linalg --bench {cosine,dot,l2,norm_l2}run mid-suite (even withnohup— the VM itself sleeps). If anyone can recommend a free resource that holds a benchmark for ~1 hour, or a maintainer-preferred narrower bench shape, I'd appreciate the pointer.Pre-Haswell verification on Sandy Bridge Xeon E5-2609 (the hardware the published wheel SIGILLs on) via the companion lancedb wheel build: pre-PR
pip install lancedbSIGILLs at import; post-PR a from-source build with the documentedRUSTFLAGSoverride produces a wheel where import + table-create + vector-search all work at the AVX tier. PASS output intobocop2/lancedb#2.Test plan
cargo test -p lance-linalg --lib— 83/83 on aarch64 dev boxcargo clippy --all-targets -- -D warningsclean;cargo fmt --checkclean;Cargo.lockunchangedis_x86_feature_detected!()so each runs on hosts that can execute its tierRUSTFLAGSoverrideTo be transparent: this isn't my domain of expertise and the implementation is AI-generated — I stuck to the existing
mod x86 { #[target_feature] }precedent and verified end-to-end on the failing hardware, but wanted to be upfront. Happy to roll in feedback.