fix(lance-linalg): gate f32 cosine dispatch behind cfg(target_feature=avx2)#3
Merged
tobocop2 merged 1 commit intoJun 11, 2026
Conversation
…=avx2) On AVX2-baseline builds (the default haswell wheel), cosine_batch uses inlined non-target_feature kernels (base-equivalent) plus a single per-batch AVX-512 check, avoiding the per-vector runtime-dispatch + target_feature tax that regressed the modern path. Sub-AVX2 builds keep the runtime dispatch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Owner
|
awesome, thank you! |
8577f3b
into
tobocop2:fix/runtime-simd-multiversion
2 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up from the benchmarks I posted on lance-format#6630 (lance-format#6630 (comment)): the per-vector dispatch regressed the f32 cosine batch path +24-36% on AVX2-baseline builds.
This gates the f32 cosine dispatch behind
cfg(target_feature = "avx2"):Same-session benchmarks (Broadwell AVX2 + Zen4 AVX-512): dim8 flat on both boxes; dim1024 +6% on Broadwell (codegen-unit residual, not dispatch) and -7% on Zen4 (the AVX-512 tier engages on the default wheel for the first time).
Scope: just the cosine path. dot/l2/norm_l2 have the same shape (l2 measured +6.4%); happy to extend the gate there in this PR or as a follow-up, whichever you prefer.
🤖 Generated with Claude Code