Draft
Conversation
Add #elif defined(__aarch64__) ARM NEON paths alongside all existing
x86-64 SSE2 paths in every translation unit. The scalar fallback is
retained for other architectures.
Intrinsic mapping applied
--------------------------
matrix_operations.cpp
__m128d -> float64x2_t
_mm_setzero_pd() -> vdupq_n_f64(0.0)
_mm_loadu_pd(p) -> vld1q_f64(p)
_mm_set_pd(hi,lo) -> local double[2] + vld1q_f64
_mm_mul_pd(a,b) -> vmulq_f64(a,b)
_mm_add_pd(a,b) -> vaddq_f64(a,b)
_mm_storeu_pd / horiz-add -> vgetq_lane_f64(v,0)+vgetq_lane_f64(v,1)
hash_operations.cpp
_mm_loadu_si128 -> vld1q_u8
_mm_extract_epi16(v, j/2) [variable lane — ILLEGAL on both SSE2 and NEON
when j is a runtime variable] -> fixed on BOTH paths: _mm_storeu_si128 /
vst1q_u8 to a local uint8_t[16] array, then iterate.
string_search.cpp
_mm_set1_epi8(c) -> vdupq_n_u8(c)
_mm_loadu_si128 -> vld1q_u8
_mm_cmpeq_epi8(a,b) -> vceqq_u8(a,b) (0xFF/0x00 per lane — same semantics)
_mm_movemask_epi8 -> neon_movemask_epi8() helper:
vandq_u8 with power-of-2 mask + three rounds of
vpadd_u8, then vget_lane_u8 × 2
memory_operations.cpp
_mm_loadu_si128 -> vld1q_u8
_mm_storeu_si128 -> vst1q_u8
polynomial_eval.cpp
__m128d -> float64x2_t
_mm_setzero_pd() -> vdupq_n_f64(0.0)
_mm_set1_pd(x) -> vdupq_n_f64(x)
_mm_set_pd(hi,lo) -> local double[2] + vld1q_f64
_mm_loadu_pd / coeffs.data()+i -> vld1q_f64(coeffs.data()+i) (contiguous)
_mm_mul_pd -> vmulq_f64
_mm_add_pd -> vaddq_f64
_mm_storeu_pd / horiz-add -> vgetq_lane_f64(v,0)+vgetq_lane_f64(v,1)
Dockerfile
ubuntu:22.04 is a multi-arch manifest (amd64 + arm64) — no base-image
change needed. Updated comment to document dual-arch SIMD support.
Build flags are already architecture-neutral (-O2 -std=c++11).
Co-authored-by: JoeStech <4088382+JoeStech@users.noreply.github.com>
…warning build) Co-authored-by: JoeStech <4088382+JoeStech@users.noreply.github.com>
Co-authored-by: JoeStech <4088382+JoeStech@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Migrate this repo to Arm using MCP server tools
Migrate x86 SSE2 SIMD to ARM64 NEON
Apr 15, 2026
Copilot stopped work on behalf of
JoeStech due to an error
April 15, 2026 16:25
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
All compute-hot paths used x86-only SSE2 intrinsics (
<immintrin.h>), making the codebase non-portable to ARM64.Changes
matrix_operations,hash_operations,string_search,memory_operations,polynomial_eval): wrapped existing SSE2 paths in#ifdef __x86_64__, added#elif defined(__aarch64__)NEON paths, and scalar fallbacks<immintrin.h>→<arm_neon.h>under aarch64 guardmain.cpp: added__aarch64__NEON detection branchubuntu:22.04already multi-arch; updated comment onlyBug fixes (discovered during migration)
hash_operations.cpp:_mm_extract_epi16was called with a runtime-variable lane index — illegal on NEON (lane indices must be compile-time constants). Replaced withvst1q_u8into a local byte array.string_search.cpp: no direct NEON equivalent for_mm_movemask_epi8; addedneon_movemask_epi8()helper.Example pattern applied across all files