You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GPU wins at large dense GEMM. We win at **everything else**: similarity search, latency-sensitive inference, edge deployment, energy efficiency, and cost. A $35 Raspberry Pi 4 at 5 watts outperforms a $350 GPU at 170 watts for codebook inference — because table lookups don't need floating-point hardware.
19
+
**Our Near tier (2.4B/s) beats an RTX 3060 by 5–12×.** Our Foveal tier (611M/s) matches RTX 3060 IVF-PQ — but with 0.4% error vs. PQ's 5–10%, and at $0 hardware cost. Only an H100 ($30K, 700W) comes close — and it still needs PCIe transfer + kernel launch overhead that we don't have.
20
+
21
+
The trick: GPU must FP32-multiply, FP32-divide, and transfer over PCIe. We read one u8 from a 64KB table that lives in L1 cache. No transfer, no kernel launch, no floating point.
21
22
22
23
## Core Architecture
23
24
24
-
Five layers built on top of upstream ndarray's array primitives:
25
+
Five layers on top of upstream ndarray's array primitives:
25
26
26
27
**SIMD Polyfill** (`simd.rs`, `simd_avx512.rs`, `simd_avx2.rs`, `simd_neon.rs`) — `std::simd`-compatible types (`F32x16`, `F64x8`, `U8x64`, `I32x16`) on stable Rust via `core::arch`. Detection once via `LazyLock<SimdCaps>`, dispatch via frozen function pointer table (0.3ns per call).
611M cosine-equivalent lookups/sec using only integer operations. The 256×256 table (64KB) lives in L1 cache — no FP division, no multiplication, no PCIe transfer.
0 commit comments