Add a way to prefetch a hash table bucket (#677)#727
Conversation
Adds `prefetch(hash)` to `RawTable` and exposes it as: - `HashTable::prefetch(hash)` - `HashMap::prefetch(&Q)` / `HashSet::prefetch(&Q)` (hash the key, then prefetch) A prefetch issues a software prefetch hint for the two cache lines a lookup of that hash would touch first: the control-byte group at the start of the probe sequence and the corresponding data bucket. It is a hint only — no memory access, never faults (an invalid/dangling address is fine), a no-op in the abstract machine. The stable path is per-architecture: `_mm_prefetch` (`_MM_HINT_T0`) on x86/x86-64, a no-op everywhere else (aarch64 has no stable prefetch intrinsic yet, and `core::intrinsics::prefetch_read_data` is unstable). The new `src/prefetch.rs` shim is `#[cfg(not(miri))]`-gated for the intrinsic, like the SIMD `Group` impls. For now this is L1 read prefetch only; a richer locality/read-write interface can follow once the std prefetch hints (rust-lang/rust#146941) stabilize. This only helps when the table is large enough that its control bytes spill out of cache *and* the caller can prefetch a key several lookups ahead of the one being processed (batched lookups / join probing). On a single lookup, or a cache-resident table, it does nothing useful. The new `benches/prefetch.rs` batch-lookup bench shows the crossover: roughly a slight loss on a small (4K-slot) table, ~1.1-1.15x on tables that no longer fit in L2/L3 (~1M-4M slots).
|
So, first, thank you for working on this! Figuring out the benchmarking and examples is most of the work for a feature like this, and I appreciate you doing it. In terms of main remarks on the implementation, I think that probably the first major one is I think these methods should all be labelled Additionally, while I don't think the API needs to be resolved right now, but especially if the implementation plans to only prefetch a set number of elements in advance, we should have some way of returning the prefetched hash/bucket index such that it can be reused later. With a set number of elements, this could be stored in a fixed-size array and used later. (Note: you'd need to special case the first n elements in this case.) Not actually sure whether that'd be a helpful API but it feels important to point out since the hashed data is stored potentially somewhere else in memory and you don't want to re-pull that page back into cache to hash it again if you don't need to. (Speaking of which, I think this is a rare case where benchmarking with both integers and strings might be useful. Particularly heap-allocated strings, which will be scattered way more in memory.) |
|
Want to make sure I'm reading the cold-branch suggestion right before coding it. Two implementations of (a) Issue only the ctrl prefetch, skip the data prefetch entirely. The wasted-data-prefetch-on-empty-buckets concern is solved structurally: the data line is never hinted from (b) Read ctrl during the prefetch call to probe for a matching tag, prefetch data only on a hit, My read is that (a) gets the semantics you want without the synchronous ctrl read, but the |
|
I did have b in mind, but I also am way less certain about what would work best, so, I'll trust in whatever actually does work best. |
…ntrinsics gate Addresses clarfonthey's review on PR rust-lang#727: * API split: rename `prefetch` to `prefetch_get` on HashMap, HashSet, HashTable, raw table; add `prefetch_insert` to signal insert intent. The two methods currently share the same implementation (`RawTableInner::prefetch_both`) because measured bench evidence on Crucible (Ryzen 9 9950X, hit-heavy AND miss-heavy workloads) shows the data-line prefetch is load-bearing for the win on lookups. A ctrl-only prefetch_get regresses 18-40% on hit-heavy and is neutral-to-slowdown on miss-heavy across the size sweep. The split expresses caller intent at the API surface; the implementations can diverge in a follow-up if a workload supports it. * Nightly intrinsics feature gate in src/prefetch.rs: when the `nightly` feature is on, prefetch_read_l1 routes through core::intrinsics::prefetch_read_data with locality 3 (matches the stable shim's _MM_HINT_T0 on x86 so the comparison is apples-to-apples). Source comment documents the locality invariant. * Bench module restructured into three groups: batch_lookup (integer keys, hit-heavy), batch_lookup_string (heap-string keys, hit-heavy), batch_lookup_miss (integer keys, miss-heavy), batch_insert (integer keys). Doc comments distributed through the module per the review ask. The batch_lookup_miss group exists specifically to bench the (a) ctrl-only vs (b) ctrl+data trade-off across workload regimes. * Updated test_prefetch to exercise both methods over the same shapes (empty singleton, tiny, large, ZST, look-ahead patterns for both lookup and insert). Tests + clippy + fmt + miri all green.
|
v2 pushed (f05a9f9). Addresses all three asks:
On the (a)-vs-(b) decision for (a) regresses on hit-heavy and doesn't recover on miss-heavy. The data prefetch was load-bearing for v1's original v2 ships both methods hinting both lines (behavioral parity with v1's original Can implement (b) and re-bench if that's useful for the design conversation. My read of the evidence is that the API surface is the substantive part of your ask, the behavioral split is workload-dependent and currently doesn't pay, and the cleanest landing is named-split-only. |
|
Thank you again for doing all the work on this! I am sceptical that the I'm particularly interested in what the nightly intrinsics actually do to the benchmark, since I would assume that they don't affect the performance significantly, but intrinsics tend to be weird in a number of ways. It would be particularly helpful for considering how those intrinsics should be long-term since this is a pretty compelling use case. |
|
Bench results, stable shim vs nightly intrinsic on Crucible (Ryzen 9 9950X, taskset to core 0). Both runs use Cleanest signal is the insert group (stable to nightly delta per size, both for naive and prefetch_insert): All within ±2%, both for naive (which doesn't call the prefetch path at all, so it acts as a control) and prefetch_insert (which does). On x86 the intrinsic lowers to the same Lookup groups had wider cross-run delta (~5-10%) but the delta hits naive too, indicating thermal drift across runs rather than a shim-vs-intrinsic signal. The insert group ran first, lookups second; the nightly run was warmer in the lookup phase. The within-run prefetch_get/naive ratio cancels the baseline drift: The prefetch perf signature is preserved across both impls. Hit-heavy: prefetch starts winning around 256K, peaks at 1.18-1.26x at 1-4M. Miss-heavy: prefetch is a loss across the sweep (the data-line hint is wasted on probes that terminate on control bytes). Both behaviors hold under both shim and intrinsic. Read: no meaningful codegen difference on x86. If the intrinsic does anything weird on aarch64 or another arch, this bench wouldn't catch it because Crucible is x86-only. That's the open question for the long-term API decision, but not one this PR can answer. Methodology: |
Summary
Closes #677 (or starts the concrete discussion of it — see the API question below).
Adds
prefetch(hash)toRawTable, exposed asHashTable::prefetch(hash),HashMap::prefetch(&Q), andHashSet::prefetch(&Q)(the map/set versions hash the key first). It issues a software prefetch hint for the two cache lines a lookup of that hash would touch first — the control-byte group at the start of the probe sequence, and the corresponding data bucket — like abseil'sprefetch_hash. It's a hint only: no memory access, never faults (an invalid/dangling address is harmless), and a no-op in the abstract machine.Stable path, per architecture:
_mm_prefetch::<_MM_HINT_T0>on x86 / x86-64 (prefetcht0), a no-op everywhere else — aarch64 has no stable prefetch intrinsic yet, andcore::intrinsics::prefetch_read_datais unstable (rust-lang/rust#146941). The newsrc/prefetch.rsshim is#[cfg(not(miri))]-gated for the intrinsic, like the SIMDGroupimpls. The pointer arithmetic useswrapping_*deliberately (a prefetch of an out-of-bounds address is a safe no-op, so it must be the arithmetic that doesn't UB) — works on the empty singleton too.API question
For now this is L1 read prefetch only — the minimal thing, so there's something concrete to react to. If you'd rather offer the full
Locality/ read-vs-write surface, or wait for the std prefetch hints (rust-lang/rust#146941) to stabilize so this can re-useLocality, I'm happy to redo it — theLocalityform is an easy extension on top of this. (Also, per the thread: this prefetches both the ctrl and data lines, and it's aprefetch(hash)method rather than a "raw bucket guess" getter.)When it helps
Only when looking up many hashes in a sequence and the table is large enough that its control bytes have spilled out of cache — then the caller prefetches a key several lookups ahead. On a single lookup, or a cache-resident table, it does nothing useful (it's a slight loss — extra instructions, no missed-load to hide). The new
benches/prefetch.rsbatch_lookupbench (16-byte keys, randomized lookup order, prefetchi+8) shows the crossover:Matches the cross-architecture numbers @joshuaisaact posted in #677 (~8% @1m, ~15% @4m); batched multi-key lookups, per the thread, do better. The doc comments on the new methods spell this out so nobody reaches for it on a hot small table.
Tests
cargo testgreen, including a newtest_prefetchover the empty singleton / tiny / 1000-entry / ZST-value tables, present and absent keys, and the look-ahead pattern (asserts the table is intact and lookups still work after prefetching).cargo +nightly miri test --libgreen — no UB (the pointer math runs under Miri; the intrinsic isnot(miri)-gated, so the no-op branch runs there).cargo clippy --all-targets0 warnings;cargo fmt -- --checkclean.+374 lines, 8 files (the
src/prefetch.rsshim, theRawTable/HashTable/HashMap/HashSetmethods, the bench, the test).