Skip to content

Add a way to prefetch a hash table bucket (#677)#727

Open
RyanJamesStewart wants to merge 2 commits into
rust-lang:mainfrom
RyanJamesStewart:feat/bucket-prefetch
Open

Add a way to prefetch a hash table bucket (#677)#727
RyanJamesStewart wants to merge 2 commits into
rust-lang:mainfrom
RyanJamesStewart:feat/bucket-prefetch

Conversation

@RyanJamesStewart
Copy link
Copy Markdown

Summary

Closes #677 (or starts the concrete discussion of it — see the API question below).

Adds prefetch(hash) to RawTable, exposed as HashTable::prefetch(hash), HashMap::prefetch(&Q), and HashSet::prefetch(&Q) (the map/set versions hash the key first). It issues a software prefetch hint for the two cache lines a lookup of that hash would touch first — the control-byte group at the start of the probe sequence, and the corresponding data bucket — like abseil's prefetch_hash. It's a hint only: no memory access, never faults (an invalid/dangling address is harmless), and a no-op in the abstract machine.

Stable path, per architecture: _mm_prefetch::<_MM_HINT_T0> on x86 / x86-64 (prefetcht0), a no-op everywhere else — aarch64 has no stable prefetch intrinsic yet, and core::intrinsics::prefetch_read_data is unstable (rust-lang/rust#146941). The new src/prefetch.rs shim is #[cfg(not(miri))]-gated for the intrinsic, like the SIMD Group impls. The pointer arithmetic uses wrapping_* deliberately (a prefetch of an out-of-bounds address is a safe no-op, so it must be the arithmetic that doesn't UB) — works on the empty singleton too.

API question

For now this is L1 read prefetch only — the minimal thing, so there's something concrete to react to. If you'd rather offer the full Locality / read-vs-write surface, or wait for the std prefetch hints (rust-lang/rust#146941) to stabilize so this can re-use Locality, I'm happy to redo it — the Locality form is an easy extension on top of this. (Also, per the thread: this prefetches both the ctrl and data lines, and it's a prefetch(hash) method rather than a "raw bucket guess" getter.)

When it helps

Only when looking up many hashes in a sequence and the table is large enough that its control bytes have spilled out of cache — then the caller prefetches a key several lookups ahead. On a single lookup, or a cache-resident table, it does nothing useful (it's a slight loss — extra instructions, no missed-load to hide). The new benches/prefetch.rs batch_lookup bench (16-byte keys, randomized lookup order, prefetch i+8) shows the crossover:

   4K slots:  ~0.7×   (cache-resident → pure overhead, a loss — the docs say so)
  64K slots:  ~0.93×
 256K slots:  ~1.15×
   1M slots:  ~1.15×
   4M slots:  ~1.12×   (ctrl array spilled L2/L3)

Matches the cross-architecture numbers @joshuaisaact posted in #677 (~8% @1m, ~15% @4m); batched multi-key lookups, per the thread, do better. The doc comments on the new methods spell this out so nobody reaches for it on a hot small table.

Tests

  • cargo test green, including a new test_prefetch over the empty singleton / tiny / 1000-entry / ZST-value tables, present and absent keys, and the look-ahead pattern (asserts the table is intact and lookups still work after prefetching).
  • cargo +nightly miri test --lib green — no UB (the pointer math runs under Miri; the intrinsic is not(miri)-gated, so the no-op branch runs there).
  • cargo clippy --all-targets 0 warnings; cargo fmt -- --check clean.

+374 lines, 8 files (the src/prefetch.rs shim, the RawTable/HashTable/HashMap/HashSet methods, the bench, the test).

Adds `prefetch(hash)` to `RawTable` and exposes it as:
- `HashTable::prefetch(hash)`
- `HashMap::prefetch(&Q)` / `HashSet::prefetch(&Q)` (hash the key, then prefetch)

A prefetch issues a software prefetch hint for the two cache lines a lookup
of that hash would touch first: the control-byte group at the start of the
probe sequence and the corresponding data bucket. It is a hint only — no
memory access, never faults (an invalid/dangling address is fine), a no-op
in the abstract machine.

The stable path is per-architecture: `_mm_prefetch` (`_MM_HINT_T0`) on
x86/x86-64, a no-op everywhere else (aarch64 has no stable prefetch
intrinsic yet, and `core::intrinsics::prefetch_read_data` is unstable).
The new `src/prefetch.rs` shim is `#[cfg(not(miri))]`-gated for the
intrinsic, like the SIMD `Group` impls. For now this is L1 read prefetch
only; a richer locality/read-write interface can follow once the std
prefetch hints (rust-lang/rust#146941) stabilize.

This only helps when the table is large enough that its control bytes spill
out of cache *and* the caller can prefetch a key several lookups ahead of
the one being processed (batched lookups / join probing). On a single
lookup, or a cache-resident table, it does nothing useful. The new
`benches/prefetch.rs` batch-lookup bench shows the crossover: roughly a
slight loss on a small (4K-slot) table, ~1.1-1.15x on tables that no longer
fit in L2/L3 (~1M-4M slots).
Comment thread src/prefetch.rs
Comment thread benches/prefetch.rs Outdated
@clarfonthey
Copy link
Copy Markdown
Contributor

So, first, thank you for working on this! Figuring out the benchmarking and examples is most of the work for a feature like this, and I appreciate you doing it.

In terms of main remarks on the implementation, I think that probably the first major one is I think these methods should all be labelled prefetch_get (or prefetch_read) instead of just prefetch. Even if the actual instructions don't care whether you're reading or writing, I'd say it matters for the kind of prefetching in this crate whether you plan to read or write, since if you end up finding an empty bucket, you have no reason to actually prefetch the place the value would be stored unless you were planning to insert there later. (This is also why I would use prefetch_get instead of prefetch_read, since it's specifically indicating the operation you'd be performing. Similarly, maybe prefetch_insert could be used instead of prefetch_write.)

Additionally, while I don't think the API needs to be resolved right now, but especially if the implementation plans to only prefetch a set number of elements in advance, we should have some way of returning the prefetched hash/bucket index such that it can be reused later. With a set number of elements, this could be stored in a fixed-size array and used later. (Note: you'd need to special case the first n elements in this case.)

Not actually sure whether that'd be a helpful API but it feels important to point out since the hashed data is stored potentially somewhere else in memory and you don't want to re-pull that page back into cache to hash it again if you don't need to. (Speaking of which, I think this is a rare case where benchmarking with both integers and strings might be useful. Particularly heap-allocated strings, which will be scattered way more in memory.)

Comment thread src/raw.rs
@RyanJamesStewart
Copy link
Copy Markdown
Author

Want to make sure I'm reading the cold-branch suggestion right before coding it. Two implementations of prefetch_get:

(a) Issue only the ctrl prefetch, skip the data prefetch entirely. The wasted-data-prefetch-on-empty-buckets concern is solved structurally: the data line is never hinted from prefetch_get. Simple, no ctrl read in the prefetch call.

(b) Read ctrl during the prefetch call to probe for a matching tag, prefetch data only on a hit, #[cold] on the empty-group branch. Matches your literal description but reads ctrl synchronously, which collapses the prefetch-vs-lookup overlap window the caller is trying to exploit.

My read is that (a) gets the semantics you want without the synchronous ctrl read, but the #[cold] annotation in your comment reads like you had (b) in mind. Which were you thinking?

@clarfonthey
Copy link
Copy Markdown
Contributor

I did have b in mind, but I also am way less certain about what would work best, so, I'll trust in whatever actually does work best.

…ntrinsics gate

Addresses clarfonthey's review on PR rust-lang#727:

* API split: rename `prefetch` to `prefetch_get` on HashMap, HashSet,
  HashTable, raw table; add `prefetch_insert` to signal insert intent.
  The two methods currently share the same implementation
  (`RawTableInner::prefetch_both`) because measured bench evidence on
  Crucible (Ryzen 9 9950X, hit-heavy AND miss-heavy workloads) shows
  the data-line prefetch is load-bearing for the win on lookups. A
  ctrl-only prefetch_get regresses 18-40% on hit-heavy and is
  neutral-to-slowdown on miss-heavy across the size sweep. The split
  expresses caller intent at the API surface; the implementations can
  diverge in a follow-up if a workload supports it.

* Nightly intrinsics feature gate in src/prefetch.rs: when the
  `nightly` feature is on, prefetch_read_l1 routes through
  core::intrinsics::prefetch_read_data with locality 3 (matches the
  stable shim's _MM_HINT_T0 on x86 so the comparison is
  apples-to-apples). Source comment documents the locality invariant.

* Bench module restructured into three groups: batch_lookup (integer
  keys, hit-heavy), batch_lookup_string (heap-string keys, hit-heavy),
  batch_lookup_miss (integer keys, miss-heavy), batch_insert (integer
  keys). Doc comments distributed through the module per the review
  ask. The batch_lookup_miss group exists specifically to bench the
  (a) ctrl-only vs (b) ctrl+data trade-off across workload regimes.

* Updated test_prefetch to exercise both methods over the same shapes
  (empty singleton, tiny, large, ZST, look-ahead patterns for both
  lookup and insert).

Tests + clippy + fmt + miri all green.
@RyanJamesStewart
Copy link
Copy Markdown
Author

v2 pushed (f05a9f9). Addresses all three asks:

  1. API split: prefetchprefetch_get + new prefetch_insert. Three layers (RawTable, HashTable, HashMap, HashSet), test_prefetch updated to exercise both. The split is named-only at this point: both methods share the same RawTableInner::prefetch_both implementation. Reason below.

  2. Nightly intrinsics feature gate in src/prefetch.rs. Under feature = "nightly", routes through core::intrinsics::prefetch_read_data with locality 3 (matches the stable shim's _MM_HINT_T0). Source comment documents the locality invariant so future readers can verify the comparison stays apples-to-apples.

  3. Bench doc comments distributed through benches/prefetch.rs. New groups: batch_lookup (integer, hit-heavy), batch_lookup_string (heap-string, hit-heavy, per the non-blocking suggestion), batch_lookup_miss (integer, 0% hit rate), batch_insert.

On the (a)-vs-(b) decision for prefetch_get's behavior. Benched (a) (ctrl-only) on Crucible (Ryzen 9 9950X):

batch_lookup (100% hit rate, integer keys):
  4K:   naive 88µs   prefetch_get 120µs  1.36x slowdown
  64K:  naive 135µs  prefetch_get 190µs  1.40x slowdown
  256K: naive 180µs  prefetch_get 215µs  1.20x slowdown
  1M:   naive 231µs  prefetch_get 282µs  1.22x slowdown
  4M:   naive 278µs  prefetch_get 327µs  1.18x slowdown

batch_lookup_miss (0% hit rate, integer keys):
  4K:   naive 159µs  prefetch_get 126µs  1.26x speedup (noisy, small table)
  64K:  naive 102µs  prefetch_get 207µs  2.03x slowdown
  256K: naive 108µs  prefetch_get 136µs  1.26x slowdown
  1M:   naive 156µs  prefetch_get 186µs  1.19x slowdown
  4M:   naive 170µs  prefetch_get 172µs  neutral

(a) regresses on hit-heavy and doesn't recover on miss-heavy. The data prefetch was load-bearing for v1's original prefetch-both; removing it loses the win without the trade-off paying off on either workload. (b)'s synchronous-ctrl-read structure would block during prefetch_get, which contradicts the overlap-window the API is designed to exploit. Didn't implement (b) because the bench evidence on (a) already showed the data prefetch isn't optional for the headline workload.

v2 ships both methods hinting both lines (behavioral parity with v1's original prefetch). Re-bench at 1M and 4M with prefetch_get-now-both-lines shows the win restored (1.36-1.46x speedup, in line with v1's reported numbers; absolute numbers shifted from the first run due to thermal variance, relative ratios are stable). The named split keeps room for a behavioral specialization in a follow-up if a workload surfaces that supports the trade-off.

Can implement (b) and re-bench if that's useful for the design conversation. My read of the evidence is that the API surface is the substantive part of your ask, the behavioral split is workload-dependent and currently doesn't pay, and the cleanest landing is named-split-only.

@clarfonthey
Copy link
Copy Markdown
Contributor

Thank you again for doing all the work on this! I am sceptical that the prefetch_read / prefetch_write distinction will be useful in the long run because I genuinely can't think of an instance where the intent would matter here, but it's worth pointing out that the nightly API technically would prefer that the prefetch_both be separated for that reason. I personally think it's fine to stay on read prefetch for now on nightly.

I'm particularly interested in what the nightly intrinsics actually do to the benchmark, since I would assume that they don't affect the performance significantly, but intrinsics tend to be weird in a number of ways. It would be particularly helpful for considering how those intrinsics should be long-term since this is a pretty compelling use case.

@RyanJamesStewart
Copy link
Copy Markdown
Author

Bench results, stable shim vs nightly intrinsic on Crucible (Ryzen 9 9950X, taskset to core 0). Both runs use cargo +nightly bench; only the --features nightly flag differs, so the compiler is identical and the only swap is _mm_prefetch::<_MM_HINT_T0> for core::intrinsics::prefetch_read_data::<_, 3>. Locality-3-equals-T0 invariant documented at the source.

Cleanest signal is the insert group (stable to nightly delta per size, both for naive and prefetch_insert):

              naive delta   prefetch_insert delta
  4K:          +0.3%         +0.8%
  64K:          0.0%         +0.2%
  256K:        -1.8%         -1.2%
  1M:          -0.1%         +0.3%
  4M:          -0.6%         -0.4%

All within ±2%, both for naive (which doesn't call the prefetch path at all, so it acts as a control) and prefetch_insert (which does). On x86 the intrinsic lowers to the same prefetcht0 as the stable shim, so equivalence is what the codegen predicts.

Lookup groups had wider cross-run delta (~5-10%) but the delta hits naive too, indicating thermal drift across runs rather than a shim-vs-intrinsic signal. The insert group ran first, lookups second; the nightly run was warmer in the lookup phase. The within-run prefetch_get/naive ratio cancels the baseline drift:

batch_lookup (prefetch_get / naive, within run):
                stable shim    nightly intrinsic
  4K:           1.361 slow     1.353 slow
  64K:          1.157 slow     1.103 slow
  256K:         1.016 ~        0.950 fast
  1M:           0.902 fast     0.851 fast
  4M:           0.844 fast     0.792 fast

batch_lookup_miss (prefetch_get / naive, within run):
                stable shim    nightly intrinsic
  4K:           1.403 slow     1.255 slow
  64K:          1.490 slow     1.341 slow
  256K:         1.537 slow     1.444 slow
  1M:           1.214 slow     1.260 slow
  4M:           1.369 slow     1.391 slow

The prefetch perf signature is preserved across both impls. Hit-heavy: prefetch starts winning around 256K, peaks at 1.18-1.26x at 1-4M. Miss-heavy: prefetch is a loss across the sweep (the data-line hint is wasted on probes that terminate on control bytes). Both behaviors hold under both shim and intrinsic.

Read: no meaningful codegen difference on x86. If the intrinsic does anything weird on aarch64 or another arch, this bench wouldn't catch it because Crucible is x86-only. That's the open question for the long-term API decision, but not one this PR can answer.

Methodology: cargo +nightly bench --bench bench -- --warm-up-time 1 --measurement-time 3 "batch_(lookup|insert)", with --features nightly added for the intrinsic run. Both pinned to core 0 via taskset. Full bench logs available on request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow to prefetch buckets

2 participants