Skip to content

opencoff/go-sieve

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

go-sieve - SIEVE cache eviction for Go

Go Reference Go Report Card Release

A Go implementation of the SIEVE cache eviction algorithm (NSDI'24, Zhang et al.), engineered from the ground up for highly concurrent, read-heavy workloads. Generic over key and value types.

The read path (Get()) is fully lock-free — a single atomic load on the key→index map plus a single atomic bit update on a shared visited bitfield. No mutex, no pointer chasing, no per-entry allocations, zero GC traffic on hits. Under parallel load this is ~90–300x faster than hashicorp/golang-lru and scales linearly with cores: measured at 1–3 ns/op across 32 goroutines on real-world cache traces. The write path uses a single short-held mutex and a pre-allocated node pool, so Add()/Probe() also avoid per-operation heap allocation in steady state.

If you need a cache that many goroutines hit simultaneously on the read path — an HTTP response cache, a DNS resolver cache, an authz decision cache, a hot-path object lookup — this implementation is built for that shape. For purely single-threaded use, hashicorp/golang-lru may be marginally faster on the sequential write path; under any concurrency, Sieve wins decisively.

SIEVE uses a FIFO queue with a roving "hand" pointer: cache hits set a visited bit (lazy promotion), and eviction scans from the hand clearing visited bits until it finds an unvisited node (quick demotion). It matches or exceeds LRU/ARC hit ratios with far less bookkeeping — validated here on ~300M requests from the MSR Cambridge and Meta Storage trace repositories.

Key Design Elements

Array-backed indexed list. Marc Brooker observed that SIEVE's mid-list removal prevents a simple circular buffer. Rather than Tobin Baker's "swap tail into hole" workaround, this implementation uses a doubly-linked list with int32 indices into a pre-allocated backing array. This preserves SIEVE's exact eviction semantics while eliminating all interior pointers — the GC sees a flat []node with no pointers to trace (for non-pointer K, V types).

xsync.MapOf for concurrent access. The key→index map uses puzpuzpuz/xsync.MapOf which stores int32 values inline in cache-line-padded buckets — no traced pointers per entry. Get() is fully lock-free; only Add()/Probe() (on miss) and Delete() take the global mutex.

Columnar slot state (slotState). Each node's lock and visited counter are hoisted out of the node struct into a separate contiguous []uint64 — one word per slot, laid out as a column alongside the []node array. This columnar split has three effects:

  1. the eviction hand scans IsVisited by walking a dense []uint64 sequentially — 8 slots per cache line, hardware-prefetch friendly — without pulling in key/val data it doesn't need;
  2. Get()'s LockAndMark writes only to the slotState word for that slot, so it never dirties the cache line holding the node's key/val — no false sharing with concurrent readers of adjacent nodes
  3. []uint64 contains no pointers, so the GC never traces it, unlike node fields that may hold pointer-typed K/V.

Within each word, bit 63 is a spinlock and the low bits are a saturating visited counter (1 bit at k=1, ⌈log₂(k+1)⌉ bits for higher k).

Pre-allocated node pool. All nodes are allocated once at cache creation in a contiguous array. A bump allocator + intrusive freelist (reusing node.next) provides O(1) alloc/free with zero heap allocations during steady-state operation.

TOCTOU-safe concurrent writes. Add() and Probe() use a double-check pattern: fast-path Load() outside the lock, re-check under mu.Lock() before inserting. This prevents duplicate nodes from concurrent writers racing on the same key.

Benchmark Results

Benchmarked against hashicorp/golang-lru v2.0.7 (LRU and ARC) on a 13th Gen Intel Core i9-13900 with 32 cores and 32GB of RAM:

  • Ubuntu/Linux 24.04 (kernel 6.8.0-106)
  • go1.26.1
  • GOMAXPROCS=32

Benchmarks live in bench/ as a separate module to avoid polluting go.mod.

Commands used (no name filter — every benchmark in every package runs):

cd bench && make bench    # synthetic comparison, count=3
cd bench && make trace    # trace replay + miss ratio + GC, count=1

Full raw results: bench-results.md.

Reading the parallel ns/op numbers

The sub-2 ns/op numbers in the parallel tables below are real but need context: they are aggregate throughput.

Go's b.RunParallel distributes b.N total operations across GOMAXPROCS goroutines and reports ns/op = wall_clock / b.N. When 32 goroutines complete 1 billion Get()s in ~1 second, the reported number is 1.0 ns/op — meaning "the system produces one completed Get every ~1 ns." The per-core latency is ~32 ns (1.0 × 32 cores), which is consistent with two L1/L2-hot atomic operations on a 5 GHz CPU.

LRU/ARC report ~200–600 ns/op under the same conditions — not because a single Get is 200x slower, but because every Get takes a mutex. The 32 goroutines serialize through one lock, so throughput is flat regardless of core count. On a single goroutine (see BenchmarkReplay, sequential), SIEVE and LRU are within 15% of each other — the ~100–300x gap is a concurrency-scaling story.

After warmup, the working data structures (map buckets, node array, visited bitfield) are resident in CPU cache — L1/L2 for small traces, L3 for larger ones. Real workloads with lower temporal locality will see higher per-core latency, but the relative advantage over mutex-bound LRU/ARC holds whenever there is any parallelism at all.

Parallel Micro-Benchmarks (count=3, medians)

Benchmark Sieve LRU ARC
Get_Parallel 2.36 ns/op, 0 B 563.2 ns/op, 0 B 606.7 ns/op, 0 B
Add_Parallel 426.9 ns/op, 8 B 527.0 ns/op, 40 B 1020 ns/op, 76 B
Probe_Parallel 378.4 ns/op, 8 B
Delete_Parallel 230.1 ns/op, 0 B 163.1 ns/op, 0 B 253.9 ns/op, 0 B
Mixed_Parallel (60/30/10) 344.1 ns/op, 2 B 602.7 ns/op, 12 B 637.9 ns/op, 24 B
Zipf_Get_Parallel (s=1.01) 16.5 ns/op 472.5 ns/op 396.7 ns/op
Memory @ 1M fill 122 MB, 1.10M allocs 156 MB, 1.01M allocs 156 MB, 1.01M allocs
GCImpact 9.22 ms/op, 9.8 KB/op 13.64 ms/op, 26.3 KB/op 14.26 ms/op, 117.2 KB/op

Probe_Parallel is SIEVE-only — LRU's PeekOrAdd and ContainsOrAdd skip recency promotion and are not semantic equivalents. Delete_Parallel is the one micro where SIEVE loses: LRU's single-lock linked-list unlink edges out SIEVE's slot-state clear (163 vs 230 ns/op). ARC is slowest (254 ns/op) because its T1/T2/B1/B2 bookkeeping doubles the work.

Trace Replay Results

This implmentation is benchmarked against real-world cache traces from the libCacheSim trace repository — 14 MSR Cambridge enterprise block I/O traces + 5 Meta Storage (Tectonic) block traces totalling ~300M requests. Each trace was replayed with a cache sized at 10% of unique keys, comparing SIEVE (k=1, k=2, k=3) against hashicorp/golang-lru (LRU and ARC).

Parallel Get throughput (warm cache, 32 goroutines, ns/op, zero allocs)

SIEVE's lock-free Get() is ~100–300x faster than LRU/ARC under concurrent read load. Every trace, every cache:

Trace SIEVE k=1 SIEVE k=3 LRU ARC
msr_web_2 1.02 1.04 182.6 377.5
meta_storage/block_traces_1 1.29 1.44 232.1 360.2
meta_storage/block_traces_2 1.30 1.51 257.2 358.9
msr_proj_4 1.31 1.23 360.7 479.1
meta_storage/block_traces_3 1.50 1.60 264.9 458.0
meta_storage/block_traces_4 1.55 1.62 234.1 449.2
msr_prn_1 1.56 1.59 321.4 381.6
meta_storage/block_traces_5 1.59 1.70 222.5 348.6
msr_prxy_0 1.76 2.27 313.5 363.3
msr_usr_2 2.03 2.12 344.5 502.0
msr_src1_1 2.19 2.28 394.8 587.4
msr_usr_1 2.32 2.26 395.0 536.9
msr_proj_2 2.79 3.03 280.8 460.6
msr_proj_1 2.91 2.89 334.6 483.3
msr_src1_0 4.85 4.89 336.4 443.6
msr_proj_0 5.24 5.67 294.6 407.5
msr_hm_0 6.61 7.12 275.6 455.2
msr_prn_0 9.21 10.41 288.0 402.3

The k=3 saturating counter adds under 10% overhead to the read path. This benchmark pre-warms the cache with a full trace replay, then hammers Get() only — the ideal scenario for a read-heavy cache in steady state.

Parallel Replay throughput (cold cache, mixed read+write, 32 goroutines, ns/op)

This is Probe() for SIEVE and Get+Add for LRU/ARC, hammered in parallel with no warmup. It measures the steady-state workload a real cache faces: reads and writes interleaved, evictions happening live. The previous table's numbers reflect the best-case read ceiling; this table reflects what throughput you actually get when your cache is doing work.

Trace SIEVE k=1 SIEVE k=3 LRU ARC
msr_prxy_0 6.45 6.61 401.3 431.0
msr_prn_0 14.90 15.49 481.7 497.1
meta_storage/block_traces_4 16.01 16.75 469.5 453.0
meta_storage/block_traces_1 16.15 18.48 440.7 486.5
msr_proj_0 16.36 15.86 422.8 471.8
msr_prn_1 16.50 18.88 421.4 491.6
meta_storage/block_traces_2 16.69 16.85 454.9 525.0
meta_storage/block_traces_3 16.74 17.71 434.4 501.3
meta_storage/block_traces_5 17.05 16.65 449.6 519.8
msr_hm_0 19.18 17.29 447.8 499.7
msr_usr_2 19.04 19.13 431.5 545.2
msr_proj_4 20.20 20.71 416.4 438.8
msr_usr_1 21.64 23.53 441.6 418.7
msr_src1_1 21.96 20.60 429.2 526.0
msr_src1_0 22.34 25.78 402.0 461.8
msr_proj_1 22.46 22.27 426.8 446.2
msr_proj_2 22.81 22.03 441.2 529.0
msr_web_2 24.35 25.41 436.6 466.3

Under concurrent cold-cache replay, SIEVE is ~18–62x faster than LRU/ARC. The best case is msr_prxy_0 (95% hits, so the lock-free fast path dominates); the worst case is msr_web_2 (98% miss ratio, almost every call takes the write lock) — and even there SIEVE is 18x faster.

Miss ratio (every trace)

Cache sized at 10% of unique keys. Bold = best in row.

Trace SIEVE k=1 SIEVE k=2 SIEVE k=3 LRU ARC
meta_storage/block_traces_1 0.4632 0.4651 0.4672 0.4602 0.4667
meta_storage/block_traces_2 0.4719 0.4743 0.4754 0.4676 0.4755
meta_storage/block_traces_3 0.4908 0.4928 0.4948 0.4885 0.4947
meta_storage/block_traces_4 0.4841 0.4870 0.4888 0.4812 0.4887
meta_storage/block_traces_5 0.4959 0.4984 0.4998 0.4927 0.5003
msr_hm_0 0.2991 0.3025 0.3025 0.3188 0.2923
msr_prn_0 0.2156 0.2194 0.2208 0.2310 0.2145
msr_prn_1 0.3908 0.3837 0.3796 0.4341 0.4148
msr_proj_0 0.2537 0.2660 0.2745 0.2375 0.2242
msr_proj_1 0.6794 0.6794 0.6794 0.7215 0.6788
msr_proj_2 0.8231 0.8231 0.8231 0.8548 0.8125
msr_proj_4 0.8463 0.8463 0.8463 0.8140 0.7173
msr_prxy_0 0.0512 0.0572 0.0594 0.0476 0.0468
msr_src1_0 0.7845 0.7845 0.7845 0.9132 0.7811
msr_src1_1 0.7939 0.7934 0.7934 0.8129 0.8231
msr_usr_1 0.3558 0.3558 0.3558 0.4007 0.3513
msr_usr_2 0.7216 0.7216 0.7216 0.7533 0.7199
msr_web_2 0.9786 0.9786 0.9786 0.9929 0.9785

Overall best (bold): ARC has the lowest miss ratio in 11 of 18 rows, LRU in 5 (all meta_storage), SIEVE k=3 in 2 (msr_prn_1 and msr_src1_1). SIEVE k=1 is never the overall best.

Head-to-head, SIEVE k=1 vs LRU (the typical deployment choice): SIEVE k=1 has lower miss ratio on 10 of 18 traces, LRU on 8. When SIEVE is better the margins are large (msr_src1_0: 12.9 points, msr_usr_1: 4.5, msr_prn_1: 4.3). When LRU is better the margins are narrow (all 5 meta_storage: 2–3 points each).

SIEVE's case rests on throughput Miss ratios are competitive (SIEVE is never dramatically worse than LRU), and SIEVE is 18–300x faster under any concurrency. SIEVE k=3 produces the single-best entry in the entire table on msr_prn_1 (0.3796 vs LRU 0.4341, ARC 0.4148).

Memory during replay

On the 13.2M-request meta_storage/block_traces_1 trace (601K-entry cache), TestGCPressure reports:

Variant TotalAlloc
SIEVE k=1 154 MB
SIEVE k=3 155 MB
LRU 418 MB
ARC 997 MB

SIEVE allocates 2.7x less than LRU and 6.5x less than ARC during replay — the array-backed node pool and inline-int32 xsync.MapOf are structural wins.

Full per-trace tables (sequential replay ns/op, B/op, miss ratio for every trace) are in bench-results.md. Methodology and trace-loading details are in bench/README.md.

Usage

import "github.com/opencoff/go-sieve"

// Create a cache mapping string keys to int values, capacity 1000.
c, err := sieve.New[string, int](1000)
if err != nil {
    log.Fatal(err) // ErrInvalidCapacity or ErrInvalidVisitClamp
}

// Or, for constant arguments, use Must to get a one-liner:
c := sieve.Must(sieve.New[string, int](1000))

c.Add("foo", 42)

if val, ok := c.Get("foo"); ok {
    fmt.Println(val) // 42
}

// Probe inserts only if absent; returns the cached value if present.
val, _, r := c.Probe("foo", 99)
// val == 42, r.Hit() == true

c.Delete("foo")
c.Purge() // reset entire cache

SIEVE-k

WithVisitClamp(k) creates a SIEVE-k cache where each entry uses a saturating counter instead of a single visited bit. An item accessed k+1 times survives k eviction passes before being evicted. k=1 is equivalent to classic SIEVE (the default). Use k=2 or k=3 for workloads with repeated access patterns where extra eviction resistance is beneficial.

// Classic SIEVE (k=1, the default)
c, err := sieve.New[string, int](1000)

// SIEVE-k=3: items survive up to 3 eviction passes
c, err := sieve.New[string, int](1000, sieve.WithVisitClamp(3))

New returns an error for capacity <= 0 (ErrInvalidCapacity) and for WithVisitClamp(k) with k > sieve.MaxVisitClamp (255 — ErrInvalidVisitClamp). Clamp values below 1 are silently rounded up to 1.

Eviction Handling

Add() and Probe() return the evicted entry (if any) along with a CacheResult bitmask. This allows callers to handle evictions synchronously without channels, goroutines, or lifecycle management.

c := sieve.Must(sieve.New[string, int](1000))

ev, r := c.Add("foo", 42)
if r.Evicted() {
    cleanupDisk(ev.Key, ev.Val)
}

// CacheResult bitmask:
//   CacheHit   — key was already present (value updated, no eviction)
//   CacheEvict — an entry was evicted to make room (mutually exclusive with CacheHit)

Purge() and Delete() do not report evictions.

API

Function / Method Description
New[K, V](capacity, ...Option) (*Sieve[K,V], error) Create a cache with fixed capacity. Returns ErrInvalidCapacity or ErrInvalidVisitClamp on bad input.
Must[K, V](*Sieve[K,V], error) *Sieve[K,V] Helper that panics on error; useful with constant arguments.
Get(key) (V, bool) Look up a key (lock-free, zero-alloc).
Add(key, val) (Evicted[K,V], CacheResult) Insert or update; returns evicted entry and result bitmask.
Probe(key, val) (V, Evicted[K,V], CacheResult) Insert-if-absent; returns cached/inserted value, evicted entry, and result bitmask.
Delete(key) bool Remove a key.
Purge() Clear the entire cache.
Len() int Current number of entries (lock-free atomic load).
Cap() int Maximum capacity.

Options

Option Description
WithVisitClamp(k) Use k-level saturating counters (default k=1 = classic SIEVE). k is capped at MaxVisitClamp (255); k < 1 is silently rounded to 1.

GC Note

When K or V is a pointer type (including string, which contains an internal pointer in Go), the node array will still contain GC-traced pointers. The GC pressure reduction is most dramatic for scalar key/value types (int, [16]byte, fixed-size structs).

License

BSD-2-Clause. See the source files for the full license text.

Packages

 
 
 

Contributors