go-sieve - SIEVE cache eviction for Go

A Go implementation of the SIEVE cache eviction algorithm (NSDI'24, Zhang et al.), engineered from the ground up for highly concurrent, read-heavy workloads. Generic over key and value types.

The read path (Get()) is fully lock-free — a single atomic load on the key→index map plus a single atomic bit update on a shared visited bitfield. No mutex, no pointer chasing, no per-entry allocations, zero GC traffic on hits. Under parallel load this is ~90–300x faster than hashicorp/golang-lru and scales linearly with cores: measured at 1–3 ns/op across 32 goroutines on real-world cache traces. The write path uses a single short-held mutex and a pre-allocated node pool, so Add()/Probe() also avoid per-operation heap allocation in steady state.

If you need a cache that many goroutines hit simultaneously on the read path — an HTTP response cache, a DNS resolver cache, an authz decision cache, a hot-path object lookup — this implementation is built for that shape. For purely single-threaded use, hashicorp/golang-lru may be marginally faster on the sequential write path; under any concurrency, Sieve wins decisively.

SIEVE uses a FIFO queue with a roving "hand" pointer: cache hits set a visited bit (lazy promotion), and eviction scans from the hand clearing visited bits until it finds an unvisited node (quick demotion). It matches or exceeds LRU/ARC hit ratios with far less bookkeeping — validated here on ~300M requests from the MSR Cambridge and Meta Storage trace repositories.

Key Design Elements

Array-backed indexed list. Marc Brooker observed that SIEVE's mid-list removal prevents a simple circular buffer. Rather than Tobin Baker's "swap tail into hole" workaround, this implementation uses a doubly-linked list with int32 indices into a pre-allocated backing array. This preserves SIEVE's exact eviction semantics while eliminating all interior pointers — the GC sees a flat []node with no pointers to trace (for non-pointer K, V types).

xsync.MapOf for concurrent access. The key→index map uses puzpuzpuz/xsync.MapOf which stores int32 values inline in cache-line-padded buckets — no traced pointers per entry. Get() is fully lock-free; only Add()/Probe() (on miss) and Delete() take the global mutex.

Columnar slot state (slotState). Each node's lock and visited counter are hoisted out of the node struct into a separate contiguous []uint64 — one word per slot, laid out as a column alongside the []node array. This columnar split has three effects:

the eviction hand scans IsVisited by walking a dense []uint64 sequentially — 8 slots per cache line, hardware-prefetch friendly — without pulling in key/val data it doesn't need;
Get()'s LockAndMark writes only to the slotState word for that slot, so it never dirties the cache line holding the node's key/val — no false sharing with concurrent readers of adjacent nodes
[]uint64 contains no pointers, so the GC never traces it, unlike node fields that may hold pointer-typed K/V.

Within each word, bit 63 is a spinlock and the low bits are a saturating visited counter (1 bit at k=1, ⌈log₂(k+1)⌉ bits for higher k).

Pre-allocated node pool. All nodes are allocated once at cache creation in a contiguous array. A bump allocator + intrusive freelist (reusing node.next) provides O(1) alloc/free with zero heap allocations during steady-state operation.

TOCTOU-safe concurrent writes. Add() and Probe() use a double-check pattern: fast-path Load() outside the lock, re-check under mu.Lock() before inserting. This prevents duplicate nodes from concurrent writers racing on the same key.

Benchmark Results

Benchmarked against hashicorp/golang-lru v2.0.7 (LRU and ARC) on a 13th Gen Intel Core i9-13900 with 32 cores and 32GB of RAM:

Ubuntu/Linux 24.04 (kernel 6.8.0-106)
go1.26.1
GOMAXPROCS=32

Benchmarks live in bench/ as a separate module to avoid polluting go.mod.

Commands used (no name filter — every benchmark in every package runs):

cd bench && make bench    # synthetic comparison, count=3
cd bench && make trace    # trace replay + miss ratio + GC, count=1

Full raw results: bench-results.md.

Reading the parallel ns/op numbers

The sub-2 ns/op numbers in the parallel tables below are real but need context: they are aggregate throughput.

Go's b.RunParallel distributes b.N total operations across GOMAXPROCS goroutines and reports ns/op = wall_clock / b.N. When 32 goroutines complete 1 billion Get()s in ~1 second, the reported number is 1.0 ns/op — meaning "the system produces one completed Get every ~1 ns." The per-core latency is ~32 ns (1.0 × 32 cores), which is consistent with two L1/L2-hot atomic operations on a 5 GHz CPU.

LRU/ARC report ~200–600 ns/op under the same conditions — not because a single Get is 200x slower, but because every Get takes a mutex. The 32 goroutines serialize through one lock, so throughput is flat regardless of core count. On a single goroutine (see BenchmarkReplay, sequential), SIEVE and LRU are within 15% of each other — the ~100–300x gap is a concurrency-scaling story.

After warmup, the working data structures (map buckets, node array, visited bitfield) are resident in CPU cache — L1/L2 for small traces, L3 for larger ones. Real workloads with lower temporal locality will see higher per-core latency, but the relative advantage over mutex-bound LRU/ARC holds whenever there is any parallelism at all.

Parallel Micro-Benchmarks (`count=3`, medians)

Benchmark	Sieve	LRU	ARC
`Get_Parallel`	2.36 ns/op, 0 B	563.2 ns/op, 0 B	606.7 ns/op, 0 B
`Add_Parallel`	426.9 ns/op, 8 B	527.0 ns/op, 40 B	1020 ns/op, 76 B
`Probe_Parallel`	378.4 ns/op, 8 B	—	—
`Delete_Parallel`	230.1 ns/op, 0 B	163.1 ns/op, 0 B	253.9 ns/op, 0 B
`Mixed_Parallel` (60/30/10)	344.1 ns/op, 2 B	602.7 ns/op, 12 B	637.9 ns/op, 24 B
`Zipf_Get_Parallel` (s=1.01)	16.5 ns/op	472.5 ns/op	396.7 ns/op
Memory @ 1M fill	122 MB, 1.10M allocs	156 MB, 1.01M allocs	156 MB, 1.01M allocs
`GCImpact`	9.22 ms/op, 9.8 KB/op	13.64 ms/op, 26.3 KB/op	14.26 ms/op, 117.2 KB/op

Probe_Parallel is SIEVE-only — LRU's PeekOrAdd and ContainsOrAdd skip recency promotion and are not semantic equivalents. Delete_Parallel is the one micro where SIEVE loses: LRU's single-lock linked-list unlink edges out SIEVE's slot-state clear (163 vs 230 ns/op). ARC is slowest (254 ns/op) because its T1/T2/B1/B2 bookkeeping doubles the work.

Trace Replay Results

This implmentation is benchmarked against real-world cache traces from the libCacheSim trace repository — 14 MSR Cambridge enterprise block I/O traces + 5 Meta Storage (Tectonic) block traces totalling ~300M requests. Each trace was replayed with a cache sized at 10% of unique keys, comparing SIEVE (k=1, k=2, k=3) against hashicorp/golang-lru (LRU and ARC).

Parallel Get throughput (warm cache, 32 goroutines, ns/op, zero allocs)

SIEVE's lock-free Get() is ~100–300x faster than LRU/ARC under concurrent read load. Every trace, every cache:

Trace	SIEVE k=1	SIEVE k=3	LRU	ARC
msr_web_2	1.02	1.04	182.6	377.5
meta_storage/block_traces_1	1.29	1.44	232.1	360.2
meta_storage/block_traces_2	1.30	1.51	257.2	358.9
msr_proj_4	1.31	1.23	360.7	479.1
meta_storage/block_traces_3	1.50	1.60	264.9	458.0
meta_storage/block_traces_4	1.55	1.62	234.1	449.2
msr_prn_1	1.56	1.59	321.4	381.6
meta_storage/block_traces_5	1.59	1.70	222.5	348.6
msr_prxy_0	1.76	2.27	313.5	363.3
msr_usr_2	2.03	2.12	344.5	502.0
msr_src1_1	2.19	2.28	394.8	587.4
msr_usr_1	2.32	2.26	395.0	536.9
msr_proj_2	2.79	3.03	280.8	460.6
msr_proj_1	2.91	2.89	334.6	483.3
msr_src1_0	4.85	4.89	336.4	443.6
msr_proj_0	5.24	5.67	294.6	407.5
msr_hm_0	6.61	7.12	275.6	455.2
msr_prn_0	9.21	10.41	288.0	402.3

The k=3 saturating counter adds under 10% overhead to the read path. This benchmark pre-warms the cache with a full trace replay, then hammers Get() only — the ideal scenario for a read-heavy cache in steady state.

Parallel Replay throughput (cold cache, mixed read+write, 32 goroutines, ns/op)

This is Probe() for SIEVE and Get+Add for LRU/ARC, hammered in parallel with no warmup. It measures the steady-state workload a real cache faces: reads and writes interleaved, evictions happening live. The previous table's numbers reflect the best-case read ceiling; this table reflects what throughput you actually get when your cache is doing work.

Trace	SIEVE k=1	SIEVE k=3	LRU	ARC
msr_prxy_0	6.45	6.61	401.3	431.0
msr_prn_0	14.90	15.49	481.7	497.1
meta_storage/block_traces_4	16.01	16.75	469.5	453.0
meta_storage/block_traces_1	16.15	18.48	440.7	486.5
msr_proj_0	16.36	15.86	422.8	471.8
msr_prn_1	16.50	18.88	421.4	491.6
meta_storage/block_traces_2	16.69	16.85	454.9	525.0
meta_storage/block_traces_3	16.74	17.71	434.4	501.3
meta_storage/block_traces_5	17.05	16.65	449.6	519.8
msr_hm_0	19.18	17.29	447.8	499.7
msr_usr_2	19.04	19.13	431.5	545.2
msr_proj_4	20.20	20.71	416.4	438.8
msr_usr_1	21.64	23.53	441.6	418.7
msr_src1_1	21.96	20.60	429.2	526.0
msr_src1_0	22.34	25.78	402.0	461.8
msr_proj_1	22.46	22.27	426.8	446.2
msr_proj_2	22.81	22.03	441.2	529.0
msr_web_2	24.35	25.41	436.6	466.3

Under concurrent cold-cache replay, SIEVE is ~18–62x faster than LRU/ARC. The best case is msr_prxy_0 (95% hits, so the lock-free fast path dominates); the worst case is msr_web_2 (98% miss ratio, almost every call takes the write lock) — and even there SIEVE is 18x faster.

Miss ratio (every trace)

Cache sized at 10% of unique keys. Bold = best in row.

Trace	SIEVE k=1	SIEVE k=2	SIEVE k=3	LRU	ARC
meta_storage/block_traces_1	0.4632	0.4651	0.4672	0.4602	0.4667
meta_storage/block_traces_2	0.4719	0.4743	0.4754	0.4676	0.4755
meta_storage/block_traces_3	0.4908	0.4928	0.4948	0.4885	0.4947
meta_storage/block_traces_4	0.4841	0.4870	0.4888	0.4812	0.4887
meta_storage/block_traces_5	0.4959	0.4984	0.4998	0.4927	0.5003
msr_hm_0	0.2991	0.3025	0.3025	0.3188	0.2923
msr_prn_0	0.2156	0.2194	0.2208	0.2310	0.2145
msr_prn_1	0.3908	0.3837	0.3796	0.4341	0.4148
msr_proj_0	0.2537	0.2660	0.2745	0.2375	0.2242
msr_proj_1	0.6794	0.6794	0.6794	0.7215	0.6788
msr_proj_2	0.8231	0.8231	0.8231	0.8548	0.8125
msr_proj_4	0.8463	0.8463	0.8463	0.8140	0.7173
msr_prxy_0	0.0512	0.0572	0.0594	0.0476	0.0468
msr_src1_0	0.7845	0.7845	0.7845	0.9132	0.7811
msr_src1_1	0.7939	0.7934	0.7934	0.8129	0.8231
msr_usr_1	0.3558	0.3558	0.3558	0.4007	0.3513
msr_usr_2	0.7216	0.7216	0.7216	0.7533	0.7199
msr_web_2	0.9786	0.9786	0.9786	0.9929	0.9785

Overall best (bold): ARC has the lowest miss ratio in 11 of 18 rows, LRU in 5 (all meta_storage), SIEVE k=3 in 2 (msr_prn_1 and msr_src1_1). SIEVE k=1 is never the overall best.

Head-to-head, SIEVE k=1 vs LRU (the typical deployment choice): SIEVE k=1 has lower miss ratio on 10 of 18 traces, LRU on 8. When SIEVE is better the margins are large (msr_src1_0: 12.9 points, msr_usr_1: 4.5, msr_prn_1: 4.3). When LRU is better the margins are narrow (all 5 meta_storage: 2–3 points each).

SIEVE's case rests on throughput Miss ratios are competitive (SIEVE is never dramatically worse than LRU), and SIEVE is 18–300x faster under any concurrency. SIEVE k=3 produces the single-best entry in the entire table on msr_prn_1 (0.3796 vs LRU 0.4341, ARC 0.4148).

Memory during replay

On the 13.2M-request meta_storage/block_traces_1 trace (601K-entry cache), TestGCPressure reports:

Variant	TotalAlloc
SIEVE k=1	154 MB
SIEVE k=3	155 MB
LRU	418 MB
ARC	997 MB

SIEVE allocates 2.7x less than LRU and 6.5x less than ARC during replay — the array-backed node pool and inline-int32 xsync.MapOf are structural wins.

Full per-trace tables (sequential replay ns/op, B/op, miss ratio for every trace) are in bench-results.md. Methodology and trace-loading details are in bench/README.md.

Usage

import "github.com/opencoff/go-sieve"

// Create a cache mapping string keys to int values, capacity 1000.
c, err := sieve.New[string, int](1000)
if err != nil {
    log.Fatal(err) // ErrInvalidCapacity or ErrInvalidVisitClamp
}

// Or, for constant arguments, use Must to get a one-liner:
c := sieve.Must(sieve.New[string, int](1000))

c.Add("foo", 42)

if val, ok := c.Get("foo"); ok {
    fmt.Println(val) // 42
}

// Probe inserts only if absent; returns the cached value if present.
val, _, r := c.Probe("foo", 99)
// val == 42, r.Hit() == true

c.Delete("foo")
c.Purge() // reset entire cache

SIEVE-k

WithVisitClamp(k) creates a SIEVE-k cache where each entry uses a saturating counter instead of a single visited bit. An item accessed k+1 times survives k eviction passes before being evicted. k=1 is equivalent to classic SIEVE (the default). Use k=2 or k=3 for workloads with repeated access patterns where extra eviction resistance is beneficial.

// Classic SIEVE (k=1, the default)
c, err := sieve.New[string, int](1000)

// SIEVE-k=3: items survive up to 3 eviction passes
c, err := sieve.New[string, int](1000, sieve.WithVisitClamp(3))

New returns an error for capacity <= 0 (ErrInvalidCapacity) and for WithVisitClamp(k) with k > sieve.MaxVisitClamp (255 — ErrInvalidVisitClamp). Clamp values below 1 are silently rounded up to 1.

Eviction Handling

Add() and Probe() return the evicted entry (if any) along with a CacheResult bitmask. This allows callers to handle evictions synchronously without channels, goroutines, or lifecycle management.

c := sieve.Must(sieve.New[string, int](1000))

ev, r := c.Add("foo", 42)
if r.Evicted() {
    cleanupDisk(ev.Key, ev.Val)
}

// CacheResult bitmask:
//   CacheHit   — key was already present (value updated, no eviction)
//   CacheEvict — an entry was evicted to make room (mutually exclusive with CacheHit)

Purge() and Delete() do not report evictions.

API

Function / Method	Description
`New[K, V](capacity, ...Option) (*Sieve[K,V], error)`	Create a cache with fixed capacity. Returns `ErrInvalidCapacity` or `ErrInvalidVisitClamp` on bad input.
`Must[K, V](Sieve[K,V], error) Sieve[K,V]`	Helper that panics on error; useful with constant arguments.
`Get(key) (V, bool)`	Look up a key (lock-free, zero-alloc).
`Add(key, val) (Evicted[K,V], CacheResult)`	Insert or update; returns evicted entry and result bitmask.
`Probe(key, val) (V, Evicted[K,V], CacheResult)`	Insert-if-absent; returns cached/inserted value, evicted entry, and result bitmask.
`Delete(key) bool`	Remove a key.
`Purge()`	Clear the entire cache.
`Len() int`	Current number of entries (lock-free atomic load).
`Cap() int`	Maximum capacity.

Options

Option	Description
`WithVisitClamp(k)`	Use k-level saturating counters (default k=1 = classic SIEVE). `k` is capped at `MaxVisitClamp` (255); `k < 1` is silently rounded to 1.

GC Note

When K or V is a pointer type (including string, which contains an internal pointer in Go), the node array will still contain GC-traced pointers. The GC pressure reduction is most dramatic for scalar key/value types (int, [16]byte, fixed-size structs).

License

BSD-2-Clause. See the source files for the full license text.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
bench		bench
exp		exp
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
assert_test.go		assert_test.go
bench-results.md		bench-results.md
go.mod		go.mod
go.sum		go.sum
invariants_test.go		invariants_test.go
options.go		options.go
pause.go		pause.go
sieve.go		sieve.go
sieve_adversarial_test.go		sieve_adversarial_test.go
sieve_bench_custom_test.go		sieve_bench_custom_test.go
sieve_evict_test.go		sieve_evict_test.go
sieve_k_test.go		sieve_k_test.go
sieve_race_test.go		sieve_race_test.go
sieve_test.go		sieve_test.go
slotstate.go		slotstate.go
slotstate_test.go		slotstate_test.go
validate_test.go		validate_test.go
zipf_bench_test.go		zipf_bench_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

go-sieve - SIEVE cache eviction for Go

Key Design Elements

Benchmark Results

Reading the parallel ns/op numbers

Parallel Micro-Benchmarks (`count=3`, medians)

Trace Replay Results

Parallel Get throughput (warm cache, 32 goroutines, ns/op, zero allocs)

Parallel Replay throughput (cold cache, mixed read+write, 32 goroutines, ns/op)

Miss ratio (every trace)

Memory during replay

Usage

SIEVE-k

Eviction Handling

API

Options

GC Note

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

go-sieve - SIEVE cache eviction for Go

Key Design Elements

Benchmark Results

Reading the parallel ns/op numbers

Parallel Micro-Benchmarks (count=3, medians)

Trace Replay Results

Parallel Get throughput (warm cache, 32 goroutines, ns/op, zero allocs)

Parallel Replay throughput (cold cache, mixed read+write, 32 goroutines, ns/op)

Miss ratio (every trace)

Memory during replay

Usage

SIEVE-k

Eviction Handling

API

Options

GC Note

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Parallel Micro-Benchmarks (`count=3`, medians)

Packages