Skip to content

Add S3-FIFO RAM cache eviction algorithm (ram_cache.algorithm = 2)#13255

Open
phongn wants to merge 3 commits into
apache:masterfrom
phongn:s3fifo-ram-cache
Open

Add S3-FIFO RAM cache eviction algorithm (ram_cache.algorithm = 2)#13255
phongn wants to merge 3 commits into
apache:masterfrom
phongn:s3fifo-ram-cache

Conversation

@phongn

@phongn phongn commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Add S3-FIFO RAM cache eviction algorithm (ram_cache.algorithm = 2)

Summary

This adds S3-FIFO as a selectable RAM cache eviction policy via proxy.config.cache.ram_cache.algorithm = 2, alongside the existing CLFUS (0) and LRU (1). The default (LRU) is unchanged.

S3-FIFO (Yang, Zhang, Qiu, Yue & Rashmi, "FIFO queues are all you need for cache eviction", SOSP 2023) is a FIFO-based policy: a small admission queue and a main queue, plus a ghost queue that remembers the keys of recently evicted objects. The small queue and ghost filter one-hit-wonders, which makes the policy scan-resistant and yields strong hit rates on CDN and key-value workloads — while keeping every operation cheap, since a hit only bumps a 2-bit counter and nothing is ever reordered.

Motivation

The two current RAM cache policies split sharply on real traffic: LRU favors recency and loses on frequency-skewed CDN workloads, while CLFUS favors frequency/size, is comparatively expensive, and can collapse on recency-heavy traffic. In benchmarking, S3-FIFO beats both LRU and CLFUS on every production trace tested (CDN and key-value), with a put on par with LRU and far below CLFUS and a get between the two — making it an attractive general-purpose option to offer operators.

What's in this PR

  • RamCacheS3FIFO.cc — the new RamCache implementation.
  • Cache.h, CacheProcessor.cc, RecordsConfig.cc, P_RamCache.h, CMakeLists.txt — the algorithm enum, the factory wiring (with a one-line cache_init log of the selected algorithm), the config range [0-2], the declaration, and the build entry.
  • CacheTest.cc — S3-FIFO added to the existing ram_cache regression comparison.
  • Admin docs (records.yaml.en.rst, storage/index.en.rst).

Design

S3-FIFO keeps three FIFO queues: a small admission queue (~10% of capacity), a main queue (~90%, run as a 2-bit CLOCK), and a ghost queue holding only keys of recently evicted objects. A new object enters the small queue. When an object is evicted from the small queue it is promoted to the main queue if it was reused (frequency ≥ 2) and otherwise demoted to the ghost; a subsequent miss whose key is still in the ghost is admitted straight to the main queue. This "quick demotion" of one-hit-wonders is what gives S3-FIFO its scan resistance and CDN-friendly behavior.

The policy is byte-budgeted to fit the ATS RAM cache. The ghost stores keys, not data, but each remembered key still costs real memory, so the ghost is bounded both by its object-size footprint and by an entry-count cap, and that metadata is counted against proxy.config.cache.ram_cache.size — so total resident memory (data plus all eviction metadata) stays within the configured budget regardless of object cardinality. Access is serialized per stripe by the existing stripe lock, exactly like LRU and CLFUS; no new concurrency primitives are introduced.

The implementation follows the original paper and the reference implementation in libCacheSim (S3FIFO.c).

Benchmarks

S3-FIFO was selected after an independent evaluation by @bryancall across all five candidate policies (the incumbents plus W-TinyLFU, SIEVE, and S3-FIFO), using both an in-process microbenchmark and an end-to-end h2load sweep through a real ATS proxy. S3-FIFO was the best all-rounder — top or tied-top hit rate on every workload, passing scan resistance and adaptivity, with the cheapest put. The full writeup and methodology are in @bryancall's benchmark on the evaluation PR: phongn#2. The tables below focus on S3-FIFO against the two incumbents this change adds it alongside.

Hit rate — real production traces

Full-trace replay (libCacheSim oracleGeneral traces), hit rate over the second half after warmup, the same stream for every policy; higher is better.

trace (cache size) LRU CLFUS S3-FIFO
wiki CDN, frequency-heavy (512 MB) 0.176 0.194 0.237
wiki CDN, frequency-heavy (8 GB) 0.521 0.571 0.585
Meta CDN-prn, recency-heavy (512 MB) 0.324 0.168 0.340
Meta CDN-prn, recency-heavy (8 GB) 0.387 0.219 0.427
Meta key-value (4 GB) 0.926 0.916 0.942

S3-FIFO has the best hit rate on every real trace and at every size measured.

Microbenchmark (@bryancall, Ryzen 9950X3D)

Per-operation cost, ns/op, lower is better:

LRU CLFUS S3-FIFO
get 10.1 21.7 15.3
put 566.3 837.8 567.2

S3-FIFO's put is the cheapest measured (on par with LRU, ~35% below CLFUS); its get sits between LRU and CLFUS. On the correctness suite it passes scan resistance (hot-set retention 1.000 under a one-time scan), adaptivity after an abrupt working-set shift (new-set hit rate 1.000, stale-set retention 15/112), and gradual drift (0.978), and its synthetic-Zipf hit rate is top or tied-top at every cache size.

End-to-end throughput (@bryancall)

In the h2load sweep through a real proxy, end-to-end req/s barely moved between algorithms — 3–5% on 1 KB objects (where S3-FIFO was the top performer) and flat on bandwidth-saturated large objects. The reason is the test box's tiering: a RAM-cache miss falls through to a fast local NVMe disk-cache hit rather than an origin fetch, so the eviction policy only chooses which tier serves a hit. The policy's value is in hit ratio under memory pressure, which surfaces as throughput only where a RAM miss is expensive (origin-bound or slow-tier deployments) — so this change is offered on its hit-ratio and per-operation-cost merits, not as a throughput win on every topology.

Configuration

records:
  cache:
    ram_cache:
      algorithm: 2   # 0 = CLFUS, 1 = LRU (default), 2 = S3-FIFO

The selected algorithm is logged at cache initialization (cache_init debug tag): ram_cache algorithm = 2 (S3-FIFO).

Testing

The NIGHTLY-gated ram_cache regression test (traffic_server -R 2 -r ram_cache) now exercises S3-FIFO alongside LRU and CLFUS on the synthetic Zipfian workload. Booting with ram_cache.algorithm: 2 was verified to construct the S3-FIFO cache and log the selection.

Notes

  • Experimental opt-in; the default remains LRU.
  • S3-FIFO does not use the seen filter (it is scan-resistant by construction) and does not support in-RAM compression (which remains CLFUS-only).

🤖 Generated with Claude Code

S3-FIFO (Yang et al., SOSP 2023) is a FIFO-based eviction policy: a
small admission queue and a main queue (a 2-bit clock), plus a ghost
queue of recently evicted keys. The small queue and ghost filter
one-hit-wonders, giving scan resistance and strong hit rates on CDN
and key-value workloads at low cost -- a hit needs no list reordering.

Selectable as ram_cache.algorithm = 2 alongside CLFUS (0) and LRU (1).
The policy is byte-budgeted; its eviction metadata (the ghost included,
bounded by object size and an entry-count cap) is accounted within
ram_cache.size so total memory stays within the configured budget. Like
LRU and CLFUS it enforces one resident copy per key (a put with a new
aux key discards the stale one) and allocates entries from a per-thread
ProxyAllocator (Thread.h).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@phongn phongn marked this pull request as ready for review June 10, 2026 20:05
@masaori335 masaori335 requested a review from Copilot June 11, 2026 01:53
@masaori335 masaori335 added this to the 11.0.0 milestone Jun 11, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds S3-FIFO as a third selectable RAM cache eviction policy (proxy.config.cache.ram_cache.algorithm = 2) alongside existing CLFUS (0) and LRU (1), wires it into cache initialization, extends the regression test to cover it, and updates admin documentation accordingly.

Changes:

  • Introduces a new RamCache implementation (RamCacheS3FIFO) and integrates it into the RAM-cache factory selection.
  • Extends configuration validation / constants to support algorithm value 2, and logs the chosen algorithm at cache init.
  • Updates the nightly ram_cache regression test and admin docs to include S3-FIFO.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/records/RecordsConfig.cc Extends the allowed config range for proxy.config.cache.ram_cache.algorithm to include 2.
src/iocore/cache/RamCacheS3FIFO.cc Adds the S3-FIFO RAM cache policy implementation.
src/iocore/cache/P_RamCache.h Declares the S3-FIFO factory function.
src/iocore/cache/CMakeLists.txt Adds RamCacheS3FIFO.cc to the cache build.
src/iocore/cache/CacheTest.cc Extends the existing RAM cache regression test to run S3-FIFO.
src/iocore/cache/CacheProcessor.cc Wires algorithm 2 into initialization and adds a debug log of the selected policy.
include/iocore/eventsystem/Thread.h Adds a per-thread allocator freelist slot for S3-FIFO entries.
include/iocore/cache/Cache.h Defines RAM_CACHE_ALGORITHM_S3FIFO as 2.
doc/admin-guide/storage/index.en.rst Documents S3-FIFO as a supported RAM cache eviction algorithm and clarifies seen-filter applicability.
doc/admin-guide/files/records.yaml.en.rst Updates the ram_cache.algorithm record documentation to describe the new 2 = S3-FIFO option.

Comment thread src/iocore/cache/RamCacheS3FIFO.cc Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

@masaori335 masaori335 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good! I'm very excited for adding new RAM cache algorithm.

The implementation follows the original paper and the reference implementation in libCacheSim (S3FIFO.c).

It's good to note it in our NOTICE file.
https://github.com/apache/trafficserver/blob/master/NOTICE

Comment thread src/iocore/cache/RamCacheS3FIFO.cc Outdated
The S3-FIFO queue split, ghost bounds, and promotion threshold were
compile-time constants. Expose them as
proxy.config.cache.ram_cache.s3fifo.{main_percent,ghost_size_percent,
ghost_mem_percent,promote_threshold} so operators can tune the policy
without a rebuild; the defaults match the paper and the prior behavior.

Each setting carries a RECC_INT range in RecordsConfig.cc, so an
out-of-range value is rejected at config load with a warning and the
documented default is used in its place -- the same guard every other
RAM cache record relies on, rather than a bespoke clamp. The admin
guide and the ram_cache regression test (a non-default "tuned" pass)
are updated to cover the new settings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@phongn

phongn commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

@masaori335 in my initial benchmark private PR you asked

I heard S3-FIFO is lock free, which means we can improve it more by relaxing Stripe Mutex lock contention before RAMCache access.

This is unfortunately not true for our internal implementation using existing ATS data structures and conventions. Quoting Claude:

The property the developer is thinking of is real: because S3-FIFO is FIFO-plus-a-counter, a hit needs no list reordering — just a 2-bit bump. That's what makes it amenable to a concurrent/lock-free implementation, unlike LRU/CLFUS which mutate list order (and CLFUS recomputes value) on every hit.

But our RamCacheS3FIFO.cc realizes none of that concurrency. As written it is not thread-safe, let alone lock-free:

  • intrusive doubly-linked segment lists and a chained hash table that rehashes on growth (_resize_hashtable);
  • plain non-atomic counters (_s_bytes, _m_bytes, _g_count, _objects, _nentries);
  • entries freed with THREAD_FREE;
  • and crucially, get() is itself a mutator — it does e->freq++ (a non-atomic read-modify-write) and assigns *ret_data = e->data. Two concurrent gets on the same bucket already race.

Getting to the theoretical property would mean rewriting the data structures: atomic counters, a concurrent/sharded hash map, sharded or lock-free FIFOs, and — the hard part — safe memory reclamation (epoch/hazard pointers) so eviction can't THREAD_FREE an entry a reader is touching. That's a project, not a property we inherit by picking the algorithm.

@phongn

phongn commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

Following on from the thread-safety note above — a reviewer asked whether S3-FIFO being "lock-free" in the literature might let us relax the per-stripe mutex for this policy. Short version: not as implemented, and the stripe mutex isn't the RAM cache's lock to relax — but the FIFO structure does buy a measurably shorter critical section today, with no re-architecting. Below: the measurement, a safety-check against the existing stripe-contention work, and a sketch of what a real concurrent RAM cache would take.

1. Time under the stripe lock, per policy

Every RamCache::get()/put() runs while stripe->mutex is held, so per-call cost is exactly the marginal time each policy spends under the lock. We measured it through the real RamCache interface — so the numbers include the metric-counter atomics and Ptr refcounting that genuinely run under the lock — driving all three policies with one identical Zipf stream. RNG/sampling is hoisted out of the timed region, buffers are pre-allocated and reused so allocation doesn't pollute the figures, and the seen filter is disabled for the put comparison so every policy admits the same way. 16 MB cache, 16 KB objects, Ice Lake; absolute ns is host-dependent, the cross-policy ratio is the point.

policy get ns/op put ns/op (realistic Zipf) get hit rate
LRU 117 144 0.807
CLFUS 117 180 0.802
S3-FIFO 51 85 0.841

S3-FIFO's get is ~2.3× cheaper because a hit only bumps a 2-bit counter — it never relinks a list, whereas LRU/CLFUS move the entry on every access (pointer-chasing into scattered nodes is most of their cost). It is also the cheapest put here, at a higher hit rate. This is consistent with the microbenchmark in the PR description; measuring through the live interface just confirms it holds with the real under-lock bookkeeping included.

Two caveats we want to state plainly:

  • The worst case flips the put. Under pure one-hit-wonder churn (every object unique, never reused) S3-FIFO's put rises to ~10.3 µs vs LRU ~5.6 / CLFUS ~6.3 µs — the cost of pushing every object through the small queue and ghost. That is exactly the workload where its scan-resistance pays off, and it is an unusual put pattern, but it is real.
  • Magnitude vs. the actual bottleneck. These are tens of nanoseconds. On a cache miss, the stripe lock is held for microseconds (directory probe + AIO scheduling), so the RAM-cache call is a small slice of the critical section; the saving concentrates on the RAM-hit path, where the critical section is short. So this is real aggregate contention relief at high RAM-hit rates — not a structural fix.

(The benchmark is a REGRESSION_TEST(ram_cache_timing) we kept out of this PR, since it compares all policies and belongs with the evaluation harness rather than the focused change. Happy to share it.)

2. Safety check: relaxing the stripe lock is a separate, much larger effort

Worth being explicit that this is a far bigger change than adding a policy, and the existing design work shows why it shouldn't be attached to a RAM-cache algorithm:

  • The RAM cache is called inside the stripe critical section, not alongside it. CacheVC::handleRead asserts stripe->mutex is held before load_from_ram_cache(), and put runs in the aggregation-write path under the same lock. The lock exists for the on-disk directory (and the agg buffer); the RAM cache is nested within it — so there is nothing to relax "for S3-FIFO."
  • ⛱️ Stripe Mutex Lock Contention #12788 frames the real target: open_read() serializing on stripe->mutex, with a proposed two-tier StripeSM/Stripe split and an RW lock — a core restructuring.
  • Both attempts target the directory, not the RAM cache, and both hit the two hazards a concurrent RAM cache would also face. Add reader/writer locks to cache stripe for reduced contention #12601 (closed, pending deeper analysis) took a shared dir_mutex, but Directory::probe() conditionally writes (it deletes invalid dir entries), so a read lock isn't sufficient. Avoid Stripe Mutex lock contention for RWW #12794 (the OpenDir shared_mutex approach) only stays correct by making OpenDirEntry reference-counted so an entry can't be freed under a reader holding a Ptr.

Those two lessons — "read-only" paths that conditionally write, and cross-thread object lifetime — are exactly what a concurrent RAM cache must solve too. So the directory line of work is both the higher-leverage lever and the prerequisite learning. Our suggestion is to land S3-FIFO as a policy here (for the shorter critical section now) and keep lock-relaxation on the #12788/#12794 track, where it benefits every policy.

3. If we later want a concurrent / independently-locked RAM cache

Sketching the design space — none of which should be coupled to S3-FIFO specifically; it should live at the RamCache layer so every policy benefits:

  • Sharded RAM cache with its own locks (pragmatic first step). Partition the RAM cache by key hash into N independent sub-caches, each with its own mutex, separate from stripe->mutex. No lock-free data structures required — just finer locks — and it captures most of the contention relief. The real work is byte-budget accounting across shards (the global ram_cache.size budget has to be split or coordinated) and accepting that eviction becomes per-shard rather than global (a small hit-rate effect). Lowest-risk, algorithm-agnostic.
  • Lock-free / concurrent RAM cache (larger). Atomic counters, a concurrent or sharded hash map, sharded FIFO queues, an atomic frequency counter, and — the hard part — safe reclamation (epoch/hazard pointers, or refcounted entries exactly as Avoid Stripe Mutex lock contention for RWW #12794 needed) so eviction can't free an entry a reader holds. Here the reviewer's intuition is right: S3-FIFO is a good substrate, because FIFO-plus-a-counter is far friendlier to concurrency than LRU's per-hit reordering — that is the basis of the paper's scalability claims. But it's a substantial, separate project.
  • Decoupling the RAM probe from the stripe lock. To serve a RAM hit without taking stripe->mutex at all, handleRead would have to probe the RAM cache before/outside the stripe critical section, which interleaves with the same OpenDir/directory ordering Avoid Stripe Mutex lock contention for RWW #12794 is untangling — i.e. coupled to the directory work, not independent of it.

Recommendation: do it generically, and measure first. The time-under-lock data above says the RAM-cache op is a small fraction of the disk-miss critical section, so before investing in a concurrent RAM cache it's worth profiling where stripe->mutex is actually held (the directory PRs suggest the probe dominates). That keeps a major change evidence-driven.

@masaori335

Copy link
Copy Markdown
Contributor

On lock-free/thread-safety: as you pointed out, the open_read() serialization on stripe->mutex is the main bottleneck. From my PoC testing, RAM cache lock contention only becomes a problem once that main bottleneck is solved. So I'm +1 on landing this S3-FIFO implementation as-is and making it lock-free if/when the need arises.

FWIW, here's a past effort at a lock-free RAM cache: #7351

@masaori335 masaori335 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the NOTICE file to mention reference implementation.
https://github.com/apache/trafficserver/blob/master/NOTICE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants