Skip to content

perf(cache): O_DIRECT by default + explicit L1 RAM budget#63

Merged
jaredLunde merged 4 commits into
mainfrom
jared/cache-ram-ownership
May 29, 2026
Merged

perf(cache): O_DIRECT by default + explicit L1 RAM budget#63
jaredLunde merged 4 commits into
mainfrom
jared/cache-ram-ownership

Conversation

@jaredLunde
Copy link
Copy Markdown
Contributor

@jaredLunde jaredLunde commented May 29, 2026

Summary

Makes the foyer-backed caches (clean block cache + pack-index cache) own their RAM explicitly instead of leaning on the OS page cache: O_DIRECT by default (page-cache bypass) with a flat, tunable L1 memory-tier budget. Builds on the io_uring engine change (#62).

Every decision here is backed by measurement on a real NVMe host. The full investigation follows.


Why change anything?

With buffered I/O, foyer's SSD tier is read through the OS page cache. That gives fast warm reads — but it (a) double-caches every block (once in foyer's L1, once in the page cache) and (b) evicts co-resident tenants' pages under memory pressure. GlideFS runs on the same boxes as the workloads it serves, so that page-cache appetite competes with the tenants. This PR trades the page cache's "free" speed for explicit, neighborly RAM ownership — and the numbers say that's the right trade if L1 is sized correctly.


Benchmark results

All via the harnesses in examples/ (cache_ram, cache_hot_read, cache_l2_poll, interference_test.sh) on an NVMe + ext4 host. Block size 128 KiB.

1. Buffered (page-cache) vs O_DIRECT (media), SSD-tier read latency

QD1 latency conc=16 conc=64
psync, buffered 55 µs 8.34 GiB/s 3.07 GiB/s
io_uring, buffered 34 µs 4.82 GiB/s 4.50 GiB/s
psync, O_DIRECT 245 µs 2.33 GiB/s 3.38 GiB/s
io_uring, O_DIRECT 218 µs 2.42 GiB/s 3.38 GiB/s
  • Buffered warm reads are ~4–5× faster than O_DIRECT media reads (page cache vs NVMe). Earlier "fast" numbers were page-cache-warm — real, but only one regime.
  • Under O_DIRECT the device is the bottleneck and the I/O engine washes out (psync ≈ io_uring). This matters below.

2. Double-caching: page-cache footprint (fincore)

Mode L2 on disk Page cache held foyer L1 (RSS)
buffered 265 MiB 264 MiB (~100%) ~38 MiB
O_DIRECT 265 MiB 0 MiB ~38 MiB

Buffered pins the entire L2 footprint in RAM a second time. At a production 10 GB SSD cache that's ~10 GB of host RAM duplicating data foyer already manages on disk. O_DIRECT reclaims all of it.

3. Does a bigger explicit L1 recover the buffered speed? (full get() path, hot-biased)

Config Hot-read median Hit source
buffered, L1=2 MiB 120.7 µs page-cached L2
O_DIRECT, L1=2 MiB 316.6 µs (p99 5.7 ms) NVMe media
O_DIRECT, L1=64 MiB 131 ns L1
buffered, L1=64 MiB 130 ns L1

A properly-sized L1 + O_DIRECT = 131 ns, identical to buffered, with zero page-cache cost. The buffered "win" was just the page cache acting as an accidental L1. Trap: O_DIRECT + an undersized L1 is the worst case (317 µs / 5.7 ms p99) — so L1 must cover the hot set.

4. Tenant interference (fio "tenant" + foyer antagonist in a memory-capped cgroup)

Run buffered antagonist O_DIRECT antagonist
cap 3G, antag 2G −20% IOPS −9%
cap 3G, antag 2G (repeat) −12% IOPS −2%

A buffered cache degrades the co-resident tenant ~2–6× more than O_DIRECT — it evicts the tenant's page-cached working set; O_DIRECT leaves it alone. (Reproducible direction; magnitude modest at these toy sizes, larger at production scale.)

5. O_DIRECT-only io_uring tuning (iopoll / sqpoll)

Mode QD1 conc=16 CPU/read
plain io_uring 145 µs 64 µs baseline
iopoll error -95 (unsupported)
sqpoll 197 µs (worse) ~same 1.5–2.8× more CPU
  • iopoll doesn't work on a filesystem-backed O_DIRECT cache (needs a raw block device + NVMe poll queues); it even builds fine then fails reads at runtime.
  • sqpoll is a net loss: no latency win, 1.5–2.8× the CPU — exactly backwards on a CPU-contended shared box.
  • Confirms try perf things #1: the cold path is device-bound, so completion-path tuning buys nothing. Plain io_uring stays.

Decisions

1. O_DIRECT by default (clean + pack-index caches). Reclaims ~100% of the cache's RAM footprint (#2), stops degrading tenants (#4), and keeps warm-read speed if L1 is sized (#3). Falls back to buffered automatically on filesystems without O_DIRECT support (tmpfs etc.) so it can't crash startup. CLI one-shot tools (push/bless) stay buffered (ephemeral temp dirs).

2. L1 is a flat RAM budget, not a fraction of node RAM. With O_DIRECT, L1 is pinned, non-reclaimable, non-chargeable host RAM (unlike the page cache, which is elastic and reclaimable — its permanent claim on sellable capacity is ~0). A fraction-of-RAM would grow that unsellable reservation as nodes get bigger for no benefit — and on the bare-metal NVMe nodes we target (128–512+ GiB RAM) it always clamped anyway, so it was a hardcoded number wearing a costume.

The working set L1 must hold is set by the workload, not the node:

  • ~90% of bytes are shared (pre-fork / layered base images) and content-addressed → deduplicated to one copy — a sub-GB hot set serving the whole fleet.
  • Plus a few hot tenants (e.g. Postgres) with unique working sets — but those cache their own hot pages in guest RAM they paid for, so GlideFS only sees the warm tail.

So L1 ≈ (deduped shared base) + (a few hot tenants' warm tails) ≈ low single-digit GB, roughly constant across node sizes. Default: flat 2 GiB (DEFAULT_L1_BYTES), overridable via memory_size_gb. S3-FIFO eviction dynamically allocates the budget to whatever's hot. Blocks beyond L1 fall to the O_DIRECT SSD tier (~220 µs) — still ~1000× faster than the S3 miss it replaces.

3. Plain io_uring — iopoll/sqpoll rejected empirically (#5).


Changes

  • config.rs — new [cache] knobs: direct (default true), pack_index_memory_size_gb, pack_index_ssd_size_gb; DEFAULT_L1_BYTES = 2 GiB; accessor methods + validation.
  • foyer_engine.rso_direct_supported() probe for the graceful fallback.
  • cache.rs / pack_index_cache.rsdirect threaded into FsDeviceBuilder::with_direct with probe-and-fall-back + warn log.
  • server.rs / router.rs — server builds both caches with the policy; pack-index passed via a new Option RouterConfig field (None elsewhere → unchanged).
  • push.rs / bless.rs — buffered (ephemeral).
  • examples/ — the four measurement harnesses.

Verification

  • default + ublk + test-utils builds clean; clippy clean.
  • Tests pass: cache (8, incl. new test_foyer_direct_falls_back_gracefully), pack-index (16), config (12), router (66), api (34).
  • The fallback test confirms direct: true opens successfully even on tmpfs (takes the buffered path).

Follow-ups (not in this PR)

  • Consider a pressure-aware / reclaimable L1 (shrink on cgroup memory.high/PSI) to get page-cache-like elasticity without the double-caching.

🤖 Generated with Claude Code

jaredLunde and others added 4 commits May 28, 2026 18:48
Make the foyer-backed caches own their RAM explicitly instead of leaning on
the OS page cache: default to O_DIRECT (page-cache bypass) with a flat,
tunable L1 memory-tier budget. Driven by the empirical findings below (full
detail in the PR description).

- O_DIRECT default for both clean + pack-index caches, with graceful fallback
  to buffered I/O on filesystems that don't support it (e.g. tmpfs), so the
  neighborly default can't crash startup. CLI one-shot tools stay buffered.
- Clean-cache L1 default = flat 2 GiB budget (DEFAULT_L1_BYTES), framed as
  non-chargeable RAM overhead, not a fraction of node RAM. Sized to the
  deduplicated shared base + a few hot tenants' warm tails.
- New [cache] knobs: `direct`, `pack_index_memory_size_gb`,
  `pack_index_ssd_size_gb`; pack-index tiers now configurable.
- Engine stays plain io_uring — iopoll/sqpoll measured and rejected
  (unsupported on FS-backed O_DIRECT / net CPU loss).

Includes the measurement harnesses used to make these decisions
(examples/cache_ram, cache_hot_read, cache_l2_poll, interference_test.sh).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The O_DIRECT/L1 change added `direct` to FoyerCacheConfig and
`pack_index_cache` to RouterConfig, but the initial commit only updated
sites under glidefs/src. This fixes the construction sites I missed:

- oci-distribution/src/config.rs: FoyerCacheConfig missing `direct`
  (E0063) — this broke the whole workspace build, cascading every CI job.
- glidefs/tests/**: ~26 RouterConfig literals missing `pack_index_cache`.

Workspace `cargo build --all-targets` and clippy now clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The docker_integration (docker-tests) and zc_glidefs (ublk) test targets
are feature-gated, so a plain `cargo build --tests` never compiled them —
hiding 4 more RouterConfig literals (field-init shorthand `clean_cache,`)
missing the new `pack_index_cache` field.

Verified with the check I should have run from the start:
`cargo test --no-run --all-targets --features docker-tests,ublk` → 0 errors,
clippy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The fio_bench, fio_verify, and blktests test targets are gated behind the
`fio-bench`/`blktests` features (which the Kernel Devices and blktests CI
jobs enable), so they weren't compiled by my earlier feature sets. Fixes
the remaining 6 RouterConfig shorthand literals.

Now verified against the COMPLETE test-feature union:
`cargo test --no-run --all-targets --features docker-tests,fio-bench,blktests`
→ 0 errors, clippy clean. (That's the check that covers every gated target.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jaredLunde jaredLunde merged commit 1998efe into main May 29, 2026
24 checks passed
@jaredLunde jaredLunde deleted the jared/cache-ram-ownership branch May 29, 2026 03:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant