perf(cache): O_DIRECT by default + explicit L1 RAM budget by jaredLunde · Pull Request #63 · beyondoss/glidefs

jaredLunde · 2026-05-29T01:49:28Z

Summary

Makes the foyer-backed caches (clean block cache + pack-index cache) own their RAM explicitly instead of leaning on the OS page cache: O_DIRECT by default (page-cache bypass) with a flat, tunable L1 memory-tier budget. Builds on the io_uring engine change (#62).

Every decision here is backed by measurement on a real NVMe host. The full investigation follows.

Why change anything?

With buffered I/O, foyer's SSD tier is read through the OS page cache. That gives fast warm reads — but it (a) double-caches every block (once in foyer's L1, once in the page cache) and (b) evicts co-resident tenants' pages under memory pressure. GlideFS runs on the same boxes as the workloads it serves, so that page-cache appetite competes with the tenants. This PR trades the page cache's "free" speed for explicit, neighborly RAM ownership — and the numbers say that's the right trade if L1 is sized correctly.

Benchmark results

All via the harnesses in examples/ (cache_ram, cache_hot_read, cache_l2_poll, interference_test.sh) on an NVMe + ext4 host. Block size 128 KiB.

1. Buffered (page-cache) vs O_DIRECT (media), SSD-tier read latency

	QD1 latency	conc=16	conc=64
psync, buffered	55 µs	8.34 GiB/s	3.07 GiB/s
io_uring, buffered	34 µs	4.82 GiB/s	4.50 GiB/s
psync, O_DIRECT	245 µs	2.33 GiB/s	3.38 GiB/s
io_uring, O_DIRECT	218 µs	2.42 GiB/s	3.38 GiB/s

Buffered warm reads are ~4–5× faster than O_DIRECT media reads (page cache vs NVMe). Earlier "fast" numbers were page-cache-warm — real, but only one regime.
Under O_DIRECT the device is the bottleneck and the I/O engine washes out (psync ≈ io_uring). This matters below.

2. Double-caching: page-cache footprint (`fincore`)

Mode	L2 on disk	Page cache held	foyer L1 (RSS)
buffered	265 MiB	264 MiB (~100%)	~38 MiB
O_DIRECT	265 MiB	0 MiB	~38 MiB

Buffered pins the entire L2 footprint in RAM a second time. At a production 10 GB SSD cache that's ~10 GB of host RAM duplicating data foyer already manages on disk. O_DIRECT reclaims all of it.

3. Does a bigger explicit L1 recover the buffered speed? (full `get()` path, hot-biased)

Config	Hot-read median	Hit source
buffered, L1=2 MiB	120.7 µs	page-cached L2
O_DIRECT, L1=2 MiB	316.6 µs (p99 5.7 ms)	NVMe media
O_DIRECT, L1=64 MiB	131 ns	L1
buffered, L1=64 MiB	130 ns	L1

A properly-sized L1 + O_DIRECT = 131 ns, identical to buffered, with zero page-cache cost. The buffered "win" was just the page cache acting as an accidental L1. Trap: O_DIRECT + an undersized L1 is the worst case (317 µs / 5.7 ms p99) — so L1 must cover the hot set.

4. Tenant interference (fio "tenant" + foyer antagonist in a memory-capped cgroup)

Run	buffered antagonist	O_DIRECT antagonist
cap 3G, antag 2G	−20% IOPS	−9%
cap 3G, antag 2G (repeat)	−12% IOPS	−2%

A buffered cache degrades the co-resident tenant ~2–6× more than O_DIRECT — it evicts the tenant's page-cached working set; O_DIRECT leaves it alone. (Reproducible direction; magnitude modest at these toy sizes, larger at production scale.)

5. O_DIRECT-only io_uring tuning (iopoll / sqpoll)

Mode	QD1	conc=16	CPU/read
plain io_uring	145 µs	64 µs	baseline
iopoll	❌ error -95 (unsupported)	❌	—
sqpoll	197 µs (worse)	~same	1.5–2.8× more CPU

iopoll doesn't work on a filesystem-backed O_DIRECT cache (needs a raw block device + NVMe poll queues); it even builds fine then fails reads at runtime.
sqpoll is a net loss: no latency win, 1.5–2.8× the CPU — exactly backwards on a CPU-contended shared box.
Confirms try perf things #1: the cold path is device-bound, so completion-path tuning buys nothing. Plain io_uring stays.

Decisions

1. O_DIRECT by default (clean + pack-index caches). Reclaims ~100% of the cache's RAM footprint (#2), stops degrading tenants (#4), and keeps warm-read speed if L1 is sized (#3). Falls back to buffered automatically on filesystems without O_DIRECT support (tmpfs etc.) so it can't crash startup. CLI one-shot tools (push/bless) stay buffered (ephemeral temp dirs).

2. L1 is a flat RAM budget, not a fraction of node RAM. With O_DIRECT, L1 is pinned, non-reclaimable, non-chargeable host RAM (unlike the page cache, which is elastic and reclaimable — its permanent claim on sellable capacity is ~0). A fraction-of-RAM would grow that unsellable reservation as nodes get bigger for no benefit — and on the bare-metal NVMe nodes we target (128–512+ GiB RAM) it always clamped anyway, so it was a hardcoded number wearing a costume.

The working set L1 must hold is set by the workload, not the node:

~90% of bytes are shared (pre-fork / layered base images) and content-addressed → deduplicated to one copy — a sub-GB hot set serving the whole fleet.
Plus a few hot tenants (e.g. Postgres) with unique working sets — but those cache their own hot pages in guest RAM they paid for, so GlideFS only sees the warm tail.

So L1 ≈ (deduped shared base) + (a few hot tenants' warm tails) ≈ low single-digit GB, roughly constant across node sizes. Default: flat 2 GiB (DEFAULT_L1_BYTES), overridable via memory_size_gb. S3-FIFO eviction dynamically allocates the budget to whatever's hot. Blocks beyond L1 fall to the O_DIRECT SSD tier (~220 µs) — still ~1000× faster than the S3 miss it replaces.

3. Plain io_uring — iopoll/sqpoll rejected empirically (#5).

Changes

config.rs — new [cache] knobs: direct (default true), pack_index_memory_size_gb, pack_index_ssd_size_gb; DEFAULT_L1_BYTES = 2 GiB; accessor methods + validation.
foyer_engine.rs — o_direct_supported() probe for the graceful fallback.
cache.rs / pack_index_cache.rs — direct threaded into FsDeviceBuilder::with_direct with probe-and-fall-back + warn log.
server.rs / router.rs — server builds both caches with the policy; pack-index passed via a new Option RouterConfig field (None elsewhere → unchanged).
push.rs / bless.rs — buffered (ephemeral).
examples/ — the four measurement harnesses.

Verification

default + ublk + test-utils builds clean; clippy clean.
Tests pass: cache (8, incl. new test_foyer_direct_falls_back_gracefully), pack-index (16), config (12), router (66), api (34).
The fallback test confirms direct: true opens successfully even on tmpfs (takes the buffered path).

Follow-ups (not in this PR)

Consider a pressure-aware / reclaimable L1 (shrink on cgroup memory.high/PSI) to get page-cache-like elasticity without the double-caching.

🤖 Generated with Claude Code

Make the foyer-backed caches own their RAM explicitly instead of leaning on the OS page cache: default to O_DIRECT (page-cache bypass) with a flat, tunable L1 memory-tier budget. Driven by the empirical findings below (full detail in the PR description). - O_DIRECT default for both clean + pack-index caches, with graceful fallback to buffered I/O on filesystems that don't support it (e.g. tmpfs), so the neighborly default can't crash startup. CLI one-shot tools stay buffered. - Clean-cache L1 default = flat 2 GiB budget (DEFAULT_L1_BYTES), framed as non-chargeable RAM overhead, not a fraction of node RAM. Sized to the deduplicated shared base + a few hot tenants' warm tails. - New [cache] knobs: `direct`, `pack_index_memory_size_gb`, `pack_index_ssd_size_gb`; pack-index tiers now configurable. - Engine stays plain io_uring — iopoll/sqpoll measured and rejected (unsupported on FS-backed O_DIRECT / net CPU loss). Includes the measurement harnesses used to make these decisions (examples/cache_ram, cache_hot_read, cache_l2_poll, interference_test.sh). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The O_DIRECT/L1 change added `direct` to FoyerCacheConfig and `pack_index_cache` to RouterConfig, but the initial commit only updated sites under glidefs/src. This fixes the construction sites I missed: - oci-distribution/src/config.rs: FoyerCacheConfig missing `direct` (E0063) — this broke the whole workspace build, cascading every CI job. - glidefs/tests/**: ~26 RouterConfig literals missing `pack_index_cache`. Workspace `cargo build --all-targets` and clippy now clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The docker_integration (docker-tests) and zc_glidefs (ublk) test targets are feature-gated, so a plain `cargo build --tests` never compiled them — hiding 4 more RouterConfig literals (field-init shorthand `clean_cache,`) missing the new `pack_index_cache` field. Verified with the check I should have run from the start: `cargo test --no-run --all-targets --features docker-tests,ublk` → 0 errors, clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The fio_bench, fio_verify, and blktests test targets are gated behind the `fio-bench`/`blktests` features (which the Kernel Devices and blktests CI jobs enable), so they weren't compiled by my earlier feature sets. Fixes the remaining 6 RouterConfig shorthand literals. Now verified against the COMPLETE test-feature union: `cargo test --no-run --all-targets --features docker-tests,fio-bench,blktests` → 0 errors, clippy clean. (That's the check that covers every gated target.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jaredLunde and others added 4 commits May 28, 2026 18:48

jaredLunde merged commit 1998efe into main May 29, 2026
24 checks passed

jaredLunde deleted the jared/cache-ram-ownership branch May 29, 2026 03:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(cache): O_DIRECT by default + explicit L1 RAM budget#63

perf(cache): O_DIRECT by default + explicit L1 RAM budget#63
jaredLunde merged 4 commits into
mainfrom
jared/cache-ram-ownership

jaredLunde commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jaredLunde commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why change anything?

Benchmark results

1. Buffered (page-cache) vs O_DIRECT (media), SSD-tier read latency

2. Double-caching: page-cache footprint (fincore)

3. Does a bigger explicit L1 recover the buffered speed? (full get() path, hot-biased)

4. Tenant interference (fio "tenant" + foyer antagonist in a memory-capped cgroup)

5. O_DIRECT-only io_uring tuning (iopoll / sqpoll)

Decisions

Changes

Verification

Follow-ups (not in this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jaredLunde commented May 29, 2026 •

edited

Loading

2. Double-caching: page-cache footprint (`fincore`)

3. Does a bigger explicit L1 recover the buffered speed? (full `get()` path, hot-biased)