perf(cache): O_DIRECT by default + explicit L1 RAM budget#63
Merged
Conversation
Make the foyer-backed caches own their RAM explicitly instead of leaning on the OS page cache: default to O_DIRECT (page-cache bypass) with a flat, tunable L1 memory-tier budget. Driven by the empirical findings below (full detail in the PR description). - O_DIRECT default for both clean + pack-index caches, with graceful fallback to buffered I/O on filesystems that don't support it (e.g. tmpfs), so the neighborly default can't crash startup. CLI one-shot tools stay buffered. - Clean-cache L1 default = flat 2 GiB budget (DEFAULT_L1_BYTES), framed as non-chargeable RAM overhead, not a fraction of node RAM. Sized to the deduplicated shared base + a few hot tenants' warm tails. - New [cache] knobs: `direct`, `pack_index_memory_size_gb`, `pack_index_ssd_size_gb`; pack-index tiers now configurable. - Engine stays plain io_uring — iopoll/sqpoll measured and rejected (unsupported on FS-backed O_DIRECT / net CPU loss). Includes the measurement harnesses used to make these decisions (examples/cache_ram, cache_hot_read, cache_l2_poll, interference_test.sh). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The O_DIRECT/L1 change added `direct` to FoyerCacheConfig and `pack_index_cache` to RouterConfig, but the initial commit only updated sites under glidefs/src. This fixes the construction sites I missed: - oci-distribution/src/config.rs: FoyerCacheConfig missing `direct` (E0063) — this broke the whole workspace build, cascading every CI job. - glidefs/tests/**: ~26 RouterConfig literals missing `pack_index_cache`. Workspace `cargo build --all-targets` and clippy now clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The docker_integration (docker-tests) and zc_glidefs (ublk) test targets are feature-gated, so a plain `cargo build --tests` never compiled them — hiding 4 more RouterConfig literals (field-init shorthand `clean_cache,`) missing the new `pack_index_cache` field. Verified with the check I should have run from the start: `cargo test --no-run --all-targets --features docker-tests,ublk` → 0 errors, clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The fio_bench, fio_verify, and blktests test targets are gated behind the `fio-bench`/`blktests` features (which the Kernel Devices and blktests CI jobs enable), so they weren't compiled by my earlier feature sets. Fixes the remaining 6 RouterConfig shorthand literals. Now verified against the COMPLETE test-feature union: `cargo test --no-run --all-targets --features docker-tests,fio-bench,blktests` → 0 errors, clippy clean. (That's the check that covers every gated target.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes the foyer-backed caches (clean block cache + pack-index cache) own their RAM explicitly instead of leaning on the OS page cache: O_DIRECT by default (page-cache bypass) with a flat, tunable L1 memory-tier budget. Builds on the io_uring engine change (#62).
Every decision here is backed by measurement on a real NVMe host. The full investigation follows.
Why change anything?
With buffered I/O, foyer's SSD tier is read through the OS page cache. That gives fast warm reads — but it (a) double-caches every block (once in foyer's L1, once in the page cache) and (b) evicts co-resident tenants' pages under memory pressure. GlideFS runs on the same boxes as the workloads it serves, so that page-cache appetite competes with the tenants. This PR trades the page cache's "free" speed for explicit, neighborly RAM ownership — and the numbers say that's the right trade if L1 is sized correctly.
Benchmark results
All via the harnesses in
examples/(cache_ram,cache_hot_read,cache_l2_poll,interference_test.sh) on an NVMe + ext4 host. Block size 128 KiB.1. Buffered (page-cache) vs O_DIRECT (media), SSD-tier read latency
2. Double-caching: page-cache footprint (
fincore)Buffered pins the entire L2 footprint in RAM a second time. At a production 10 GB SSD cache that's ~10 GB of host RAM duplicating data foyer already manages on disk. O_DIRECT reclaims all of it.
3. Does a bigger explicit L1 recover the buffered speed? (full
get()path, hot-biased)A properly-sized L1 + O_DIRECT = 131 ns, identical to buffered, with zero page-cache cost. The buffered "win" was just the page cache acting as an accidental L1. Trap: O_DIRECT + an undersized L1 is the worst case (317 µs / 5.7 ms p99) — so L1 must cover the hot set.
4. Tenant interference (fio "tenant" + foyer antagonist in a memory-capped cgroup)
A buffered cache degrades the co-resident tenant ~2–6× more than O_DIRECT — it evicts the tenant's page-cached working set; O_DIRECT leaves it alone. (Reproducible direction; magnitude modest at these toy sizes, larger at production scale.)
5. O_DIRECT-only io_uring tuning (iopoll / sqpoll)
iopolldoesn't work on a filesystem-backed O_DIRECT cache (needs a raw block device + NVMe poll queues); it even builds fine then fails reads at runtime.sqpollis a net loss: no latency win, 1.5–2.8× the CPU — exactly backwards on a CPU-contended shared box.Decisions
1. O_DIRECT by default (clean + pack-index caches). Reclaims ~100% of the cache's RAM footprint (#2), stops degrading tenants (#4), and keeps warm-read speed if L1 is sized (#3). Falls back to buffered automatically on filesystems without O_DIRECT support (tmpfs etc.) so it can't crash startup. CLI one-shot tools (
push/bless) stay buffered (ephemeral temp dirs).2. L1 is a flat RAM budget, not a fraction of node RAM. With O_DIRECT, L1 is pinned, non-reclaimable, non-chargeable host RAM (unlike the page cache, which is elastic and reclaimable — its permanent claim on sellable capacity is ~0). A fraction-of-RAM would grow that unsellable reservation as nodes get bigger for no benefit — and on the bare-metal NVMe nodes we target (128–512+ GiB RAM) it always clamped anyway, so it was a hardcoded number wearing a costume.
The working set L1 must hold is set by the workload, not the node:
So L1 ≈ (deduped shared base) + (a few hot tenants' warm tails) ≈ low single-digit GB, roughly constant across node sizes. Default: flat 2 GiB (
DEFAULT_L1_BYTES), overridable viamemory_size_gb. S3-FIFO eviction dynamically allocates the budget to whatever's hot. Blocks beyond L1 fall to the O_DIRECT SSD tier (~220 µs) — still ~1000× faster than the S3 miss it replaces.3. Plain io_uring — iopoll/sqpoll rejected empirically (#5).
Changes
config.rs— new[cache]knobs:direct(default true),pack_index_memory_size_gb,pack_index_ssd_size_gb;DEFAULT_L1_BYTES = 2 GiB; accessor methods + validation.foyer_engine.rs—o_direct_supported()probe for the graceful fallback.cache.rs/pack_index_cache.rs—directthreaded intoFsDeviceBuilder::with_directwith probe-and-fall-back + warn log.server.rs/router.rs— server builds both caches with the policy; pack-index passed via a newOptionRouterConfig field (Noneelsewhere → unchanged).push.rs/bless.rs— buffered (ephemeral).examples/— the four measurement harnesses.Verification
test_foyer_direct_falls_back_gracefully), pack-index (16), config (12), router (66), api (34).direct: trueopens successfully even on tmpfs (takes the buffered path).Follow-ups (not in this PR)
memory.high/PSI) to get page-cache-like elasticity without the double-caching.🤖 Generated with Claude Code