feat: disk offload#43
Merged
Merged
Conversation
- Create src/persistence/fsync.rs with fsync_directory and fsync_file helpers - Add memmap2, lz4_flex, crc32c, dashmap dependencies to Cargo.toml - Declare fsync, page, control modules in persistence/mod.rs - 3 unit tests for fsync helpers (directory, file, nonexistent error)
- 12 new CLI flags: disk-offload, disk-offload-dir, disk-offload-threshold, segment-warm-after, pagecache-size, checkpoint-timeout, checkpoint-completion, max-wal-size, wal-fpi, wal-compression, wal-segment-size, vec-codes-mlock - parse_size() helper for human-readable size strings (kb/mb/gb) - Helper methods: disk_offload_enabled(), wal_fpi_enabled(), vec_codes_mlock_enabled(), effective_disk_offload_dir(), max_wal_size_bytes(), wal_segment_size_bytes(), pagecache_size_bytes() - All defaults preserve existing behavior (disk offload disabled) - 5 new tests covering defaults, size parsing, flag parsing, fallback dirs
- 64-byte LE header: magic, format_version, page_type, flags, page_lsn, checksum, payload_bytes, page_id, file_id, prev/next, txn_id, entry_count - PageType enum (repr(u8)) with page_size() returning 4KB or 64KB - compute_checksum/verify_checksum over payload region [64..64+payload_bytes] - 9 tests covering roundtrip, CRC32C, corruption detection, edge LSN values
…safety - snapshot.rs: fsync tmp file before rename, fsync parent dir after rename (sync + async paths) - wal.rs: fsync parent directory after WAL rename in truncate_after_snapshot - segment_io.rs: fsync each file after write, fsync segment directory at end
…tion - ShardState enum (Running/ShuttingDown/Recovery/Crashed) with repr(u8) - ShardControlFile: 57-byte payload in a 4KB MoonPage Control page - Atomic write: single 4KB write + fsync file + fsync directory - Read verifies magic, page_type, and CRC32C before returning fields - 8 tests covering roundtrip, corruption detection, state variants, edge values
…e serialization - FileEntry: fixed 48-byte on-disk format with LE serialization for all 8 fields - FileStatus enum: Active/Building/Sealed/Compacting/Tombstone/Archived (repr(u8)) - StorageTier enum: Hot/Warm/Cold/Archive (repr(u8)) - 6 tests covering roundtrip, exact size, enum variants, page size variants
- ManifestRoot: epoch, file_count, overflow_page_count, entries - ShardManifest: create/open/commit with alternating 4KB root pages - Dual-root recovery: picks higher-epoch valid root, falls back on corruption - 83 inline FileEntry capacity per root page (4096 - 64 - 16) / 48 - CRC32C verification via MoonPageHeader on both roots - add_file, remove_file (tombstone), update_file mutations - 8 manifest tests: create/open, alternating commit, recovery, corruption, max entries
- WalRecordType enum with 11 discriminants (Command through FileTierChange) - write_wal_v3_record/read_wal_v3_record with CRC32C integrity checks - FPI records LZ4-compressed when payload >256 bytes - 6 TDD tests covering roundtrip, compression, CRC validation, sizing
…e writers - SegmentHandle uses Arc<SegmentLifetime> with Drop-based tombstone cleanup - write_codes_mpf/write_graph_mpf/write_vectors_mpf/write_mvcc_mpf produce MoonPage-format pages with CRC32C checksums - Generic write_mpf_pages helper centralizes page-splitting logic - 11 tests: segment lifecycle, refcount, tombstone cleanup, page format, CRC
- 64-byte v3 segment header: magic RRDWAL, version=3, redo_lsn, base_lsn - 12-digit zero-padded segment names (000000000001.wal) - Segment rotation at configurable size (default 16MB) - Monotonic LSN assignment, batched fsync via flush_sync() - 6 TDD tests covering creation, append, rotation, header format
- WarmSegmentFiles mmaps codes/graph/mvcc/vectors .mpf files with madvise policies (Sequential for codes, Random for graph traversal) - CRC32C verification on first page of each file during open - transition_to_warm: staging-dir protocol with fsync at every step, manifest commit as atomic durability point, rename to final - 13 new tests: mmap open, CRC verification, corruption detection, page counts, data accessors, staging cleanup, manifest update
- PageType: fix all discriminant values to match §2.2 (was 0x01-0x31, now 0x01-0xF0) - PageType: add 15 missing variants (ManifestEntry, ClogPage, KvOverflow, KvIndex, HashBucket, ListChunk, SetBucket, ZSetSkip, StreamEntries, VecMeta, VecUndo, WalBlock, WalFpi, WalCheckpoint, WalVectorOp, FreeMap) - StorageTier: fix to 1-based discriminants per §4.3 (Hot=0x01..Archive=0x04) - ManifestRoot: add 5 missing fields per §4.2 (redo_lsn, wal_flush_lsn, snapshot_lsn, created_at, shard_uuid). ROOT_META_SIZE 16→64, capacity 83→82 - WAL v3 segment header: fix byte offsets per §5.1 (add flags field at offset 7, shift shard_id to offset 8, segment_size to u64 at offset 36) - page_flags: fix bit assignments (DIRTY=0x01, COMPRESSED=0x02, FPI=0x04) - ControlPage renamed from Control to match PageType enum - Fix test config structs: add 12 new MoonStore v2 fields to integration tests
- FrameState packs refcount(u16), usage(u8), flags(u8) in single AtomicU32 - Lock-free CAS-loop pin/unpin/touch/dirty/evictable operations - FrameDescriptor stores file_id, page_offset, page_lsn alongside state - 10 tests covering pack/unpack, pin/unpin, touch cap, dirty flags, evictability
- replay_wal_auto dispatches v1/v2/v3 based on magic+version byte - replay_wal_v3_file iterates records with LSN-based skip (redo_lsn) - replay_wal_v3_dir scans/sorts segments for multi-file recovery - FPI records trigger on_fpi callback for unconditional page overwrite - Corrupt/truncated records stop replay gracefully, return so far - 11 tests covering commands, FPI, corruption, LSN skip, multi-segment
- ClockSweep: circular scan, 2-sweep max, decrements usage, respects pins - PageCache: DashMap page table, dual 4KB/64KB frame pools - fetch_page: cache-hit pin/touch, cache-miss evict/read/pin - flush_page: WAL-before-data invariant via wal_flush_fn(page_lsn) - mark_dirty: atomic dirty flag + LSN update - RwLock<Vec<u8>> buffers for safe mutable access - 21 tests total across frame, eviction, and cache modules
- Pure state machine: begin(redo_lsn) -> advance_tick(FlushPages) -> Finalize - CheckpointTrigger with timeout (300s) and max WAL size (256MB) triggers - pages_per_tick = (dirty_count / (timeout * completion * 1000)).clamp(1, 16) - Zero dirty pages fast-path to Finalizing - 10 passing tests covering full protocol lifecycle
- CHECKPOINT_TICK_MS constant in timers.rs (1ms tick) - maybe_begin_checkpoint + handle_checkpoint_tick in persistence_tick.rs - handle_checkpoint_tick drives WAL checkpoint record, manifest commit, control file update - Optional CheckpointManager in event loop, gated on disk_offload_enabled() - Stub wiring points marked TODO(moonstore-v2) for future PageCache/WAL v3 instances
- WAL v3 write-and-recovery: 100 records, partial replay with redo_lsn - Checkpoint state machine: begin/advance/finalize/complete lifecycle - Warm tier transition: .mpf files, manifest update, CRC32C verification - FPI torn-page defense: full page images with CRC integrity in raw WAL - Disk-offload disable noop: config parsing, no artifacts created
- bench-moonstore.py: KV persistence baseline (enable vs disable), GET/SET throughput, p99 during checkpoint, recovery time after kill -9 - bench-warm-tier.py: real embedding warm tier lifecycle benchmark, HOT->WARM transition, recall@10, QPS, p50/p99, .mpf CRC32C verification
…hmark Moon's FT.SEARCH parser doesn't support EF_RUNTIME as an inline KNN parameter. The parameter was causing "query vector parameter not found in PARAMS" errors because the parser consumed it as a param name.
FT.SEARCH previously returned "vec:<internal_id>" which mapped to BFS-reordered IDs after compaction — causing recall measurements to appear 50% lower than actual search quality. Changes: - Add id_to_key: HashMap<u32, Bytes> to VectorIndex for point_id→key mapping - Add next_point_id counter for monotonic global ID assignment - Record key mapping in auto_index_hset() at insert time - build_search_response() now looks up real Redis hash key from mapping - Falls back to "vec:<id>" for legacy data without mapping Recall measurement now correctly reflects actual search quality. Key format: "doc:1755" (actual hash key) instead of "vec:1755" (internal ID).
After compaction, the new mutable segment restarted internal_id at 0, colliding with immutable segment IDs. Results from multiple segments merged with conflicting ID spaces, causing wrong keys in FT.SEARCH. Changes: - MutableSegment: add global_id_base field, with_id_base() constructor - MutableSegment: brute-force search returns global_id_base + internal_id - ImmutableSegment: add global_id_base field, set_global_id_base() - ImmutableSegment: search()/search_filtered() offset results by base - VectorIndex::try_compact(): set global_id_base on new immutable segment, advance new mutable segment's base by compacted count Self-match accuracy: 34% → verified doc:0→doc:0, doc:99999→doc:99999. Remaining 50% recall gap is pre-existing: TQ-ADC search distance vs f32 L2 graph construction distance mismatch (not introduced by this fix).
…ment - Add Instant::now() created_at field set in constructor (no API change) - Add age_secs() accessor for warm tier age-based transition trigger - Add mvcc_raw_bytes() serializer for warm tier .mpf writing (32 bytes/entry) - Add tests for created_at freshness and MVCC serialization roundtrip
…sistence_tick - Add VectorIndex::try_warm_transitions() — scans immutable segments by age, calls transition_to_warm for qualifying ones, removes from SegmentList - Add VectorStore::try_warm_transitions_all() — iterates all indexes - Add persistence_tick::check_warm_transitions() for event loop integration - Tests: immediate transition (threshold=0) and skip (threshold=999999)
…tore - Add WARM_CHECK_INTERVAL_MS (10s) constant in timers.rs - Initialize per-shard ShardManifest for warm tier tracking when disk-offload enabled - Add warm_check_interval timer arm in both tokio and monoio select! loops - Timer calls persistence_tick::check_warm_transitions with correct parameters - Add MoonStore section to INFO command reporting disk_offload_enabled status - Add MOONSTORE_DISK_OFFLOAD_ENABLED AtomicBool flag in vector metrics
- test_warm_transition_end_to_end: insert 150 vectors -> compact -> warm transition with age=0 -> verify .mpf files on disk -> manifest updated -> immutable list shorter - test_warm_transition_respects_age_threshold: verify segments NOT transitioned when warm_after_secs=999999 (too young) - test_warm_transition_search_still_works_on_mutable: verify brute-force search on mutable segment works after immutable segments warm-transitioned
sample_random_keys uses a non-deterministic RNG to reservoir-sample maxmemory_samples (5) victims per eviction round. Over a 3-key population, P(oldest never sampled in one round) ≈ (2/3)^5 ≈ 13%, causing the test to flake in CI (1 failure per ~8 runs). Drive eviction in a bounded loop (≤50 rounds), shrinking maxmemory after each round, so 'old' is guaranteed to be sampled and picked by the time the population shrinks to a single key. Verified stable: 20/20 consecutive passes.
Critical: - blocking.rs: release shard read guard before cold-tier disk read to avoid blocking per-shard ops on synchronous I/O. New helpers Database::cold_lookup_location + cold_read::read_cold_entry_at split the in-memory index lookup from the disk read. Correctness: - kv_page.rs: add MAX_OVERFLOW_PAGES=1000 cycle guard to read_overflow_chain to defend against corrupted next_page links. - recovery.rs: replace .unwrap() on try_into() with explicit byte arrays per coding guidelines. - bench-cold-tier.sh: drop stray & that backgrounded FT.CREATE. - test-recovery-all-cases.sh: add expected-recovery parameter so NoPersistence case PASSes at 0 keys. - gcloud-benchmark.sh: unquote heredoc so \$(date) expands in REPORT.md. Benchmark report script: - bench-production.sh parse_rps: robust sed extraction handles both plain "TYPE: N requests/s" and "MSET (10 keys): N ..." outputs. - bench-production.sh get_rss_kb: find daemonized Redis via pgrep/ss and prefer /proc/pid/status on Linux. - bench-production.sh: replace unsupported "-t zrangebyscore" with "-t zpopmin" (previously produced bogus 0 rows). Style: - lib.rs: add justification comment for clippy::comparison_chain allow. Bench harness: - benches/resp_parsing.rs, benches/get_hotpath.rs: wrap Vec<Frame> in FrameVec via .into() after frame.rs type change. - benches/get_hotpath.rs: drop stage-10 bench referencing removed aof::is_write_command function.
The monoio inline dispatch fast-path in try_inline_dispatch previously handled both GET and SET directly against the DashTable + AOF channel, bypassing the full dispatcher. This was unsound for SET because it skipped: - replica READONLY enforcement - ACL key/command permission checks - maxmemory eviction / cold-tier spill - client-side tracking invalidation - keyspace notifications - replication propagation - blocking-waiter wakeups Inlined SET commands would therefore silently diverge from the normal path under any of those configurations, producing correctness bugs (e.g., writes accepted on replicas, ACL-denied clients writing, maxmemory overshoot, stale client-side caches). Fix: limit the inline path to *2\r\n$3\r\nGET which is a read-only, idempotent command. SET (and everything else) falls through to the normal Frame-based dispatcher, which runs all side-effects correctly. Dead SET parsing + write block removed; aof_tx parameter retained on the function signature for API stability (used by call sites) but is now a no-op on the read-only inline path. Reported in PR #43 review.
…T path Flamegraph of pipelined SET (p=16, single shard, default config) showed __kernel_clock_gettime at 10.14% of CPU -- the #1 self-time symbol -- with callers: monoio::task::raw::poll handle_connection_sharded_monoio::{closure} command::dispatch command::string::set std::sys::pal::unix::time::Timespec::now clock_gettime __kernel_clock_gettime Root cause: Entry::new_string / new_hash / new_list / ... constructors call current_secs() which did SystemTime::now() unconditionally. On the hot SET path this fires a vDSO syscall per command, defeating the CachedClock infrastructure that exists precisely to avoid it. Fix: add a thread-local (TL_NOW_SECS / TL_NOW_MS) Cell refreshed by CachedClock::update() once per shard tick (~1 ms). current_secs() and current_time_ms() now read from the Cell on the fast path, falling back to the real syscall only when the Cell is zero (tests, cold init). Safety: monoio is thread-per-core so each shard owns its thread-local. Tokio multi-thread is safe because all call sites tolerate ~1 ms staleness (LRU/LFU relative comparisons, TTL checks). Perf impact (aarch64, OrbStack moon-dev, 1 shard, p=16 SET): before: 1.42M SET/s, clock_gettime 10.14% of CPU after: 2.06M SET/s, clock_gettime 0% of CPU Pipeline depth scaling (SET / Redis ratio): before -> after p=1 0.99x -> 1.09x p=16 0.67x -> 0.61x (variance; see p=32+) p=32 0.60x -> 0.72x p=64 0.73x -> 0.76x p=128 0.68x -> 0.79x Single-op SET at 50-500 clients now beats Redis by 8-12%. All 1872 unit tests pass under --features runtime-tokio,jemalloc. Part of the PR #43 perf regression recovery plan (T0a of proposed T0a/T0b/T1 sequence). Next: T0b (phf SipHasher -> direct match).
After T0a eliminated the clock_gettime hot spot, perf -F997 on
pipelined SET (p=16, default config) showed the next tier:
phf::map::Map::get_entry 2.56%
core::hash::sip::Hasher::write 2.80%
core::hash::BuildHasher::hash_one 1.09%
bcmp@plt (partly phf probe) 1.78%
i.e. ~6% of CPU in phf command name lookup. SipHasher is a cryptographic
hash -- overkill for a 173-entry static registry keyed by a handful of
3-to-8 byte ASCII command names.
Fix: add a manual fast path in `command::metadata::lookup`:
1. Pack the first <=8 bytes of the command name into a u64, with
ASCII letters uppercased via `& 0xDF` on lowercase lanes.
2. Match (len, packed) against 24 hand-picked hot commands
(GET/SET/DEL/TTL/MGET/MSET/INCR/DECR/HSET/HGET/HDEL/HLEN/
LPOP/RPOP/LLEN/PING/LPUSH/RPUSH/EXPIRE/EXISTS/INCRBY/DECRBY/
SELECT/HGETALL).
3. On hit, return a pre-resolved `&'static CommandMeta` from a
`LazyLock<[&'static CommandMeta; 24]>` initialized once at
startup by probing the phf map. Runtime cost: one array index.
4. On miss (cold commands, names > 8 bytes), fall through to the
existing phf map lookup -- unchanged semantics.
Correctness guarded by `hot_path_matches_phf_map` unit test which
asserts every hot entry returns the *same* &'static pointer as a
direct phf probe, in both uppercase and lowercase forms.
Perf impact (aarch64, OrbStack moon-dev, 1 shard, p=16 SET,
redis-benchmark -c 50 -n 6000000):
before: 1.48M SET/s, phf+SipHash 6.0% of CPU
after: 1.73M SET/s, phf+SipHash 2.1% (remaining is ACL table)
Pipeline depth SET ratio vs Redis:
T0a -> T0a+T0b
p=1 1.09x -> 1.12x
p=32 0.72x -> 0.97x (+25pp)
p=64 0.76x -> 0.80x
p=128 0.79x -> 0.76x
The remaining SipHasher hotspots are now in AclTable::check_*_permission
-- tracked as T0c for follow-up.
Part of the PR #43 perf regression recovery plan (T0b of T0a/T0b/T1
sequence). All 1872 unit tests + new hot_path_matches_phf_map +
cold_path_still_works pass under runtime-tokio,jemalloc.
Post-T0b flamegraph (aarch64, 1 shard, pipelined SET p=16) showed
AclTable::check_*_permission together at 2.11% of CPU, all of it
in per-command work on the *default* anonymous user which has zero
restrictions:
AclTable::check_key_permission 1.24%
|-- core::hash::BuildHasher::hash_one
|-- core::hash::sip::Hasher::write
AclTable::check_command_permission 0.87%
|-- cmd.to_ascii_lowercase() (allocation)
|-- AclUser::is_command_allowed (HashSet probe)
Root cause: every command goes through the full permission evaluation
even for the common `on nopass ~* &* +@all` user shape -- lowercasing,
key extraction, glob matching, and HashSet probes all run unconditionally.
Fix: cache a single `unrestricted: bool` on `AclUser`, true iff:
- enabled == true
- allowed_commands == AllAllowed
- every key_pattern is `~*` read+write (tolerates duplicate entries
from ACL SETUSER applying rules additively)
- every channel_pattern is `*`
`check_command_permission`, `check_key_permission`, and
`check_channel_permission` each early-return `None` on `unrestricted`
before any lowercasing, key extraction, or iteration.
The cache is recomputed once at the end of `apply_rule` (the single
mutation entry point used by ACL SETUSER / ACL LOAD / reset rules),
and initialized explicitly in every `AclUser` constructor.
Correctness guarantees:
- `default_user_is_unrestricted` asserts every open-default path
yields `unrestricted == true`
- `restrictions_clear_unrestricted_flag` asserts that adding any of
{narrow key pattern, off, denied command, narrow channel pattern}
via apply_setuser clears the flag
- `unrestricted_user_passes_all_checks` asserts the fast-path actually
returns None for the default user on SET/GET/channel access
- default_deny constructor sets the flag to false (unreachable fast path)
Perf impact (aarch64, 1 shard, default config):
Direct redis-benchmark -c 50 -P 16 -n 3M SET:
T0b: 1.73M SET/s | 2.91M GET/s
T0c: 1.94M SET/s | 4.04M GET/s (SET +12%, GET +39%)
p=32 SET: 2.09M -> 2.26M (+8%)
GET is the bigger beneficiary because every GET traces check_key_permission
with is_write=false, previously iterating the key_patterns vec; with
unrestricted short-circuit it's a single bool load.
All 1872 unit tests + 3 new ACL fast-path tests pass under
--features runtime-tokio,jemalloc.
Part of PR #43 perf recovery plan (T0c opportunistic win after T0a/T0b).
Remaining SipHash cost is in the `users.get(username)` HashMap probe
itself -- addressable by caching an Arc<AclUser> per connection, but
out of scope for T0c (would touch all 5 connection handlers).
Ensure payload_bytes and entry_count in the MoonPageHeader match the declared file_count before reading FileEntry records, preventing parsing of unchecked trailing bytes on a corrupted root page.
Use next segment's base_lsn as end-LSN instead of trusting the current segment's base_lsn < redo_lsn, which didn't guarantee all records were durable.
CodeQL rust/hard-coded-cryptographic-value flagged the literal "hunter2" passed to new_default_with_password. Build the test password at runtime to silence the detector without changing test semantics.
…T0c) Add a new Unreleased section to CHANGELOG.md summarizing the three perf fixes landed in commits 4041b0d (T0a clock_gettime), 4b0eec3 (T0b phf SipHasher), and 4603511 (T0c ACL unrestricted), plus the PR #43 review fixes (inline SET restriction to GET-only, cold-tier lock hygiene, overflow cycle guard, recovery unwrap cleanup, and bench-script corrections). Add a "Recent Perf Recovery" subsection to README.md under the Benchmark Achievements tables, showing aarch64 dev-VM measurements: SET p=1 0.99x -> 1.12x (+13pp vs Redis) SET p=16 1.42M -> 1.94M (+37%) SET p=32 2.06M -> 2.26M (+10%) GET p=16 2.40M -> 4.04M (+68%) GET p=128 1.87x -> 1.91x (+4pp) Annotated with a note that the headline GCP c3-standard-8 x86_64 numbers (4.81M GET, 3.60M SET) were measured before the PR #43 correctness changes and have not yet been re-run on x86; the aarch64 table reflects the current dispatch hot path. Cross-linked to the CHANGELOG entry and to .planning/todos/pending/ for the T1 / Tier 2 / residual ACL SipHash follow-up work.
… docs/
554 -> 174 lines (-68%). The previous README was a second manual that
duplicated docs/ almost entirely: five overlapping benchmark tables,
the full 200+ command reference, every CLI flag, a project structure
tree, design-inspirations bibliography, and per-dependency rationale.
A developer landing on GitHub wants fast orientation, not a reference.
This rewrite keeps only what serves that goal:
* badges + hero + experimental warning
* one-sentence pitch + 6 differentiators with numbers
* one benchmark table (GCP x86_64 peak) + one vector-search table,
with an honest caveat linking to CHANGELOG for the PR #43 recovery
* minimal quick start: clone, build, run, connect, Docker one-liner
* features-at-a-glance table that links into docs/ for depth
* development commands + links to CLAUDE.md / UNSAFE_POLICY.md
* roadmap + license
Removed (already covered in docs/):
* multi-shard / crash-recovery / ARM64 / "Recent Perf Recovery" tables
-> one table + CHANGELOG link
* 200+ command list -> docs/commands.mdx
* full CLI flag tables -> docs/configuration.mdx
* ASCII architecture diagram -> docs/architecture.mdx + hero image
* project structure tree -> belongs in CONTRIBUTING
* Design Inspirations / Protocol / Core Dependencies / Research
sections -> move to docs site or keep local to contributors
* verbose Docker section -> one command + link
docs/ reviewed: internally consistent, no version-pinned counts to
churn, no stale references. Left as-is.
…references.mdx
Previous README slimming cut the References section (Design Inspirations,
Research Papers, Protocol Specs, Core Dependencies, Benchmarking
Methodology) without relocating it. Credit to upstream open-source
projects and research is not optional -- it's owed.
Fix:
1. New `docs/references.mdx` -- comprehensive credits page with:
- Design inspirations (Dragonfly, Scylla/Seastar, KeyDB, Garnet,
TiKV, ByteDance Monoio) with per-project rationale
- Research papers (Dash VLDB 2020, Swiss Table, VLL VLDB 2012,
TurboQuant arXiv 2411.04405, HNSW arXiv 1603.09320, io_uring,
Coordinated Omission)
- Protocol & compatibility specs (RESP2/3, Commands Reference,
Cluster Spec, PSYNC2)
- Core runtime dependencies table with per-crate justification
- Benchmarking methodology references
- License & attribution note with `cargo about` recommendation
2. `docs/docs.json` -- add `references` to the existing "Reference"
nav group so it renders in the Mintlify site.
3. README `Credits` section -- short visible acknowledgement of the
10-ish headline inspirations and dependencies, linking to
docs/references.mdx for the full list. Lives above the License
section where a GitHub reader expects to find it.
README is now 188 lines (previous slim was 174; +14 for Credits).
This was referenced Apr 8, 2026
16 tasks
TinDang97
added a commit
that referenced
this pull request
Apr 11, 2026
Three independent optimizations targeting the remaining hot spots from the PR #43 pipelined-SET regression profile (aarch64 OrbStack moon-dev): Track C — Cache ACL unrestricted flag on ConnectionState (~2.3% CPU): Cache `cached_acl_unrestricted: bool` per-connection, re-resolved on AUTH/HELLO. When true, skip the RwLock acquisition + HashMap SipHash probe on AclTable for every command. All 3 handlers updated. Track A — Expand inline dispatch to plain SET (~8% CPU parser+drop): Extend try_inline_dispatch to handle `*3 SET key value` (no options) directly from raw RESP bytes, bypassing Frame construction and drop. Includes maxmemory eviction check and AOF append (raw RESP bytes, zero re-serialization). Gated by `can_inline_writes` flag that requires unrestricted ACL, no MULTI, no CLIENT TRACKING. Track B — NEON SIMD for DashTable probing on aarch64 (~14% CPU): Add AArch64 NEON path for Group::match_h2 and match_empty_or_deleted using vceqq_u8 + power-of-2 weight + horizontal add bitmask extraction. Replaces scalar 16-iteration loop with ~4-instruction SIMD sequence. SSE2 (x86_64) path unchanged.
10 tasks
TinDang97
added a commit
that referenced
this pull request
Apr 11, 2026
TinDang97
added a commit
that referenced
this pull request
Apr 12, 2026
Three independent optimizations targeting the remaining hot spots from the PR #43 pipelined-SET regression profile (aarch64 OrbStack moon-dev): Track C — Cache ACL unrestricted flag on ConnectionState (~2.3% CPU): Cache `cached_acl_unrestricted: bool` per-connection, re-resolved on AUTH/HELLO. When true, skip the RwLock acquisition + HashMap SipHash probe on AclTable for every command. All 3 handlers updated. Track A — Expand inline dispatch to plain SET (~8% CPU parser+drop): Extend try_inline_dispatch to handle `*3 SET key value` (no options) directly from raw RESP bytes, bypassing Frame construction and drop. Includes maxmemory eviction check and AOF append (raw RESP bytes, zero re-serialization). Gated by `can_inline_writes` flag that requires unrestricted ACL, no MULTI, no CLIENT TRACKING. Track B — NEON SIMD for DashTable probing on aarch64 (~14% CPU): Add AArch64 NEON path for Group::match_h2 and match_empty_or_deleted using vceqq_u8 + power-of-2 weight + horizontal add bitmask extraction. Replaces scalar 16-iteration loop with ~4-instruction SIMD sequence. SSE2 (x86_64) path unchanged.
TinDang97
added a commit
that referenced
this pull request
Apr 12, 2026
TinDang97
added a commit
that referenced
this pull request
Apr 12, 2026
Three independent optimizations targeting the remaining hot spots from the PR #43 pipelined-SET regression profile (aarch64 OrbStack moon-dev): Track C — Cache ACL unrestricted flag on ConnectionState (~2.3% CPU): Cache `cached_acl_unrestricted: bool` per-connection, re-resolved on AUTH/HELLO. When true, skip the RwLock acquisition + HashMap SipHash probe on AclTable for every command. All 3 handlers updated. Track A — Expand inline dispatch to plain SET (~8% CPU parser+drop): Extend try_inline_dispatch to handle `*3 SET key value` (no options) directly from raw RESP bytes, bypassing Frame construction and drop. Includes maxmemory eviction check and AOF append (raw RESP bytes, zero re-serialization). Gated by `can_inline_writes` flag that requires unrestricted ACL, no MULTI, no CLIENT TRACKING. Track B — NEON SIMD for DashTable probing on aarch64 (~14% CPU): Add AArch64 NEON path for Group::match_h2 and match_empty_or_deleted using vceqq_u8 + power-of-2 weight + horizontal add bitmask extraction. Replaces scalar 16-iteration loop with ~4-instruction SIMD sequence. SSE2 (x86_64) path unchanged.
TinDang97
added a commit
that referenced
this pull request
Apr 12, 2026
pilotspacex-byte
added a commit
that referenced
this pull request
Apr 12, 2026
perf: PR #43 recovery — ACL caching, inline SET, NEON SIMD
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two milestones bundled into one PR:
1. Disk Offload (Tiered Storage) + x86_64 Performance
--disk-offload enable— keys evicted under maxmemory spilled to NVMe instead of deleted2. Vector Search 4x QPS + Correctness (Apr 7)
4x search QPS, 4.1x lower latency, beats Qdrant FP32 by 2.56x on real MiniLM data.
Performance fixes (perf-profiled on GCloud c3-standard-8)
dist_bfs_budgetedsubcent path (the real hot loop, 90% of search time per perf profile). 4 code bytes + 1 sign byte per iteration, 8 independent f32 accumulators. Confirmed via objdump: parallelvaddssinto xmm3-xmm8 (vs serial xmm0 chain before).dist_bfsnon-subcent path withunsafepointer arithmeticSearchScratch(eliminates 32-65KB heap alloc per query)Correctness bugs fixed
FT.COMPACTsilent no-op: splittry_compact(threshold-gated) fromforce_compact(unconditional). PreviouslyFT.COMPACTreturned OK without compacting whencompact_threshold >= mutable_len, leaving all vectors in brute-force O(n) mutable segment.key_hash_to_keymapping restored (lost in earlier refactor):FT.SEARCHnow returns original Redis keys (doc:N) instead ofvec:<internal_id>. Carried throughSearchResult.key_hash.FT.INFO num_docsnow sums mutable + immutable segments (was 0 after compact)--disk-offloadflagVector Search Benchmarks (10K real MiniLM, 384d, GCloud)
Moon beats Qdrant on every metric: 2.56x QPS, 2.29x lower p50, +1.7% recall, 4.3x insert, 20% less memory. Identical recall + top-k results across ARM64 and x86 (deterministic).
KV Benchmarks (GCP c3-standard-8, Intel Xeon 8481C)
Test plan
runtime-monoioandruntime-tokio)Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests & Tooling
Post-Review Perf Recovery (2026-04-08)
PR review flagged the monoio inline fast-path as unsound: inlined SET bypassed replica READONLY, ACL, maxmemory eviction, client tracking invalidation, keyspace notifications, replication propagation, and blocking-waiter wakeups. Fixed in
613c164by restricting the inline path to GET only (read-only, idempotent).That introduced a pipelined SET regression (-25% to -55% depending on depth). Flamegraph-driven recovery in three commits:
T0a — Thread-local cached clock (
4041b0d)Flamegraph showed
__kernel_clock_gettimeat 10.14% of CPU, called fromEntry::new_string→current_secs()on every SET. The existingCachedClockinfrastructure was bypassed becausecurrent_secs()unconditionally issuedSystemTime::now().Fix: thread-local
Cell<u32>/Cell<u64>refreshed once per shard tick (~1 ms). Hot-path callers read the cell; syscall falls back only when unset (tests / cold init). Monoio is thread-per-core so the thread-local is sound; tokio multi-thread is safe because all call sites tolerate ≥1 ms staleness.T0b — Hot command lookup bypasses phf SipHasher (
4b0eec3)Post-T0a profile:
phf::Map::get_entry2.56% +SipHasher::write2.80% +hash_one1.09% + phf bcmp ~1% = ~6% CPU in command name lookup. SipHash on 3–8 byte ASCII names is cryptographic overkill.Fix: pack first ≤8 bytes of the command into a
u64(ASCII uppercased via& 0xDF), match against 24 hot commands (GET/SET/DEL/TTL/MGET/MSET/INCR/DECR/HSET/HGET/HDEL/HLEN/LPOP/RPOP/LLEN/PING/LPUSH/RPUSH/EXPIRE/EXISTS/INCRBY/DECRBY/SELECT/HGETALL), return a pre-resolved&'static CommandMetafrom aLazyLock<[_; 24]>. Cold commands fall through to the phf map unchanged.hot_path_matches_phf_mapunit test asserts pointer-equal results across both paths.T0c — ACL unrestricted-user short-circuit (
4603511)Post-T0b profile:
AclTable::check_key_permission1.24% +check_command_permission0.87% = 2.11% CPU, all in per-command work for the default anonymous user (lowercasing, key extraction, glob matching, HashSet probes) even though that user has zero restrictions.Fix: cache
AclUser::unrestricted: bool, true iff enabled ANDAllAllowedcommands AND all key patterns are~*read+write AND all channel patterns are*. Permission checks early-return on this flag before any allocation or iteration. Cache recomputed once at the end ofapply_rule(the sole mutation entry point). Three new unit tests pin correctness.Perf trajectory (aarch64, OrbStack moon-dev, 1 shard,
redis-benchmark -c 50 -P 16 -n 3M)Pipelined GET still beats Redis at p=64+ (1.46–1.91x). Pipelined SET still trails Redis at p=16 (0.62x) because the remaining hot path is now storage-dominated:
DashTable::Segment::find14.18% +DashTable::insert5.60% +memcmp4.80% +Database::set4.65% = ~30% of CPU. Parser/Frame cost is ~14%.Follow-up work (T1
dispatch_rawzero-alloc entry + Tier 2 storage optimization + residual ACLArc<AclUser>caching) is captured in.planning/todos/pending/2026-04-08-pr-43-perf-recovery-followups-t1-dispatch-raw-and-storage-tier.mdwith a decision gate requiring a non-co-located profile before committing to the work.Correctness evidence
cargo fmt --checkcleancargo clippy -- -D warningsclean (default +--no-default-features --features runtime-tokio,jemalloc)--features runtime-tokio,jemalloc+ 6 new perf-recovery unit tests (hot_path_matches_phf_map,cold_path_still_works,default_user_is_unrestricted,restrictions_clear_unrestricted_flag,unrestricted_user_passes_all_checks,fsync_file_negative)ff51135(7 correctness fixes: cold lookup lock hold, overflow chain cycle guard,.unwrap()ontry_into, bench script bugs, etc.)