fix(persistence,tests): tokio AOF replay data loss + deflake txn_kv_wiring by TinDang97 · Pull Request #96 · pilotspace/moon

TinDang97 · 2026-05-18T02:12:14Z

Summary

Real bug fix: gate the multi-part AOF recovery block in `main.rs` to `runtime-monoio`. Under `runtime-tokio`, BGREWRITEAOF writes a single-file `appendonly.aof` (RDB preamble) and never advances the manifest — but the unconditional multi-part block created an empty manifest on first boot, then on the next boot wiped v2-loaded state because the multi-part replay found no `base.rdb`. Net effect: every SET written under tokio was lost on restart, even after BGREWRITEAOF. Reproducer:

```sh
./moon --port 16399 --shards 1 --dir /tmp/t --appendonly yes &
redis-cli -p 16399 SET k v
redis-cli -p 16399 BGREWRITEAOF
kill -9 $!
./moon --port 16399 --shards 1 --dir /tmp/t --appendonly yes &
redis-cli -p 16399 GET k # (nil) before fix, "v" after fix
```
Test deflake: `tests/txn_kv_wiring.rs::test_txn_commit_wal_crash_recovery` was failing on every CI run because it polled for `appendonlydir/*.base.rdb` (monoio-only artifact) while CI runs tokio. Replaced with a runtime-agnostic poll that accepts either the multi-part base.rdb OR a non-empty single-file `appendonly.aof`. Also bounded the redis-rs handshake with `get_connection_with_timeout(10s)` instead of an unbounded `get_connection()` — the latter could hang the CI runner for 15 minutes when the shard accept loop briefly lagged the TCP bind.

Together the AOF gate + the test deflake restore `test_txn_commit_wal_crash_recovery` to green under `--no-default-features --features runtime-tokio,jemalloc`, which is the CI configuration. Verified locally on macOS.

Out of scope (separate follow-ups)

`test_sharded_concurrent_clients` Linux-CI flake (passes 10/10 locally on macOS — needs Linux repro before patching).
Pre-existing monoio first-boot path where manifest exists with empty base.rdb but non-empty incr produces `AOF base RDB missing ... incr is N bytes; refusing to replay`. Surfaced by the same reproducer under monoio.

Test plan

`cargo build --release` (default features = monoio) — clean
`cargo build --release --no-default-features --features runtime-tokio,jemalloc` — clean
`cargo clippy -- -D warnings` (both feature sets) — no warnings
`cargo fmt --check` — clean
Local reproducer SET/kill/restart/GET passes under tokio after the fix
`cargo test --release --no-default-features --features runtime-tokio,jemalloc --test txn_kv_wiring test_txn_commit_wal_crash_recovery` — passes

Companion to #95.

Summary by CodeRabbit

New Features
- Multi-part AOF replay is now available only when running with the monoio runtime; non-monoio runs will emit a warning if multi-part AOF artifacts are present.
Tests
- Crash-recovery tests made more robust: connection retries and artifact detection now handle both multi-part and single-file AOF scenarios to reduce flakiness.

Under runtime-tokio, BGREWRITEAOF writes a single-file appendonly.aof with an RDB preamble (legacy v2 format) — it never advances the AOF manifest or produces moon.aof.<seq>.base.rdb files. But main.rs ran the multi-part AOF block unconditionally, which: 1. On first boot under tokio, called AofManifest::initialize() to create an empty moon.aof.manifest. 2. On the next boot, the loader saw the manifest, wiped the freshly v2-loaded databases (db.clear()), then tried replay_multi_part which found no base.rdb and silently produced 0 entries. Net effect: every SET written under tokio was lost on restart, even after BGREWRITEAOF. This is an ACID-04 / ACID-11 violation that test_txn_commit_wal_crash_recovery has been catching as a hang + unexpected None on GET. Reproducer (any non-TXN write also fails): ./moon --port 16399 --shards 1 --dir /tmp/t --appendonly yes & redis-cli -p 16399 SET k v redis-cli -p 16399 BGREWRITEAOF kill -9 $! ./moon --port 16399 --shards 1 --dir /tmp/t --appendonly yes & redis-cli -p 16399 GET k # → (nil) before fix, "v" after fix Fix: wrap the entire multi-part AOF replay/initialize block in `#[cfg(feature = "runtime-monoio")]`. Under tokio, v2 single-file recovery owns the AOF path (`appendonly.aof` with RDB preamble). The monoio path is unchanged. Add a runtime-tokio-only warn that surfaces when an operator has switched from monoio: a moon.aof.manifest exists on disk but is now ignored, with guidance to switch back to monoio to load it. This keeps data unreachable rather than silently overwritten. Out of scope: monoio-side first-boot path where manifest exists with empty base.rdb but non-empty incr produces "AOF base RDB missing ... incr is N bytes; refusing to replay incr against empty state". That is a separate pre-existing bug surfaced by the same reproducer under monoio and will be fixed in a follow-up. author: Tin Dang

test_txn_commit_wal_crash_recovery was failing on every CI run because its post-BGREWRITEAOF wait polled for `appendonlydir/*.base.rdb`, the multi-part-AOF artifact only produced by runtime-monoio. CI runs under `--no-default-features --features runtime-tokio,jemalloc`, where BGREWRITEAOF writes a single-file `appendonly.aof` with RDB preamble and never produces a base.rdb. Result: the test always hit the 10-second timeout asserting "BGREWRITEAOF did not create a base RDB file" before reaching the real recovery assertion. The companion `get_connection()` calls were also unbounded — when the shard accept loop lagged the TCP bind, the redis-rs handshake could block indefinitely (the macOS "Connection reset by peer" failure mode that the previous push tried to deflake by churning wait_for_server). Two minimal changes: 1. Poll for either `appendonlydir/moon.aof.<seq>.base.rdb` (monoio) OR a non-empty `appendonly.aof` (tokio). Bump the deadline from 10s to 15s for CI runner variance. Update the assertion message to name both candidate paths so a future failure points at the right file. 2. Replace `client.get_connection()` with `client.get_connection_with_timeout(10s)` at both phase-1 and phase-2 connect sites. The handshake now fails fast instead of hanging the CI runner for 15 minutes. Companion to the runtime-monoio gate on multi-part AOF in main.rs (previous commit): together those two changes restore test_txn_commit_wal_crash_recovery to green on tokio, which is the CI configuration. author: Tin Dang

qodo-code-review · 2026-05-18T02:12:18Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

coderabbitai · 2026-05-18T02:12:27Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 93ab651c-7075-41b7-8372-1a289c763c6c

📥 Commits

Reviewing files that changed from the base of the PR and between b0196d1 and 4425632.

📒 Files selected for processing (2)

CHANGELOG.md
tests/txn_kv_wiring.rs

✅ Files skipped from review due to trivial changes (1)

CHANGELOG.md

🚧 Files skipped from review as they are similar to previous changes (1)

tests/txn_kv_wiring.rs

📝 Walkthrough

Walkthrough

PR gates multi-part AOF replay to monoio builds and warns under tokio when a manifest exists; tests now use a retrying Redis connector and poll for either monoio .base.rdb artifacts or a non-empty appendonly.aof; CHANGELOG updated.

Changes

Multi-part AOF runtime support

Layer / File(s)	Summary
Runtime-gated multi-part AOF in main `src/main.rs`	Multi-part AOF replay is compiled only with `runtime-monoio`; non-monoio builds detect an existing multi-part manifest and log a warning that it will be ignored.
Test adaptations for dual-runtime AOF artifacts `tests/txn_kv_wiring.rs`	Adds `connect_redis_with_retry` and switches Phase 1/Phase 2 connections to it; post-`BGREWRITEAOF` readiness now polls for either `appendonlydir/*.base.rdb` (multipart) or a non-empty `appendonly.aof` (single-file).
Changelog entry `CHANGELOG.md`	Adds unreleased note describing the runtime gating, test deflake steps, and guidance for tokio vs monoio behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nudge the code to skip the race,
Monoio replays with steady grace,
Tokio hears a gentle chime:
"AOF manifest ignored this time."
Tests now wait, and handshakes breathe.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix(persistence,tests): tokio AOF replay data loss + deflake txn_kv_wiring' clearly and concisely summarizes the two main changes: fixing a data-loss bug in AOF replay under tokio and deflaking a test.
Description check	✅ Passed	The pull request description is comprehensive and complete, covering the summary, detailed explanation of both fixes with a reproducer, test plan with verification steps, and notes on scope limitations.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/main-test-flakes-and-aof-replay

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

`test_txn_commit_wal_crash_recovery` was failing on both Linux Check (EAGAIN, os error 11) and macOS Check (Connection reset by peer, os error 54) at the redis-rs handshake. `wait_for_server` only proves TCP bind; the shard accept loop and RESP handler can lag the bind by a small window, during which the first RESP exchange races and fails. Replace the single `get_connection_with_timeout(10s)` attempt at both phase-1 and phase-2 connect sites with a `connect_redis_with_retry` helper that polls 1s attempts with 200ms backoff for up to 15s, then panics with the last captured error. Also add the PR #96 CHANGELOG entry under `[0.2.0-alpha] — Unreleased` to satisfy the Lint gate's `CHANGELOG.md not updated` check. author: Tin Dang

TinDang97 added 2 commits May 18, 2026 09:06

TinDang97 merged commit 14ca4f0 into main May 18, 2026
5 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(persistence,tests): tokio AOF replay data loss + deflake txn_kv_wiring#96

fix(persistence,tests): tokio AOF replay data loss + deflake txn_kv_wiring#96
TinDang97 merged 3 commits into
mainfrom
fix/main-test-flakes-and-aof-replay

TinDang97 commented May 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

qodo-code-review Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TinDang97 commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Out of scope (separate follow-ups)

Test plan

Summary by CodeRabbit

Uh oh!

qodo-code-review Bot commented May 18, 2026

Qodo reviews are paused for this user.

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TinDang97 commented May 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading