fix(persistence,tests): tokio AOF replay data loss + deflake txn_kv_wiring#96
Conversation
Under runtime-tokio, BGREWRITEAOF writes a single-file appendonly.aof
with an RDB preamble (legacy v2 format) — it never advances the AOF
manifest or produces moon.aof.<seq>.base.rdb files. But main.rs ran
the multi-part AOF block unconditionally, which:
1. On first boot under tokio, called AofManifest::initialize() to
create an empty moon.aof.manifest.
2. On the next boot, the loader saw the manifest, wiped the freshly
v2-loaded databases (db.clear()), then tried replay_multi_part
which found no base.rdb and silently produced 0 entries.
Net effect: every SET written under tokio was lost on restart, even
after BGREWRITEAOF. This is an ACID-04 / ACID-11 violation that
test_txn_commit_wal_crash_recovery has been catching as a hang +
unexpected None on GET. Reproducer (any non-TXN write also fails):
./moon --port 16399 --shards 1 --dir /tmp/t --appendonly yes &
redis-cli -p 16399 SET k v
redis-cli -p 16399 BGREWRITEAOF
kill -9 $!
./moon --port 16399 --shards 1 --dir /tmp/t --appendonly yes &
redis-cli -p 16399 GET k # → (nil) before fix, "v" after fix
Fix: wrap the entire multi-part AOF replay/initialize block in
`#[cfg(feature = "runtime-monoio")]`. Under tokio, v2 single-file
recovery owns the AOF path (`appendonly.aof` with RDB preamble). The
monoio path is unchanged.
Add a runtime-tokio-only warn that surfaces when an operator has
switched from monoio: a moon.aof.manifest exists on disk but is now
ignored, with guidance to switch back to monoio to load it. This
keeps data unreachable rather than silently overwritten.
Out of scope: monoio-side first-boot path where manifest exists with
empty base.rdb but non-empty incr produces "AOF base RDB missing ...
incr is N bytes; refusing to replay incr against empty state". That
is a separate pre-existing bug surfaced by the same reproducer under
monoio and will be fixed in a follow-up.
author: Tin Dang
test_txn_commit_wal_crash_recovery was failing on every CI run because its post-BGREWRITEAOF wait polled for `appendonlydir/*.base.rdb`, the multi-part-AOF artifact only produced by runtime-monoio. CI runs under `--no-default-features --features runtime-tokio,jemalloc`, where BGREWRITEAOF writes a single-file `appendonly.aof` with RDB preamble and never produces a base.rdb. Result: the test always hit the 10-second timeout asserting "BGREWRITEAOF did not create a base RDB file" before reaching the real recovery assertion. The companion `get_connection()` calls were also unbounded — when the shard accept loop lagged the TCP bind, the redis-rs handshake could block indefinitely (the macOS "Connection reset by peer" failure mode that the previous push tried to deflake by churning wait_for_server). Two minimal changes: 1. Poll for either `appendonlydir/moon.aof.<seq>.base.rdb` (monoio) OR a non-empty `appendonly.aof` (tokio). Bump the deadline from 10s to 15s for CI runner variance. Update the assertion message to name both candidate paths so a future failure points at the right file. 2. Replace `client.get_connection()` with `client.get_connection_with_timeout(10s)` at both phase-1 and phase-2 connect sites. The handshake now fails fast instead of hanging the CI runner for 15 minutes. Companion to the runtime-monoio gate on multi-part AOF in main.rs (previous commit): together those two changes restore test_txn_commit_wal_crash_recovery to green on tokio, which is the CI configuration. author: Tin Dang
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughPR gates multi-part AOF replay to monoio builds and warns under tokio when a manifest exists; tests now use a retrying Redis connector and poll for either monoio ChangesMulti-part AOF runtime support
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
`test_txn_commit_wal_crash_recovery` was failing on both Linux Check (EAGAIN, os error 11) and macOS Check (Connection reset by peer, os error 54) at the redis-rs handshake. `wait_for_server` only proves TCP bind; the shard accept loop and RESP handler can lag the bind by a small window, during which the first RESP exchange races and fails. Replace the single `get_connection_with_timeout(10s)` attempt at both phase-1 and phase-2 connect sites with a `connect_redis_with_retry` helper that polls 1s attempts with 200ms backoff for up to 15s, then panics with the last captured error. Also add the PR #96 CHANGELOG entry under `[0.2.0-alpha] — Unreleased` to satisfy the Lint gate's `CHANGELOG.md not updated` check. author: Tin Dang
Summary
Real bug fix: gate the multi-part AOF recovery block in `main.rs` to `runtime-monoio`. Under `runtime-tokio`, BGREWRITEAOF writes a single-file `appendonly.aof` (RDB preamble) and never advances the manifest — but the unconditional multi-part block created an empty manifest on first boot, then on the next boot wiped v2-loaded state because the multi-part replay found no `base.rdb`. Net effect: every SET written under tokio was lost on restart, even after BGREWRITEAOF. Reproducer:
```sh
./moon --port 16399 --shards 1 --dir /tmp/t --appendonly yes &
redis-cli -p 16399 SET k v
redis-cli -p 16399 BGREWRITEAOF
kill -9 $!
./moon --port 16399 --shards 1 --dir /tmp/t --appendonly yes &
redis-cli -p 16399 GET k # (nil) before fix, "v" after fix
```
Test deflake: `tests/txn_kv_wiring.rs::test_txn_commit_wal_crash_recovery` was failing on every CI run because it polled for `appendonlydir/*.base.rdb` (monoio-only artifact) while CI runs tokio. Replaced with a runtime-agnostic poll that accepts either the multi-part base.rdb OR a non-empty single-file `appendonly.aof`. Also bounded the redis-rs handshake with `get_connection_with_timeout(10s)` instead of an unbounded `get_connection()` — the latter could hang the CI runner for 15 minutes when the shard accept loop briefly lagged the TCP bind.
Together the AOF gate + the test deflake restore `test_txn_commit_wal_crash_recovery` to green under `--no-default-features --features runtime-tokio,jemalloc`, which is the CI configuration. Verified locally on macOS.
Out of scope (separate follow-ups)
Test plan
Companion to #95.
Summary by CodeRabbit
New Features
Tests