Skip to content

fix(persistence,tests): tokio AOF replay data loss + deflake txn_kv_wiring#96

Merged
TinDang97 merged 3 commits into
mainfrom
fix/main-test-flakes-and-aof-replay
May 18, 2026
Merged

fix(persistence,tests): tokio AOF replay data loss + deflake txn_kv_wiring#96
TinDang97 merged 3 commits into
mainfrom
fix/main-test-flakes-and-aof-replay

Conversation

@TinDang97
Copy link
Copy Markdown
Collaborator

@TinDang97 TinDang97 commented May 18, 2026

Summary

  • Real bug fix: gate the multi-part AOF recovery block in `main.rs` to `runtime-monoio`. Under `runtime-tokio`, BGREWRITEAOF writes a single-file `appendonly.aof` (RDB preamble) and never advances the manifest — but the unconditional multi-part block created an empty manifest on first boot, then on the next boot wiped v2-loaded state because the multi-part replay found no `base.rdb`. Net effect: every SET written under tokio was lost on restart, even after BGREWRITEAOF. Reproducer:

    ```sh
    ./moon --port 16399 --shards 1 --dir /tmp/t --appendonly yes &
    redis-cli -p 16399 SET k v
    redis-cli -p 16399 BGREWRITEAOF
    kill -9 $!
    ./moon --port 16399 --shards 1 --dir /tmp/t --appendonly yes &
    redis-cli -p 16399 GET k # (nil) before fix, "v" after fix
    ```

  • Test deflake: `tests/txn_kv_wiring.rs::test_txn_commit_wal_crash_recovery` was failing on every CI run because it polled for `appendonlydir/*.base.rdb` (monoio-only artifact) while CI runs tokio. Replaced with a runtime-agnostic poll that accepts either the multi-part base.rdb OR a non-empty single-file `appendonly.aof`. Also bounded the redis-rs handshake with `get_connection_with_timeout(10s)` instead of an unbounded `get_connection()` — the latter could hang the CI runner for 15 minutes when the shard accept loop briefly lagged the TCP bind.

Together the AOF gate + the test deflake restore `test_txn_commit_wal_crash_recovery` to green under `--no-default-features --features runtime-tokio,jemalloc`, which is the CI configuration. Verified locally on macOS.

Out of scope (separate follow-ups)

  • `test_sharded_concurrent_clients` Linux-CI flake (passes 10/10 locally on macOS — needs Linux repro before patching).
  • Pre-existing monoio first-boot path where manifest exists with empty base.rdb but non-empty incr produces `AOF base RDB missing ... incr is N bytes; refusing to replay`. Surfaced by the same reproducer under monoio.

Test plan

  • `cargo build --release` (default features = monoio) — clean
  • `cargo build --release --no-default-features --features runtime-tokio,jemalloc` — clean
  • `cargo clippy -- -D warnings` (both feature sets) — no warnings
  • `cargo fmt --check` — clean
  • Local reproducer SET/kill/restart/GET passes under tokio after the fix
  • `cargo test --release --no-default-features --features runtime-tokio,jemalloc --test txn_kv_wiring test_txn_commit_wal_crash_recovery` — passes

Companion to #95.

Summary by CodeRabbit

  • New Features

    • Multi-part AOF replay is now available only when running with the monoio runtime; non-monoio runs will emit a warning if multi-part AOF artifacts are present.
  • Tests

    • Crash-recovery tests made more robust: connection retries and artifact detection now handle both multi-part and single-file AOF scenarios to reduce flakiness.

Review Change Stack

TinDang97 added 2 commits May 18, 2026 09:06
Under runtime-tokio, BGREWRITEAOF writes a single-file appendonly.aof
with an RDB preamble (legacy v2 format) — it never advances the AOF
manifest or produces moon.aof.<seq>.base.rdb files. But main.rs ran
the multi-part AOF block unconditionally, which:

  1. On first boot under tokio, called AofManifest::initialize() to
     create an empty moon.aof.manifest.
  2. On the next boot, the loader saw the manifest, wiped the freshly
     v2-loaded databases (db.clear()), then tried replay_multi_part
     which found no base.rdb and silently produced 0 entries.

Net effect: every SET written under tokio was lost on restart, even
after BGREWRITEAOF. This is an ACID-04 / ACID-11 violation that
test_txn_commit_wal_crash_recovery has been catching as a hang +
unexpected None on GET. Reproducer (any non-TXN write also fails):

  ./moon --port 16399 --shards 1 --dir /tmp/t --appendonly yes &
  redis-cli -p 16399 SET k v
  redis-cli -p 16399 BGREWRITEAOF
  kill -9 $!
  ./moon --port 16399 --shards 1 --dir /tmp/t --appendonly yes &
  redis-cli -p 16399 GET k   # → (nil) before fix, "v" after fix

Fix: wrap the entire multi-part AOF replay/initialize block in
`#[cfg(feature = "runtime-monoio")]`. Under tokio, v2 single-file
recovery owns the AOF path (`appendonly.aof` with RDB preamble). The
monoio path is unchanged.

Add a runtime-tokio-only warn that surfaces when an operator has
switched from monoio: a moon.aof.manifest exists on disk but is now
ignored, with guidance to switch back to monoio to load it. This
keeps data unreachable rather than silently overwritten.

Out of scope: monoio-side first-boot path where manifest exists with
empty base.rdb but non-empty incr produces "AOF base RDB missing ...
incr is N bytes; refusing to replay incr against empty state". That
is a separate pre-existing bug surfaced by the same reproducer under
monoio and will be fixed in a follow-up.

author: Tin Dang
test_txn_commit_wal_crash_recovery was failing on every CI run because
its post-BGREWRITEAOF wait polled for `appendonlydir/*.base.rdb`, the
multi-part-AOF artifact only produced by runtime-monoio. CI runs under
`--no-default-features --features runtime-tokio,jemalloc`, where
BGREWRITEAOF writes a single-file `appendonly.aof` with RDB preamble
and never produces a base.rdb. Result: the test always hit the
10-second timeout asserting "BGREWRITEAOF did not create a base RDB
file" before reaching the real recovery assertion.

The companion `get_connection()` calls were also unbounded — when the
shard accept loop lagged the TCP bind, the redis-rs handshake could
block indefinitely (the macOS "Connection reset by peer" failure mode
that the previous push tried to deflake by churning wait_for_server).

Two minimal changes:

1. Poll for either `appendonlydir/moon.aof.<seq>.base.rdb` (monoio) OR
   a non-empty `appendonly.aof` (tokio). Bump the deadline from 10s to
   15s for CI runner variance. Update the assertion message to name
   both candidate paths so a future failure points at the right file.

2. Replace `client.get_connection()` with
   `client.get_connection_with_timeout(10s)` at both phase-1 and
   phase-2 connect sites. The handshake now fails fast instead of
   hanging the CI runner for 15 minutes.

Companion to the runtime-monoio gate on multi-part AOF in main.rs
(previous commit): together those two changes restore test_txn_commit_wal_crash_recovery
to green on tokio, which is the CI configuration.

author: Tin Dang
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 93ab651c-7075-41b7-8372-1a289c763c6c

📥 Commits

Reviewing files that changed from the base of the PR and between b0196d1 and 4425632.

📒 Files selected for processing (2)
  • CHANGELOG.md
  • tests/txn_kv_wiring.rs
✅ Files skipped from review due to trivial changes (1)
  • CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/txn_kv_wiring.rs

📝 Walkthrough

Walkthrough

PR gates multi-part AOF replay to monoio builds and warns under tokio when a manifest exists; tests now use a retrying Redis connector and poll for either monoio .base.rdb artifacts or a non-empty appendonly.aof; CHANGELOG updated.

Changes

Multi-part AOF runtime support

Layer / File(s) Summary
Runtime-gated multi-part AOF in main
src/main.rs
Multi-part AOF replay is compiled only with runtime-monoio; non-monoio builds detect an existing multi-part manifest and log a warning that it will be ignored.
Test adaptations for dual-runtime AOF artifacts
tests/txn_kv_wiring.rs
Adds connect_redis_with_retry and switches Phase 1/Phase 2 connections to it; post-BGREWRITEAOF readiness now polls for either appendonlydir/*.base.rdb (multipart) or a non-empty appendonly.aof (single-file).
Changelog entry
CHANGELOG.md
Adds unreleased note describing the runtime gating, test deflake steps, and guidance for tokio vs monoio behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nudge the code to skip the race,
Monoio replays with steady grace,
Tokio hears a gentle chime:
"AOF manifest ignored this time."
Tests now wait, and handshakes breathe.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(persistence,tests): tokio AOF replay data loss + deflake txn_kv_wiring' clearly and concisely summarizes the two main changes: fixing a data-loss bug in AOF replay under tokio and deflaking a test.
Description check ✅ Passed The pull request description is comprehensive and complete, covering the summary, detailed explanation of both fixes with a reproducer, test plan with verification steps, and notes on scope limitations.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/main-test-flakes-and-aof-replay

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

`test_txn_commit_wal_crash_recovery` was failing on both Linux Check
(EAGAIN, os error 11) and macOS Check (Connection reset by peer, os error
54) at the redis-rs handshake. `wait_for_server` only proves TCP bind;
the shard accept loop and RESP handler can lag the bind by a small
window, during which the first RESP exchange races and fails.

Replace the single `get_connection_with_timeout(10s)` attempt at both
phase-1 and phase-2 connect sites with a `connect_redis_with_retry`
helper that polls 1s attempts with 200ms backoff for up to 15s, then
panics with the last captured error.

Also add the PR #96 CHANGELOG entry under `[0.2.0-alpha] — Unreleased`
to satisfy the Lint gate's `CHANGELOG.md not updated` check.

author: Tin Dang
@TinDang97 TinDang97 merged commit 14ca4f0 into main May 18, 2026
5 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant