-
Notifications
You must be signed in to change notification settings - Fork 2.2k
chore(tests): disk_v2 data-loss research scratchbook #25524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| // Scoped relaxations for the Antithesis research scratchbook. These are internal | ||
| // working notes (dense property tables, ad-hoc code fences), not published docs, | ||
| // so a few cosmetic rules are disabled here only — the repo-wide config still | ||
| // applies everywhere else. | ||
| { | ||
| "extends": "../../../.markdownlint.jsonc", | ||
| "MD060": false, // table-column-style: the property-catalog tables use empty-header `| | |` 2-col layout | ||
| "MD040": false, // fenced-code-language: many ad-hoc evidence snippets are intentionally language-less | ||
| "MD022": false // blanks-around-headings: relaxed for the dense note format | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,113 @@ | ||
| # External References Digest (working note for discovery agents) | ||
|
|
||
| This is scaffolding for the antithesis-research run on **disk buffers v2** | ||
| (`lib/vector-buffers/src/variants/disk_v2/`). User scope answer: *"Whatever you | ||
| have access to. You have your MCPs."* — so in-repo docs/RFCs plus Datadog | ||
| internal doc/Jira were consulted. Key findings condensed below so per-focus agents | ||
| don't need to re-fetch. | ||
|
|
||
| ## In-repo references | ||
|
|
||
| - `rfcs/2021-10-14-9477-buffer-improvements.md` — original buffer-rework RFC. | ||
| - `docs/specs/buffer.md` — buffer component spec / claimed behavior. | ||
| - `lib/vector-buffers/src/variants/disk_v2/mod.rs` — authoritative design doc | ||
| (module-level comment): on-disk format, ledger, record IDs, recovery. | ||
|
|
||
| ## Claimed guarantees (from `mod.rs` design doc + buffer spec + internal doc) | ||
|
|
||
| - Data files never exceed 128MB; ≤ 65,536 files; buffer ≤ ~8TB. | ||
| - All records checksummed with **CRC32C**; records written | ||
| sequentially/contiguously; a record never spans two data files. | ||
| - Writers create+write data files; readers read+delete them. Reader deletes a | ||
| data file **only after all records in it are acknowledged** (whole-file | ||
| deletion, never partial truncation). | ||
| - Ledger (`buffer.db`, memory-mapped) tracks `writer_next_record_id`, | ||
| `writer_current_data_file_id`, `reader_current_data_file_id`, | ||
| `reader_last_record_id`. Fields updated atomically, but **not** atomically | ||
| w.r.t. reader/writer activity. | ||
| - Record IDs are monotonic and encode event count: record ID N with next record | ||
| M means the record holds M−N events. Used to compute buffer event count and to | ||
| detect gaps / dropped events after corruption. | ||
| - **Durability:** data is fsync'd every **500ms** (`DEFAULT_FLUSH_INTERVAL`). | ||
| Page-cache flush happens on every `flush()` (readers see data immediately on | ||
| Linux); full fsync only every 500ms. **Data-loss window on crash = up to 500ms | ||
| of unsynced writes** (when e2e acks off). Graceful shutdown flushes everything | ||
| → no loss. | ||
| - Min buffer `max_size` ~256MB; `DEFAULT_MAX_DATA_FILE_SIZE` 128MB; | ||
| `DEFAULT_MAX_RECORD_SIZE` = 128MB; `DEFAULT_WRITE_BUFFER_SIZE` 256KB. | ||
| - Endianness: files are host-endian; not portable across architectures. | ||
| - Delivery semantics with e2e acks + disk buffer = **at-least-once**: crash after | ||
| buffer write but before downstream ack → replay on restart → **possible | ||
| duplicates** (downstream must dedup). | ||
|
|
||
| ## Known bugs / incidents (HIGH-VALUE Antithesis targets) | ||
|
|
||
| 1. **Ledger `total_buffer_size` AtomicU64 underflow → permanent writer deadlock** | ||
| (Vector #21683, partially mitigated by PR #23561 on the *reporter* side only; | ||
| the ledger atomic still wraps). | ||
| - `decrement_total_buffer_size` (ledger.rs ~291-298) does raw | ||
| `fetch_sub(amount, AcqRel)` with **no saturation**. If `amount > | ||
| current_value`, the atomic wraps to ≈ 2^64. | ||
| - Then `total_buffer_size + unflushed_bytes` is always astronomical → | ||
| `is_buffer_full()` returns true forever → `can_write_record()` false forever | ||
| → writer's `ensure_ready_for_write()` (writer.rs ~1001-1020) loops on | ||
| `ledger.wait_for_reader().await` and never recovers. **Writer deadlocks | ||
| permanently.** | ||
| - Trigger: crash/reboot/abrupt-shutdown that leaves a data file whose on-disk | ||
| size and readable-record bytes disagree, combined with the reader running | ||
| through that file on restart. Partial writes at file-rotation boundaries are | ||
| the most plausible cause. Not deterministic per-restart, but not exotic. | ||
| - Reporter-side gauges use `saturating_sub` (PR #23561) so the *dashboard* | ||
| no longer shows 2^64, but the ledger control-path atomic is unfixed. | ||
|
|
||
| 2. **Disk buffer stall + silent event drops during config reload** | ||
| (Vector #24948, PR #24949; directly implicated in the **internal config-reload incident non-prod | ||
| incident**). | ||
| - Old writer dropped while events still in-flight → events lost without | ||
| accounting. | ||
| - `track_dropped_events` passes `0` for `byte_size` → permanent drift in | ||
| buffer-size metrics. | ||
| - `synchronize_buffer_usage()` re-seeds metrics while the old reporter may | ||
| still run → double-counted metric spikes; then a metrics gap between old | ||
| reporter teardown and the first tick (2s) of the new reporter. | ||
|
|
||
| 3. **`component_discarded_events_total` blind to buffer drops** (Vector #24606, | ||
| #24144). When a disk buffer fills and `drop_newest` fires, only | ||
| `buffer_discarded_events_total` increments; the component-level discarded | ||
| counter stays 0 → silent data loss on dashboards. `BufferEventsDropped::emit()` | ||
| in `lib/vector-buffers/src/internal_events.rs` never calls | ||
| `ComponentEventsDropped`. | ||
|
|
||
| 4. **Buffer size gauges stuck non-zero / negative** (Vector #23995, #17666, | ||
| #21683). Reporter `current() = total_entered.saturating_sub(total_left)`; | ||
| stuck-at-non-zero still open. | ||
|
|
||
| 5. **Component tags lost for sinks using disk buffers** (OPA-5380): components | ||
| paused for IO at init time lose `component_*` labels on later-registered | ||
| metrics (utilization, etc.). | ||
|
|
||
| ## Existing test strategy (so we don't duplicate it) | ||
|
|
||
| - In-repo: extensive `proptest` + **model-based testing** under | ||
| `variants/disk_v2/tests/model/` (a reference model + action sequencer + | ||
| in-memory filesystem). Unit tests for acknowledgements, initialization, | ||
| known_errors, size_limits, invariants, record. | ||
| - Datadog internal: an E2E **chaos test** that SIGKILLs the worker 3× with e2e acks | ||
| enabled and asserts every event is delivered end-to-end. Antithesis should go | ||
| beyond: explore fault *timing/interleavings* (partial writes at rotation, | ||
| fsync-vs-crash windows, reader/writer races on the mmap'd ledger) that a fixed | ||
| 3×SIGKILL test cannot. | ||
| - A **major lock-contention performance issue** affected all disk-buffer users | ||
| (writer throughput ~90 MiB/s capped by contention) — points at writer/reader | ||
| coordination hot paths. | ||
|
|
||
| ## Notes on faults | ||
|
|
||
| - Crash-recovery properties require **node termination faults** (often disabled | ||
| by default in Antithesis tenants) — flag this in the catalog. | ||
| - The disk buffer is **single-process** (intra-Vector reader+writer sharing an | ||
| mmap'd ledger). Network/partition faults are largely irrelevant to the buffer | ||
| itself; the strong levers are node kill/restart, node hang, CPU throttling | ||
| (exposes the fsync/flush timing windows and lock contention), and filesystem | ||
| state across restart. | ||
| </content> |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,174 @@ | ||
| --- | ||
| sut_path: /home/ssm-user/src/vector | ||
| commit: b7aae737cef5dd37d1445915443a1eb97b584f85 | ||
| updated: 2026-05-28 | ||
| external_references: | ||
| - path: lib/vector-buffers/src/variants/disk_v2/mod.rs | ||
| why: Confirms the buffer is single-process (intra-Vector reader+writer over an mmap'd ledger) | ||
| - path: (internal design doc, not linked) | ||
| why: Disk buffer is configured per-sink; e2e acks require a supporting source; at-least-once semantics | ||
| - path: (internal design doc, not linked) | ||
| why: Existing chaos test crashes the worker with SIGKILL x3 + e2e acks — the topology must support repeated kill/restart | ||
| - path: distribution/docker/ | ||
| why: Existing Vector Dockerfiles to reuse/adapt for the SUT container | ||
| --- | ||
|
|
||
| # Deployment Topology: Disk Buffer v2 | ||
|
|
||
| ## Key fact driving the design | ||
|
|
||
| The disk buffer is **single-process**: the reader, writer, and finalizer all run | ||
| inside one Vector process, coordinating through an `mmap`'d ledger and the local | ||
| filesystem. There is **no network, no peer, no quorum**. Therefore: | ||
|
|
||
| - The strong fault levers are **node termination (kill/restart)**, **node hang**, | ||
| **CPU throttling**, **clock jitter**, and **filesystem state across restart** — | ||
| NOT network partitions or bad-node faults (those are irrelevant to the buffer). | ||
| - The topology is minimal: **one SUT container + one workload/client container.** | ||
| No dependency containers are needed (no S3/Kafka/Postgres) — the buffer's only | ||
| "dependency" is the local filesystem. | ||
|
|
||
| ## Topology | ||
|
|
||
| ```text | ||
| +-----------------------------+ events (HTTP, e2e-ack-capable source) | ||
| | workload (client) | -----------------------------------------> +-----------------------------+ | ||
| | - produces unique event IDs| | vector (SUT) | | ||
| | - HTTP collector endpoint | <----------------------------------------- | source -> sink(disk buffer)| | ||
| | - tracks produced/delivered| sink delivers here (HTTP sink) | data_dir on PERSISTENT vol | | ||
| | - emits Antithesis asserts | +-----------------------------+ | ||
| | - test template /opt/... | | Antithesis injects | ||
| +-----------------------------+ | node-kill / hang / | ||
| | CPU-throttle / clock | ||
| v faults HERE | ||
| +-----------------------------+ | ||
| | persistent volume | | ||
| | <data_dir>/buffer/v2/<id>/ | | ||
| +-----------------------------+ | ||
| ``` | ||
|
|
||
| ## Containers | ||
|
|
||
| ### 1. `vector` — Service (the SUT) | ||
|
|
||
| - **Image:** adapt an existing Dockerfile from `distribution/docker/` (Debian or | ||
| Distroless). Two build variants: | ||
| - **Baseline build:** stock Vector — exercises all workload-observable | ||
| properties (durability, at-least-once, deadlock-via-throughput-stall, metric | ||
| correctness, recovery). | ||
| - **Instrumented build (recommended for the deadlock/corruption cluster):** | ||
| Vector built with the **Antithesis Rust SDK** added as a dependency to | ||
| `lib/vector-buffers`, with the missing SUT-side assertions inserted (see | ||
| "SUT-side instrumentation" below). This is the only way to directly assert | ||
| the internal states (`total-buffer-size-never-underflows`, | ||
| `record-id-monotonicity-holds`, `partial-write-at-rotation-recovers`, | ||
| `graceful-shutdown-flushes-all`/`unflushed_bytes==0`) that are invisible from | ||
| the workload. | ||
| - **Runs:** a single `vector` process with a config: | ||
| - `source`: an e2e-ack-capable source the workload can push to. Prefer | ||
| `datadog_agent` or `http_server` with `acknowledgements: true` (needed for | ||
| `every-written-event-eventually-delivered` and the durable-survival | ||
| properties). Keep one source. | ||
| - `sink`: an `http` sink with `buffer: { type: disk, max_size: <~256MB+>, | ||
| when_full: block }`, posting to the workload's collector. A second | ||
| config/run uses `when_full: drop_newest` for `dropped-events-are-counted`. | ||
| - Internal metrics exposed (e.g. `internal_metrics` → `prometheus_exporter`) | ||
| so the workload can read `buffer_*` / `component_discarded_events_total` for | ||
| the metric-correctness properties. | ||
| - **CRITICAL — persistent buffer storage:** the disk-buffer `data_dir` MUST be on | ||
| storage that **survives the container's kill/restart**. Disk-buffer durability | ||
| is the whole point; if Antithesis node-termination recreates the container with | ||
| a fresh filesystem, the buffer is wiped and every crash-recovery property | ||
| passes vacuously (or fails spuriously). Mount `<data_dir>` on a persistent | ||
| volume. **Confirm with the user how their tenant's node-termination interacts | ||
| with filesystem persistence.** | ||
| - **Faults target this container:** node kill/restart (required by Categories | ||
| 2–6), node hang, CPU throttle (widens fsync/lock-contention windows), clock | ||
| jitter (perturbs the 500ms `should_flush` deadline). | ||
| - **Replica count:** 1. (No replication; more instances add nothing.) | ||
| - **Tuning for bug-finding:** set a small `max_data_file_size` (e.g. 1MB) and a | ||
| small `max_size` to maximize file-rotation frequency and reach the rotation/ | ||
| partial-write window faster; optionally set `flush_interval` low to widen the | ||
| durably-written set, or high to widen the loss window — test both. | ||
|
Comment on lines
+89
to
+92
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This tuning advice points the harness at Useful? React with 👍 / 👎. |
||
|
|
||
| ### 2. `workload` — Client (the test driver) | ||
|
|
||
| - **Image:** a small Rust (or Go) container with the **Antithesis Rust SDK** (to | ||
| match the SUT language and emit assertions). Includes the test template at | ||
| `/opt/antithesis/test/v1/{name}/`. | ||
| - **Runs:** | ||
| 1. Starts an HTTP **collector** endpoint (the sink's destination) that records | ||
| every delivered event ID (counting duplicates). | ||
| 2. Emits `setup_complete` once it and Vector are ready. | ||
| 3. Sleeps so Antithesis can run test-template commands. | ||
| - **Test-template commands** drive: produce a stream of uniquely-IDed events to | ||
| Vector's source; periodically (via `ANTITHESIS_STOP_FAULTS` quiet periods) | ||
| drain and assert liveness/at-least-once; inspect Vector's metrics; toggle the | ||
| collector to return errors (for `sink-failure-not-silently-acked`); trigger a | ||
| config reload (custom fault, for `config-reload-no-silent-loss`). | ||
| - **Assertions emitted here** (workload-observable properties): at-least-once | ||
| set-difference, no-loss-on-graceful-shutdown, drop accounting vs metric, writer | ||
| throughput resumes after recovery (deadlock signal), buffer gauges return to ~0 | ||
| on drained restart. | ||
| - **Replica count:** 1. | ||
|
|
||
| ## SUT-side instrumentation (for the instrumented build) | ||
|
|
||
| No Antithesis SDK exists in the repo today (`existing-assertions.md`). For the | ||
| internal-state properties, add `antithesis-sdk` to `lib/vector-buffers/Cargo.toml` | ||
| and insert (all currently MISSING): | ||
|
|
||
| - `assert_unreachable!` / `assert_always!(amount <= current)` at the two unguarded | ||
| subtraction sites: `ledger.rs:~292` and `reader.rs:~524` | ||
| (`total-buffer-size-never-underflows`). | ||
| - `assert_sometimes!(writer_unblocked_after_full)` after `ensure_ready_for_write` | ||
| exits its wait loop; `assert_unreachable!` on repeated no-progress wakeups | ||
| (`writer-eventually-makes-progress`). | ||
| - `assert_unreachable!` at the monotonicity panic `reader.rs:~482` | ||
| (`record-id-monotonicity-holds`). | ||
| - `assert_always_or_unreachable!` at the record-emission point `reader.rs:~1131` | ||
| (`no-corrupted-record-delivered`) and `assert_sometimes!` in the | ||
| `is_bad_read` branch `reader.rs:~1035` (`corruption-is-detected-and-recovered`). | ||
| - `assert_sometimes!(torn_tail_recovered)` in the `validate_last_write` | ||
| recovery branches (`partial-write-at-rotation-recovers`). | ||
| - `assert_always!(unflushed_bytes == 0)` inside `close()` | ||
| (`graceful-shutdown-flushes-all`). | ||
|
|
||
| These assertions are no-ops outside Antithesis, so the instrumented build is safe | ||
| to run normally. | ||
|
|
||
| ## Custom faults required | ||
|
|
||
| - **Config reload** (`config-reload-no-silent-loss`): a custom fault that sends | ||
| `SIGHUP` to the Vector process (or swaps the config file and triggers reload), | ||
| fired under sustained load. | ||
| - **Downstream sink error** (`sink-failure-not-silently-acked`): the workload's | ||
| collector returns 5xx for a window, or a custom fault toggles it. | ||
|
|
||
| ## SDKs | ||
|
|
||
| - **Workload:** Antithesis Rust SDK (or Go SDK) — required to emit assertions and | ||
| `setup_complete`, and to draw random numbers for the producer. | ||
| - **SUT:** Antithesis Rust SDK only for the instrumented build. | ||
|
|
||
| ## Simplicity note | ||
|
|
||
| Two containers, one network link, no external dependency services. Every | ||
| container is justified: the SUT runs the buffer; the workload produces/observes | ||
| and asserts. We deliberately exclude S3/Kafka/etc. — the disk buffer has no such | ||
| dependency. The only non-obvious requirement is the **persistent volume for the | ||
| buffer data_dir**, which is essential for crash-durability testing to be | ||
| meaningful. | ||
|
|
||
| ## Open Questions | ||
|
|
||
| - How does the target Antithesis tenant's node-termination fault interact with | ||
| container filesystem persistence? (Determines whether the buffer survives a | ||
| modeled crash — essential.) | ||
| - Are node-termination and clock faults enabled in the tenant? (Categories 2–6 | ||
| need kill/restart.) | ||
| - Which e2e-ack-capable source is easiest to drive from the workload — | ||
| `http_server`, `datadog_agent`, or `socket`? (Affects workload protocol.) | ||
|
Comment on lines
+170
to
+171
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This open question groups Useful? React with 👍 / 👎. |
||
| - Is config reload feasible as a custom fault (SIGHUP) in the harness, or must the | ||
| workload drive it via Vector's API? | ||
| </content> | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This topology puts
acknowledgements: trueon the source but leaves the HTTP sink config without acknowledgements. The current config model documents source-level acknowledgements as deprecated in favor of global/sink-level settings, andruns.mdin this same scratchbook records that source-only acks were observed at acceptance/buffering rather than e2e delivery; if someone follows this setup, the durability andevery-written-event-eventually-deliveredproperties can measure source acceptance instead of downstream delivery. Please move/add the ack setting to the sink/global config used by this topology.Useful? React with 👍 / 👎.