Add async cancellation safety and graceful shutdown protocol (#49)#85
Merged
elizabetheonoja-art merged 5 commits intoJun 26, 2026
Conversation
…-Protocol#49) Guarantees in-flight telemetry is drained and the watermark persisted before exit on SIGTERM/SIGINT/SIGHUP, preventing dropped events / double-settlement: - lifecycle/task_group: self-contained structured concurrency over tokio -- CancelToken (hierarchical: cancelling a parent cascades to children) and StructuredTaskGroup (owns a stage's tasks + token; shutdown() cancels then drains within a deadline, reporting tasks that overran). - lifecycle/shutdown: ShutdownProtocol drains phases in strict order (Acceptors -> Reassembly -> Parsing -> Evaluation -> Storage -> Watermark -> Blockchain -> Complete), each within a configurable deadline (default 30s, capped 300s). Persists a checkpoint after each phase; enforces the 1,000,000 in-flight ceiling (CRIT + abort); writes the completion marker file. A phase timeout is logged and shutdown proceeds, returning an error so the caller can exit non-zero for a supervised restart. Plus SIGTERM/SIGINT/SIGHUP listener. - storage/checkpoint: durable ShutdownCheckpoint (timestamp, per-source watermark, drained-stage bitset) with a binary codec and a file-backed store using File::sync_all (the WriteOptions::set_sync(true) equivalent). Kept dependency-free and CI-runnable: no RocksDB (the checkpoint is file-backed), no tokio_util TaskGroup (a non-existent type -- replaced with the self-contained primitives above), and no second binary. Tests drive the protocol directly (no real signals/devnet) covering the cancellation hierarchy, ordered draining, per-phase timeout, the in-flight ceiling, and durable checkpoint persistence.
…Graceful-Shutdown # Conflicts: # tests/mod.rs
Contributor
|
@real-venus CI failed |
Contributor
Author
Now I am working on it. |
Contributor
Author
|
Fixed. |
elizabetheonoja-art
approved these changes
Jun 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Async I/O Cancellation Safety and Graceful Shutdown Protocol for Telemetry Stream Lifecycle (#49)
Closes #49
What's added
lifecycle/task_group.rsCancelToken(hierarchical — cancelling a parent cascades to children, modelling the per-stage token tree) andStructuredTaskGroup(owns a stage's tasks + its token;shutdown()cancels then drains within a deadline, returning how many tasks overran).lifecycle/shutdown.rsShutdownProtocoldrains phases in strict order — Acceptors → Reassembly → Parsing → Evaluation → Storage → Watermark → Blockchain → Complete — each within a configurable deadline (default 30 s, capped 300 s). Persists a checkpoint after every phase; enforces the 1,000,000 in-flight ceiling (CRIT log + abort); writes the/tmp/utility_shutdown_completemarker. A phase timeout is logged and shutdown proceeds, returningErr(PhaseTimeout)somaincan exit non-zero for a supervised restart. Includes the SIGTERM/SIGINT/SIGHUP signal listener.storage/checkpoint.rsShutdownCheckpoint(timestamp, per-source watermark map, drained-stage bitset) with a binary codec, plus aCheckpointStorewritten viaFile::sync_all— the durability equivalent of RocksDB'sWriteOptions::set_sync(true). Enables resume-from-last-completed-phase on restart.