fix(storage-opendal): add TimeoutLayer inside RetryLayer to bound hangs#2455
Open
AlJohri wants to merge 1 commit into
Open
fix(storage-opendal): add TimeoutLayer inside RetryLayer to bound hangs#2455AlJohri wants to merge 1 commit into
AlJohri wants to merge 1 commit into
Conversation
The current `create_operator` applies `RetryLayer` with no `TimeoutLayer`. Per opendal's docs, `TimeoutLayer` must be inside `RetryLayer`; without it, a future parked indefinitely on a silently dropped TCP connection never produces an `Err` and `RetryLayer` cannot retry — the caller hangs forever. We hit this in production: an iceberg `try_next()` Range-GET parked for 24h until a Kubernetes Job's `activeDeadlineSeconds` killed the pod. Core-dump analysis confirmed two in-flight Range-GETs sitting in heap with no live TCP connection, no error propagated, no retries attempted. This adds `TimeoutLayer::new()` with opendal's defaults (60s for non-IO ops; 10s per IO chunk) inside `RetryLayer`. Each retry attempt is now independently bounded; a hung connection times out, RetryLayer sees the error, and either succeeds on a subsequent attempt or propagates a clean error to the caller. Refs: apache#1288 (auto-closed by stale bot without engagement), apache#788 (the PR that added the unbounded RetryLayer).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
What changes are included in this PR?
iceberg-storage-opendal::OpenDalStorage::create_operatorcurrently wraps the built operator withRetryLayer::new(). That layer retries when its inner future returnsErr, which is what was needed for the fast-failing transient errors that #788 set out to fix — and it works correctly for that case.It does not bound futures that park indefinitely without producing an
Err. The canonical example is an S3 Range-GET whose underlying TCP connection is silently dropped (NAT/conntrack eviction, route flap, server-side disconnect with no RST received). The response future staysPendingforever;RetryLayerhas nothing to retry against because no error is ever produced.This PR adds
TimeoutLayer::new()insideRetryLayerto close that gap. Per opendal's docs:opendal's defaults (60 s for non-IO ops like
stat/list/delete, 10 s per IO chunk forread/write) are used; each retry attempt is now independently bounded, hung connections surface as a timeout error whichRetryLayerthen retries with backoff, and unrecoverable hangs propagate a clean error to the caller in seconds rather than the inner future parking forever.The diff is two lines in
crates/storage/opendal/src/lib.rs(the import and the layer composition) plus an updated comment explaining the ordering invariant.How we hit this
In production: a Rust application using
iceberg-storage-opendal::OpenDalStorageFactory::S3to read iceberg tables on AWS hung for 24 hours when icebergtry_next()returned aPendingfuture whose underlying opendal Range-GET against S3 never completed. Core-dump analysis showed:/proc/<pid>/net/tcphad only the OTel collector socket).Condvar::wait_until_internal, main thread inRuntime::block_on.So the response future was permanently
Pendingafter the TCP connection silently died, with no error to propagate. TheRetryLayerwas in the chain but dormant because there was no error to react to. AddingTimeoutLayerwould have produced a timeoutErrwithin seconds,RetryLayerwould have retried with backoff, and the operation would have surfaced cleanly within ~90 s instead of hanging until the pod'sactiveDeadlineSecondskilled it 24 h later.Context on the original composition
RetryLayer::new()was added in #788 (Dec 2024) to bound transient"connection closed before message completed"errors. That PR's description explicitly noted that configurability could be a follow-up. It correctly addressed the fast-failing transient case; the silent-hang case wasn't in scope. This PR extends the layer composition to also cover that second class.A user filed #1288 in May 2025 asking for IO-operation timeout support; it received no maintainer engagement and was auto-closed by the stale bot. This PR closes that issue with the minimal change: add a per-attempt bound so
RetryLayerhas a timeout error to retry against. opendal's docs explicitly document the ordering rule that applies when both layers are used together, which this PR follows.Are these changes tested?
Upstream CI on this PR (already running)
The full project CI ran on this draft and the codebase-internal tests pass — including the S3 integration suite against MinIO that I couldn't run locally:
Tests (default)— fullcargo nextest✅ pass (7m55s)Tests (doc)✅ passcheck_standalone(every crate builds in isolation) ✅ passcheck,buildon Linux + macOS ✅ passbuild_with_no_default_featureson Linux + macOS + Windows ✅ passSo healthy-path reads/writes/stats/deletes against MinIO still work with the new layer composition — opendal's defaults (60 s non-IO, 10 s per IO chunk) are not tight enough to false-positive on normal test traffic.
Local validation that the fix actually fixes the bug
The existing test suite doesn't have a "hung-connection" harness, so it can't directly validate the new behaviour. I wrote a small standalone reproducer that does: it spawns a TCP tarpit (accepts connections, never replies — exactly mimicking a silently-dropped TCP session), points opendal at it, and measures time-to-error under three layer stacks.
Result:
RetryLayerhas nothing to retry against when the inner future never produces anErr. This is the gap the PR closes.io timeout reachederror naming the timeout duration. Math checks out: 2 sio_timeout× 3 retry attempts + default exponential backoff between attempts (≈ 1+2+4 s) ≈ 13–15 s.This is reproducible in ~5 minutes by anyone with a Rust toolchain. Happy to upstream this as an integration test in
crates/storage/opendal/tests/if maintainers want — the harness is ~70 lines and depends only onopendal+tokio+anyhow.Click to expand: full tarpit harness source
Cargo.toml:src/main.rs:Notes
TimeoutLayerapplies uniformly. For in-memory/fs backends the timeout is effectively never hit; the cost is negligible. For network backends it is the actual fix.iceberg::Storagethemselves and inject their preferred layer stack — but the default path should not hang silently, which is what this PR addresses. Happy to expand into a configurable form (e.g. accept aTimeoutLayeron theOpenDalStorageFactory::S3variant) in review if maintainers prefer.