Fix Slic stream lifecycle bugs triggered by canceled writes#4765
Fix Slic stream lifecycle bugs triggered by canceled writes#4765pepone wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses two Slic transport lifecycle bugs that could leave an otherwise healthy multiplexed connection in a degraded state when an invocation is canceled at specific points (permit leak for unstarted streams; suppressed writes-closure notification when an endStream write is canceled before StreamLast is queued).
Changes:
- Release stream-count semaphore permits for local streams that are created but never started (unstarted streams) once they reach the terminal state.
- Roll back
_writesClosePendingwhen anendStreamwrite fails before theStreamLastframe is queued, allowingCloseWritesto still sendStreamWritesClosed. - Add Slic transport tests covering the above cancellation/cleanup scenarios.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| tests/IceRpc.Tests/Transports/Slic/SlicTransportTests.cs | Adds regression tests for unstarted-stream permit release and canceled endStream write closure signaling. |
| src/IceRpc/Transports/Slic/Internal/SlicStream.cs | Rolls back _writesClosePending on failed endStream writes; releases permits for unstarted local streams once fully closed. |
| src/IceRpc/Transports/Slic/Internal/SlicConnection.cs | Adds ReleaseUnstartedStream to return the stream-count semaphore permit for unstarted local streams. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Fixes icerpc/icerpc-csharp-audit#25. Fixes icerpc/icerpc-csharp-audit#26. Two bugs left a Slic connection degraded when an invocation was canceled at the wrong time: - A stream created by CreateStreamAsync acquires a stream-count semaphore permit that was only released for started streams. A stream abandoned before its first write (e.g. invocation canceled while waiting on payload data or send credit) burned its permit forever; after MaxBidirectionalStreams or MaxUnidirectionalStreams such leaks, every new CreateStreamAsync on the connection blocked forever. The permit is now released when an unstarted local stream reaches its terminal state. - SlicStream.WriteStreamFrameAsync set _writesClosePending before writing the endStream frame and never rolled it back when that write failed before the StreamLast frame was queued (cancellation while blocked on send credit or the write semaphore). The later Complete(exception) on the stream output was then suppressed by the CloseWrites guard: neither StreamLast nor StreamWritesClosed was sent and the peer hung reading the payload until connection closure. _writesClosePending is now rolled back when the endStream write fails before the StreamLast frame was queued, so CloseWrites sends the StreamWritesClosed frame and the peer observes TruncatedData.
| // A stream that was created but never started is always local; only started streams are registered in _streams. | ||
| Debug.Assert(stream.IsStarted || !stream.IsRemote); | ||
|
|
There was a problem hiding this comment.
ReleaseStream already handles unstarted streams safely — _streams.Remove is guarded by IsStarted — so we can drop the assert and let the single caller invoke it unconditionally:
| // A stream that was created but never started is always local; only started streams are registered in _streams. | |
| Debug.Assert(stream.IsStarted || !stream.IsRemote); | |
| // Only a started stream has an Id and is registered in _streams. |
The unconditional call now also runs for an unstarted remote stream (the brief state when AddStream throws at line 1686, before it assigns the Id). That reaches the if (stream.IsRemote) decrement below, which is correct — the remote count is incremented at line 1659/1677 before AddStream — and effectively a no-op: AddStream only throws when _isClosed (line 1025), and once the connection is closed the stream counts are never read again (they only gate accepting new remote streams, line 1653). So the decrement has no observable effect; it just keeps the bookkeeping balanced.
| // The stream reads and writes are closed, it's time to release the stream to either allow creating or | ||
| // accepting a new stream. | ||
| if (IsStarted) | ||
| // accepting a new stream. We don't release an unstarted remote stream (which can briefly exist when | ||
| // AddStream fails on a closing connection) since it never acquired any resource. Releasing an unstarted | ||
| // local stream returns the stream-count semaphore permit acquired by CreateStreamAsync. | ||
| if (IsStarted || !IsRemote) | ||
| { | ||
| _connection.ReleaseStream(this); | ||
| } |
There was a problem hiding this comment.
And the call site collapses to an unconditional call (safe for the unstarted-remote case — see the ReleaseStream comment):
| // The stream reads and writes are closed, it's time to release the stream to either allow creating or | |
| // accepting a new stream. | |
| if (IsStarted) | |
| // accepting a new stream. We don't release an unstarted remote stream (which can briefly exist when | |
| // AddStream fails on a closing connection) since it never acquired any resource. Releasing an unstarted | |
| // local stream returns the stream-count semaphore permit acquired by CreateStreamAsync. | |
| if (IsStarted || !IsRemote) | |
| { | |
| _connection.ReleaseStream(this); | |
| } | |
| // The stream reads and writes are closed, it's time to release the stream to either allow creating or | |
| // accepting a new stream. | |
| _connection.ReleaseStream(this); |
Fixes icerpc/icerpc-csharp-audit#25.
Fixes icerpc/icerpc-csharp-audit#26.
Two bugs left a Slic connection degraded when an invocation was canceled at the wrong time:
Stream-count permit leak (audit Encoding not checked when resolving a well-known proxy from the locator cache #25): a stream created by
CreateStreamAsyncacquires a stream-count semaphore permit that was only released for started streams (TrySetState→ReleaseStream, gated onIsStarted). A stream abandoned before its first write — e.g. invocation canceled while waiting on payload data or send credit — burned its permit forever; afterMaxBidirectionalStreams/MaxUnidirectionalStreamssuch leaks, every newCreateStreamAsyncon an otherwise healthy connection blocked forever. The permit is now released when an unstarted local stream reaches its terminal state (newSlicConnection.ReleaseUnstartedStream)._writesClosePendingnever rolled back (audit idempotent and tuple return type #26):SlicStream.WriteStreamFrameAsyncset_writesClosePendingbefore the endStream write and never rolled it back when that write failed before theStreamLastframe was queued (cancellation while blocked on send credit or the write semaphore). The laterComplete(exception)on the stream output was then suppressed by theCloseWritesguard: neitherStreamLastnorStreamWritesClosedwas sent, and the peer hung reading the payload until connection closure._writesClosePendingis now rolled back when the endStream write fails before theStreamLastframe was queued — detected via_writesClosedTcs, whichWroteLastStreamFramecompletes exactly when the frame is queued — soCloseWritessends theStreamWritesClosedframe and the peer observesTruncatedData. The bundled_closeReadsOnWritesClosureis preserved as well since it is never cleared on this path.Tests
Completing_unstarted_stream_releases_the_stream_count_permit(bidirectional and unidirectional): with max streams = 1, completes an unstarted stream and verifies a new stream can still be created.Canceled_end_stream_write_does_not_lose_the_writes_closed_frame: cancels an endStream write blocked on send credit, completes the output with an exception, and verifies the peer observesTruncatedDatainstead of waiting for data forever.End_stream_write_canceled_after_frame_queued_preserves_graceful_peer_reads: pins the opposite case — when theStreamLastframe was already queued (write parked on the connection-level pipe pause), noStreamWritesClosedframe is sent after it and the peer reads all the data with a graceful end of stream. This guards against an over-eager rollback; it passes before and after the fix.Verified the first three fail without the fix (all time out on their 2 min fallback guards); the
WaitAsyncmargins follow the convention that CI's--blame-hang-timeout(60 s) detects and dumps a hang first.Full
IceRpc.Testssuite passes (991 passed / 0 failed),dotnet buildanddotnet format --verify-no-changesclean.