You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(sdk): raise asyncio StreamReader buffer in Python AsyncHostTransport (#2760)
* fix(sdk): raise asyncio StreamReader buffer in Python AsyncHostTransport
The Python async transport spawned the host CLI without passing a `limit=`
to `asyncio.create_subprocess_exec`, so its stdout `StreamReader` inherited
asyncio's default 64 KiB buffer. Every host response is written as a single
newline-delimited JSON line, so any `cli.invoke` whose serialized result
exceeds 64 KiB (e.g. `superdoc_get_content` on larger documents) caused
`readline()` to raise `ValueError: Separator is not found, and chunk
exceed the limit` inside `_reader_loop`. The exception was caught by the
generic reader-loop handler and pending requests were rejected with the
misleading `HOST_DISCONNECTED` error — even though the host process was
still alive and healthy.
Pass `limit=` to `create_subprocess_exec` and expose it as a new
`stdout_buffer_limit_bytes` constructor option on `AsyncHostTransport`,
threaded through `SuperDocAsyncRuntime` and `AsyncSuperDocClient`. The
default of 64 MiB safely covers the host's own 32 MiB
`DEFAULT_MAX_STDIN_BYTES` input cap with room for ~2x JSON expansion.
`SyncHostTransport` is unaffected — it uses raw blocking `subprocess.Popen`
which has no asyncio buffer limit.
Adds a `TestAsyncLargeResponse` regression suite that:
1. Round-trips a 200 KB response through the default-configured transport.
2. Pins that an explicitly tightened `stdout_buffer_limit_bytes` still
reproduces the original failure mode, guaranteeing the option is
wired through to `create_subprocess_exec`.
* fix(sdk): tear down host process on async reader-loop failure
AsyncHostTransport._reader_loop caught reader exceptions by rejecting
pending futures and flipping state to DISCONNECTED, but never killed
self._process. Because dispose() early-returns on DISCONNECTED, any
reader-loop failure left an orphaned host subprocess running with no public API to reap it. This is a pre-existing bug, but the previous
commit made it easier to trip by exposing stdout_buffer_limit_bytes:
any caller who sets it below their real response size hits the orphan path.
Route both the buffer-overflow and generic-error branches through a
new _schedule_cleanup helper that fires _cleanup() as a separate task
(it can't be awaited inline — _cleanup cancels and awaits the reader
task itself). _cleanup kills the process, waits on it, rejects pending,
and only then transitions to DISCONNECTED, so a subsequent dispose()
is a safe no-op instead of leaking the host.
Also catch asyncio.LimitOverrunError / ValueError separately and
surface HOST_PROTOCOL_ERROR with a "raise stdout_buffer_limit_bytes"
hint plus the current limit in details. The previous HOST_DISCONNECTED
code pointed users at the wrong problem since the host was still alive.
Extends TestAsyncLargeResponse to assert HOST_PROTOCOL_ERROR, verify
the hint is in the message, confirm the subprocess is actually reaped
(returncode set, _process cleared), and that dispose() after an overflow is a safe no-op.
* refactor(sdk): dedupe stdout_buffer_limit default and add wiring test
Address review follow-ups on the async transport buffer-limit option.
- Hoist DEFAULT_STDOUT_BUFFER_LIMIT_BYTES (64 MiB) to module scope in
transport.py and reference it from AsyncHostTransport, the async
runtime, and AsyncSuperDocClient so the default lives in one place
instead of three copies of 64 * 1024 * 1024.
- Add a short "raise if a single host response can exceed this size"
comment on the client.py parameter so callers see the guidance at
the public API boundary, not buried in transport.py.
- Rename test_response_above_default_64kb_buffer to
test_response_above_asyncio_default_streamreader_limit. 64 KiB is
asyncio's default, not the SDK's (which is now 64 MiB), so the old
name read backwards after this PR.
- Add test_client_threads_stdout_buffer_limit_to_transport: builds
AsyncSuperDocClient with a custom limit and asserts the value
reaches AsyncHostTransport. Without this, a silent drop of the arg
in client.py or runtime.py would leave the existing overflow test
passing while the public API reverts to the asyncio 64 KiB default.
* fix(sdk): mark transport DISPOSING synchronously on reader teardown
Round-2 review follow-ups:
- _schedule_cleanup now flips state to DISPOSING before scheduling the
cleanup task. Previously, between the reader returning and the async
_cleanup running, _ensure_connected's CONNECTED fast path would still
accept invoke() calls; they then blocked on a future the dead reader
could never resolve until watchdog_timeout_ms (default 30s).
- Narrow the buffer-overflow catch to readline() only and drop
asyncio.LimitOverrunError from the tuple. readline() re-raises
LimitOverrunError as ValueError (it is not a ValueError subclass on
any supported CPython), so the previous broad except could
reclassify unrelated ValueErrors from dispatch as a buffer-limit
error with a misleading remediation hint. Comment corrected to match.
- Re-export DEFAULT_STDOUT_BUFFER_LIMIT_BYTES from superdoc/__init__.py
so consumers tuning the option don't import from the implementation
module.
- Tighten test_host_crash to assert HOST_DISCONNECTED specifically and
verify process teardown via the new _schedule_cleanup path.
- Strengthen the dispose-after-overflow assertion to actually verify
the no-op claim (state stays DISCONNECTED, _process stays None, a
second dispose is also safe). Replace the timing-sensitive
process.returncode read with await process.wait().
* fix(sdk): serialize teardown across reader, _kill_and_reset, and dispose
Round-2 follow-up — addresses the residual race that the synchronous
DISPOSING flip didn't cover.
Before: `_kill_and_reset()` (called from `_send_request` on stdin write
failure or watchdog timeout) `await`ed `_cleanup` directly. If a
reader-triggered `_schedule_cleanup` was in flight, both ran
concurrently and raced on `_reject_all_pending`'s read-then-clear of
`self._pending` (futures added between snapshot and clear were leaked)
and on `process.kill()`/`reader_task.cancel()`. `dispose()` similarly
short-circuited on DISPOSING without waiting for the in-flight cleanup
to finish — the caller saw "disposed" before the host was fully torn
down.
Now:
- `_kill_and_reset` and `dispose` both check the cleanup-task slot and
`await` an in-flight cleanup rather than starting a parallel one.
Single-flight teardown across all three entry points.
- `_cleanup` clears `self._cleanup_task` in `finally` when it owns the
slot, so introspection doesn't surface a stale done handle and the
next teardown gets a fresh slot.
- `dispose()` after a reader-triggered cleanup now blocks until that
cleanup finishes, restoring the "host fully torn down on return"
contract.
Tests:
- `test_schedule_cleanup_dedupe_guard_drops_reentrant_call` — second
`_schedule_cleanup` does not replace the in-flight task slot.
- `test_overflow_during_dispose_does_not_schedule_cleanup` — `_stopping`
suppression is honored.
- `test_kill_and_reset_awaits_in_flight_cleanup` — `_kill_and_reset`
observes the existing task instead of running a parallel `_cleanup`.
- `test_dispose_waits_for_in_flight_cleanup` — `dispose()` blocks until
reader-triggered cleanup completes before returning.
95 transport tests pass; 5 consecutive runs with PYTHONASYNCIODEBUG=1
show no flakes.
* fix(sdk): close residual races in async transport teardown
Two correctness regressions and three test gaps surfaced in the
final-pass review of the cleanup-task lifecycle.
**1. _ensure_connected race (HIGH).** The synchronous DISPOSING flip
in _schedule_cleanup did not gate _ensure_connected, so a concurrent
connect()/invoke() reaching _start_host during the DISPOSING window
would reassign self._process and self._reader_task. The pending
cleanup task then read those slots after its first await and killed
the freshly-spawned process. Fix: drain self._cleanup_task at the top
of _ensure_connected via asyncio.shield (so a cancelled caller doesn't
abort the in-flight cleanup).
**2. Cancellation propagation race (HIGH).** _kill_and_reset and
dispose() awaited the cleanup task without asyncio.shield. When the
caller (e.g. an invoke task at the watchdog branch) was cancelled,
asyncio cancelled the awaited cleanup task too — _cleanup did not
catch CancelledError around process.wait(), so teardown stopped before
clearing _process / setting state. dispose() then saw DISPOSING with
_cleanup_task=None and returned without finishing teardown, leaking
the host process. Fix: wrap the awaited cleanup in asyncio.shield in
both call sites; restructure _cleanup so it captures handles and sets
state synchronously up-front, before any awaits, so observable state
is always consistent.
**3. Move _stopping guard into _schedule_cleanup.** The previous
test_overflow_during_dispose_does_not_schedule_cleanup was tautological
— it set _stopping=True and then re-checked the same condition in the
test body before calling _schedule_cleanup, so the call never ran and
the assertion passed trivially. Move the guard into _schedule_cleanup
itself (it's the correct authoritative location anyway), remove the
now-redundant call-site checks in _reader_loop, and rewrite the test
to call _schedule_cleanup unconditionally with _stopping=True. The
test now actually exercises the production guard.
**4. Multi-pending-invoke overflow test.** Codex round-2 gap that
remained open. Locks down that _reject_all_pending fails ALL pending
futures with HOST_PROTOCOL_ERROR plus the actionable hint, not just
the one whose response overflowed.
**5. Async reconnect-after-buffer-overflow test.** Sync transport
already had test_reconnect_after_failure; async only covered reconnect
after explicit dispose. Validates that reader-triggered cleanup leaves
the transport reusable for a fresh invoke without wedging
_cleanup_task / _connecting / _process.
Plus: replaced asyncio.sleep(0) with asyncio.Event-based
synchronization in lifecycle tests (Codex/Opus medium — sleep(0) is
implementation-defined under uvloop / Python scheduling changes); two
new tests directly cover the round-3 races
(test_ensure_connected_drains_in_flight_cleanup_before_spawn,
test_kill_and_reset_caller_cancellation_does_not_cancel_cleanup).
99 transport tests pass; 5 consecutive runs with PYTHONASYNCIODEBUG=1
show no flakes; new tests pass under -W error::ResourceWarning.
---------
Co-authored-by: Caio Pizzol <97641911+caio-pizzol@users.noreply.github.com>
Co-authored-by: Caio Pizzol <caio@harbourshare.com>
0 commit comments