ds4-server: keep SSE connection alive during long prefill#194
Open
Allen091080 wants to merge 1 commit into
Open
ds4-server: keep SSE connection alive during long prefill#194Allen091080 wants to merge 1 commit into
Allen091080 wants to merge 1 commit into
Conversation
Long prefills (large prompts, no cache hit) can take minutes on local hardware. ds4-server was silent on the socket the whole time, so HTTP and TCP idle timeouts on the client side would close the connection before the first response byte was written -- see the "sse headers failed" log line that appeared at the very end of a multi-minute prefill in real agent runs. Stream the SSE response headers from the prefill_chunk progress callback, then emit a ":" comment line (ignored by SSE clients per the spec) at most every five seconds while prefill is still running. The keepalive is best-effort: a closed socket simply fails the writes and the outer code discovers the dead connection when it tries to stream a real event, matching the existing error path. The tool-checkpoint rebuild path pre-arms headers_sent because it only runs after the response stream is already in flight, so it never tries to re-send the SSE header line. Verified on macOS Metal, q2-imatrix GGUF, ctx=200000: - ./ds4_test --server passes - 35s fresh prefill: client receives ": prefill" lines at +6/+12/ +18/+25/+31s, then SSE content events at +35s, no client disconnect - 1s cached prompt: unchanged (sse_headers is still emitted from the request handler when prefill never fires) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Long prefills (large prompts with no cache hit) can take minutes on local hardware. Today ds4-server is silent on the socket for that entire window, so HTTP/TCP idle timeouts on the client side will close the connection before the first response byte. The existing log line
"sse headers failed"is what fires at the end of a multi-minute prefill in real agent runs — the headers literally fail to write because the client already gave up.This PR makes the existing
prefill_chunkprogress callback also act as an SSE keepalive source:Content-Type: text/event-streamwithin the first chunk.":"SSE comment line at most every 5 seconds. Per the SSE spec these lines are required to be ignored by conforming clients, so they keep the TCP/HTTP path alive without injecting fake events into the stream.sse_headers()in the request handler becomes idempotent — it only fires when prefill never ran (e.g. fully cached prompt), so non-streaming and short paths are unchanged.headers_sent = truebecause by the time that rebuild runs, the response stream is already in flight, so we only want the keepalive ticks and never a second header line.Writes are best-effort: if the client has already gone away,
send_allreturns false and the outer code discovers the closed socket the next time it tries to stream a real event, matching the existing error path.Reproduction of the original bug
5 minutes of complete socket silence during prefill, then
sse headers failedbecause the client RST'd the connection long ago. This is on--ctx 200000, Anthropic-API endpoint, agent client (Claude Code style) with a ~80k system+tools+CLAUDE.md prompt — i.e. a normal first turn against a local model.Verification
Machine: MacBook Pro M5 Max, 128 GiB RAM
Backend: Metal
Model:
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf(q2-imatrix)Server:
./ds4-server --host 0.0.0.0 --port 8000 --ctx 200000 --kv-disk-dir … --kv-disk-space-mb 204800make clean && make— clean build, no new warnings./ds4_test --server—server: OK/ds4 tests: okprompt done 33.594s, so the keepalive cadence really is real-time over the long prefill — the client just sees pipe-buffered output collapse without it.0.76stotal round trip, unchanged.Test plan
./ds4_test --server: prefillcomment lines appear during prefill