Skip to content

ds4-server: keep SSE connection alive during long prefill#194

Open
Allen091080 wants to merge 1 commit into
antirez:mainfrom
Allen091080:sse-keepalive-prefill
Open

ds4-server: keep SSE connection alive during long prefill#194
Allen091080 wants to merge 1 commit into
antirez:mainfrom
Allen091080:sse-keepalive-prefill

Conversation

@Allen091080
Copy link
Copy Markdown

Summary

Long prefills (large prompts with no cache hit) can take minutes on local hardware. Today ds4-server is silent on the socket for that entire window, so HTTP/TCP idle timeouts on the client side will close the connection before the first response byte. The existing log line "sse headers failed" is what fires at the end of a multi-minute prefill in real agent runs — the headers literally fail to write because the client already gave up.

This PR makes the existing prefill_chunk progress callback also act as an SSE keepalive source:

  1. The first time the callback fires, it writes the SSE response headers immediately so the client sees an HTTP 200 + Content-Type: text/event-stream within the first chunk.
  2. On subsequent callback fires it emits a ":" SSE comment line at most every 5 seconds. Per the SSE spec these lines are required to be ignored by conforming clients, so they keep the TCP/HTTP path alive without injecting fake events into the stream.
  3. The existing call to sse_headers() in the request handler becomes idempotent — it only fires when prefill never ran (e.g. fully cached prompt), so non-streaming and short paths are unchanged.
  4. The tool-checkpoint rebuild path pre-arms headers_sent = true because by the time that rebuild runs, the response stream is already in flight, so we only want the keepalive ticks and never a second header line.

Writes are best-effort: if the client has already gone away, send_all returns false and the outer code discovers the closed socket the next time it tries to stream a real event, matching the existing error path.

Reproduction of the original bug

0518 21:04:34 ds4-server: chat ctx=0..81722:81722 TOOLS prefill chunk 81722/81722 (100.0%) chunk=204.64 t/s avg=270.77 t/s 301.808s
0518 21:04:34 ds4-server: chat ctx=0..81722:81722 TOOLS prompt done 301.810s
0518 21:04:34 ds4-server: chat ctx=0..81722:81722 TOOLS sse headers failed

5 minutes of complete socket silence during prefill, then sse headers failed because the client RST'd the connection long ago. This is on --ctx 200000, Anthropic-API endpoint, agent client (Claude Code style) with a ~80k system+tools+CLAUDE.md prompt — i.e. a normal first turn against a local model.

Verification

Machine: MacBook Pro M5 Max, 128 GiB RAM
Backend: Metal
Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf (q2-imatrix)
Server: ./ds4-server --host 0.0.0.0 --port 8000 --ctx 200000 --kv-disk-dir … --kv-disk-space-mb 204800

  • make clean && make — clean build, no new warnings
  • ./ds4_test --serverserver: OK / ds4 tests: ok
  • Fresh ~10.5k-token streaming request (cache miss), client-side timing:
    + 6.27s  : prefill
    +12.46s  : prefill
    +18.73s  : prefill
    +25.08s  : prefill
    +31.70s  : prefill
    +34.99s  data: {"id":"chatcmpl-3", … role: assistant}
    +35.04s  data: {"id":"chatcmpl-3", … content: "嗯"}
    +35.04s  data: [DONE]
    
    The corresponding server log shows prompt done 33.594s, so the keepalive cadence really is real-time over the long prefill — the client just sees pipe-buffered output collapse without it.
  • Short 5-token prompt: 0.76s total round trip, unchanged.
  • Verified the no-prefill path (fully cached prompt) still writes headers from the request handler, not the callback (because the callback never fires).

Test plan

  • CI runs ./ds4_test --server
  • Manual streaming smoke test with a long unique prompt to confirm : prefill comment lines appear during prefill
  • Confirm short / cached prompts unaffected

Long prefills (large prompts, no cache hit) can take minutes on local
hardware. ds4-server was silent on the socket the whole time, so HTTP
and TCP idle timeouts on the client side would close the connection
before the first response byte was written -- see the "sse headers
failed" log line that appeared at the very end of a multi-minute
prefill in real agent runs.

Stream the SSE response headers from the prefill_chunk progress
callback, then emit a ":" comment line (ignored by SSE clients per the
spec) at most every five seconds while prefill is still running. The
keepalive is best-effort: a closed socket simply fails the writes and
the outer code discovers the dead connection when it tries to stream
a real event, matching the existing error path. The tool-checkpoint
rebuild path pre-arms headers_sent because it only runs after the
response stream is already in flight, so it never tries to re-send
the SSE header line.

Verified on macOS Metal, q2-imatrix GGUF, ctx=200000:

- ./ds4_test --server passes
- 35s fresh prefill: client receives ": prefill" lines at +6/+12/
  +18/+25/+31s, then SSE content events at +35s, no client disconnect
- 1s cached prompt: unchanged (sse_headers is still emitted from the
  request handler when prefill never fires)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant