From 590ef9f83e8063febbdcce1e4fec3a7be7e4f719 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Sun, 31 May 2026 11:22:38 +0000
Subject: [PATCH] fix(jni): skip is_begin null-sentinel partial results in
 streaming completions

Upstream b9437 added an `is_begin` SSE marker that emits a no-content opening
event when `stream && !return_progress` so HTTP clients can receive headers
before the first delta (server-context.cpp:2835). For that event,
`server_task_result_cmpl_partial::to_json()` returns `nullptr` as a sentinel
meaning "send headers only, no body" (server-task.cpp:1425).

The JNI bridge in `Java_net_ladenthin_llama_LlamaModel_receiveCompletionJson`
called `result->to_json()` once and assigned `response["stop"] = ...`, which
silently auto-promoted the null value to an object `{"stop": false}` via
nlohmann::json's index-assignment behaviour. Every Java streaming caller
then saw a phantom empty `LlamaOutput` at the start of each generation,
causing five `LlamaModelTest` cases (testGenerateAnswer, testGenerateChat,
testGenerateInfill, testIteratorTerminatesOnRepetitivePrompt,
testSpeculativeDecoding) to overrun their `nPredict + 1` cap by one token.

Fix: wrap the `rd->next()` call in a loop that continues past null-valued
`to_json()` results so only real events surface to Java.

Other `to_json()` sites in jllama.cpp were audited and are not affected:
`dispatch_one_shot_task` (line 338) only handles SLOT_METRICS/SAVE/RESTORE/
ERASE which never carry `is_begin`, and the embedding path (line 1162) is
non-streaming.

Verified locally with 435/435 C++ ctest passing. Java tests cannot be
exercised here (HF model download blocked in sandbox); CI will exercise
them on PR.

Updates docs/history/llama-cpp-breaking-changes.md to correct the prior
claim that the upstream `is_begin` change was "compiled-from-upstream
only, no project source changes required".
---
 docs/history/llama-cpp-breaking-changes.md |  4 +--
 src/main/cpp/jllama.cpp                    | 30 ++++++++++++++--------
 2 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/docs/history/llama-cpp-breaking-changes.md b/docs/history/llama-cpp-breaking-changes.md
index 1a847bd2..7775e1f6 100644
--- a/docs/history/llama-cpp-breaking-changes.md
+++ b/docs/history/llama-cpp-breaking-changes.md
@@ -271,8 +271,8 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
 | ~b9333–b9354 | upstream CI (`.github/workflows/`) | CANN and SYCL builds disabled to save Actions resources; macOS builds moved to `build-apple.yml`; cache keys prefixed with `cache-gha-`; `[no release]` commit message token skips release pipeline; no project changes required |
 | ~b9354–b9437 | `common/common.h` + `common/arg.h` + `common/arg.cpp` | `common_params_handle_models()` return type `void` &#x2192; `bool` (caller can detect skip-download misses); new `common_params::skip_download`; `common_params::timeout_read` default raised 600 &#x2192; 3600. Project does not call `common_params_handle_models()` directly &mdash; arg parsing happens upstream; the new defaults flow through transparently |
 | ~b9354–b9437 | `common/download.h` + `common/download.cpp` | `common_download_model()` parameter list trimmed: `download_mmproj`/`download_mtp` moved into `common_download_opts`; new `common_skip_download_exception`; new opt `skip_download` returns `-2` on missing/etag mismatch. Project does not include `download.h` directly, no source changes required |
-| ~b9354–b9437 | `tools/server/server-task.h` + `server-task.cpp` | `task_params::stream` default `true` &#x2192; `false`; new `server_task_result_cmpl_partial::is_begin` bool to let HTTP layer emit SSE headers before the first delta; `to_json()` may now return `nullptr` for the begin marker. Project always sets `stream` explicitly from Java (`LlamaIterator.java`, `LlamaModel.java`) so the default change is inert; the `is_begin` &amp; nullable-`to_json` behaviour is contained inside compiled-from-upstream `server-context.cpp` &amp; `server-task.cpp` |
-| ~b9354–b9437 | `tools/server/server-context.cpp` + `server-queue.cpp` | `send_partial_response()` gained `is_begin` parameter (defaulted); SSE stream now emits a no-content opening event when `stream &amp;&amp; !return_progress` so the client sees HTTP 200 + headers before first token. `server_response_reader::next()` 30s warn-on-cancel diagnostic message updated. Compiled-from-upstream only, no project source changes required |
+| ~b9354–b9437 | `tools/server/server-task.h` + `server-task.cpp` | `task_params::stream` default `true` &#x2192; `false`; new `server_task_result_cmpl_partial::is_begin` bool to let HTTP layer emit SSE headers before the first delta; `to_json()` returns `nullptr` for the begin marker (sentinel meaning "HTTP-headers-only, no body"). Project always sets `stream` explicitly from Java (`LlamaIterator.java`, `LlamaModel.java`) so the default change is inert. The `is_begin` / nullable-`to_json` contract DOES leak into the JNI bridge &mdash; see the row below for the required fix |
+| ~b9354–b9437 | `tools/server/server-context.cpp` + `server-queue.cpp` | `send_partial_response()` gained `is_begin` parameter (defaulted); SSE stream now emits a no-content opening event when `stream &amp;&amp; !return_progress` (`server-context.cpp:2835`) so the client sees HTTP 200 + headers before first token. `server_response_reader::next()` 30s warn-on-cancel diagnostic message updated. **Required project source change**: `Java_net_ladenthin_llama_LlamaModel_receiveCompletionJson` in `src/main/cpp/jllama.cpp` called `result->to_json()` once and assigned `response["stop"]`, which silently auto-promoted the `nullptr` to an object `{"stop": false}` and surfaced a phantom empty `LlamaOutput` to every Java streaming caller (`LlamaModelTest.testGenerateAnswer` and four sibling tests overran by +1 token). Fixed by wrapping the `rd->next()` call in a loop that skips `response.is_null()` results so only real events reach Java |
 | ~b9354–b9437 | `common/arg.cpp` (env-var renames) | `LLAMA_LOG_*` &#x2192; `LLAMA_ARG_LOG_*`, `LLAMA_OFFLINE` &#x2192; `LLAMA_ARG_OFFLINE`, `LLAMA_LOG_FILE` &#x2192; `LLAMA_ARG_LOG_FILE`, `LLAMA_CHAT_TEMPLATE_KWARGS` &#x2192; `LLAMA_ARG_CHAT_TEMPLATE_KWARGS`. CLI verbosity values relabeled (4=trace, 5=debug). The `--license` CLI flag was REMOVED and moved to the new `llama-app licenses` subcommand. Project does not expose these env vars or the `--license` flag through the Java API, no changes required |
 | ~b9354–b9437 | `src/llama.cpp` | `llama_backend_init()` device-discovery rule tightened: iGPUs are now added only when no discrete GPUs were found (was: when no devices at all). RPC servers no longer count as "found" for this purpose, so iGPU + RPC setups keep the local iGPU. Behavioural only, single-line caller in `jllama.cpp` unchanged |
 | ~b9354–b9437 | `src/llama-chat.cpp` | New `LLM_CHAT_TEMPLATE_GRANITE_4_1` enum value + "granite-4.1" template name; `granite-4.0` detection now requires the literal token `g4_default_system_message` in the template, otherwise it routes to 4.1. Project does not implement chat-template detection directly &mdash; routing happens inside compiled-from-upstream code, no source changes required |
diff --git a/src/main/cpp/jllama.cpp b/src/main/cpp/jllama.cpp
index 5498b0b1..9e5b4c2b 100644
--- a/src/main/cpp/jllama.cpp
+++ b/src/main/cpp/jllama.cpp
@@ -812,18 +812,28 @@ JNIEXPORT jstring JNICALL Java_net_ladenthin_llama_LlamaModel_receiveCompletionJ
         rd = it->second.get();
     }
 
-    server_task_result_ptr result = rd->next([] { return false; });
-
-    if (!result_ok_or_throw(env, result)) {
-        erase_reader(jctx, id_task);
-        return nullptr;
-    }
+    // Upstream b9437 added is_begin partial results whose to_json() returns
+    // a nullptr sentinel meaning "HTTP-headers-only, no body". Loop past
+    // those so the Java iterator only ever sees real events.
+    json response;
+    while (true) {
+        server_task_result_ptr result = rd->next([] { return false; });
+
+        if (!result_ok_or_throw(env, result)) {
+            erase_reader(jctx, id_task);
+            return nullptr;
+        }
 
-    json response      = result->to_json();
-    response["stop"]   = result->is_stop();
+        response = result->to_json();
+        if (response.is_null()) {
+            continue;
+        }
+        response["stop"] = result->is_stop();
 
-    if (result->is_stop()) {
-        erase_reader(jctx, id_task);
+        if (result->is_stop()) {
+            erase_reader(jctx, id_task);
+        }
+        break;
     }
 
     return json_to_jstring_impl(env, response);