From 590ef9f83e8063febbdcce1e4fec3a7be7e4f719 Mon Sep 17 00:00:00 2001 From: Claude Date: Sun, 31 May 2026 11:22:38 +0000 Subject: [PATCH] fix(jni): skip is_begin null-sentinel partial results in streaming completions Upstream b9437 added an `is_begin` SSE marker that emits a no-content opening event when `stream && !return_progress` so HTTP clients can receive headers before the first delta (server-context.cpp:2835). For that event, `server_task_result_cmpl_partial::to_json()` returns `nullptr` as a sentinel meaning "send headers only, no body" (server-task.cpp:1425). The JNI bridge in `Java_net_ladenthin_llama_LlamaModel_receiveCompletionJson` called `result->to_json()` once and assigned `response["stop"] = ...`, which silently auto-promoted the null value to an object `{"stop": false}` via nlohmann::json's index-assignment behaviour. Every Java streaming caller then saw a phantom empty `LlamaOutput` at the start of each generation, causing five `LlamaModelTest` cases (testGenerateAnswer, testGenerateChat, testGenerateInfill, testIteratorTerminatesOnRepetitivePrompt, testSpeculativeDecoding) to overrun their `nPredict + 1` cap by one token. Fix: wrap the `rd->next()` call in a loop that continues past null-valued `to_json()` results so only real events surface to Java. Other `to_json()` sites in jllama.cpp were audited and are not affected: `dispatch_one_shot_task` (line 338) only handles SLOT_METRICS/SAVE/RESTORE/ ERASE which never carry `is_begin`, and the embedding path (line 1162) is non-streaming. Verified locally with 435/435 C++ ctest passing. Java tests cannot be exercised here (HF model download blocked in sandbox); CI will exercise them on PR. Updates docs/history/llama-cpp-breaking-changes.md to correct the prior claim that the upstream `is_begin` change was "compiled-from-upstream only, no project source changes required". --- docs/history/llama-cpp-breaking-changes.md | 4 +-- src/main/cpp/jllama.cpp | 30 ++++++++++++++-------- 2 files changed, 22 insertions(+), 12 deletions(-) diff --git a/docs/history/llama-cpp-breaking-changes.md b/docs/history/llama-cpp-breaking-changes.md index 1a847bd2..7775e1f6 100644 --- a/docs/history/llama-cpp-breaking-changes.md +++ b/docs/history/llama-cpp-breaking-changes.md @@ -271,8 +271,8 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r | ~b9333–b9354 | upstream CI (`.github/workflows/`) | CANN and SYCL builds disabled to save Actions resources; macOS builds moved to `build-apple.yml`; cache keys prefixed with `cache-gha-`; `[no release]` commit message token skips release pipeline; no project changes required | | ~b9354–b9437 | `common/common.h` + `common/arg.h` + `common/arg.cpp` | `common_params_handle_models()` return type `void` → `bool` (caller can detect skip-download misses); new `common_params::skip_download`; `common_params::timeout_read` default raised 600 → 3600. Project does not call `common_params_handle_models()` directly — arg parsing happens upstream; the new defaults flow through transparently | | ~b9354–b9437 | `common/download.h` + `common/download.cpp` | `common_download_model()` parameter list trimmed: `download_mmproj`/`download_mtp` moved into `common_download_opts`; new `common_skip_download_exception`; new opt `skip_download` returns `-2` on missing/etag mismatch. Project does not include `download.h` directly, no source changes required | -| ~b9354–b9437 | `tools/server/server-task.h` + `server-task.cpp` | `task_params::stream` default `true` → `false`; new `server_task_result_cmpl_partial::is_begin` bool to let HTTP layer emit SSE headers before the first delta; `to_json()` may now return `nullptr` for the begin marker. Project always sets `stream` explicitly from Java (`LlamaIterator.java`, `LlamaModel.java`) so the default change is inert; the `is_begin` & nullable-`to_json` behaviour is contained inside compiled-from-upstream `server-context.cpp` & `server-task.cpp` | -| ~b9354–b9437 | `tools/server/server-context.cpp` + `server-queue.cpp` | `send_partial_response()` gained `is_begin` parameter (defaulted); SSE stream now emits a no-content opening event when `stream && !return_progress` so the client sees HTTP 200 + headers before first token. `server_response_reader::next()` 30s warn-on-cancel diagnostic message updated. Compiled-from-upstream only, no project source changes required | +| ~b9354–b9437 | `tools/server/server-task.h` + `server-task.cpp` | `task_params::stream` default `true` → `false`; new `server_task_result_cmpl_partial::is_begin` bool to let HTTP layer emit SSE headers before the first delta; `to_json()` returns `nullptr` for the begin marker (sentinel meaning "HTTP-headers-only, no body"). Project always sets `stream` explicitly from Java (`LlamaIterator.java`, `LlamaModel.java`) so the default change is inert. The `is_begin` / nullable-`to_json` contract DOES leak into the JNI bridge — see the row below for the required fix | +| ~b9354–b9437 | `tools/server/server-context.cpp` + `server-queue.cpp` | `send_partial_response()` gained `is_begin` parameter (defaulted); SSE stream now emits a no-content opening event when `stream && !return_progress` (`server-context.cpp:2835`) so the client sees HTTP 200 + headers before first token. `server_response_reader::next()` 30s warn-on-cancel diagnostic message updated. **Required project source change**: `Java_net_ladenthin_llama_LlamaModel_receiveCompletionJson` in `src/main/cpp/jllama.cpp` called `result->to_json()` once and assigned `response["stop"]`, which silently auto-promoted the `nullptr` to an object `{"stop": false}` and surfaced a phantom empty `LlamaOutput` to every Java streaming caller (`LlamaModelTest.testGenerateAnswer` and four sibling tests overran by +1 token). Fixed by wrapping the `rd->next()` call in a loop that skips `response.is_null()` results so only real events reach Java | | ~b9354–b9437 | `common/arg.cpp` (env-var renames) | `LLAMA_LOG_*` → `LLAMA_ARG_LOG_*`, `LLAMA_OFFLINE` → `LLAMA_ARG_OFFLINE`, `LLAMA_LOG_FILE` → `LLAMA_ARG_LOG_FILE`, `LLAMA_CHAT_TEMPLATE_KWARGS` → `LLAMA_ARG_CHAT_TEMPLATE_KWARGS`. CLI verbosity values relabeled (4=trace, 5=debug). The `--license` CLI flag was REMOVED and moved to the new `llama-app licenses` subcommand. Project does not expose these env vars or the `--license` flag through the Java API, no changes required | | ~b9354–b9437 | `src/llama.cpp` | `llama_backend_init()` device-discovery rule tightened: iGPUs are now added only when no discrete GPUs were found (was: when no devices at all). RPC servers no longer count as "found" for this purpose, so iGPU + RPC setups keep the local iGPU. Behavioural only, single-line caller in `jllama.cpp` unchanged | | ~b9354–b9437 | `src/llama-chat.cpp` | New `LLM_CHAT_TEMPLATE_GRANITE_4_1` enum value + "granite-4.1" template name; `granite-4.0` detection now requires the literal token `g4_default_system_message` in the template, otherwise it routes to 4.1. Project does not implement chat-template detection directly — routing happens inside compiled-from-upstream code, no source changes required | diff --git a/src/main/cpp/jllama.cpp b/src/main/cpp/jllama.cpp index 5498b0b1..9e5b4c2b 100644 --- a/src/main/cpp/jllama.cpp +++ b/src/main/cpp/jllama.cpp @@ -812,18 +812,28 @@ JNIEXPORT jstring JNICALL Java_net_ladenthin_llama_LlamaModel_receiveCompletionJ rd = it->second.get(); } - server_task_result_ptr result = rd->next([] { return false; }); - - if (!result_ok_or_throw(env, result)) { - erase_reader(jctx, id_task); - return nullptr; - } + // Upstream b9437 added is_begin partial results whose to_json() returns + // a nullptr sentinel meaning "HTTP-headers-only, no body". Loop past + // those so the Java iterator only ever sees real events. + json response; + while (true) { + server_task_result_ptr result = rd->next([] { return false; }); + + if (!result_ok_or_throw(env, result)) { + erase_reader(jctx, id_task); + return nullptr; + } - json response = result->to_json(); - response["stop"] = result->is_stop(); + response = result->to_json(); + if (response.is_null()) { + continue; + } + response["stop"] = result->is_stop(); - if (result->is_stop()) { - erase_reader(jctx, id_task); + if (result->is_stop()) { + erase_reader(jctx, id_task); + } + break; } return json_to_jstring_impl(env, response);