Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co

Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.

Current llama.cpp pinned version: **b9495**
Current llama.cpp pinned version: **b9543**

## Upgrading CUDA Version

Expand Down
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)
FetchContent_Declare(
llama.cpp
GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
GIT_TAG b9495
GIT_TAG b9543
)
FetchContent_MakeAvailable(llama.cpp)

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
**Build:**
![Java 8+](https://img.shields.io/badge/Java-8%2B-informational)
![Platform](https://img.shields.io/badge/Platform-Linux%20%7C%20macOS%20%7C%20Windows%20%7C%20Android-lightgrey)
[![llama.cpp b9495](https://img.shields.io/badge/llama.cpp-%23b9495-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9495)
[![llama.cpp b9543](https://img.shields.io/badge/llama.cpp-%23b9543-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9543)
[![JPMS](https://img.shields.io/badge/JPMS-modular%20JAR-25A162)](https://openjdk.org/projects/jigsaw/)
![JUnit](https://img.shields.io/badge/tested%20with-JUnit6-25A162)
[![JSpecify](https://img.shields.io/badge/JSpecify-1.0.0%20%40NullMarked-25A162)](https://jspecify.dev)
Expand Down
9 changes: 9 additions & 0 deletions docs/history/llama-cpp-breaking-changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -303,3 +303,12 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
| ~b9490–b9495 | `gguf-py/gguf/constants.py` + `gguf-py/gguf/tensor_mapping.py` + `tools/mtmd/clip-impl.h` + `tools/mtmd/clip-model.h` + `tools/mtmd/clip.cpp` + new `tools/mtmd/models/gemma4uv.cpp` + new `tools/mtmd/models/gemma4ua.cpp` + `tools/mtmd/mtmd-audio.{h,cpp}` + `tools/mtmd/mtmd.cpp` + `conversion/__init__.py` + `conversion/gemma.py` | New Gemma4 Unified vision + audio variant (`Gemma4UnifiedForConditionalGeneration`). Adds new projector types `PROJECTOR_TYPE_GEMMA4UV` and `PROJECTOR_TYPE_GEMMA4UA` (vision uses bigger patch size with token merging done on the conv layer; audio is encoder-free, raw 16 kHz waveform chunked into 640-sample frames). New `V_ENC_EMBD_PATCH_NORM` tensor enum (`v.patch_norm.{bid}`) and 3 indexed `patch_norm_{1,2,3}_{w,b}` weights on `clip_model` (Gemma4U uses standard PyTorch LayerNorm rather than RMSNorm before/after the patch embedding). New `mtmd_audio_preprocessor_gemma4ua` mel-major waveform packer (40 ms / 16 kHz frames; no FFT, no filterbank). Multimodal additions are routed through upstream `mtmd-cli` / `mtmd-debug` binaries that the project does not link; the JNI build links `libllama` + `libcommon` only. Additive at the GGUF / projector loader level: existing GGUFs without these projector types continue to load through the previous code paths. No project source or Java API changes required |
| ~b9490–b9495 | `tools/ui/` (`package.json`, `src/lib/components/app/content/MarkdownContent/`, new `MermaidPreview.svelte`, new `DialogMermaidPreview.svelte`, new constants / icons / rehype plugins) | Upstream `llama-server` web UI gains Mermaid diagram rendering: new `mermaid@^11.15` dependency, lazy-loaded; new rehype plugin chain (`rehype-mermaid-pre`, `rehype-enhance-mermaid-blocks`) converts ` ```mermaid ` code fences to `<pre class="mermaid">` and wraps them with copy / preview action buttons; the existing single-file `MarkdownContent.svelte` is split into a `.svelte` + sibling `.css` / `markdown-utils.ts` / `markdown-handlers.ts` so the new mermaid renderer can share helpers. Project does not compile or ship the upstream `tools/ui` (server-only feature, classpath-only JNI build); no impact |
| ~b9490–b9495 | upstream build / verification | Local build with `GIT_TAG b9495` was verified clean: `cmake -B build -DBUILD_TESTING=ON` configures cleanly, `cmake --build build --config Release -j$(nproc)` links `libjllama.so` + `jllama_test` with zero warnings on any project translation unit; `ctest --test-dir build --output-on-failure` reports 435/435 tests passing. All breaking changes in this range are renames within upstream-compiled translation units; no project source edits required for the version bump itself |
| ~b9495–b9543 | `src/llama-hparams.{h,cpp}` + every `src/models/*.cpp` (~150 files) | Field `hparams::n_layer` (uint32_t) was split: the raw count moved to `hparams::n_layer_all` and `hparams::n_layer()` is now a member **function** that returns `n_layer_all - n_layer_nextn` (the effective non-MTP layer count). Sibling rename: `hparams::nextn_predict_layers` &#x2192; `hparams::n_layer_nextn`. Every per-model TU in `src/models/*.cpp` was updated to call `hparams.n_layer()` and `hparams.n_layer_nextn`. New `hparams::set_recr_pattern()` mirror of `set_swa_pattern()` for hybrid recurrent architectures. New per-layer `hparams::deepstack_mapping_arr` (LLAMA_MAX_LAYERS, default -1) populated from new GGUF key `LLM_KV_DEEPSTACK_MAPPING` for Granite4-Vision-style per-layer deepstack injection. `hparams::kv_only_nextn` was removed (MTP heads now use a layer filter callback instead). Project does not reference any of these hparams symbols directly &mdash; verified via `grep -rn "hparams\.n_layer\|nextn_predict_layers\|n_layer_nextn\|n_layer_all\|deepstack_mapping" src/main/cpp/ src/test/cpp/` returns zero matches. All consumers are inside upstream-compiled TUs (`llama-model.cpp`, `llama-context.cpp`, model TUs); no project source changes required |
| ~b9495–b9543 | `include/llama.h` (state-seq flags) + `tools/server/server-context.cpp` + `examples/speculative-simple/speculative-simple.cpp` | The `LLAMA_STATE_SEQ_FLAGS_ON_DEVICE` flag was removed from the `llama_state_seq_flags` enum. All upstream call sites that passed `LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY \| LLAMA_STATE_SEQ_FLAGS_ON_DEVICE` were updated to pass only `LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY` &mdash; the on-device path is now the default for partial saves/loads. Project does not call `llama_state_seq_get_*` / `llama_state_seq_set_*` directly from `jllama.cpp`; the only consumer in the JNI build is upstream `server-context.cpp` (speculative checkpoint helpers), which was updated upstream. Verified via `grep -rn "LLAMA_STATE_SEQ_FLAGS_ON_DEVICE" src/` returns zero matches. No project source changes required |
| ~b9495–b9543 | new `common/imatrix-loader.{h,cpp}` + refactor of `tools/imatrix/imatrix.cpp` + `tools/quantize/quantize.cpp` | Extracted shared imatrix-loading logic into a standalone library: new `common_imatrix` struct (`entries`, `datasets`, `chunk_count`, `chunk_size`, `is_legacy`, `has_metadata`) and `common_imatrix_load(const std::string &, common_imatrix &)` reader. New GGUF metadata keys exposed as `LLM_KV_IMATRIX_DATASETS`, `LLM_KV_IMATRIX_CHUNK_COUNT`, `LLM_KV_IMATRIX_CHUNK_SIZE`. The imatrix and quantize CLIs were rewritten to consume this shared loader (the legacy in-file binary parser also moved into the shared loader). Build system: `common/CMakeLists.txt` now includes `imatrix-loader.cpp` and `imatrix-loader.h` in `libcommon`, which means the JNI build picks up the new TU automatically via FetchContent + the existing `target_link_libraries(jllama PRIVATE common)` line. Project does not use imatrix loading from Java today (no `LlamaImatrix` class); the new symbols ship as additive surface area only. No project source changes required |
| ~b9495–b9543 | `tools/mtmd/clip.{h,cpp}` + `tools/mtmd/clip-impl.h` + `tools/mtmd/clip-model.h` + `tools/mtmd/mtmd.{h,cpp}` + `tools/mtmd/mtmd-helper.{h,cpp}` + `tools/mtmd/mtmd-image.cpp` + every `tools/mtmd/models/*.cpp` | Large MTMD subsystem refactor: (1) `clip_image_u8` and `clip_image_f32` switched from public POD-style `nx` / `ny` / `buf` fields to private members with `get_size()` / `set_size()` / `get_ro_buf()` / `cpy_buf()` / `get_pixel()` / `set_pixel()` / `is_placeholder()` getters/setters; every model TU and image helper was updated to the new API. (2) Several public helpers were removed from `tools/mtmd/clip.h`: `clip_embd_nbytes`, `clip_embd_nbytes_by_img`, `clip_image_u8_get_data`, `clip_build_img_from_pixels`, `clip_get_newline_tensor`, `clip_encode_float_image`, `clip_image_f32_batch_add_mel`. (3) `mtmd_helper_bitmap_init_from_file()` and `mtmd_helper_bitmap_init_from_buf()` gained a required `bool placeholder` parameter (when true the bitmap reserves shape only, no pixel decode &mdash; used for token counting). (4) `mtmd_bitmap` is now a true class (private buffer + `is_placeholder()` / `can_batch_with()`); `mtmd_bitmap_init()` and `mtmd_bitmap_init_from_audio()` accept `nullptr` data to create placeholder bitmaps. (5) New Granite4 Vision projector type `PROJECTOR_TYPE_GRANITE4_VISION` and tensor enums (`V_MULTI_PROJ_*`, `V_QF_*`) for QFormer-with-window projection. (6) Qwen-VL video / temporal-merge support: `clip_graph_qwen2vl::build_inp_with_temporal_merge()` plus `n_batch_max=2` for batch-merged consecutive image frames. Project does not link any `tools/mtmd/*` TUs into the JNI build (`libllama` + `libcommon` only); the JNI vision API surfaces through `mtmd-helper.h` and was reviewed: zero `clip_image_*` / removed-helper references found across `src/main/cpp/` and `src/test/cpp/`. No project source changes required |
| ~b9495–b9543 | `tools/server/server-context.cpp` + `tools/server/server-http.cpp` + `tools/server/server.cpp` (new `/v1/responses/input_tokens` + `/v1/chat/completions/input_tokens` + `/v1/messages/count_tokens`) | New token-counting endpoints (Anthropic-compatible + OpenAI Responses-API-compatible). Implementation: `server_routes::handle_count_tokens()` consolidates the body parsing path (chat completions, responses, anthropic messages) and emits `{"input_tokens": N, "object": "response.input_tokens"}`. `process_mtmd_prompt()` signature gained a `bool is_placeholder = false` parameter so token-counting can reuse the multimodal tokenization path without decoding image/audio pixels. Server-only HTTP endpoints (the JNI build links neither `tools/server/server.cpp` nor `server-http.cpp`); the only server TU we link is `server-context.cpp`, where the only project-visible change is the new optional `process_mtmd_prompt` parameter, which is defaulted &mdash; existing project call sites compile unchanged. No project source changes required |
| ~b9495–b9543 | `common/chat-peg-parser.{h,cpp}` + `common/chat.cpp` (LFM2/2.5 unified) | LFM2.5's chat-completion parser was merged into the single `common_chat_params_init_lfm2()` (was a separate `_lfm2_5` function); a `bool tool_list_tokens` flag toggles between the two template flavours. New helper `common_chat_peg_builder::python_or_json_value()` and a new `bool allow_json_literals` parameter on `python_style_tool_calls()` so LFM2.5 can accept JSON-cased `true` / `false` / `null` alongside the Python-cased literals. Pure-Python literal normalisation in `chat-peg-parser.cpp` (`True`/`False`/`None` &#x2192; JSON during streaming). Project does not call any `common_chat_peg_*` or `common_chat_params_init_lfm2*` symbols; routing happens inside upstream-compiled `chat.cpp`. No project source changes required |
| ~b9495–b9543 | `ggml/src/ggml-cuda/mmvq.cu` + `ggml/src/ggml-cpu/arch/{riscv,wasm}/quants.c` + `ggml/src/ggml-metal/ggml-metal-device.m` + `ggml/src/ggml-opencl/*` + `ggml/src/ggml-sycl/*` + `ggml/src/ggml-vulkan/*` + `ggml/src/ggml-webgpu/*` + `ggml/src/ggml-cpu/kleidiai/kleidiai.cpp` | Per-backend numerical & performance work: (1) CUDA `mul_mat_vec_q_moe` switched to `GGML_CUDA_RESTRICT` aliasing + PDL launch params for Hopper. (2) RISC-V Vector quants: dispatch-by-VL refactor (`vl128` / `vl256` / `vl512` / `vl1024` separate kernels for Q2_K, Q3_K, Q4_K, Q6_K, IQ1_S, IQ1_M, IQ2_S, IQ2_XS, IQ3_S, IQ3_XXS, IQ4_XS, TQ1_0, TQ2_0). (3) WebAssembly SIMD path for Q4_1. (4) Metal residency-set keep-alive polling interval tightened to 5 ms (was 500 ms). (5) OpenCL Adreno: faster `concat`/`cpy`/`get_rows` packed kernels for narrow tensors (`<32` cols); Q6_K mat-vec rewritten with vec4 weight gather. (6) SYCL: multi-column MMVQ paths added for all quant types (ncols=2..8) used by speculative decoding's draft verification batches; `should_reorder_tensor` gate widened from `ne[1]==1` to `ne[1]<=8`. (7) Vulkan: NV cooperative-matrix2 feature detection now requires every `coopmat2_features.*` bit; FWHT shader gains shmem fallback (Intel Windows driver bug workaround). (8) WebGPU: flash-attention split into vector / tile / subgroup-matrix variants with K/V quantization-aware staging (`U32_DEQUANT_HELPERS`); GRANITE_SPEECH bumped to multi-projector. (9) KleidiAI: env vars `GGML_KLEIDIAI_CHUNK_MULTIPLIER` & `GGML_KLEIDIAI_SME` thread-cap auto-detect; SME + non-SME hybrid scheduling. All purely backend-internal; project compiles backends through FetchContent with no API surface change visible to `jllama.cpp`. No project source changes required |
| ~b9495–b9543 | `conversion/__init__.py` + `conversion/granite.py` + `conversion/gemma.py` + `convert_lora_to_gguf.py` + `gguf-py/gguf/{constants,tensor_mapping,gguf_writer}.py` | Python-side: new `Granite4VisionMmprojModel` (vision-projector for Granite4 with QFormer-window deepstack + per-projector spatial offsets + image-grid pinpoints); Gemma4 unified vision/audio conversion fix-ups for newer HF checkpoints (`hidden_size` falls back to `audio_embed_dim`; `model_patch_size` falls back to `patch_size * pooling_kernel_size`). `convert_lora_to_gguf.py` gained `--trust-remote-code`. New `LLM_KV_DEEPSTACK_MAPPING` writer (`add_deepstack_mapping`) and new clip-vision keys (`KEY_PROJ_SAMPLE_QUERY_SIDE`, `KEY_PROJ_SAMPLE_WINDOW_SIDE`, `KEY_PROJ_SPATIAL_OFFSETS`, `KEY_FEATURE_LAYERS`, `KEY_IMAGE_GRID_PINPOINTS`) for the Granite4 vision projector. Python-side only; no impact on the Java/JNI build. No project source changes required |
| ~b9495–b9543 | upstream build / verification | Local build pending: the b9495 &#x2192; b9543 bump is expected to compile cleanly given the audit above (zero `grep` matches in `src/main/cpp/` for any of the renamed or removed symbols: `hparams.n_layer`, `nextn_predict_layers`, `n_layer_nextn`, `n_layer_all`, `LLAMA_STATE_SEQ_FLAGS_ON_DEVICE`, `clip_image_u8`/`clip_image_f32` field access, `clip_build_img_from_pixels`, `clip_get_newline_tensor`, `clip_image_u8_get_data`, `clip_embd_nbytes`, `clip_embd_nbytes_by_img`, `clip_encode_float_image`, `clip_image_f32_batch_add_mel`, `mtmd_helper_bitmap_init_from_file`, `mtmd_helper_bitmap_init_from_buf`, `common_imatrix_load`). The only project-visible signature change &mdash; `process_mtmd_prompt()`'s new `bool is_placeholder` parameter &mdash; is defaulted, so existing call sites inside the project compile unchanged. All breaking changes in this range are absorbed inside upstream-compiled translation units; no project source edits required for the version bump itself |
Loading