[pull] master from mudler:master#1188
Merged
Merged
Conversation
…ults (#9852) * feat(llama-cpp): bump to MTP-merge SHA and document draft-mtp spec type Update LLAMA_VERSION to 0253fb21 (post ggml-org/llama.cpp#22673 merge, 2026-05-16) to pick up Multi-Token Prediction support. No grpc-server.cpp changes are required: the existing `spec_type` option delegates to upstream's `common_speculative_types_from_names()`, which already accepts the new `draft-mtp` name. The `n_rs_seq` cparam needed by MTP is auto-derived inside `common_context_params_to_llama` from `params.speculative.need_n_rs_seq()`, and when no `draft_model` is set the upstream server builds the MTP context off the target model itself. Docs: extend the speculative-decoding section of the model-configuration guide with the new type, both load paths (MTP head embedded in the main GGUF vs. separate `mtp-*.gguf` sibling), the PR's recommended `spec_n_max:2-3`, and the chained `draft-mtp,ngram-mod` recipe. Also notes that the upstream `-hf` auto-discovery of `mtp-*.gguf` siblings is not wired through LocalAI's gRPC layer. Agent guide: short note explaining that new upstream spec types are picked up automatically and that MTP needs no gRPC plumbing. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(llama-cpp): auto-detect MTP heads and enable draft-mtp on import + load Detect upstream's `<arch>.nextn_predict_layers` GGUF metadata key (set by `convert_hf_to_gguf.py` for Qwen3.5/3.6 family models and similar) and, when present and the user has not configured a `spec_type` explicitly, auto-append the upstream-recommended speculative-decoding tuple: - spec_type:draft-mtp - spec_n_max:6 - spec_p_min:0.75 The 0.75 p_min is pinned defensively because upstream marks the current default with a "change to 0.0f" TODO; locking it here keeps acceptance thresholds stable across future llama.cpp bumps. Detection runs in two places: - The model importer (`POST /models/import-uri`, the `/import-model` UI) range-fetches the GGUF header for HuggingFace / direct-URL imports via `gguf.ParseGGUFFileRemote`, with a 30s timeout and non-fatal error handling. OCI/Ollama URIs are skipped because the artifact is not directly streamable; the load-time hook covers them once the file is on disk. - The llama-cpp load-time hook (`guessGGUFFromFile`) reads the local header on every model start and appends the same options if `spec_type` is not already set. Both paths share `ApplyMTPDefaults` and respect an explicit user-set `spec_type:` / `speculative_type:` so YAML overrides win. Ginkgo specs cover the append, preserve-user-choice, legacy alias, and nil safety paths. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(importer): resolve huggingface:// URIs before MTP header probe `gguf.ParseGGUFFileRemote` only speaks HTTP(S), but the importer was handing it the raw `huggingface://...` URI directly (and similarly for any other custom downloader scheme). Live-test against `huggingface://ggml-org/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-MTP-Q8_0.gguf` exposed this: the probe failed with `unsupported protocol scheme "huggingface"`, was caught by the non-fatal error path, and the MTP options were silently never applied to the generated YAML. Route every candidate URI through `downloader.URI.ResolveURL()` and require the resolved form to be HTTP(S). After the fix the probe successfully reads `<arch>.nextn_predict_layers=1` from the real HF GGUF and the emitted ConfigFile carries spec_type:draft-mtp, spec_n_max:6, spec_p_min:0.75 as intended. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
⬆️ Update docs version mudler/LocalAI Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
…752c` (#9857) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
…db1dd3b112e9051` (#9856) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
…587b75e73c4b2fed3426` (#9854) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )