Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
678 commits
Select commit Hold shift + click to select a range
689a9a4
server-bench : add speed-bench for speculative decoding benchmarking …
ruixiang63 May 29, 2026
b22da25
ggml-webgpu: add q4_0/q8_0 SET_ROWS (#23760)
reeselevine May 29, 2026
151f3a9
ggml-webgpu: Check earlier for WebGPU required features (#23879)
reeselevine May 29, 2026
0821c5f
server: in SSE mode, send HTTP headers when slot starts (#23884)
ngxson May 29, 2026
1738129
llama : do not skip iGPU when only RPC devices are present (#23868)
rgerganov May 30, 2026
d4204b0
ci : clear cache instead of "no timestamp" keys + fix macos (#23895)
ggerganov May 30, 2026
3375285
ci : fix s390x release job (#23898)
ggerganov May 30, 2026
6e093b8
vulkan: add Flash Attention support for BFloat16 KV cache (#23420)
0cc4m May 30, 2026
d48a56e
ggml : add some lsx support (#23798)
MQ-mengqing May 30, 2026
4c4e91b
ci : update ios-xcode release job to macos-26 (#23906)
ggerganov May 30, 2026
e674b12
test: (test-llama-archs) log the config name first (#23885)
ngxson May 30, 2026
2d9b7c8
metal : restore im2col implementation for large kernels (#23901)
ggerganov May 30, 2026
8b0e0db
TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs (#23843)
JohannesGaessler May 30, 2026
d38d50e
ui: exclude generated build dirs from prettier and eslint so lint err…
ServeurpersoCom May 30, 2026
d6588da
opencl: support bf16 by converting to f16 (#23839)
lhez May 30, 2026
aa46bda
Support `-fa auto` in llama-bench (#23714)
gaugarg-nv May 30, 2026
d749821
webui: add custom CSS injection via config (#23904)
ServeurpersoCom May 30, 2026
22cadc1
llama: only use one iGPU device by default (#23897)
0cc4m May 31, 2026
e6123e2
docs : update ZenDNN docs for Q8 support (#23791)
truecoder34 May 31, 2026
3292da0
ui: fix ETag truncation with MSVC compiler (#23917)
EZForever May 31, 2026
d4c8e2c
vocab : add tokenizer support for jina-embeddings-v2-base-zh (#18756)
o7si May 31, 2026
399739d
ci : limit trigger paths for the CPU workflow (#23938)
ggerganov May 31, 2026
6f165c1
server : handle If-None-Match weak ETags (#23916)
EZForever May 31, 2026
af6528e
ci: remove redundant or duplicate jobs (#23927)
netrunnereve Jun 1, 2026
44e211c
sycl : Optimize Q3_K mul_mat by reorder (#23725)
arthw Jun 1, 2026
4162522
[SYCL] Add more types in GET_ROWS OP (#23710)
arthw Jun 1, 2026
a511424
[SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention (#23812)
arthw Jun 1, 2026
e22b0de
ci : add missing Linux label to cpu-x64-high-perf runner (#23958)
ggerganov Jun 1, 2026
5254a79
common : support manually triggering the reasoning budget end sequenc…
aldehir Jun 1, 2026
f8c0a19
vulkan: Removed unused functions (#23175)
winstonma Jun 1, 2026
1962000
vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints (#23…
TheBlueMatt Jun 1, 2026
48b88c3
model: Add EXAONE 4.5 implementations (#21733)
nuxlear Jun 1, 2026
02a5701
security : disable private disclosures (#23963)
ggerganov Jun 1, 2026
8e6fff8
TP: quantized KV cache support (#23792)
JohannesGaessler Jun 1, 2026
5aba536
vocab: add normalizer.lowercase support to WPM (#23899)
o7si Jun 1, 2026
bef69f1
vulkan: reduce host memory lock contention (#23376)
winstonma Jun 1, 2026
55ac090
vulkan: don't hold the device mutex while compiling pipelines (#23641)
jeffbolznv Jun 1, 2026
95b8b8e
metal: template GLU kernels to support f16/f32 (#23882)
shrivasshankar Jun 1, 2026
de6f727
llama: limit max outputs of `llama_context` (#23861)
am17an Jun 1, 2026
335abed
vendor : update cpp-httplib to 0.46.1 (#23980)
angt Jun 1, 2026
27d9ed8
opencl: add basic support for q5_0 and q5_1 (#23548)
shaofeiqi Jun 1, 2026
5aa3a64
nix : add nix-nodejs facilities to build Web UI (#23846)
choener Jun 1, 2026
5dcb711
speculative : fix n_outputs_max and remove draft-simple auto-enable (…
ggerganov Jun 1, 2026
b8275a8
revert to using global_invocation_id for cpy shader (#23955)
yomaytk Jun 1, 2026
210a657
opencl: fix compiler warnings for non-adreno path (#23922)
lhez Jun 2, 2026
1fd5f48
clean up unused variables warnings (#23975)
anavp-nvidia Jun 2, 2026
354ebac
server: real-time reasoning interruption via control endpoint (#23971)
ServeurpersoCom Jun 2, 2026
d178a11
hexagon: add gelu_quick (#24007)
tboinovski1 Jun 2, 2026
8f7f3bf
hexagon: MUL_MAT, MUL_MAT_ID, FLASH_ATTN and GDN cleanup and optimiza…
max-krasnyansky Jun 2, 2026
4f3a4be
llama : deprecate `llama_set_warmup` (#24009)
ggerganov Jun 2, 2026
f7a0777
convert : support Step3.7-Flash (#23845)
forforever73 Jun 2, 2026
2365315
kv-cache : SWA checkpoints store only non-masked cells (#23981)
ggerganov Jun 2, 2026
f8e67fc
ui: Add Thinking mode toggle with reasoning effort levels + improveme…
allozaur Jun 2, 2026
69cea5b
ui: simplify network error handling (#23431)
socram8888 Jun 2, 2026
d5ab083
docs : update HOWTO-add-model.md (#23883)
Xarbirus Jun 2, 2026
a468b89
ci : reduce self-hosted server workflow jobs (#24012)
ggerganov Jun 2, 2026
60130d1
server: add SSE ping interval (#24013)
ngxson Jun 2, 2026
0b71540
common : fix state save in common_prompt_batch_decode (#23468)
danbev Jun 2, 2026
2187e00
StepFun 3.5 MTP (#23274)
pwilkin Jun 2, 2026
bfb4308
model : support granite multilingual embeddings R2 (ibm-granite/grani…
hansolosan Jun 2, 2026
4fb16ec
model: add Mellum architecture (#23966)
Xarbirus Jun 2, 2026
5c394fd
hexagon: profiler output fix and script updates (#24042)
max-krasnyansky Jun 2, 2026
63e66fd
opencl: use flat variants of q4_K and q6_K gemv for very large M (#24…
lhez Jun 2, 2026
e366626
arg : removed unecesary mmproj download when users pass --no-mmproj (…
ryan-mangeno Jun 3, 2026
4da6370
ci : disable ccache for msvc windows release jobs (#23911)
ggerganov Jun 3, 2026
d545a2a
update BoringSSL to 0.20260526.0 (#23794)
cabelo Jun 3, 2026
06938ac
tests : add support for qwen3 SSM archs (#24031)
ggerganov Jun 3, 2026
f8f0a47
cuda: reserve space for quantize kv-cache at startup (#23907)
am17an Jun 3, 2026
3571fa5
ggml-cpu: use runtime SVE width in FWHT (#24059)
chaxu01 Jun 3, 2026
9e58d4d
Avoid PDL race conditions by disabling __restrict__ when PDL is used …
aendk Jun 3, 2026
ee4cf70
ui: Mermaid Diagrams in chat + interactive preview (#24032)
allozaur Jun 3, 2026
a731805
mtmd, model: allow skip build_vit() (#24077)
ngxson Jun 3, 2026
c8d6a00
mtmd: enable non-causal vision for gemma 4 unified (#24082)
ngxson Jun 3, 2026
166fe29
qwen35: use post-norm hidden state for MTP (#24025)
am17an Jun 3, 2026
94a220c
mtmd: fix Gemma 4 unified FPE (#24088)
abetlen Jun 3, 2026
f478f1b
sycl : Improve SYCL doc (#23025)
malsbat Jun 4, 2026
3c7450c
ggml-cpu: extend RVV quantization vec dot to higher VLENs (#22754)
rehan-10xengineer Jun 4, 2026
e8c5489
ggml-webgpu: FlashAttention refactor + standardize quantization suppo…
reeselevine Jun 4, 2026
3d19986
metal : reduce rset heartbeat from 500ms -> 5ms (#24074)
ggerganov Jun 4, 2026
65ef50a
tests : refactor test-save-load-state to accept token input (#24073)
ggerganov Jun 4, 2026
6ddc943
readme : add status badges (#24104)
ggerganov Jun 4, 2026
e3ba22d
fix(mtmd): handle Gemma 4 audio projector embedding size (#24091)
abetlen Jun 4, 2026
7ac5a42
cmake: skip cvector-generator and export-lora when CPU backend is dis…
arichiardi Jun 4, 2026
0066404
server : add header to tools/server/server-http.h (#24089)
abawany Jun 4, 2026
4d74287
build : use umbrella Headers directory for XCFramework module map (#2…
gmarzjr Jun 4, 2026
4586479
webui: fix tool selector toggle/counter, key tools by stable identity…
ServeurpersoCom Jun 4, 2026
a121232
agents: refactor, include more guidelines (#24111)
ngxson Jun 4, 2026
6f3a9f3
server: avoid unnecessary checkpoint restore when new tokens are pres…
Abioy Jun 4, 2026
4c51309
ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128 (#22209)
sirohikartik Jun 4, 2026
e802356
convert: Fix Gemma 4 Unified conversion (#24118)
pcuenca Jun 4, 2026
0dbfa66
return filter to save memory (#24125)
forforever73 Jun 4, 2026
5269770
ui: added single line reasoning preview (#23601)
gugugiyu Jun 4, 2026
21444c8
ui: Fixed packages (#24119)
allozaur Jun 4, 2026
e7bcf1c
Move duplicated imatrix code into single common imatrix-loader.cpp (#…
bartowski1182 Jun 4, 2026
42b2d60
webui: [a11y] fix keyboard navigation issues in chat interface and si…
vignesh191 Jun 4, 2026
260862b
arg: fix double mtp downloads (#24128)
ngxson Jun 4, 2026
7c158fb
server : disable on-device spec checkpoints (#24108)
ggerganov Jun 4, 2026
7fe2ae4
sycl : port multi-column MMVQ from CUDA backend (#21845)
masonmilby Jun 5, 2026
46fa662
ci : build-msys job slimming [no ci] (#24157)
danbev Jun 5, 2026
2154a0f
CUDA: enroll mul_mat_vec_q_moe into pdl (#24087)
ORippler Jun 5, 2026
3ecfb15
kleidiai : dynamic chunck-based scheduling for hybrid execution (#23819)
chaxu01 Jun 5, 2026
7acb4e8
hparams : refactor `hparams.n_layer` (#24060)
ggerganov Jun 5, 2026
59917d3
minor : fix lint issues (#24165)
ggerganov Jun 5, 2026
ad1b88c
docs: Update quantization readme (#24133)
pcuenca Jun 5, 2026
cc7bef3
ui: add ignore-scripts=true to npmrc (#24149)
ngxson Jun 5, 2026
9c955c4
Fix link to available UI settings (#24169)
wariuccio Jun 5, 2026
2016bf2
ui: run npm install when package-lock.json is newer than node_modules…
ServeurpersoCom Jun 5, 2026
96fbe00
model : fix llama_model::n_gpu_layers() (#24188)
ggerganov Jun 5, 2026
86591c7
cli: fix model params not propagated (#23893)
therealkenc Jun 5, 2026
6effcec
TP: round up granularity to 128 (#24180)
JohannesGaessler Jun 5, 2026
64086f2
model, mtmd: Granite4 Vision (#23545)
gabe-l-hart Jun 5, 2026
c4a278d
model: fix build failed (#24193)
ngxson Jun 5, 2026
e82beaa
vulkan: add fwht support for Intel with shmem reduction (#23964)
0cc4m Jun 5, 2026
da87e9b
common/chat : unify and fix LFM2/LFM2.5 tool parser (#24178)
tdakhran Jun 5, 2026
308f61c
opencl: improve get_rows, cpy, concat and q6_k flat gemv (#24160)
lhez Jun 5, 2026
603300b
context : fix off-by-one comparisons to n_gpu_layers (#24208)
CISC Jun 6, 2026
5343f45
model : rename local n_layer_all variable (#24209)
CISC Jun 6, 2026
5a69c97
vulkan: check coopmat2 features before reporting support (#24186)
0cc4m Jun 6, 2026
f5c6ae1
mtmd, server: add "placeholder bitmap" for counting tokens , add */in…
ngxson Jun 6, 2026
588f0dc
completion : fix format specifier in LOG_INF (#24213)
angt Jun 6, 2026
6b80c74
completion : remove useless statics (#24226)
angt Jun 6, 2026
31e8249
mtmd: support "frame merge" for qwen-vl-based models (#21858)
ngxson Jun 6, 2026
98d5e8b
common/chat : fix LFM2/LFM2.5 reasoning round-trip and <think> leak (…
tdakhran Jun 6, 2026
3f7c79d
docker : bump cuda13 to 13.3.0 (#24228)
CISC Jun 7, 2026
f71af35
convert : fix Gemma4 with no audio encoder (#24242)
CISC Jun 7, 2026
465b1f0
arg: Skip mmproj download when user supplied mmproj (#24239)
konradmb Jun 7, 2026
8a091c4
spec : fix vocab compatibility check (#24256)
CISC Jun 7, 2026
04eb4c4
llama : add Gemma4 MTP (#23398)
am17an Jun 7, 2026
f0156d1
kv-cache: follow the source cache size when sharing cells (#24267)
ServeurpersoCom Jun 7, 2026
379ac66
kv-cache : avoid kv cells copies (#24277)
ggerganov Jun 7, 2026
8a963fc
convert : fix conversion for Mistral-Medium-3.5-128B (#24268)
dfriehs Jun 7, 2026
9e3b928
common : relax sampler name matching (#23744)
ddh0 Jun 7, 2026
d403f00
[SYCL] Update compute runtime version to 26.x in docker (#24070)
arthw Jun 8, 2026
daf6bc9
metal : fix im2col 1D case (audio models) (#24220)
ngxson Jun 8, 2026
19bba67
HIP: add gfx1152 and gfx1153 to RDNA3.5 (#24129)
harkgill-amd Jun 8, 2026
0f7fada
cuda: reset cuda context after reading memory size (#23935)
0cc4m Jun 8, 2026
c74759a
vulkan: Use cm2 decode_vector for mul_mat_id B matrix loads (#23991)
jeffbolznv Jun 8, 2026
715b86a
cli: fix spinner not show during prompt processing (#24283)
ngxson Jun 8, 2026
6a1de6f
ggml : bump version to 0.14.0 (ggml/1533)
ggerganov Jun 8, 2026
c2b1518
sync : ggml
ggerganov Jun 8, 2026
8f83d6c
mtmd : add video input support (#24269)
ngxson Jun 8, 2026
3ebe862
docker: install ffmpeg in the released image (#24302)
ngxson Jun 8, 2026
3b3da01
[ggml-webgpu] Implement 2D workgroups for scale, binary, and unary op…
nikhilJain17 Jun 8, 2026
1705d43
[ggml-webgpu] Handle buffer overlap / buffer aliasing for concat oper…
nikhilJain17 Jun 8, 2026
a66d505
graph: guard iswa kq_mask on its own buffer (#24294)
ServeurpersoCom Jun 8, 2026
42a0afd
server : do not parse when flushing http headers (#24281)
aldehir Jun 8, 2026
7d2b45b
mtp: support for gemma-4 E2B and E4B assistants (#24282)
max-krasnyansky Jun 8, 2026
1e1aca0
ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul fo…
yomaytk Jun 8, 2026
3ac3c20
ggml-webgpu: Add clang-format job (#24308)
reeselevine Jun 9, 2026
e3471b3
Remove case for GGML_TYPE_Q4_K in mvvq.cu (#23528)
ravel7524 Jun 9, 2026
fd3271e
ggml-cpu : fix rms_norm_back wrong output under in-place aliasing (#2…
devYRPauli Jun 9, 2026
f0152ef
models : fix plamo2 attention_key/value_length regression (#24317)
CISC Jun 9, 2026
961e9a3
server : do not clear slots without unified KV cache (#24190)
fiesh Jun 9, 2026
2602169
ggml : add GGML_OP_COL2IM_1D (#24206)
ServeurpersoCom Jun 9, 2026
efbacf8
ui: fix mobile chat form overflow and bust stale bundle cache (#24158)
ServeurpersoCom Jun 9, 2026
1e91256
server: log prompts to directory (#22031)
jacekpoplawski Jun 9, 2026
9682e35
mtmd: refactor video subproc handling (#24316)
ngxson Jun 9, 2026
ae735b1
ui: Fix excessive style recalculation on hover (#24243)
ntowle Jun 9, 2026
b4e3dc6
vulkan: add `v_dot2_f32_f16` support in matrix-matrix multiplication …
0cc4m Jun 9, 2026
d6d0ce8
vulkan: reduce iq1 shared memory usage for mul_mm (#24287)
jeffbolznv Jun 9, 2026
49f3542
mtmd: build_vit batching (#24352)
sfallah Jun 9, 2026
4836095
ui: add opt-in run_javascript frontend tool (#24244)
ServeurpersoCom Jun 9, 2026
e25a32e
ci : fix windows release (#24369)
CISC Jun 9, 2026
d73cd07
graph: Fix granite speech model inference by applying embedding scale…
arnu515 Jun 9, 2026
76da245
webui: implement pinned conversations support (#21387)
remeh Jun 9, 2026
d2e22ed
speculative : fix "ngram-map-k4v" name in logging (#24253)
ddh0 Jun 10, 2026
039e20a
ci : bump komac version (#24396)
CISC Jun 10, 2026
fb83cc9
CUDA: Fix ssm_scan_f32 data-races (#24360)
ORippler Jun 10, 2026
d2462f8
chat: fix LFM2/LFM2.5 ignoring json_schema (#24377)
tdakhran Jun 10, 2026
e95dae1
Remove padding and multiple D2D copies for MTP (#24086)
gaugarg-nv Jun 10, 2026
ac4cdde
vendor : update LibreSSL to 4.3.2 (#24397)
angt Jun 10, 2026
db94854
server : skip checkpoints beyond pos_next (#24411)
aldehir Jun 11, 2026
68f3066
vocab : refactor normalizer flags into options struct, add strip_acce…
o7si Jun 11, 2026
1bfbdb1
vocab : adopt leading TemplateProcessing special token as BOS (#24428)
o7si Jun 11, 2026
18ef86e
server: skip unused log lines on router mode (#24463)
ngxson Jun 11, 2026
1af154a
vulkan: use medium matmul tile on Asahi Linux (#24306)
xingjianll Jun 11, 2026
fdc3db9
vulkan: add fast path for contiguous buffer transfers (#23973)
winstonma Jun 11, 2026
17e59d6
ggml : bump version to 0.15.0 (ggml/1539)
ggerganov Jun 11, 2026
263cc04
sync : ggml
ggerganov Jun 11, 2026
4c65955
vulkan: ifdef eMesaHoneykrisp (build fix) (#24479)
jeffbolznv Jun 11, 2026
1593d56
docker : support specifying the GCC version for CUDA (#24447)
wencan Jun 11, 2026
ba1df05
opencl: add q5_0/q5_1 gemm and gemv kernels for Adreno (#24319)
shaofeiqi Jun 12, 2026
099ea76
[SYCL] Fix CI build & release for SYCL backend (#24387)
arthw Jun 12, 2026
85f99dc
ggml: support concat for scalar types at cuda backend (#24011)
zihaomu Jun 12, 2026
88a3927
spec: add EAGLE3 speculative decoding support (#18039)
ruixiang63 Jun 12, 2026
6471e3c
UI/jpeg exif orientation (#24196)
ServeurpersoCom Jun 12, 2026
70b54e1
vendor : update cpp-httplib to 0.47.0 (#24395)
angt Jun 12, 2026
e08c226
ggml : bump version to 0.15.1 (ggml/1541)
ggerganov Jun 12, 2026
f532be8
sync : ggml
ggerganov Jun 12, 2026
02182fc
fit : avoid including llama-ext.h in fit.h (#24506)
ggerganov Jun 12, 2026
f7ca93d
ui: PWA support (#23871)
allozaur Jun 12, 2026
3e7bd4f
vulkan: add pipeline barriers for memcpy read operations (#23770)
0cc4m Jun 12, 2026
ebc1077
server : fix reasoning budget WebUI precedence over model.ini (#24517)
ggerganov Jun 12, 2026
cd50446
ci : unbreak release (#24544)
CISC Jun 12, 2026
f58bad4
ci : unbreak release harder (#24545)
CISC Jun 12, 2026
e37abd6
mtmd: add batching API (#24384)
ngxson Jun 12, 2026
c34b922
fix sycl links in release notes (#24527)
muhammad-salem Jun 13, 2026
d8a24cc
fit : wrap llama_device_memory_data (#24522)
ggerganov Jun 13, 2026
57fe1f0
server: clean up static assets handling (#24550)
ngxson Jun 13, 2026
597b667
ui: keep original file name and path (#24568)
ngxson Jun 13, 2026
1a7718b
vulkan: support non-contig unary/glu ops (#24215)
jeffbolznv Jun 13, 2026
341babc
jinja : fix split and replace with empty first arg (#24574)
CISC Jun 13, 2026
e8067a8
ui: build-time gzip compression (#24571)
ngxson Jun 13, 2026
f05cf46
jinja : fix negative step slice with start/stop values (#24580)
CISC Jun 13, 2026
4988f6e
Add arch support for cohere2-MoE (#24260)
michaelw9999 Jun 13, 2026
53bd47e
ui : fix llama-ui-embed crash when no asset dir is given (#24597)
aldehir Jun 13, 2026
c2ba3e4
add sycl to check-release (#24583)
CISC Jun 14, 2026
4672211
ci : use CUDA label for cuda backend (#24594)
CISC Jun 14, 2026
8ed274e
Add cohere2moe to llama-vocab for TINY_AYA (#24601)
bartowski1182 Jun 14, 2026
6e14286
cli : fix not copying preserved tokens (#24258)
michaelw9999 Jun 14, 2026
acd79d6
jinja : add count/d/e filter aliases (#24606)
CISC Jun 14, 2026
1fd6dfe
ui : fix ui clipping in mobile due to incorrect height setup (#24605)
amoshydra Jun 14, 2026
fd5869f
UI/mobile keyboard and pwa popup fixes (#24610)
ServeurpersoCom Jun 14, 2026
20c5266
docker: specify registry to simplify Podman builds (#24607)
Minoru Jun 14, 2026
8edaca9
docs : fix typos in CUDA-FEDORA.md and grammars/README.md (#24459)
m-atharkhan Jun 14, 2026
aedb2a5
chat: add dedicated Cohere2MoE (North Code) parser (#24615)
pwilkin Jun 14, 2026
5f04dc7
ui: Add HEIC/HEIF image support (#24137)
NickM-27 Jun 14, 2026
ef8268f
fix(ui): render thinking/reasoning block content as markdown (#24611)
franitel Jun 14, 2026
dd4623a
convert : fix lora base model arch retrieval (#24621)
CISC Jun 14, 2026
6e9007a
ggml-webgpu: improve i-quants mul_mat performance and speed up prefil…
yomaytk Jun 15, 2026
3686e9d
CUDA: only support F32/F16 for GGML_OP_REPEAT (#24533)
leonardHONG Jun 15, 2026
2a6c391
UI/svg block rendering (#24080)
ServeurpersoCom Jun 15, 2026
a6dff71
chat: fix whitespace problems once and for all (#24624)
pwilkin Jun 15, 2026
272088b
metal : add repeat bf16 (#24638)
ggerganov Jun 15, 2026
c035ff4
[SYCL]: Remove per-allocation Level Zero runtime checks (#23399)
sanmai Jun 15, 2026
987fbd8
[SYCL] add to support pool_1d, move pool_1d/2d code to pool.cpp/hpp (…
arthw Jun 15, 2026
8872ab5
sycl : enhance set_rows to support q1_0, mxfp4, nvfp4 (#24564)
arthw Jun 15, 2026
72be44f
sycl : fix reorder function; add fp32/fp16 in build script (#24578)
arthw Jun 15, 2026
d8a3f52
sycl: fix soft_max_f32 max reduction (#24451)
someoneinjd Jun 15, 2026
e3bb1ad
SYCL: use native subgroup size for K-quant DMMV (#21700)
PMZFX Jun 15, 2026
6eab471
wasm : fix fallback symbol collision (#24639)
abetlen Jun 15, 2026
9dbc662
vulkan: support more CONCAT types (#24579)
jeffbolznv Jun 15, 2026
e3cab40
mtmd : add post-decode callback (#24645)
ggerganov Jun 15, 2026
0ae3f45
chat: fix an "oldie but goodie" grammar generator bug that surfaced d…
pwilkin Jun 15, 2026
581e8ec
chat: harden peg-native tool call parsing (#24329)
ServeurpersoCom Jun 15, 2026
a1eb756
docs: Add instructions to install `llama.cpp` from conda-forge (#22219)
jjerphan Jun 15, 2026
38d5463
chat: include full unparsed prompt in debug (#24650)
pwilkin Jun 15, 2026
e36a602
mtmd: fix miscounting n_tokens (#24656)
ngxson Jun 15, 2026
7dad2f1
chat : fix LFM2 tool-call parsing double-escaping (#24667)
tdakhran Jun 15, 2026
ad39cca
vulkan: add col2im_1d op (#24425)
ServeurpersoCom Jun 16, 2026
4196b47
sycl : Make GGML_SYCL_F16=ON the default (#23996)
malsbat Jun 16, 2026
fdd1098
[SYCL] Support OP EXPM1, support all UT cases of FLOOR, TRUNC, ROUND …
arthw Jun 16, 2026
ac79caa
sycl: support reordered Q4_K/Q5_K/Q6_K MoE MUL_MAT_ID (#24452)
newjordan Jun 16, 2026
e3a74b2
bench : add --offline (#24511)
angt Jun 16, 2026
635b65a
spec: add spec metrics mean acceptance length and acceptance rate per…
ruixiang63 Jun 16, 2026
d5fb104
vulkan: Support gated_delta_net with S_v=16 (#24581)
jeffbolznv Jun 16, 2026
32120c1
vulkan: prefer host-visible memory buffers on UMA devices (#22930)
winstonma Jun 16, 2026
a182490
spec: add backend sampling support for eagle3 (#24655)
ruixiang63 Jun 16, 2026
02810c7
Fix and restrict NVFP4 edge-cases in llama-graph (#24331)
ORippler Jun 16, 2026
c1304d7
ui: add source toggle to mermaid and svg blocks (#24652)
ServeurpersoCom Jun 16, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
17 changes: 17 additions & 0 deletions .devops/cann.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@
# Define the CANN base image for easier version updates later
ARG CHIP_TYPE=910b
ARG CANN_BASE_IMAGE=quay.io/ascend/cann:8.5.0-${CHIP_TYPE}-openeuler24.03-py3.11
ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A

# ==============================================================================
# BUILD STAGE
Expand Down Expand Up @@ -55,6 +58,7 @@ RUN mkdir -p /app/lib && \
RUN mkdir -p /app/full && \
cp build/bin/* /app/full/ && \
cp *.py /app/full/ && \
cp -r conversion /app/full/ && \
cp -r gguf-py /app/full/ && \
cp -r requirements /app/full/ && \
cp requirements.txt /app/full/
Expand All @@ -67,6 +71,19 @@ RUN mkdir -p /app/full && \
# ==============================================================================
FROM ${CANN_BASE_IMAGE} AS base

ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A
ARG IMAGE_URL=https://github.com/ggml-org/llama.cpp
ARG IMAGE_SOURCE=https://github.com/ggml-org/llama.cpp
LABEL org.opencontainers.image.created=$BUILD_DATE \
org.opencontainers.image.version=$APP_VERSION \
org.opencontainers.image.revision=$APP_REVISION \
org.opencontainers.image.title="llama.cpp" \
org.opencontainers.image.description="LLM inference in C/C++" \
org.opencontainers.image.url=$IMAGE_URL \
org.opencontainers.image.source=$IMAGE_SOURCE

# -- Install runtime dependencies --
RUN yum install -y libgomp curl && \
yum clean all && \
Expand Down
23 changes: 20 additions & 3 deletions .devops/cpu.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
ARG UBUNTU_VERSION=24.04
ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A

FROM ubuntu:$UBUNTU_VERSION AS build
FROM docker.io/ubuntu:$UBUNTU_VERSION AS build

ARG TARGETARCH

Expand All @@ -27,16 +30,30 @@ RUN mkdir -p /app/lib && \
RUN mkdir -p /app/full \
&& cp build/bin/* /app/full \
&& cp *.py /app/full \
&& cp -r conversion /app/full \
&& cp -r gguf-py /app/full \
&& cp -r requirements /app/full \
&& cp requirements.txt /app/full \
&& cp .devops/tools.sh /app/full/tools.sh

## Base image
FROM ubuntu:$UBUNTU_VERSION AS base
FROM docker.io/ubuntu:$UBUNTU_VERSION AS base

ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A
ARG IMAGE_URL=https://github.com/ggml-org/llama.cpp
ARG IMAGE_SOURCE=https://github.com/ggml-org/llama.cpp
LABEL org.opencontainers.image.created=$BUILD_DATE \
org.opencontainers.image.version=$APP_VERSION \
org.opencontainers.image.revision=$APP_REVISION \
org.opencontainers.image.title="llama.cpp" \
org.opencontainers.image.description="LLM inference in C/C++" \
org.opencontainers.image.url=$IMAGE_URL \
org.opencontainers.image.source=$IMAGE_SOURCE

RUN apt-get update \
&& apt-get install -y libgomp1 curl \
&& apt-get install -y libgomp1 curl ffmpeg \
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
Expand Down
30 changes: 25 additions & 5 deletions .devops/cuda.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,20 +1,26 @@
ARG UBUNTU_VERSION=24.04
# This needs to generally match the container host's environment.
ARG CUDA_VERSION=12.8.1
ARG GCC_VERSION=14
# Target the CUDA build image
ARG BASE_CUDA_DEV_CONTAINER=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}
ARG BASE_CUDA_DEV_CONTAINER=docker.io/nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}

ARG BASE_CUDA_RUN_CONTAINER=nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}
ARG BASE_CUDA_RUN_CONTAINER=docker.io/nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}

ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A

FROM ${BASE_CUDA_DEV_CONTAINER} AS build

ARG GCC_VERSION
# CUDA architecture to build for (defaults to all supported archs)
ARG CUDA_DOCKER_ARCH=default

RUN apt-get update && \
apt-get install -y gcc-14 g++-14 build-essential cmake python3 python3-pip git libssl-dev libgomp1
apt-get install -y gcc-${GCC_VERSION} g++-${GCC_VERSION} build-essential cmake python3 python3-pip git libssl-dev libgomp1

ENV CC=gcc-14 CXX=g++-14 CUDAHOSTCXX=g++-14
ENV CC=gcc-${GCC_VERSION} CXX=g++-${GCC_VERSION} CUDAHOSTCXX=g++-${GCC_VERSION}

WORKDIR /app

Expand All @@ -32,6 +38,7 @@ RUN mkdir -p /app/lib && \
RUN mkdir -p /app/full \
&& cp build/bin/* /app/full \
&& cp *.py /app/full \
&& cp -r conversion /app/full \
&& cp -r gguf-py /app/full \
&& cp -r requirements /app/full \
&& cp requirements.txt /app/full \
Expand All @@ -40,8 +47,21 @@ RUN mkdir -p /app/full \
## Base image
FROM ${BASE_CUDA_RUN_CONTAINER} AS base

ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A
ARG IMAGE_URL=https://github.com/ggml-org/llama.cpp
ARG IMAGE_SOURCE=https://github.com/ggml-org/llama.cpp
LABEL org.opencontainers.image.created=$BUILD_DATE \
org.opencontainers.image.version=$APP_VERSION \
org.opencontainers.image.revision=$APP_REVISION \
org.opencontainers.image.title="llama.cpp" \
org.opencontainers.image.description="LLM inference in C/C++" \
org.opencontainers.image.url=$IMAGE_URL \
org.opencontainers.image.source=$IMAGE_SOURCE

RUN apt-get update \
&& apt-get install -y libgomp1 curl \
&& apt-get install -y libgomp1 curl ffmpeg \
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
Expand Down
60 changes: 47 additions & 13 deletions .devops/intel.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,20 +1,31 @@
ARG ONEAPI_VERSION=2025.3.3-0-devel-ubuntu24.04
ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A

## Build Image

FROM intel/deep-learning-essentials:$ONEAPI_VERSION AS build
FROM docker.io/intel/deep-learning-essentials:$ONEAPI_VERSION AS build

ARG GGML_SYCL_F16=OFF
ARG GGML_SYCL_F16=ON
ARG LEVEL_ZERO_VERSION=1.28.2
ARG LEVEL_ZERO_UBUNTU_VERSION=u24.04
RUN apt-get update && \
apt-get install -y git libssl-dev
apt-get install -y git libssl-dev wget ca-certificates && \
cd /tmp && \
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero.deb && \
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero-devel_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero-devel.deb && \
apt-get -o Dpkg::Options::="--force-overwrite" install -y ./level-zero.deb ./level-zero-devel.deb && \
rm -f /tmp/level-zero.deb /tmp/level-zero-devel.deb

WORKDIR /app

COPY . .

RUN if [ "${GGML_SYCL_F16}" = "ON" ]; then \
echo "GGML_SYCL_F16 is set" \
&& export OPT_SYCL_F16="-DGGML_SYCL_F16=ON"; \
&& export OPT_SYCL_F16="-DGGML_SYCL_F16=ON" \
&& export SYCL_PROGRAM_COMPILE_OPTIONS="-cl-fp32-correctly-rounded-divide-sqrt"; \
fi && \
echo "Building with dynamic libs" && \
cmake -B build -DGGML_NATIVE=OFF -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DLLAMA_BUILD_TESTS=OFF ${OPT_SYCL_F16} && \
Expand All @@ -26,18 +37,42 @@ RUN mkdir -p /app/lib && \
RUN mkdir -p /app/full \
&& cp build/bin/* /app/full \
&& cp *.py /app/full \
&& cp -r conversion /app/full \
&& cp -r gguf-py /app/full \
&& cp -r requirements /app/full \
&& cp requirements.txt /app/full \
&& cp .devops/tools.sh /app/full/tools.sh

FROM intel/deep-learning-essentials:$ONEAPI_VERSION AS base

ARG IGC_VERSION=v2.30.1
ARG IGC_VERSION_FULL=2_2.30.1+20950
ARG COMPUTE_RUNTIME_VERSION=26.09.37435.1
ARG COMPUTE_RUNTIME_VERSION_FULL=26.09.37435.1-0
ARG IGDGMM_VERSION=22.9.0
FROM docker.io/intel/deep-learning-essentials:$ONEAPI_VERSION AS base

ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A
ARG IMAGE_URL=https://github.com/ggml-org/llama.cpp
ARG IMAGE_SOURCE=https://github.com/ggml-org/llama.cpp
LABEL org.opencontainers.image.created=$BUILD_DATE \
org.opencontainers.image.version=$APP_VERSION \
org.opencontainers.image.revision=$APP_REVISION \
org.opencontainers.image.title="llama.cpp" \
org.opencontainers.image.description="LLM inference in C/C++" \
org.opencontainers.image.url=$IMAGE_URL \
org.opencontainers.image.source=$IMAGE_SOURCE

#Following versions are for multiple GPUs, since 26.x has known issue:
# https://github.com/ggml-org/llama.cpp/issues/21747,
# https://github.com/intel/compute-runtime/issues/921.
#ARG IGC_VERSION=v2.20.5
#ARG IGC_VERSION_FULL=2_2.20.5+19972
#ARG COMPUTE_RUNTIME_VERSION=25.40.35563.10
#ARG COMPUTE_RUNTIME_VERSION_FULL=25.40.35563.10-0
#ARG IGDGMM_VERSION=22.8.2


ARG IGC_VERSION=v2.34.4
ARG IGC_VERSION_FULL=2_2.34.4+21428
ARG COMPUTE_RUNTIME_VERSION=26.18.38308.1
ARG COMPUTE_RUNTIME_VERSION_FULL=26.18.38308.1-0
ARG IGDGMM_VERSION=22.10.0
RUN mkdir /tmp/neo/ && cd /tmp/neo/ \
&& wget https://github.com/intel/intel-graphics-compiler/releases/download/$IGC_VERSION/intel-igc-core-${IGC_VERSION_FULL}_amd64.deb \
&& wget https://github.com/intel/intel-graphics-compiler/releases/download/$IGC_VERSION/intel-igc-opencl-${IGC_VERSION_FULL}_amd64.deb \
Expand All @@ -51,7 +86,7 @@ RUN mkdir /tmp/neo/ && cd /tmp/neo/ \
&& dpkg --install *.deb

RUN apt-get update \
&& apt-get install -y libgomp1 curl \
&& apt-get install -y libgomp1 curl ffmpeg \
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
Expand Down Expand Up @@ -109,4 +144,3 @@ WORKDIR /app
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]

ENTRYPOINT [ "/app/llama-server" ]

21 changes: 19 additions & 2 deletions .devops/llama-cli-cann.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
ARG ASCEND_VERSION=8.5.0-910b-openeuler22.03-py3.10
ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A

FROM ascendai/cann:$ASCEND_VERSION AS build
FROM docker.io/ascendai/cann:$ASCEND_VERSION AS build

WORKDIR /app

Expand All @@ -27,7 +30,21 @@ RUN echo "Building with static libs" && \
cmake --build build --config Release --target llama-completion

# TODO: use image with NNRT
FROM ascendai/cann:$ASCEND_VERSION AS runtime
FROM docker.io/ascendai/cann:$ASCEND_VERSION AS runtime

ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A
ARG IMAGE_URL=https://github.com/ggml-org/llama.cpp
ARG IMAGE_SOURCE=https://github.com/ggml-org/llama.cpp
LABEL org.opencontainers.image.created=$BUILD_DATE \
org.opencontainers.image.version=$APP_VERSION \
org.opencontainers.image.revision=$APP_REVISION \
org.opencontainers.image.title="llama.cpp" \
org.opencontainers.image.description="LLM inference in C/C++" \
org.opencontainers.image.url=$IMAGE_URL \
org.opencontainers.image.source=$IMAGE_SOURCE

COPY --from=build /app/build/bin/llama-cli /app/build/bin/llama-completion /

ENV LC_ALL=C.utf8
Expand Down
24 changes: 21 additions & 3 deletions .devops/musa.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,13 @@ ARG UBUNTU_VERSION=22.04
# This needs to generally match the container host's environment.
ARG MUSA_VERSION=rc4.3.0
# Target the MUSA build image
ARG BASE_MUSA_DEV_CONTAINER=mthreads/musa:${MUSA_VERSION}-devel-ubuntu${UBUNTU_VERSION}-amd64
ARG BASE_MUSA_DEV_CONTAINER=docker.io/mthreads/musa:${MUSA_VERSION}-devel-ubuntu${UBUNTU_VERSION}-amd64

ARG BASE_MUSA_RUN_CONTAINER=mthreads/musa:${MUSA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}-amd64
ARG BASE_MUSA_RUN_CONTAINER=docker.io/mthreads/musa:${MUSA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}-amd64

ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A

FROM ${BASE_MUSA_DEV_CONTAINER} AS build

Expand Down Expand Up @@ -37,6 +41,7 @@ RUN mkdir -p /app/lib && \
RUN mkdir -p /app/full \
&& cp build/bin/* /app/full \
&& cp *.py /app/full \
&& cp -r conversion /app/full \
&& cp -r gguf-py /app/full \
&& cp -r requirements /app/full \
&& cp requirements.txt /app/full \
Expand All @@ -45,8 +50,21 @@ RUN mkdir -p /app/full \
## Base image
FROM ${BASE_MUSA_RUN_CONTAINER} AS base

ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A
ARG IMAGE_URL=https://github.com/ggml-org/llama.cpp
ARG IMAGE_SOURCE=https://github.com/ggml-org/llama.cpp
LABEL org.opencontainers.image.created=$BUILD_DATE \
org.opencontainers.image.version=$APP_VERSION \
org.opencontainers.image.revision=$APP_REVISION \
org.opencontainers.image.title="llama.cpp" \
org.opencontainers.image.description="LLM inference in C/C++" \
org.opencontainers.image.url=$IMAGE_URL \
org.opencontainers.image.source=$IMAGE_SOURCE

RUN apt-get update \
&& apt-get install -y libgomp1 curl \
&& apt-get install -y libgomp1 curl ffmpeg \
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
Expand Down
31 changes: 29 additions & 2 deletions .devops/nix/package.nix
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
glibc,
config,
stdenv,
stdenvNoCC,
runCommand,
cmake,
ninja,
Expand All @@ -19,6 +20,8 @@
openssl,
shaderc,
spirv-headers,
nodejs,
importNpmLock,
useBlas ?
builtins.all (x: !x) [
useCuda
Expand Down Expand Up @@ -103,6 +106,7 @@ let
vulkan-headers
vulkan-loader
shaderc
spirv-headers
];
in

Expand All @@ -129,7 +133,31 @@ effectiveStdenv.mkDerivation (finalAttrs: {
src = lib.cleanSource ../../.;
};

postPatch = ''
# Builds the webui locally, taking care not to require updating any sha256 hash.
webui = stdenvNoCC.mkDerivation {
pname = "webui";
version = llamaVersion;
src = lib.cleanSource ../../tools/ui;

nativeBuildInputs = [
nodejs
importNpmLock.linkNodeModulesHook
];

# no sha256 required when using buildNodeModules
npmDeps = importNpmLock.buildNodeModules {
npmRoot = ../../tools/ui;
inherit nodejs;
};

installPhase = ''
LLAMA_UI_OUT_DIR=$out npm run build --offline
'';
};

postPatch = lib.optionalString useWebUi ''
cp -r ${finalAttrs.webui} tools/ui/dist
chmod -R u+w tools/ui/dist
'';

# With PR#6015 https://github.com/ggml-org/llama.cpp/pull/6015,
Expand All @@ -146,7 +174,6 @@ effectiveStdenv.mkDerivation (finalAttrs: {
ninja
pkg-config
git
spirv-headers
]
++ optionals useCuda [
cudaPackages.cuda_nvcc
Expand Down
Loading