Git commit
563137a (master-691-563137a)
Operating System & Version
Debian 13, radv 25.2.6
GGML backends
Vulkan
Command-line arguments used
./sd-cli --backend Vulkan1 --diffusion-model z_image_turbo_bf16.safetensors --llm Qwen3-4B-UD-Q4_K_XL.gguf --vae ./ae_bf16.safetensors -p flower --cfg-scale 1 --steps 4 --offload-to-cpu --mmap
Steps to reproduce
From release master-691-563137a, the command-line above (standard Z-Image Turbo and Flux.1 VAE bf16 weights, Qwen3-4b quant from Unsloth) fails on Vulkan. Same parameters and models work fine on the previous commit.
What you expected to happen
offloading working as before; this is master-690-3a54597:
[INFO ] model_loader.cpp:913 - memory-mapped 606 tensors in 3 files (13856.51 MB), taking 0.00s
|====================> | 453/1095 - 17.32MB/s
|======================================> | 851/1095 - 208.93MB/s
|==================================================| 1095/1095 - 239.24MB/s
[INFO ] model_loader.cpp:1167 - loading tensors completed, taking 1.67s (read: 0.16s, memcpy: 0.00s, convert: 0.22s, copy_to_backend: 0.00s)
[INFO ] stable-diffusion.cpp:1149 - total params memory size = 1583.87MB (VRAM 0.00MB, RAM 1583.87MB): text_encoders 1483.75MB(RAM), diffusion_model 7.30MB(RAM), vae 92.82MB(RAM), controlnet 0.00MB(N/A), extensions 0.00MB(N/A)
[INFO ] stable-diffusion.cpp:1254 - running in FLOW mode
[INFO ] stable-diffusion.cpp:4407 - generate_image 512x512
[INFO ] denoiser.hpp:579 - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:3461 - sampling using Euler method
[INFO ] ggml_extend.hpp:2158 - qwen3 offload params (3602.16 MB, 398 tensors) to runtime backend (Vulkan1), taking 5.53s
[INFO ] stable-diffusion.cpp:4164 - get_learned_condition completed, taking 5.99s
[INFO ] stable-diffusion.cpp:4441 - generating image: 1/1 - seed 42
[INFO ] ggml_extend.hpp:2158 - z_image offload params (11743.02 MB, 453 tensors) to runtime backend (Vulkan1), taking 11.90s
|==================================================| 4/4 - 6.39s/it
[INFO ] stable-diffusion.cpp:4473 - sampling completed, taking 25.76s
[INFO ] stable-diffusion.cpp:4491 - generating 1 latent images completed, taking 25.76s
Peak VRAM usage is around 11G (16G card).
What actually happened
an out-of-memory crash:
[INFO ] stable-diffusion.cpp:520 - Weight type stat: f32: 145 | q4_K: 154 | q5_K: 30 | q6_K: 49 | iq4_xs: 20 | bf16: 697
[INFO ] stable-diffusion.cpp:521 - Conditioner weight type stat: f32: 145 | q4_K: 154 | q5_K: 30 | q6_K: 49 | iq4_xs: 20
[INFO ] stable-diffusion.cpp:522 - Diffusion model weight type stat: bf16: 453
[INFO ] stable-diffusion.cpp:523 - VAE weight type stat: bf16: 244
[INFO ] stable-diffusion.cpp:930 - using VAE for encoding / decoding
[INFO ] auto_encoder_kl.hpp:525 - vae decoder: ch = 128
[INFO ] stable-diffusion.cpp:1151 - total params memory size = 15439.76MB (VRAM 0.00MB, RAM 15439.76MB): text_encoders 3602.16MB(RAM), diffusion_model 11743.02MB(RAM), vae 94.57MB(RAM), controlnet 0.00MB(N/A), extensions 0.00MB(N/A)
[INFO ] stable-diffusion.cpp:1251 - running in FLOW mode
[INFO ] stable-diffusion.cpp:4364 - generate_image 512x512
[INFO ] denoiser.hpp:579 - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:3417 - sampling using Euler method
[ERROR] model_manager.cpp:581 - model manager tensor 'text_encoders.llm.model.embed_tokens.weight' is too large for params buffer: 1555824640 > 1073741824
[ERROR] ggml_extend.hpp:1893 - qwen3 prepare graph weights failed
src/conditioning/conditioner.hpp:1719: GGML_ASSERT(!hidden_states.empty()) failed
[New LWP 1908342]
[New LWP 1908341]
[New LWP 1908340]
[New LWP 1908339]
[New LWP 1908334]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56 ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: Arquivo ou diretório inexistente
#0 __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56 in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1 0x00007fdc9d49b668 in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:49
warning: 49 ./nptl/cancellation.c: Arquivo ou diretório inexistente
#2 0x00007fdc9d49b6ad in __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75 in ./nptl/cancellation.c
#3 0x00007fdc9d5067c7 in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: Arquivo ou diretório inexistente
#4 0x0000559eb1beb82b in ggml_print_backtrace ()
#5 0x0000559eb1beb97e in ggml_abort ()
#6 0x0000559eb0fd30db in LLMEmbedder::encode_prompt(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<int, int> const&, int, int, std::vector<std::pair<int, sd::Tensor<float> >, std::allocator<std::pair<int, sd::Tensor<float> > > > const&, std::set<int, std::less<int>, std::allocator<int> > const&, int, bool, int) [clone .isra.0] ()
#7 0x0000559eb0fd3b8e in LLMEmbedder::get_learned_condition(int, ConditionerParams const&) ()
#8 0x0000559eb0e47d2e in generate_image ()
#9 0x0000559eb0d04006 in main ()
[Inferior 1 (process 1908333) detached]
master-691-563137a works with ROCm on the same card.
Logs / error messages / stack trace
No response
Additional context / environment details
No response
Git commit
563137a (master-691-563137a)
Operating System & Version
Debian 13, radv 25.2.6
GGML backends
Vulkan
Command-line arguments used
./sd-cli --backend Vulkan1 --diffusion-model z_image_turbo_bf16.safetensors --llm Qwen3-4B-UD-Q4_K_XL.gguf --vae ./ae_bf16.safetensors -p flower --cfg-scale 1 --steps 4 --offload-to-cpu --mmap
Steps to reproduce
From release
master-691-563137a, the command-line above (standard Z-Image Turbo and Flux.1 VAE bf16 weights, Qwen3-4b quant from Unsloth) fails on Vulkan. Same parameters and models work fine on the previous commit.What you expected to happen
offloading working as before; this is
master-690-3a54597:Peak VRAM usage is around 11G (16G card).
What actually happened
an out-of-memory crash:
master-691-563137aworks with ROCm on the same card.Logs / error messages / stack trace
No response
Additional context / environment details
No response