Efficient LLM inference on Slurm clusters.
-
Updated
Apr 22, 2026 - Python
Efficient LLM inference on Slurm clusters.
Layered prefill changes the scheduling axis from tokens to layers and removes redundant MoE weight reloads while keeping decode stall free. The result is lower TTFT, lower end-to-end latency, and lower energy per token without hurting TBT stability.
eLLM can infer LLM on CPUs faster than on GPUs
Add a description, image, and links to the llm-infernece topic page so that developers can more easily learn about it.
To associate your repository with the llm-infernece topic, visit your repo's landing page and select "manage topics."