This tutorial demonstrates how to use the CPU-GPU expert scheduling feature in KTransformers with SGLang. This feature introduces a flexible GPU expert mask system that allows intelligent placement of MoE experts across CPU and GPU, optimizing inference performance based on workload patterns.
- Table of Contents
- Hardware Requirements
- Prerequisites
- Step 1: Download Model Weights
- Step 2: Launch Server with Expert Scheduling
- Step 3: Send Inference Requests
- Performance
- Troubleshooting
- Additional Resources
Minimum Configuration:
- GPU: NVIDIA RTX 4090 24 GB (or equivalent with at least 24GB VRAM available)
- CPU: x86 CPU with AVX512 support (e.g., Intel Sapphire Rapids, AMD EPYC)
- RAM: At least 256GB system memory
- Storage: Sufficient space for model weights
Tested Configuration:
- GPU: 4 x NVIDIA GeForce RTX 4090 (24 GB)
- CPU: Intel Xeon Gold 6454S
- RAM: 512GB DDR5
- OS: Linux (Ubuntu 20.04+ recommended)
Before starting, ensure you have:
-
SGLang installed
Install the kvcache-ai fork of SGLang (one of):
# Option A: One-click install (from ktransformers root) ./install.sh # Option B: pip install pip install sglang-kt
-
KTransformers installed
git clone https://github.com/kvcache-ai/ktransformers.git cd ktransformers/kt-kernel bash ./install.shAfter installation, verify the CLI is working:
kt version
-
CUDA toolkit - CUDA 12.0+ recommended
-
Hugging Face CLI - For downloading models:
pip install -U huggingface-hub
Download your preferred MoE model weights. This feature supports various MoE models including:
-
Qwen3-Next-80B-A3B-Instruct-FP8
huggingface-cli download Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 --local-dir /path/to/qwen3-next-80b
The simplest way to start the server with expert scheduling:
python -m sglang.launch_server \
--model /path/to/model \
--kt-num-gpu-experts 8 \
--kt-expert-placement-strategy uniformThe system provides four expert placement strategies:
| Strategy | Description | Use Case |
|---|---|---|
uniform |
Distributes GPU experts evenly across all MoE layers | Default, no prior statistics needed |
frequency |
Places most frequently activated experts on GPU | Best performance when activation statistics are available |
front-loading |
Fills GPU experts from the first layer onwards | Testing or specific workload patterns |
random |
Randomly selects experts with fixed seed (42) | Baseline comparison |
Using Frequency Strategy (Recommended for best performance):
python -m sglang.launch_server \
--model /path/to/model \
--kt-num-gpu-experts 8 \
--kt-expert-placement-strategy frequency \
--init-expert-location /path/to/activation_stats.ptUsing Dynamic Expert Update:
python -m sglang.launch_server \
--model /path/to/model \
--kt-num-gpu-experts 8 \
--kt-expert-placement-strategy frequency \
--init-expert-location /path/to/activation_stats.pt \
--kt-enable-dynamic-expert-update \
--kt-gpu-prefill-token-threshold 512| Parameter | Description |
|---|---|
--kt-num-gpu-experts |
Number of GPU experts per MoE layer. Internally multiplied by the number of MoE layers to get the total GPU experts. Ignored if --kt-gpu-experts-ratio is set. |
--kt-gpu-experts-ratio |
Ratio of total experts to place on GPU (0.0-1.0). If set, overrides --kt-num-gpu-experts. Example: 0.1 means 10% of all experts across all layers will be on GPU. |
--kt-expert-placement-strategy |
Expert placement strategy: frequency, uniform, front-loading, or random. Default: uniform. |
--init-expert-location |
Path to activation statistics file (.pt) for frequency strategy. |
--kt-enable-dynamic-expert-update |
Enable dynamic expert update during inference. |
--kt-gpu-prefill-token-threshold |
Token threshold for triggering dynamic expert redistribution during prefill. |
--record-kt-gpu-expert-distribution |
Enable recording of GPU expert distribution for analysis. |
--expert-distribution-recorder-mode |
Recording mode: stat (default), stat_approx, per_pass, or per_token. |
Once the server is running (default: http://localhost:30000), you can interact with the model in several ways:
The easiest way to chat with the model:
kt chatThis opens an interactive terminal chat session. Type your messages and press Enter to send. Use Ctrl+C to exit.
The server exposes an OpenAI-compatible API at http://localhost:30000/v1.
curl example (streaming):
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "model-name",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'The following benchmarks were measured on Qwen3-Next-80B-A3B-Instruct-FP8 with 4 x RTX 4090, Intel Xeon Gold 6454S, tensor parallel size 4, using ShareGPT dataset:
| GPU Expert Ratio | random | uniform | front-loading | frequency | dynamic-expert-update |
|---|---|---|---|---|---|
| 0% | 53.01 | 52.96 | 54.18 | 52.72 | 53.37 |
| 10% | 56.63 | 56.57 | 57.18 | 58.60 | 70.22 |
| 20% | 58.75 | 60.28 | 58.82 | 61.92 | 74.73 |
| 30% | 62.86 | 62.08 | 63.87 | 66.50 | 75.55 |
| 40% | 66.81 | 66.82 | 67.45 | 72.78 | 80.98 |
| 50% | 70.38 | 65.25 | 73.65 | 76.19 | 81.17 |
| 60% | 71.33 | 72.80 | 77.95 | 82.33 | 82.30 |
| 70% | 74.40 | 76.17 | 81.59 | 89.37 | 88.70 |
| 80% | 79.71 | 79.20 | 89.20 | 100.67 | 92.31 |
| 90% | 88.82 | 81.06 | 98.14 | 107.15 | 95.04 |
| 100% | 112.61 | 112.32 | 111.82 | 114.26 | 112.99 |
The frequency and dynamic-expert-update strategies show significant performance improvements over baseline strategies, especially at lower GPU expert ratios.
If you encounter OOM, adjust these parameters when launching the server:
| Parameter | VRAM Impact |
|---|---|
--kt-num-gpu-experts / --kt-gpu-experts-ratio |
Reduces expert weight VRAM usage |
--chunked-prefill-size |
Reduces prefill extra VRAM allocation |
--max-total-tokens |
Reduces KV cache VRAM usage |
Ensure all conditions are met:
--kt-enable-dynamic-expert-updateis enabled--kt-gpu-prefill-token-thresholdis set- Prefill length >= threshold value
To save expert distribution statistics to a custom path, set the environment variable:
export SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR=/path/to/output