|
| 1 | +# Unsloth + SGLang: MoE-Optimized RL Training Benchmark |
| 2 | + |
| 3 | +Benchmark for the Unsloth + SGLang backend that combines SGLang for inference with Unsloth for MoE training. Uses a **dedicated GPU split** where inference and training run on separate GPUs for zero sleep/wake overhead, with a **persistent training worker** that keeps the model loaded across steps. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Architecture — Dedicated GPU Split (Default) |
| 8 | + |
| 9 | +``` |
| 10 | +┌──────────────────────────────────────────────────────────────────┐ |
| 11 | +│ 4-GPU Setup (Recommended) │ |
| 12 | +│ │ |
| 13 | +│ ┌─ GPUs 0, 2 (TP=2) ────────────┐ ┌─ GPU 1 ──────────────┐ │ |
| 14 | +│ │ SGLang Server │ │ Unsloth Training │ │ |
| 15 | +│ │ • Always active (no sleep) │ │ • Dedicated GPU │ │ |
| 16 | +│ │ • TP=2 inference │ │ • Persistent worker │ │ |
| 17 | +│ │ │ │ (model loaded once) │ │ |
| 18 | +│ │ ┌──────────┐ ┌────────────┐ │ │ • LoRA + Optimizer │ │ |
| 19 | +│ │ │ TP=2 │ │ LoRA │ │ │ • ART loss function │ │ |
| 20 | +│ │ │ Model │ │ Hot-reload│ │ │ │ │ |
| 21 | +│ │ │ Shards │ │ < 0.1s │ │ │ GPU 3: idle │ │ |
| 22 | +│ │ └──────────┘ └────────────┘ │ └───────────────────────┘ │ |
| 23 | +│ └─────────────────────────────────┘ │ |
| 24 | +│ │ |
| 25 | +│ ✓ No sleep/wake overhead │ |
| 26 | +│ ✓ SGLang stays active during training │ |
| 27 | +│ ✓ Persistent worker — model loaded once, reused across steps │ |
| 28 | +│ ✓ TP must be power of 2 (vocab size constraint) │ |
| 29 | +│ ✓ Generation is 70-90% of RL time → more inference GPUs = win │ |
| 30 | +└───────────────────────────────────────────────────────────────────┘ |
| 31 | +``` |
| 32 | + |
| 33 | +### Auto-Detected GPU Splits |
| 34 | + |
| 35 | +TP must be a power of 2 (model vocab sizes like Qwen3's 151936 are divisible by 1,2,4,8 but NOT 3). |
| 36 | + |
| 37 | +| GPUs Available | Inference GPUs | TP Size | Training GPU | Mode | |
| 38 | +|:-:|:-:|:-:|:-:|:-:| |
| 39 | +| 8 | 0, 2, 3, 4 | 4 | 1 | **Dedicated** | |
| 40 | +| 4 | 0, 2 | 2 | 1 | **Dedicated** | |
| 41 | +| 3 | 0, 2 | 2 | 1 | **Dedicated** | |
| 42 | +| 2 | 0 | 1 | 1 | **Dedicated** | |
| 43 | +| 1 | 0 | 1 | — | Shared (sleep/wake) | |
| 44 | + |
| 45 | +GPU 1 is chosen as primary training GPU to keep GPU 0 as the primary SGLang rank. |
| 46 | + |
| 47 | +### Key Features |
| 48 | + |
| 49 | +- **Dedicated GPU split** — inference and training on separate GPUs, zero sleep/wake overhead |
| 50 | +- **Persistent training worker** — model loaded once at step 1, reused for all subsequent steps (~0s model load overhead on steps 2+) |
| 51 | +- **Auto-detected** — optimal split computed from available GPU count |
| 52 | +- **~12x faster MoE training** via Unsloth Triton kernels |
| 53 | +- **~35% less VRAM** via Split LoRA approach |
| 54 | +- **LoRA hot-reload** for weight sync (<0.1s) |
| 55 | + |
| 56 | +### Shared Mode (Single GPU Fallback) |
| 57 | + |
| 58 | +When only 1 GPU is available, falls back to the verl-style sleep/wake pattern where SGLang releases GPU memory before training and reclaims it after. This adds ~5-15s overhead per step. |
| 59 | + |
| 60 | +--- |
| 61 | + |
| 62 | +## Files |
| 63 | + |
| 64 | +| File | Purpose | |
| 65 | +|------|---------| |
| 66 | +| `run_benchmark.py` | End-to-end benchmark runner | |
| 67 | +| `config.py` | Benchmark configuration + GPU split helper | |
| 68 | +| `metrics_collector.py` | Metrics collection and reporting | |
| 69 | +| `sglang_server.py` | SGLang server lifecycle management (supports GPU pinning) | |
| 70 | +| `unsloth_sglang_service.py` | Unsloth + SGLang service with dedicated/shared GPU modes | |
| 71 | +| `setup_environments.sh` | Environment setup script | |
| 72 | + |
| 73 | +--- |
| 74 | + |
| 75 | +## Training Loop |
| 76 | + |
| 77 | +### Dedicated Mode (2+ GPUs, default) |
| 78 | + |
| 79 | +**Step 1 (cold start):** |
| 80 | + |
| 81 | +1. **Rollout** — SGLang generates on inference GPUs (always active, TP=2) |
| 82 | +2. **Data pipeline** — ART preprocessing tokenizes/packs into packed tensors |
| 83 | +3. **Spawn worker** — on dedicated training GPU (`CUDA_VISIBLE_DEVICES`) |
| 84 | +4. **Load model** — base model + LoRA adapter (~50s one-time cost) |
| 85 | +5. **Train** — ART loss on packed tensors |
| 86 | +6. **Save LoRA** — adapter saved to disk |
| 87 | +7. **Load LoRA** — hot-reload adapter into SGLang (<0.1s) |
| 88 | + |
| 89 | +**Steps 2+ (persistent worker):** |
| 90 | + |
| 91 | +1. **Rollout** — SGLang generates (never stops) |
| 92 | +2. **Data pipeline** — tokenize/pack |
| 93 | +3. **Train** — reuse persistent worker (model already loaded, ~0s overhead) |
| 94 | +4. **Save LoRA** + **Load LoRA** — save and hot-reload |
| 95 | + |
| 96 | +No sleep/wake. SGLang never stops. Worker stays alive until benchmark end. |
| 97 | + |
| 98 | +### Shared Mode (1 GPU fallback) |
| 99 | + |
| 100 | +1. **Rollout** — SGLang generates completions |
| 101 | +2. **Data pipeline** — tokenize/pack |
| 102 | +3. **Sleep** — SGLang releases GPU memory |
| 103 | +4. **Spawn subprocess** → **Train** → **Save LoRA** → **Kill** |
| 104 | +5. **Wake** — SGLang restores GPU memory |
| 105 | +6. **Load LoRA** — hot-reload |
| 106 | + |
| 107 | +--- |
| 108 | + |
| 109 | +## Running the Benchmark |
| 110 | + |
| 111 | +```bash |
| 112 | +# Setup environments |
| 113 | +bash benchmarks/sglang_vs_vllm/setup_environments.sh |
| 114 | + |
| 115 | +# Run with auto-detected GPU split (recommended) |
| 116 | +CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python benchmarks/sglang_vs_vllm/run_benchmark.py \ |
| 117 | + --sglang-python ~/.venvs/sglang-bench/bin/python \ |
| 118 | + --num-steps 10 --num-rollouts 64 --dataset gsm8k |
| 119 | + |
| 120 | +# Explicit GPU split: inference on GPUs 0,2 (TP=2), training on GPU 1 |
| 121 | +uv run python benchmarks/sglang_vs_vllm/run_benchmark.py \ |
| 122 | + --inference-gpus 0,2 --training-gpus 1 \ |
| 123 | + --sglang-python ~/.venvs/sglang-bench/bin/python |
| 124 | + |
| 125 | +# Force shared mode (sleep/wake) even with multiple GPUs |
| 126 | +uv run python benchmarks/sglang_vs_vllm/run_benchmark.py \ |
| 127 | + --training-gpus -1 \ |
| 128 | + --sglang-python ~/.venvs/sglang-bench/bin/python |
| 129 | +``` |
| 130 | + |
| 131 | +### Options |
| 132 | + |
| 133 | +| Flag | Default | Description | |
| 134 | +|------|---------|-------------| |
| 135 | +| `--model` | `Qwen/Qwen3-30B-A3B-Instruct-2507` | Model to benchmark | |
| 136 | +| `--dataset` | `agentic` | Dataset: gsm8k, sharegpt, agentic, math, synthetic | |
| 137 | +| `--num-steps` | `3` | Number of RL training steps | |
| 138 | +| `--num-rollouts` | `16` | Rollouts per step | |
| 139 | +| `--inference-gpus` | auto | Comma-separated GPU IDs for SGLang inference (e.g. `0,2,3`) | |
| 140 | +| `--training-gpus` | auto | Comma-separated GPU IDs for training (e.g. `1`), `-1` for shared mode | |
| 141 | +| `--tp` | `0` (auto) | Tensor parallel size (overridden by `--inference-gpus` count) | |
| 142 | +| `--unsloth-lora-rank` | `1` | LoRA rank for Unsloth training | |
| 143 | +| `--unsloth-moe-backend` | `auto` | MoE backend: auto, grouped_mm (H100+), unsloth_triton (A100) | |
| 144 | +| `--unsloth-port` | `8300` | SGLang inference server port | |
| 145 | +| `--gpu-memory-utilization` | `0.7` | GPU memory fraction for SGLang | |
| 146 | + |
| 147 | +GSM8K test set (1,319 questions) is downloaded automatically on first run and cached locally. |
| 148 | + |
| 149 | +--- |
| 150 | + |
| 151 | +## Trade-Offs vs Distributed Training |
| 152 | + |
| 153 | +| | Unsloth + SGLang (this) | Distributed (Megatron) | |
| 154 | +|---|---|---| |
| 155 | +| **Inference** | N-1 GPUs (TP=2 on 4 GPUs) | N GPUs (TP=4) | |
| 156 | +| **Training** | 1 GPU (persistent worker) | N GPUs (tensor parallel) | |
| 157 | +| **Model reload per step** | **0s** (steps 2+) | ~50-80s (sleep/wake + resharding) | |
| 158 | +| **Sleep/wake overhead** | None (dedicated split) | Yes (each step) | |
| 159 | +| **Training throughput** | Single GPU | Linear scaling across N GPUs | |
| 160 | +| **Setup complexity** | Simple | Complex | |
| 161 | +| **Best for** | Rapid prototyping, MoE models | Production, large-scale | |
| 162 | + |
| 163 | +The persistent worker eliminates model reload overhead on steps 2+, which partially compensates for using fewer inference GPUs. Unsloth is single-GPU by design (no tensor parallelism for training), so the dedicated GPU split with persistent worker is the optimal configuration. |
| 164 | + |
| 165 | +--- |
| 166 | + |
| 167 | +## Credits |
| 168 | + |
| 169 | +- [ART (OpenPipe)](https://github.com/OpenPipe/ART) — The codebase this is built on |
| 170 | +- [verl (Volcano Engine)](https://github.com/volcengine/verl) — Reference for the SGLang integration pattern |
| 171 | +- [SGLang](https://github.com/sgl-project/sglang) — Inference engine |
| 172 | +- [Unsloth](https://unsloth.ai/) — MoE-optimized training |
0 commit comments