Skip to content

Commit 062d680

Browse files
committed
feat: complete SGLang backend with multi-GPU split, benchmarks, and core fixes
- SGLang backend with dedicated GPU split (inference GPU 0, training GPU 1+) - LoRA hot-reload via SGLang API preserves RadixAttention cache - Two-environment architecture for torchao version isolation - Benchmarks: SGLang vs vLLM comparison suite - Training utils extracted for backend-agnostic use - DeviceConfig with auto-detection - Ruler fix for empty trajectory groups and exception preservation - vLLM compatibility patches
1 parent 19bd069 commit 062d680

66 files changed

Lines changed: 8243 additions & 1798 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CLAUDE.md

Lines changed: 0 additions & 1 deletion
This file was deleted.

CLAUDE.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
## uv package manager by default
2+
3+
This project uses the `uv` package manager.
4+
5+
- To add a dependency, run `uv add <package>`.
6+
- To run a script, run `uv run <script>`.
7+
- To examine dependencies, consult the `pyproject.toml` file.
8+
9+
## Testing
10+
11+
- Always run tests before committing. The test command is `uv run prek run --all-files`.
12+
13+
## Releases
14+
15+
- If asked to help with a release, refer to the checklist in CONTRIBUTING.md. Be sure to first share a draft of the release notes with the user before actually publishing the release to GitHub.
16+
- To trigger the release workflow via GitHub CLI: `gh workflow run create-draft-release.yml --field version_type=patch` (use `minor` or `major` instead of `patch` as needed)
17+
18+
## Documentation
19+
20+
- All documentation is in the `docs` directory.
21+
- If you add a new page, be sure to add it to the sidebar in `docs/docs.json`.
22+
- If you move a page, be sure to update the sidebar in `docs/docs.json` and check for any broken links.
23+
24+
### Adding images
25+
26+
- Add images to the `docs/images` directory
27+
- If the image is a png, first convert it to webp using `magick <input.png> <output.webp>`. Do not include the original png in the repo.
28+
- Use the `<Frame>` tag to add images with captions as seen in the page `checkpoint-forking.mdx`.
29+
30+
### Adding notes
31+
32+
- Add notes using the `<Note>` tag as seen in the page `ruler.mdx`

CONTRIBUTING.md

Lines changed: 0 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -133,22 +133,4 @@ If you run into any issues, the training output is set to maximum verbosity. Cop
133133

134134
### Cleaning Up
135135

136-
When you're done, you can tear down the cluster with:
137-
138-
```bash
139-
uv run sky down art
140-
```
141-
142-
### Adding Docs
143-
144-
We use Mintlify to serve our docs. Here are the steps for adding a new page:
145-
1. Clone the ART repo
146-
2. Open the /docs directory in your CLI and IDE
147-
3. Run npx mintlify dev to start serving a local version of the docs in your browser
148-
4. Create a new .mdx file in the relevant directory
149-
5. Add a title and sidebar title (see other pages for examples)
150-
6. In docs.json, add a link to the new page within one of the `navigation`.`groups`
151-
7. Ensure everything works by navigating to and viewing the page in your browser
152-
8. Submit a PR
153-
154136
When you're done, shut down your GPU instance (if using a cloud VM) or stop the local training process.

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,7 @@ ART is in active development, and contributions are most welcome! Please see the
136136

137137
```bibtex
138138
@misc{hilton2025art,
139-
author = {Brad Hilton and Kyle Corbitt and David Corbitt and Saumya Gandhi and Angky William and Bohdan Kovalevskyi and Andie Jones},
139+
author = {Brad Hilton and Kyle Corbitt and David Corbitt and Saumya Gandhi and Angky William and Bohdan Kovalenskyi and Andie Jones},
140140
title = {ART: Agent Reinforcement Trainer},
141141
year = {2025},
142142
publisher = {GitHub},

benchmarks/__init__.py

Whitespace-only changes.
Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
# Unsloth + SGLang: MoE-Optimized RL Training Benchmark
2+
3+
Benchmark for the Unsloth + SGLang backend that combines SGLang for inference with Unsloth for MoE training. Uses a **dedicated GPU split** where inference and training run on separate GPUs for zero sleep/wake overhead, with a **persistent training worker** that keeps the model loaded across steps.
4+
5+
---
6+
7+
## Architecture — Dedicated GPU Split (Default)
8+
9+
```
10+
┌──────────────────────────────────────────────────────────────────┐
11+
│ 4-GPU Setup (Recommended) │
12+
│ │
13+
│ ┌─ GPUs 0, 2 (TP=2) ────────────┐ ┌─ GPU 1 ──────────────┐ │
14+
│ │ SGLang Server │ │ Unsloth Training │ │
15+
│ │ • Always active (no sleep) │ │ • Dedicated GPU │ │
16+
│ │ • TP=2 inference │ │ • Persistent worker │ │
17+
│ │ │ │ (model loaded once) │ │
18+
│ │ ┌──────────┐ ┌────────────┐ │ │ • LoRA + Optimizer │ │
19+
│ │ │ TP=2 │ │ LoRA │ │ │ • ART loss function │ │
20+
│ │ │ Model │ │ Hot-reload│ │ │ │ │
21+
│ │ │ Shards │ │ < 0.1s │ │ │ GPU 3: idle │ │
22+
│ │ └──────────┘ └────────────┘ │ └───────────────────────┘ │
23+
│ └─────────────────────────────────┘ │
24+
│ │
25+
│ ✓ No sleep/wake overhead │
26+
│ ✓ SGLang stays active during training │
27+
│ ✓ Persistent worker — model loaded once, reused across steps │
28+
│ ✓ TP must be power of 2 (vocab size constraint) │
29+
│ ✓ Generation is 70-90% of RL time → more inference GPUs = win │
30+
└───────────────────────────────────────────────────────────────────┘
31+
```
32+
33+
### Auto-Detected GPU Splits
34+
35+
TP must be a power of 2 (model vocab sizes like Qwen3's 151936 are divisible by 1,2,4,8 but NOT 3).
36+
37+
| GPUs Available | Inference GPUs | TP Size | Training GPU | Mode |
38+
|:-:|:-:|:-:|:-:|:-:|
39+
| 8 | 0, 2, 3, 4 | 4 | 1 | **Dedicated** |
40+
| 4 | 0, 2 | 2 | 1 | **Dedicated** |
41+
| 3 | 0, 2 | 2 | 1 | **Dedicated** |
42+
| 2 | 0 | 1 | 1 | **Dedicated** |
43+
| 1 | 0 | 1 || Shared (sleep/wake) |
44+
45+
GPU 1 is chosen as primary training GPU to keep GPU 0 as the primary SGLang rank.
46+
47+
### Key Features
48+
49+
- **Dedicated GPU split** — inference and training on separate GPUs, zero sleep/wake overhead
50+
- **Persistent training worker** — model loaded once at step 1, reused for all subsequent steps (~0s model load overhead on steps 2+)
51+
- **Auto-detected** — optimal split computed from available GPU count
52+
- **~12x faster MoE training** via Unsloth Triton kernels
53+
- **~35% less VRAM** via Split LoRA approach
54+
- **LoRA hot-reload** for weight sync (<0.1s)
55+
56+
### Shared Mode (Single GPU Fallback)
57+
58+
When only 1 GPU is available, falls back to the verl-style sleep/wake pattern where SGLang releases GPU memory before training and reclaims it after. This adds ~5-15s overhead per step.
59+
60+
---
61+
62+
## Files
63+
64+
| File | Purpose |
65+
|------|---------|
66+
| `run_benchmark.py` | End-to-end benchmark runner |
67+
| `config.py` | Benchmark configuration + GPU split helper |
68+
| `metrics_collector.py` | Metrics collection and reporting |
69+
| `sglang_server.py` | SGLang server lifecycle management (supports GPU pinning) |
70+
| `unsloth_sglang_service.py` | Unsloth + SGLang service with dedicated/shared GPU modes |
71+
| `setup_environments.sh` | Environment setup script |
72+
73+
---
74+
75+
## Training Loop
76+
77+
### Dedicated Mode (2+ GPUs, default)
78+
79+
**Step 1 (cold start):**
80+
81+
1. **Rollout** — SGLang generates on inference GPUs (always active, TP=2)
82+
2. **Data pipeline** — ART preprocessing tokenizes/packs into packed tensors
83+
3. **Spawn worker** — on dedicated training GPU (`CUDA_VISIBLE_DEVICES`)
84+
4. **Load model** — base model + LoRA adapter (~50s one-time cost)
85+
5. **Train** — ART loss on packed tensors
86+
6. **Save LoRA** — adapter saved to disk
87+
7. **Load LoRA** — hot-reload adapter into SGLang (<0.1s)
88+
89+
**Steps 2+ (persistent worker):**
90+
91+
1. **Rollout** — SGLang generates (never stops)
92+
2. **Data pipeline** — tokenize/pack
93+
3. **Train** — reuse persistent worker (model already loaded, ~0s overhead)
94+
4. **Save LoRA** + **Load LoRA** — save and hot-reload
95+
96+
No sleep/wake. SGLang never stops. Worker stays alive until benchmark end.
97+
98+
### Shared Mode (1 GPU fallback)
99+
100+
1. **Rollout** — SGLang generates completions
101+
2. **Data pipeline** — tokenize/pack
102+
3. **Sleep** — SGLang releases GPU memory
103+
4. **Spawn subprocess****Train****Save LoRA****Kill**
104+
5. **Wake** — SGLang restores GPU memory
105+
6. **Load LoRA** — hot-reload
106+
107+
---
108+
109+
## Running the Benchmark
110+
111+
```bash
112+
# Setup environments
113+
bash benchmarks/sglang_vs_vllm/setup_environments.sh
114+
115+
# Run with auto-detected GPU split (recommended)
116+
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python benchmarks/sglang_vs_vllm/run_benchmark.py \
117+
--sglang-python ~/.venvs/sglang-bench/bin/python \
118+
--num-steps 10 --num-rollouts 64 --dataset gsm8k
119+
120+
# Explicit GPU split: inference on GPUs 0,2 (TP=2), training on GPU 1
121+
uv run python benchmarks/sglang_vs_vllm/run_benchmark.py \
122+
--inference-gpus 0,2 --training-gpus 1 \
123+
--sglang-python ~/.venvs/sglang-bench/bin/python
124+
125+
# Force shared mode (sleep/wake) even with multiple GPUs
126+
uv run python benchmarks/sglang_vs_vllm/run_benchmark.py \
127+
--training-gpus -1 \
128+
--sglang-python ~/.venvs/sglang-bench/bin/python
129+
```
130+
131+
### Options
132+
133+
| Flag | Default | Description |
134+
|------|---------|-------------|
135+
| `--model` | `Qwen/Qwen3-30B-A3B-Instruct-2507` | Model to benchmark |
136+
| `--dataset` | `agentic` | Dataset: gsm8k, sharegpt, agentic, math, synthetic |
137+
| `--num-steps` | `3` | Number of RL training steps |
138+
| `--num-rollouts` | `16` | Rollouts per step |
139+
| `--inference-gpus` | auto | Comma-separated GPU IDs for SGLang inference (e.g. `0,2,3`) |
140+
| `--training-gpus` | auto | Comma-separated GPU IDs for training (e.g. `1`), `-1` for shared mode |
141+
| `--tp` | `0` (auto) | Tensor parallel size (overridden by `--inference-gpus` count) |
142+
| `--unsloth-lora-rank` | `1` | LoRA rank for Unsloth training |
143+
| `--unsloth-moe-backend` | `auto` | MoE backend: auto, grouped_mm (H100+), unsloth_triton (A100) |
144+
| `--unsloth-port` | `8300` | SGLang inference server port |
145+
| `--gpu-memory-utilization` | `0.7` | GPU memory fraction for SGLang |
146+
147+
GSM8K test set (1,319 questions) is downloaded automatically on first run and cached locally.
148+
149+
---
150+
151+
## Trade-Offs vs Distributed Training
152+
153+
| | Unsloth + SGLang (this) | Distributed (Megatron) |
154+
|---|---|---|
155+
| **Inference** | N-1 GPUs (TP=2 on 4 GPUs) | N GPUs (TP=4) |
156+
| **Training** | 1 GPU (persistent worker) | N GPUs (tensor parallel) |
157+
| **Model reload per step** | **0s** (steps 2+) | ~50-80s (sleep/wake + resharding) |
158+
| **Sleep/wake overhead** | None (dedicated split) | Yes (each step) |
159+
| **Training throughput** | Single GPU | Linear scaling across N GPUs |
160+
| **Setup complexity** | Simple | Complex |
161+
| **Best for** | Rapid prototyping, MoE models | Production, large-scale |
162+
163+
The persistent worker eliminates model reload overhead on steps 2+, which partially compensates for using fewer inference GPUs. Unsloth is single-GPU by design (no tensor parallelism for training), so the dedicated GPU split with persistent worker is the optimal configuration.
164+
165+
---
166+
167+
## Credits
168+
169+
- [ART (OpenPipe)](https://github.com/OpenPipe/ART) — The codebase this is built on
170+
- [verl (Volcano Engine)](https://github.com/volcengine/verl) — Reference for the SGLang integration pattern
171+
- [SGLang](https://github.com/sgl-project/sglang) — Inference engine
172+
- [Unsloth](https://unsloth.ai/) — MoE-optimized training
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"""Unsloth + SGLang benchmark suite."""

0 commit comments

Comments
 (0)