Skip to content

Commit 19bd069

Browse files
committed
Update sglang-integration.md
1 parent 13377c0 commit 19bd069

1 file changed

Lines changed: 0 additions & 47 deletions

File tree

docs/sglang-integration.md

Lines changed: 0 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -203,57 +203,10 @@ await backend.register(model)
203203
| `disk` | ~10-20s | Preserved | Large checkpoints |
204204
| `restart` | ~30-60s | Lost | Single-GPU fallback |
205205

206-
## Known Issues and Workarounds
207206

208-
### 1. DeviceMesh Memory Imbalance Error
209207

210-
**Symptom**: SGLang fails to start with memory imbalance error.
211208

212-
**Solution**: Set environment variable (done automatically by SGLangBackend):
213-
```bash
214-
export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=True
215-
```
216-
217-
### 2. update_weights_from_tensor Fails with TP > 1
218-
219-
**Reference**: [SGLang #3726](https://github.com/sgl-project/sglang/issues/3726)
220-
221-
**Solution**: Use `weight_sync_method="lora"` or `"disk"` instead of tensor sync.
222-
223-
### 3. OOM on Weight Update
224-
225-
**Reference**: [SGLang #8076](https://github.com/sgl-project/sglang/issues/8076)
226-
227-
**Solution**: Use disk-based sync or reduce `mem_fraction_static`.
228-
229-
### 4. dp_size Must Be 1 for Weight Updates
230-
231-
**Reference**: [SGLang #4283](https://github.com/sgl-project/sglang/issues/4283)
232-
233-
**Solution**: Don't use data parallelism for inference (use TP instead).
234-
235-
### 5. Garbled Output with Small Tensor Buckets
236-
237-
**Reference**: [SGLang #14178](https://github.com/sgl-project/sglang/issues/14178)
238-
239-
**Solution**: Use LoRA-based sync instead of tensor sync.
240-
241-
## Performance Comparison
242-
243-
Based on external benchmarks (H100, Llama 3.1 8B):
244-
245-
| Metric | vLLM | SGLang | Improvement |
246-
|--------|------|--------|-------------|
247-
| Throughput (tok/s) | ~12,500 | ~16,200 | ~29% |
248-
| TTFT (ms) | ~45 | ~35 | ~22% |
249-
| P99 Latency (ms) | ~120 | ~95 | ~21% |
250-
251-
*Source: [aimultiple.com benchmark](https://aimultiple.com/llm-inference-benchmark)*
252209

253-
The performance advantage comes from:
254-
- RadixAttention's automatic prefix caching
255-
- Zero-overhead scheduler design
256-
- Optimized FlashInfer kernels
257210

258211
## Benchmarking Your Setup
259212

0 commit comments

Comments
 (0)