| title | SGLang |
|---|---|
| description | Deploying DeepSeek-R1-Distill-Llama models using SGLang on NVIDIA and AMD GPUs |
This example shows how to deploy DeepSeek-R1-Distill-Llama 8B and 70B using SGLang and dstack.
Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B using SGLang.
=== "NVIDIA"
<div editor-title="examples/inference/sglang/nvidia/.dstack.yml">
```yaml
type: service
name: deepseek-r1
image: lmsysorg/sglang:latest
env:
- MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
commands:
- python3 -m sglang.launch_server
--model-path $MODEL_ID
--port 8000
--trust-remote-code
port: 8000
model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
resources:
gpu: 24GB
```
</div>
=== "AMD"
<div editor-title="examples/inference/sglang/amd/.dstack.yml">
```yaml
type: service
name: deepseek-r1
image: lmsysorg/sglang:v0.4.1.post4-rocm620
env:
- MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
commands:
- python3 -m sglang.launch_server
--model-path $MODEL_ID
--port 8000
--trust-remote-code
port: 8000
model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
resources:
gpu: MI300x
disk: 300GB
```
</div>
To run a configuration, use the dstack apply command.
$ dstack apply -f examples/llms/deepseek/sglang/amd/.dstack.yml
# BACKEND REGION RESOURCES SPOT PRICE
1 runpod EU-RO-1 24xCPU, 283GB, 1xMI300X (192GB) no $2.49
Submit the run deepseek-r1? [y/n]: y
Provisioning...
---> 100%If no gateway is created, the service endpoint will be available at <dstack server URL>/proxy/services/<project name>/<run name>/.
curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer <dstack token>' \
-H 'Content-Type: application/json' \
-d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is Deep Learning?"
}
],
"stream": true,
"max_tokens": 512
}'!!! info "Run router and workers separately" To run the SGLang router and workers separately, use replica groups (router as a CPU replica group, workers as GPU replica groups). See PD disaggregation.
If a gateway is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at
https://deepseek-r1.<gateway domain>/.
To run SGLang with PD disaggregation, run the router as a replica on a CPU-only host, while running prefill/decode workers as replicas on GPU hosts.
type: service
name: prefill-decode
image: lmsysorg/sglang:latest
env:
- HF_TOKEN
- MODEL_ID=zai-org/GLM-4.5-Air-FP8
replicas:
- count: 1
# For now replica group with router must have count: 1
commands:
- pip install sglang_router
- |
python -m sglang_router.launch_router \
--host 0.0.0.0 \
--port 8000 \
--pd-disaggregation \
--prefill-policy cache_aware
router:
type: sglang
resources:
cpu: 4
- count: 1..4
scaling:
metric: rps
target: 3
commands:
- |
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--port 8000 \
--disaggregation-bootstrap-port 8998
resources:
gpu: H200
- count: 1..8
scaling:
metric: rps
target: 2
commands:
- |
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--port 8000
resources:
gpu: H200
port: 8000
model: zai-org/GLM-4.5-Air-FP8
# Custom probe is required for PD disaggregation.
probes:
- type: http
url: /health
interval: 15sCurrently, auto-scaling only supports rps as the metric. TTFT and ITL metrics are coming soon.
Create a fleet that can provision both a CPU node (for the router replica group) and GPU nodes (for the prefill/decode replica groups). You can create an SSH fleet, elastic Cloud fleet (nodes: 0..) or kubernetes cluster. Just don't specify any resource constraints in the fleet, and dstack will automatically provision the correct instances (both CPU and GPU, in the same fleet) based on the resources specified in replicas in the run configuration.
The only requirement is that the router and worker replicas run in the same network. In practice, this typically means using a single fleet where the backend and region are the same or using placement: cluster if the backend supports it.
!!! note "Gateway-based routing (deprecated)"
If you create a gateway with the sglang router, you can also run SGLang with PD disaggregation. This method is deprecated and will be disallowed in a future release in favor of running the router as a replica.
The source-code of these examples can be found in
examples/llms/deepseek/sglang and examples/inference/sglang.
- Read about services and gateways
- Browse the SgLang DeepSeek Usage, Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X