Skip to content
Merged
61 changes: 36 additions & 25 deletions docs/blog/posts/pd-disaggregation.md
Comment thread
jvstme marked this conversation as resolved.
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ For inference, `dstack` provides a [services](../../docs/concepts/services.md) a

## Services

With `dstack` `0.20.10`, you can define a service with separate replica groups for Prefill and Decode workers and enable PD disaggregation directly in the `router` configuration.
With `dstack` `0.20.17`, you can define a service with separate replica groups for Router, Prefill and Decode workers and run PD disaggregated Inference.

<div editor-title="glm45air.dstack.yml">

Expand All @@ -43,6 +43,21 @@ env:
image: lmsysorg/sglang:latest

replicas:
- count: 1
# For now replica group with router must have count: 1
commands:
- pip install sglang_router
- |
python -m sglang_router.launch_router \
--host 0.0.0.0 \
--port 8000 \
--pd-disaggregation \
--prefill-policy cache_aware
router:
type: sglang
resources:
cpu: 4

- count: 1..4
scaling:
metric: rps
Expand All @@ -52,7 +67,7 @@ replicas:
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--port 8000 \
--disaggregation-bootstrap-port 8998
Expand All @@ -68,7 +83,7 @@ replicas:
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode decode \
--disaggregation-transfer-backend mooncake \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--port 8000
resources:
Expand All @@ -79,12 +94,8 @@ model: zai-org/GLM-4.5-Air-FP8

probes:
- type: http
url: /health_generate
url: /health
interval: 15s

router:
type: sglang
pd_disaggregation: true
```

</div>
Expand All @@ -100,32 +111,32 @@ $ dstack apply -f glm45air.dstack.yml

</div>

### Gateway
### SSH fleet

Just like `dstack` relies on the SGLang router for cache-aware routing, Prefill–Decode disaggregation also requires a [gateway](../../docs/concepts/gateways.md#sglang) configured with the SGLang router.
Create an [SSH fleet](https://dstack.ai/docs/concepts/fleets/#apply-a-configuration) that includes one CPU host for the router and one or more GPU hosts for the workers. Make sure the CPU and GPU hosts are in the same network.

<div editor-title="gateway-sglang.dstack.yml">
<div editor-title="pd-fleet.dstack.yml">

```yaml
type: gateway
name: inference-gateway

backends: [kubernetes]
region: any

domain: example.com

router:
type: sglang
policy: cache_aware
type: fleet
name: pd-disagg

placement: cluster

ssh_config:
user: ubuntu
identity_file: ~/.ssh/id_rsa
hosts:
- 89.169.108.16 # CPU Host (router)
- 89.169.123.100 # GPU Host (prefill/decode workers)
- 89.169.110.65 # GPU Host (prefill/decode workers)
```

</div>

## Limitations

* Because the SGLang router requires all workers to be on the same network, and `dstack` currently runs the router inside the gateway, the gateway and the service must be running in the same cluster.
* Prefill–Decode disaggregation is currently available with the SGLang backend (vLLM support is coming).
* The router replica group is currently limited to `count: 1` (no HA yet). Support for multiple router replicas for HA is planned.
* Prefill–Decode disaggregation is currently available with the SGLang backend (Nvidia-dynamo and vLLM support is coming).
* Autoscaling supports RPS as the metric for now; TTFT and ITL metrics are planned next.

With native support for inference and now Prefill–Decode disaggregation, `dstack` makes it easier to run high-throughput, low-latency model serving across GPU clouds, and Kubernetes or bare-metal clusters.
Expand Down
6 changes: 5 additions & 1 deletion docs/docs/concepts/gateways.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,11 @@ router:

If you configure the `sglang` router, [services](../concepts/services.md) can run either [standard SGLang workers](../../examples/inference/sglang/index.md) or [Prefill-Decode workers](../../examples/inference/sglang/index.md#pd-disaggregation) (aka PD disaggregation).

> Note, if you want to run services with PD disaggregation, the gateway must currently run in the same cluster as the service.
!!! note "PD disaggregation"
To run services with PD disaggregation see [SGLang PD disaggregation](https://dstack.ai/examples/inference/sglang/#pd-disaggregation).

!!! note "Deprecation"
Configuring the SGLang router in a gateway will be deprecated in a future release.
Comment thread
jvstme marked this conversation as resolved.
Outdated

??? info "Policy"
The `policy` property allows you to configure the routing policy:
Expand Down
3 changes: 1 addition & 2 deletions docs/docs/concepts/services.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,6 @@ If [authorization](#authorization) is not disabled, the service endpoint require
Here are cases where a service may need a [gateway](gateways.md):

* To use [auto-scaling](#replicas-and-scaling) or [rate limits](#rate-limits)
* To enable a support custom router, e.g. such as the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#)
* To enable HTTPS for the endpoint and map it to your domain
* If your service requires WebSockets
* If your service cannot work with a [path prefix](#path-prefix)
Expand Down Expand Up @@ -234,7 +233,7 @@ Setting the minimum number of replicas to `0` allows the service to scale down t

### PD disaggregation

If you create a gateway with the [`sglang` router](gateways.md#sglang), you can run SGLang with [Prefill-Decode disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). See the [corresponding example](../../examples/inference/sglang/index.md#pd-disaggregation).
You can run SGLang with [Prefill-Decode disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). See the [corresponding example](../../examples/inference/sglang/index.md#pd-disaggregation).

### Authorization

Expand Down
67 changes: 41 additions & 26 deletions examples/inference/sglang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,16 +108,16 @@ curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \
```
</div>

!!! info "Router policy"
If you'd like to use a custom routing policy, create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details.
!!! info "Run router and workers separately"
To run the SGLang router and workers separately, use replica groups (router as a CPU replica group, workers as GPU replica groups). See [PD disaggregation](#pd-disaggregation).

> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.

## Configuration options

### PD disaggregation

If you create a gateway with the [`sglang` router](https://dstack.ai/docs/concepts/gateways/#sglang), you can run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html).
To run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html), run the **router as a replica** on a CPU-only host, while running **prefill/decode workers** as replicas on GPU hosts.

<div editor-title="examples/inference/sglang/pd.dstack.yml">

Expand All @@ -131,6 +131,21 @@ env:
- MODEL_ID=zai-org/GLM-4.5-Air-FP8

replicas:
- count: 1
# For now replica group with router must have count: 1
commands:
- pip install sglang_router
- |
python -m sglang_router.launch_router \
--host 0.0.0.0 \
--port 8000 \
--pd-disaggregation \
--prefill-policy cache_aware
router:
type: sglang
resources:
cpu: 4

- count: 1..4
scaling:
metric: rps
Expand All @@ -140,7 +155,7 @@ replicas:
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--port 8000 \
--disaggregation-bootstrap-port 8998
Expand All @@ -156,53 +171,53 @@ replicas:
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode decode \
--disaggregation-transfer-backend mooncake \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--port 8000
resources:
gpu: H200

port: 8000
model: zai-org/GLM-4.5-Air-FP8
# SSH fleet containing both router (CPU) and workers (GPU).
fleets: [pd-disagg]

# Custom probe is required for PD disaggregation
# Custom probe is required for PD disaggregation.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) By the way, is it still required? I thought sync_router_workers_for_run_model can gracefully handle the router or workers not being ready, and perform the registration eventually, once they become ready

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is still required. Because probes queries /v1/chat/completions to register the job but router fails to serve /v1/chat/completions until workers are registered. Meanwhile, the router-worker sync pipeline only considers RUNNING jobs that are also registered=True.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, so our default probe is the problem. But I assume it's possible to work around it by either setting probes: [], or not setting model. If that's the case, a custom probe is more of a recommendation, not a strict requirement.

Anyways, I think we were going to improve the UX here by introducing a different default probe for services with the SGLang router. Not in this PR, of course.

probes:
- type: http
url: /health_generate
url: /health
interval: 15s

router:
type: sglang
pd_disaggregation: true
```

</div>

Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon.

#### Gateway

Note, running services with PD disaggregation currently requires the gateway to run in the same cluster as the service.
#### SSH fleet

For example, if you run services on the `kubernetes` backend, make sure to also create the gateway in the same backend:
Create an [SSH fleet](https://dstack.ai/docs/concepts/fleets/#apply-a-configuration) that includes one CPU host for the router and one or more GPU hosts for the workers. Make sure the CPU and GPU hosts are in the same network.
Comment thread
jvstme marked this conversation as resolved.
Outdated

<div editor-title="gateway.dstack.yml">
<div editor-title="pd-fleet.dstack.yml">

```yaml
type: gateway
name: gateway-name

backend: kubernetes
region: any

domain: example.com
router:
type: sglang
type: fleet
name: pd-disagg

placement: cluster

ssh_config:
user: ubuntu
identity_file: ~/.ssh/id_rsa
hosts:
- 89.169.108.16 # CPU Host (router)
- 89.169.123.100 # GPU Host (prefill/decode workers)
- 89.169.110.65 # GPU Host (prefill/decode workers)
```

</div>

<!-- TODO: Gateway creation using fleets is coming to simplify this. -->
!!! note "Gateway-based routing (deprecated)"
If you create a gateway with the [`sglang` router](https://dstack.ai/docs/concepts/gateways/#sglang), you can also run SGLang with PD disaggregation. This method will be deprecated in the future in favor of running the router as a replica.
Comment thread
jvstme marked this conversation as resolved.
Outdated

## Source code

Expand Down
12 changes: 12 additions & 0 deletions examples/inference/sglang/pd-disagg.fleet.dstack.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
type: fleet
name: pd-disagg

placement: cluster

ssh_config:
user: ubuntu
identity_file: ~/.ssh/id_rsa
hosts:
- 89.169.108.16 # CPU Host (router)
- 89.169.123.100 # GPU Host (prefill/decode workers)
- 89.169.110.65 # GPU Host (prefill/decode workers)
54 changes: 54 additions & 0 deletions examples/inference/sglang/pd.deprecated.dstack.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# DEPRECATED: Gateway-based PD disaggregation config.
# Use `pd.dstack.yml` instead (router runs as a replica).

type: service
name: prefill-decode
image: lmsysorg/sglang:latest

env:
- HF_TOKEN
- MODEL_ID=zai-org/GLM-4.5-Air-FP8

replicas:
- count: 1..4
scaling:
metric: rps
target: 3
commands:
- |
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--host 0.0.0.0 \
--port 8000 \
--disaggregation-bootstrap-port 8998
resources:
gpu: 1

- count: 1..8
scaling:
metric: rps
target: 2
commands:
- |
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode decode \
--disaggregation-transfer-backend mooncake \
--host 0.0.0.0 \
--port 8000
resources:
gpu: 1

port: 8000
model: zai-org/GLM-4.5-Air-FP8

probes:
- type: http
url: /health_generate
interval: 15s

router:
type: sglang
pd_disaggregation: true
Loading
Loading