Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions docs/blog/posts/pd-disaggregation.md
Comment thread
jvstme marked this conversation as resolved.
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ For inference, `dstack` provides a [services](../../docs/concepts/services.md) a

> If you’re new to Prefill–Decode disaggregation, see the official [SGLang docs](https://docs.sglang.io/advanced_features/pd_disaggregation.html).

!!! note "Deprecation notice"
Configuring the SGLang router in a gateway is deprecated and will be disallowed in a future release. To run router and workers as separate replica groups, see [SGLang PD disaggregation (router as replica group)](https://dstack.ai/examples/inference/sglang/#pd-disaggregation).

## Services

With `dstack` `0.20.10`, you can define a service with separate replica groups for Prefill and Decode workers and enable PD disaggregation directly in the `router` configuration.
Expand Down Expand Up @@ -123,10 +126,9 @@ router:
</div>

## Limitations

* Because the SGLang router requires all workers to be on the same network, and `dstack` currently runs the router inside the gateway, the gateway and the service must be running in the same cluster.
* Prefill–Decode disaggregation is currently available with the SGLang backend (vLLM support is coming).
* Autoscaling supports RPS as the metric for now; TTFT and ITL metrics are planned next.
* Prefill–Decode disaggregation is currently available with the SGLang backend (vLLM support is coming).

With native support for inference and now Prefill–Decode disaggregation, `dstack` makes it easier to run high-throughput, low-latency model serving across GPU clouds, and Kubernetes or bare-metal clusters.

Expand Down
6 changes: 5 additions & 1 deletion docs/docs/concepts/gateways.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,11 @@ router:

If you configure the `sglang` router, [services](../concepts/services.md) can run either [standard SGLang workers](../../examples/inference/sglang/index.md) or [Prefill-Decode workers](../../examples/inference/sglang/index.md#pd-disaggregation) (aka PD disaggregation).

> Note, if you want to run services with PD disaggregation, the gateway must currently run in the same cluster as the service.
!!! note "PD disaggregation"
To run services with PD disaggregation see [SGLang PD disaggregation](https://dstack.ai/examples/inference/sglang/#pd-disaggregation).

!!! note "Deprecation"
Configuring the SGLang router in a gateway is deprecated and will be disallowed in a future release.

??? info "Policy"
The `policy` property allows you to configure the routing policy:
Expand Down
3 changes: 1 addition & 2 deletions docs/docs/concepts/services.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,6 @@ If [authorization](#authorization) is not disabled, the service endpoint require
Here are cases where a service may need a [gateway](gateways.md):

* To use [auto-scaling](#replicas-and-scaling) or [rate limits](#rate-limits)
* To enable a support custom router, e.g. such as the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#)
* To enable HTTPS for the endpoint and map it to your domain
* If your service requires WebSockets
* If your service cannot work with a [path prefix](#path-prefix)
Expand Down Expand Up @@ -234,7 +233,7 @@ Setting the minimum number of replicas to `0` allows the service to scale down t

### PD disaggregation

If you create a gateway with the [`sglang` router](gateways.md#sglang), you can run SGLang with [Prefill-Decode disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). See the [corresponding example](../../examples/inference/sglang/index.md#pd-disaggregation).
You can run SGLang with [Prefill-Decode disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). See the [corresponding example](../../examples/inference/sglang/index.md#pd-disaggregation).

### Authorization

Expand Down
59 changes: 28 additions & 31 deletions examples/inference/sglang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,16 +108,16 @@ curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \
```
</div>

!!! info "Router policy"
If you'd like to use a custom routing policy, create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details.
!!! info "Run router and workers separately"
To run the SGLang router and workers separately, use replica groups (router as a CPU replica group, workers as GPU replica groups). See [PD disaggregation](#pd-disaggregation).

> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.

## Configuration options

### PD disaggregation

If you create a gateway with the [`sglang` router](https://dstack.ai/docs/concepts/gateways/#sglang), you can run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html).
To run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html), run the **router as a replica** on a CPU-only host, while running **prefill/decode workers** as replicas on GPU hosts.

<div editor-title="examples/inference/sglang/pd.dstack.yml">

Expand All @@ -131,6 +131,21 @@ env:
- MODEL_ID=zai-org/GLM-4.5-Air-FP8

replicas:
- count: 1
# For now replica group with router must have count: 1
commands:
- pip install sglang_router
- |
python -m sglang_router.launch_router \
--host 0.0.0.0 \
--port 8000 \
--pd-disaggregation \
--prefill-policy cache_aware
router:
type: sglang
resources:
cpu: 4

- count: 1..4
scaling:
metric: rps
Expand All @@ -140,7 +155,7 @@ replicas:
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--port 8000 \
--disaggregation-bootstrap-port 8998
Expand All @@ -156,7 +171,7 @@ replicas:
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode decode \
--disaggregation-transfer-backend mooncake \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--port 8000
resources:
Expand All @@ -165,44 +180,26 @@ replicas:
port: 8000
model: zai-org/GLM-4.5-Air-FP8

# Custom probe is required for PD disaggregation
# Custom probe is required for PD disaggregation.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) By the way, is it still required? I thought sync_router_workers_for_run_model can gracefully handle the router or workers not being ready, and perform the registration eventually, once they become ready

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is still required. Because probes queries /v1/chat/completions to register the job but router fails to serve /v1/chat/completions until workers are registered. Meanwhile, the router-worker sync pipeline only considers RUNNING jobs that are also registered=True.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, so our default probe is the problem. But I assume it's possible to work around it by either setting probes: [], or not setting model. If that's the case, a custom probe is more of a recommendation, not a strict requirement.

Anyways, I think we were going to improve the UX here by introducing a different default probe for services with the SGLang router. Not in this PR, of course.

probes:
- type: http
url: /health_generate
url: /health
interval: 15s

router:
type: sglang
pd_disaggregation: true
```

</div>

Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon.

#### Gateway

Note, running services with PD disaggregation currently requires the gateway to run in the same cluster as the service.
#### Fleet

For example, if you run services on the `kubernetes` backend, make sure to also create the gateway in the same backend:
Create a [fleet](https://dstack.ai/docs/concepts/fleets/) that can provision both a CPU node (for the router replica group) and GPU nodes (for the prefill/decode replica groups).
You can create an SSH fleet, elastic Cloud fleet (nodes: 0..) or kubernetes cluster. Just don't specify any resource constraints in the fleet, and dstack will automatically provision the correct instances (both CPU and GPU, in the same fleet) based on the resources specified in replicas in the run configuration.

<div editor-title="gateway.dstack.yml">

```yaml
type: gateway
name: gateway-name

backend: kubernetes
region: any

domain: example.com
router:
type: sglang
```

</div>
The only requirement is that the router and worker replicas run in the same network. In practice, this typically means using a single fleet where the backend and region are the same or using `placement: cluster` if the backend supports it.

<!-- TODO: Gateway creation using fleets is coming to simplify this. -->
!!! note "Gateway-based routing (deprecated)"
If you create a gateway with the [`sglang` router](https://dstack.ai/docs/concepts/gateways/#sglang), you can also run SGLang with PD disaggregation. This method is deprecated and will be disallowed in a future release in favor of running the router as a replica.

## Source code

Expand Down
12 changes: 12 additions & 0 deletions examples/inference/sglang/pd-disagg.fleet.dstack.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
type: fleet
name: pd-disagg

placement: cluster

ssh_config:
user: ubuntu
identity_file: ~/.ssh/id_rsa
hosts:
- 89.169.108.16 # CPU Host (router)
- 89.169.123.100 # GPU Host (prefill/decode workers)
- 89.169.110.65 # GPU Host (prefill/decode workers)
54 changes: 54 additions & 0 deletions examples/inference/sglang/pd.deprecated.dstack.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# DEPRECATED: Gateway-based PD disaggregation config.
# Use `pd.dstack.yml` instead (router runs as a replica).

type: service
name: prefill-decode
image: lmsysorg/sglang:latest

env:
- HF_TOKEN
- MODEL_ID=zai-org/GLM-4.5-Air-FP8

replicas:
- count: 1..4
scaling:
metric: rps
target: 3
commands:
- |
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--host 0.0.0.0 \
--port 8000 \
--disaggregation-bootstrap-port 8998
resources:
gpu: 1

- count: 1..8
scaling:
metric: rps
target: 2
commands:
- |
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode decode \
--disaggregation-transfer-backend mooncake \
--host 0.0.0.0 \
--port 8000
resources:
gpu: 1

port: 8000
model: zai-org/GLM-4.5-Air-FP8

probes:
- type: http
url: /health_generate
interval: 15s

router:
type: sglang
pd_disaggregation: true
29 changes: 20 additions & 9 deletions examples/inference/sglang/pd.dstack.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,21 @@ env:
- MODEL_ID=zai-org/GLM-4.5-Air-FP8

replicas:
- count: 1
# For now replica group with router must have count: 1
commands:
- pip install sglang_router
- |
python -m sglang_router.launch_router \
--host 0.0.0.0 \
--port 8000 \
--pd-disaggregation \
--prefill-policy cache_aware
router:
type: sglang
resources:
cpu: 4

- count: 1..4
scaling:
metric: rps
Expand All @@ -16,12 +31,12 @@ replicas:
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--port 8000 \
--disaggregation-bootstrap-port 8998
resources:
gpu: 1
gpu: H200

- count: 1..8
scaling:
Expand All @@ -32,20 +47,16 @@ replicas:
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode decode \
--disaggregation-transfer-backend mooncake \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--port 8000
resources:
gpu: 1
gpu: H200

port: 8000
model: zai-org/GLM-4.5-Air-FP8

probes:
- type: http
url: /health_generate
url: /health
interval: 15s

router:
type: sglang
pd_disaggregation: true
4 changes: 4 additions & 0 deletions src/dstack/_internal/core/compatibility/runs.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,10 @@ def get_run_spec_excludes(run_spec: RunSpec) -> IncludeExcludeDictType:
if run_spec.configuration.https is None:
configuration_excludes["https"] = True

replicas = run_spec.configuration.replicas
if isinstance(replicas, list) and all(g.router is None for g in replicas):
configuration_excludes["replicas"] = {"__all__": {"router": True}}

if configuration_excludes:
spec_excludes["configuration"] = configuration_excludes
if profile_excludes:
Expand Down
39 changes: 38 additions & 1 deletion src/dstack/_internal/core/models/configurations.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
parse_off_duration,
)
from dstack._internal.core.models.resources import Range, ResourcesSpec
from dstack._internal.core.models.routers import AnyServiceRouterConfig
from dstack._internal.core.models.routers import AnyServiceRouterConfig, ReplicaGroupRouterConfig
from dstack._internal.core.models.services import AnyModel, OpenAIChatModel
from dstack._internal.core.models.unix import UnixUser
from dstack._internal.core.models.volumes import MountPoint, VolumeConfiguration, parse_mount_point
Expand Down Expand Up @@ -801,6 +801,12 @@ class ReplicaGroup(CoreModel):
CommandsList,
Field(description="The shell commands to run for replicas in this group"),
] = []
router: Annotated[
Comment thread
jvstme marked this conversation as resolved.
Optional[ReplicaGroupRouterConfig],
Field(
description="When set, replicas in this group run the in-service HTTP router (e.g. SGLang).",
),
] = None

@validator("name")
def validate_name(cls, v: Optional[str]) -> Optional[str]:
Expand Down Expand Up @@ -1032,6 +1038,37 @@ def validate_replica_groups_have_commands_or_image(cls, values):

return values

@root_validator()
def validate_at_most_one_router_replica_group(cls, values):
replicas = values.get("replicas")
if not isinstance(replicas, list):
return values
router_groups = [g for g in replicas if g.router is not None]
if len(router_groups) > 1:
raise ValueError("At most one replica group may specify `router`.")
if router_groups:
router_group = router_groups[0]
if router_group.count.min != 1 or router_group.count.max != 1:
raise ValueError("For now replica group with `router` must have `count: 1`.")
return values

@root_validator()
def validate_replica_group_router_mutex(cls, values):
"""
When a replica group sets `router:`, service-level `router` must be omitted.
Comment thread
jvstme marked this conversation as resolved.
(Gateway-level SGLang is rejected at service registration when a gateway is selected.)
"""
replicas = values.get("replicas")
if not isinstance(replicas, list):
return values
if not any(g.router is not None for g in replicas):
return values
if values.get("router") is not None:
raise ValueError(
"Service-Level router configuration is not allowed together with replica-group `router`."
)
return values


class ServiceConfigurationConfig(
ProfileParamsConfig,
Expand Down
7 changes: 7 additions & 0 deletions src/dstack/_internal/core/models/routers.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,5 +43,12 @@ class SGLangServiceRouterConfig(CoreModel):
] = False


class ReplicaGroupRouterConfig(CoreModel):
type: Annotated[
Literal["sglang"],
Field(description="The router implementation for this replica group."),
] = "sglang"


AnyServiceRouterConfig = SGLangServiceRouterConfig
AnyGatewayRouterConfig = SGLangGatewayRouterConfig
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,8 @@ server {
}
{% endfor %}

{# For SGLang router: block all requests except whitelisted locations added dynamically above #}
{% if router is not none and router.type == "sglang" %}
{# For router services: block all requests except whitelisted locations added dynamically above #}
{% if has_router_replica or (router is not none and router.type == "sglang") %}
location / {
return 403;
}
Expand Down
1 change: 1 addition & 0 deletions src/dstack/_internal/proxy/gateway/routers/registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ async def register_service(
model=body.options.openai.model if body.options.openai is not None else None,
ssh_private_key=body.ssh_private_key,
repo=repo,
has_router_replica=body.has_router_replica,
router=body.router,
nginx=nginx,
service_conn_pool=service_conn_pool,
Expand Down
Loading
Loading