dstackai · Bihan · Apr 15, 2026 · Apr 1, 2026 · Apr 1, 2026 · Apr 1, 2026
diff --git a/docs/blog/posts/pd-disaggregation.md b/docs/blog/posts/pd-disaggregation.md
@@ -26,6 +26,9 @@ For inference, `dstack` provides a [services](../../docs/concepts/services.md) a
 
 > If you’re new to Prefill–Decode disaggregation, see the official [SGLang docs](https://docs.sglang.io/advanced_features/pd_disaggregation.html).
 
+!!! note "Deprecation notice"
+    Configuring the SGLang router in a gateway is deprecated and will be disallowed in a future release. To run router and workers as separate replica groups, see [SGLang PD disaggregation (router as replica group)](https://dstack.ai/examples/inference/sglang/#pd-disaggregation).
+
 ## Services
 
 With `dstack` `0.20.10`, you can define a service with separate replica groups for Prefill and Decode workers and enable PD disaggregation directly in the `router` configuration.
@@ -123,10 +126,9 @@ router:
 </div>
 
 ## Limitations
-
 * Because the SGLang router requires all workers to be on the same network, and `dstack` currently runs the router inside the gateway, the gateway and the service must be running in the same cluster.
-* Prefill–Decode disaggregation is currently available with the SGLang backend (vLLM support is coming).
 * Autoscaling supports RPS as the metric for now; TTFT and ITL metrics are planned next.
+* Prefill–Decode disaggregation is currently available with the SGLang backend (vLLM support is coming).
 
 With native support for inference and now Prefill–Decode disaggregation, `dstack` makes it easier to run high-throughput, low-latency model serving across GPU clouds, and Kubernetes or bare-metal clusters.
 

diff --git a/docs/docs/concepts/gateways.md b/docs/docs/concepts/gateways.md
@@ -95,7 +95,11 @@ router:
 
 If you configure the `sglang` router, [services](../concepts/services.md) can run either [standard SGLang workers](../../examples/inference/sglang/index.md) or [Prefill-Decode workers](../../examples/inference/sglang/index.md#pd-disaggregation) (aka PD disaggregation).
 
-> Note, if you want to run services with PD disaggregation, the gateway must currently run in the same cluster as the service.
+!!! note "PD disaggregation"
+    To run services with PD disaggregation see [SGLang PD disaggregation](https://dstack.ai/examples/inference/sglang/#pd-disaggregation).
+
+!!! note "Deprecation"
+    Configuring the SGLang router in a gateway is deprecated and will be disallowed in a future release.
 
 ??? info "Policy"
     The `policy` property allows you to configure the routing policy:

diff --git a/docs/docs/concepts/services.md b/docs/docs/concepts/services.md
@@ -107,7 +107,6 @@ If [authorization](#authorization) is not disabled, the service endpoint require
 Here are cases where a service may need a [gateway](gateways.md):
 
 * To use [auto-scaling](#replicas-and-scaling) or [rate limits](#rate-limits)
-* To enable a support custom router, e.g. such as the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#)
 * To enable HTTPS for the endpoint and map it to your domain
 * If your service requires WebSockets
 * If your service cannot work with a [path prefix](#path-prefix)
@@ -234,7 +233,7 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
 
 ### PD disaggregation
 
-If you create a gateway with the [`sglang` router](gateways.md#sglang), you can run SGLang with [Prefill-Decode disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). See the [corresponding example](../../examples/inference/sglang/index.md#pd-disaggregation).
+You can run SGLang with [Prefill-Decode disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). See the [corresponding example](../../examples/inference/sglang/index.md#pd-disaggregation).
 
 ### Authorization
 

diff --git a/examples/inference/sglang/README.md b/examples/inference/sglang/README.md
@@ -108,16 +108,16 @@ curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \
 ```
 </div>
 
-!!! info "Router policy"
-    If you'd like to use a custom routing policy, create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details.
+!!! info "Run router and workers separately"
+    To run the SGLang router and workers separately, use replica groups (router as a CPU replica group, workers as GPU replica groups). See [PD disaggregation](#pd-disaggregation).
 
 > If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.
 
 ## Configuration options
 
 ### PD disaggregation
 
-If you create a gateway with the [`sglang` router](https://dstack.ai/docs/concepts/gateways/#sglang), you can run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html).
+To run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html), run the **router as a replica** on a CPU-only host, while running **prefill/decode workers** as replicas on GPU hosts.
 
 <div editor-title="examples/inference/sglang/pd.dstack.yml">
 
@@ -131,6 +131,21 @@ env:
   - MODEL_ID=zai-org/GLM-4.5-Air-FP8
 
 replicas:
+  - count: 1
+    # For now replica group with router must have count: 1
+    commands:
+      - pip install sglang_router
+      - |
+        python -m sglang_router.launch_router \
+          --host 0.0.0.0 \
+          --port 8000 \
+          --pd-disaggregation \
+          --prefill-policy cache_aware
+    router:
+      type: sglang
+    resources:
+      cpu: 4
+
   - count: 1..4
     scaling:
       metric: rps
@@ -140,7 +155,7 @@ replicas:
         python -m sglang.launch_server \
           --model-path $MODEL_ID \
           --disaggregation-mode prefill \
-          --disaggregation-transfer-backend mooncake \
+          --disaggregation-transfer-backend nixl \
           --host 0.0.0.0 \
           --port 8000 \
           --disaggregation-bootstrap-port 8998
@@ -156,7 +171,7 @@ replicas:
         python -m sglang.launch_server \
           --model-path $MODEL_ID \
           --disaggregation-mode decode \
-          --disaggregation-transfer-backend mooncake \
+          --disaggregation-transfer-backend nixl \
           --host 0.0.0.0 \
           --port 8000
     resources:
@@ -165,44 +180,26 @@ replicas:
 port: 8000
 model: zai-org/GLM-4.5-Air-FP8
 
-# Custom probe is required for PD disaggregation
+# Custom probe is required for PD disaggregation.
 probes:
   - type: http
-    url: /health_generate
+    url: /health
     interval: 15s
-
-router:
-  type: sglang
-  pd_disaggregation: true
 ```
 
 </div>
 
 Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon.
 
-#### Gateway
-
-Note, running services with PD disaggregation currently requires the gateway to run in the same cluster as the service.
+#### Fleet
 
-For example, if you run services on the `kubernetes` backend, make sure to also create the gateway in the same backend:
+Create a [fleet](https://dstack.ai/docs/concepts/fleets/) that can provision both a CPU node (for the router replica group) and GPU nodes (for the prefill/decode replica groups).
+You can create an SSH fleet, elastic Cloud fleet (nodes: 0..) or kubernetes cluster. Just don't specify any resource constraints in the fleet, and dstack will automatically provision the correct instances (both CPU and GPU, in the same fleet) based on the resources specified in replicas in the run configuration.
 
-<div editor-title="gateway.dstack.yml">
-
-```yaml
-type: gateway
-name: gateway-name
-
-backend: kubernetes
-region: any
-
-domain: example.com
-router:
-  type: sglang
-```
-
-</div>
+The only requirement is that the router and worker replicas run in the same network. In practice, this typically means using a single fleet where the backend and region are the same or using `placement: cluster` if the backend supports it.
 
-<!-- TODO: Gateway creation using fleets is coming to simplify this. -->
+!!! note "Gateway-based routing (deprecated)"
+    If you create a gateway with the [`sglang` router](https://dstack.ai/docs/concepts/gateways/#sglang), you can also run SGLang with PD disaggregation. This method is deprecated and will be disallowed in a future release in favor of running the router as a replica.
 
 ## Source code
 

diff --git a/examples/inference/sglang/pd-disagg.fleet.dstack.yml b/examples/inference/sglang/pd-disagg.fleet.dstack.yml
@@ -0,0 +1,12 @@
+type: fleet
+name: pd-disagg
+
+placement: cluster
+
+ssh_config:
+  user: ubuntu
+  identity_file: ~/.ssh/id_rsa
+  hosts:
+    - 89.169.108.16   # CPU Host (router)
+    - 89.169.123.100  # GPU Host (prefill/decode workers)
+    - 89.169.110.65   # GPU Host (prefill/decode workers)
diff --git a/examples/inference/sglang/pd.deprecated.dstack.yml b/examples/inference/sglang/pd.deprecated.dstack.yml
@@ -0,0 +1,54 @@
+# DEPRECATED: Gateway-based PD disaggregation config.
+# Use `pd.dstack.yml` instead (router runs as a replica).
+
+type: service
+name: prefill-decode
+image: lmsysorg/sglang:latest
+
+env:
+  - HF_TOKEN
+  - MODEL_ID=zai-org/GLM-4.5-Air-FP8
+
+replicas:
+  - count: 1..4
+    scaling:
+      metric: rps
+      target: 3
+    commands:
+      - |
+          python -m sglang.launch_server \
+            --model-path $MODEL_ID \
+            --disaggregation-mode prefill \
+            --disaggregation-transfer-backend mooncake \
+            --host 0.0.0.0 \
+            --port 8000 \
+            --disaggregation-bootstrap-port 8998
+    resources:
+      gpu: 1
+
+  - count: 1..8
+    scaling:
+      metric: rps
+      target: 2
+    commands:
+      - |
+          python -m sglang.launch_server \
+            --model-path $MODEL_ID \
+            --disaggregation-mode decode \
+            --disaggregation-transfer-backend mooncake \
+            --host 0.0.0.0 \
+            --port 8000
+    resources:
+      gpu: 1
+
+port: 8000
+model: zai-org/GLM-4.5-Air-FP8
+
+probes:
+  - type: http
+    url: /health_generate
+    interval: 15s
+
+router:
+  type: sglang
+  pd_disaggregation: true
diff --git a/examples/inference/sglang/pd.dstack.yml b/examples/inference/sglang/pd.dstack.yml
@@ -7,6 +7,21 @@ env:
   - MODEL_ID=zai-org/GLM-4.5-Air-FP8
 
 replicas:
+  - count: 1
+    # For now replica group with router must have count: 1
+    commands:
+      - pip install sglang_router
+      - |
+          python -m sglang_router.launch_router \
+            --host 0.0.0.0 \
+            --port 8000 \
+            --pd-disaggregation \
+            --prefill-policy cache_aware
+    router:
+      type: sglang
+    resources:
+      cpu: 4
+
   - count: 1..4
     scaling:
       metric: rps
@@ -16,12 +31,12 @@ replicas:
           python -m sglang.launch_server \
             --model-path $MODEL_ID \
             --disaggregation-mode prefill \
-            --disaggregation-transfer-backend mooncake \
+            --disaggregation-transfer-backend nixl \
             --host 0.0.0.0 \
             --port 8000 \
             --disaggregation-bootstrap-port 8998
     resources:
-      gpu: 1
+      gpu: H200
 
   - count: 1..8
     scaling:
@@ -32,20 +47,16 @@ replicas:
           python -m sglang.launch_server \
             --model-path $MODEL_ID \
             --disaggregation-mode decode \
-            --disaggregation-transfer-backend mooncake \
+            --disaggregation-transfer-backend nixl \
             --host 0.0.0.0 \
             --port 8000
     resources:
-      gpu: 1
+      gpu: H200
 
 port: 8000
 model: zai-org/GLM-4.5-Air-FP8
 
 probes:
   - type: http
-    url: /health_generate
+    url: /health
     interval: 15s
-
-router:
-  type: sglang
-  pd_disaggregation: true
diff --git a/src/dstack/_internal/core/compatibility/runs.py b/src/dstack/_internal/core/compatibility/runs.py
@@ -101,6 +101,10 @@ def get_run_spec_excludes(run_spec: RunSpec) -> IncludeExcludeDictType:
         if run_spec.configuration.https is None:
             configuration_excludes["https"] = True
 
+        replicas = run_spec.configuration.replicas
+        if isinstance(replicas, list) and all(g.router is None for g in replicas):
+            configuration_excludes["replicas"] = {"__all__": {"router": True}}
+
     if configuration_excludes:
         spec_excludes["configuration"] = configuration_excludes
     if profile_excludes:

diff --git a/src/dstack/_internal/core/models/configurations.py b/src/dstack/_internal/core/models/configurations.py
@@ -28,7 +28,7 @@
     parse_off_duration,
 )
 from dstack._internal.core.models.resources import Range, ResourcesSpec
-from dstack._internal.core.models.routers import AnyServiceRouterConfig
+from dstack._internal.core.models.routers import AnyServiceRouterConfig, ReplicaGroupRouterConfig
 from dstack._internal.core.models.services import AnyModel, OpenAIChatModel
 from dstack._internal.core.models.unix import UnixUser
 from dstack._internal.core.models.volumes import MountPoint, VolumeConfiguration, parse_mount_point
@@ -801,6 +801,12 @@ class ReplicaGroup(CoreModel):
         CommandsList,
         Field(description="The shell commands to run for replicas in this group"),
     ] = []
+    router: Annotated[
+        Optional[ReplicaGroupRouterConfig],
+        Field(
+            description="When set, replicas in this group run the in-service HTTP router (e.g. SGLang).",
+        ),
+    ] = None
 
     @validator("name")
     def validate_name(cls, v: Optional[str]) -> Optional[str]:
@@ -1032,6 +1038,37 @@ def validate_replica_groups_have_commands_or_image(cls, values):
 
         return values
 
+    @root_validator()
+    def validate_at_most_one_router_replica_group(cls, values):
+        replicas = values.get("replicas")
+        if not isinstance(replicas, list):
+            return values
+        router_groups = [g for g in replicas if g.router is not None]
+        if len(router_groups) > 1:
+            raise ValueError("At most one replica group may specify `router`.")
+        if router_groups:
+            router_group = router_groups[0]
+            if router_group.count.min != 1 or router_group.count.max != 1:
+                raise ValueError("For now replica group with `router` must have `count: 1`.")
+        return values
+
+    @root_validator()
+    def validate_replica_group_router_mutex(cls, values):
+        """
+        When a replica group sets `router:`, service-level `router` must be omitted.
+        (Gateway-level SGLang is rejected at service registration when a gateway is selected.)
+        """
+        replicas = values.get("replicas")
+        if not isinstance(replicas, list):
+            return values
+        if not any(g.router is not None for g in replicas):
+            return values
+        if values.get("router") is not None:
+            raise ValueError(
+                "Service-Level router configuration is not allowed together with replica-group `router`."
+            )
+        return values
+
 
 class ServiceConfigurationConfig(
     ProfileParamsConfig,

diff --git a/src/dstack/_internal/core/models/routers.py b/src/dstack/_internal/core/models/routers.py
@@ -43,5 +43,12 @@ class SGLangServiceRouterConfig(CoreModel):
     ] = False
 
 
+class ReplicaGroupRouterConfig(CoreModel):
+    type: Annotated[
+        Literal["sglang"],
+        Field(description="The router implementation for this replica group."),
+    ] = "sglang"
+
+
 AnyServiceRouterConfig = SGLangServiceRouterConfig
 AnyGatewayRouterConfig = SGLangGatewayRouterConfig
diff --git a/src/dstack/_internal/proxy/gateway/resources/nginx/service.jinja2 b/src/dstack/_internal/proxy/gateway/resources/nginx/service.jinja2
@@ -47,8 +47,8 @@ server {
     }
     {% endfor %}
 
-    {# For SGLang router: block all requests except whitelisted locations added dynamically above #}
-    {% if router is not none and router.type == "sglang" %}
+    {# For router services: block all requests except whitelisted locations added dynamically above #}
+    {% if has_router_replica or (router is not none and router.type == "sglang") %}
     location / {
         return 403;
     }

diff --git a/src/dstack/_internal/proxy/gateway/routers/registry.py b/src/dstack/_internal/proxy/gateway/routers/registry.py
@@ -36,6 +36,7 @@ async def register_service(
         model=body.options.openai.model if body.options.openai is not None else None,
         ssh_private_key=body.ssh_private_key,
         repo=repo,
+        has_router_replica=body.has_router_replica,
         router=body.router,
         nginx=nginx,
         service_conn_pool=service_conn_pool,