-
-
Notifications
You must be signed in to change notification settings - Fork 220
Support router as replica with pipelines #3721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
2e46b95
14bab7a
f99bdd2
35120a3
8481bd3
f04999e
b349f2c
8fe01e5
c5a6716
37a1c5a
397cf98
cbb13f0
59d246b
274ad08
0f8f1b6
34baf13
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -108,16 +108,16 @@ curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \ | |
| ``` | ||
| </div> | ||
|
|
||
| !!! info "Router policy" | ||
| If you'd like to use a custom routing policy, create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details. | ||
| !!! info "Run router and workers separately" | ||
| To run the SGLang router and workers separately, use replica groups (router as a CPU replica group, workers as GPU replica groups). See [PD disaggregation](#pd-disaggregation). | ||
|
|
||
| > If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`. | ||
|
|
||
| ## Configuration options | ||
|
|
||
| ### PD disaggregation | ||
|
|
||
| If you create a gateway with the [`sglang` router](https://dstack.ai/docs/concepts/gateways/#sglang), you can run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). | ||
| To run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html), run the **router as a replica** on a CPU-only host, while running **prefill/decode workers** as replicas on GPU hosts. | ||
|
|
||
| <div editor-title="examples/inference/sglang/pd.dstack.yml"> | ||
|
|
||
|
|
@@ -131,6 +131,21 @@ env: | |
| - MODEL_ID=zai-org/GLM-4.5-Air-FP8 | ||
|
|
||
| replicas: | ||
| - count: 1 | ||
| # For now replica group with router must have count: 1 | ||
| commands: | ||
| - pip install sglang_router | ||
| - | | ||
| python -m sglang_router.launch_router \ | ||
| --host 0.0.0.0 \ | ||
| --port 8000 \ | ||
| --pd-disaggregation \ | ||
| --prefill-policy cache_aware | ||
| router: | ||
| type: sglang | ||
| resources: | ||
| cpu: 4 | ||
|
|
||
| - count: 1..4 | ||
| scaling: | ||
| metric: rps | ||
|
|
@@ -140,7 +155,7 @@ replicas: | |
| python -m sglang.launch_server \ | ||
| --model-path $MODEL_ID \ | ||
| --disaggregation-mode prefill \ | ||
| --disaggregation-transfer-backend mooncake \ | ||
| --disaggregation-transfer-backend nixl \ | ||
| --host 0.0.0.0 \ | ||
| --port 8000 \ | ||
| --disaggregation-bootstrap-port 8998 | ||
|
|
@@ -156,7 +171,7 @@ replicas: | |
| python -m sglang.launch_server \ | ||
| --model-path $MODEL_ID \ | ||
| --disaggregation-mode decode \ | ||
| --disaggregation-transfer-backend mooncake \ | ||
| --disaggregation-transfer-backend nixl \ | ||
| --host 0.0.0.0 \ | ||
| --port 8000 | ||
| resources: | ||
|
|
@@ -165,44 +180,26 @@ replicas: | |
| port: 8000 | ||
| model: zai-org/GLM-4.5-Air-FP8 | ||
|
|
||
| # Custom probe is required for PD disaggregation | ||
| # Custom probe is required for PD disaggregation. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (nit) By the way, is it still required? I thought
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes this is still required. Because probes queries
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, I see, so our default probe is the problem. But I assume it's possible to work around it by either setting Anyways, I think we were going to improve the UX here by introducing a different default probe for services with the SGLang router. Not in this PR, of course. |
||
| probes: | ||
| - type: http | ||
| url: /health_generate | ||
| url: /health | ||
| interval: 15s | ||
|
|
||
| router: | ||
| type: sglang | ||
| pd_disaggregation: true | ||
| ``` | ||
|
|
||
| </div> | ||
|
|
||
| Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon. | ||
|
|
||
| #### Gateway | ||
|
|
||
| Note, running services with PD disaggregation currently requires the gateway to run in the same cluster as the service. | ||
| #### Fleet | ||
|
|
||
| For example, if you run services on the `kubernetes` backend, make sure to also create the gateway in the same backend: | ||
| Create a [fleet](https://dstack.ai/docs/concepts/fleets/) that can provision both a CPU node (for the router replica group) and GPU nodes (for the prefill/decode replica groups). | ||
| You can create an SSH fleet, elastic Cloud fleet (nodes: 0..) or kubernetes cluster. Just don't specify any resource constraints in the fleet, and dstack will automatically provision the correct instances (both CPU and GPU, in the same fleet) based on the resources specified in replicas in the run configuration. | ||
|
|
||
| <div editor-title="gateway.dstack.yml"> | ||
|
|
||
| ```yaml | ||
| type: gateway | ||
| name: gateway-name | ||
|
|
||
| backend: kubernetes | ||
| region: any | ||
|
|
||
| domain: example.com | ||
| router: | ||
| type: sglang | ||
| ``` | ||
|
|
||
| </div> | ||
| The only requirement is that the router and worker replicas run in the same network. In practice, this typically means using a single fleet where the backend and region are the same or using `placement: cluster` if the backend supports it. | ||
|
|
||
| <!-- TODO: Gateway creation using fleets is coming to simplify this. --> | ||
| !!! note "Gateway-based routing (deprecated)" | ||
| If you create a gateway with the [`sglang` router](https://dstack.ai/docs/concepts/gateways/#sglang), you can also run SGLang with PD disaggregation. This method is deprecated and will be disallowed in a future release in favor of running the router as a replica. | ||
|
|
||
| ## Source code | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| type: fleet | ||
| name: pd-disagg | ||
|
|
||
| placement: cluster | ||
|
|
||
| ssh_config: | ||
| user: ubuntu | ||
| identity_file: ~/.ssh/id_rsa | ||
| hosts: | ||
| - 89.169.108.16 # CPU Host (router) | ||
| - 89.169.123.100 # GPU Host (prefill/decode workers) | ||
| - 89.169.110.65 # GPU Host (prefill/decode workers) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| # DEPRECATED: Gateway-based PD disaggregation config. | ||
| # Use `pd.dstack.yml` instead (router runs as a replica). | ||
|
|
||
| type: service | ||
| name: prefill-decode | ||
| image: lmsysorg/sglang:latest | ||
|
|
||
| env: | ||
| - HF_TOKEN | ||
| - MODEL_ID=zai-org/GLM-4.5-Air-FP8 | ||
|
|
||
| replicas: | ||
| - count: 1..4 | ||
| scaling: | ||
| metric: rps | ||
| target: 3 | ||
| commands: | ||
| - | | ||
| python -m sglang.launch_server \ | ||
| --model-path $MODEL_ID \ | ||
| --disaggregation-mode prefill \ | ||
| --disaggregation-transfer-backend mooncake \ | ||
| --host 0.0.0.0 \ | ||
| --port 8000 \ | ||
| --disaggregation-bootstrap-port 8998 | ||
| resources: | ||
| gpu: 1 | ||
|
|
||
| - count: 1..8 | ||
| scaling: | ||
| metric: rps | ||
| target: 2 | ||
| commands: | ||
| - | | ||
| python -m sglang.launch_server \ | ||
| --model-path $MODEL_ID \ | ||
| --disaggregation-mode decode \ | ||
| --disaggregation-transfer-backend mooncake \ | ||
| --host 0.0.0.0 \ | ||
| --port 8000 | ||
| resources: | ||
| gpu: 1 | ||
|
|
||
| port: 8000 | ||
| model: zai-org/GLM-4.5-Air-FP8 | ||
|
|
||
| probes: | ||
| - type: http | ||
| url: /health_generate | ||
| interval: 15s | ||
|
|
||
| router: | ||
| type: sglang | ||
| pd_disaggregation: true |
Uh oh!
There was an error while loading. Please reload this page.