Demonstrate data-parallel (DP) group fault tolerance and autoscaling on Ray Serve LLM deployments. Both demos use gang-scheduled data-parallel deployments (DPServer), where all workers in a DP group are restarted atomically on failure.
├── dp_group_fault_tolerance_demo.py # Demo 1: fault tolerance via Python builders
├── dp_group_autoscaling_service.yaml # Demo 2: autoscaling via declarative YAML config
├── locustfile.py # Locust load test with shaped traffic pattern
├── run_locust.py # CLI wrapper for running the load test
├── requirements.txt # Python dependencies
- Python 3.10+
- An Anyscale account with API access
anyscaleCLI installed and authenticated- A Ray cluster with GPUs (for Demo 1) or Anyscale platform access (for Demo 2).
pip install -r requirements.txtYou can either use Python builder or declarative YAML config pattern to spin up a service. The DP group fault tolerance and autoscaling features are agnostic to the builder pattern. Both features are fully supported in Ray OSS 2.55.
This demo uses dp_group_fault_tolerance_demo.py to deploy a DP group locally on a Ray cluster using Python builders, send continuous traffic, kill a GPU process simulating real-world GPU failures, and observe DP group recovery.
The script uses build_dp_deployment from ray.serve.llm to construct a DPServer deployment programmatically:
from ray.serve.llm import LLMConfig, ModelLoadingConfig, build_dp_deployment
llm_config = LLMConfig(
model_loading_config=ModelLoadingConfig(
model_id="microsoft/Phi-tiny-MoE-instruct",
model_source="microsoft/Phi-tiny-MoE-instruct",
),
deployment_config=dict(
num_replicas=2,
),
engine_kwargs=dict(
tensor_parallel_size=1,
pipeline_parallel_size=1,
data_parallel_size=2,
distributed_executor_backend="ray",
max_model_len=1024,
max_num_seqs=32,
enforce_eager=True,
),
runtime_env={
"env_vars": {
"VLLM_DISABLE_COMPILE_CACHE": "1",
},
},
)
handle = serve.run(build_dp_deployment(llm_config), blocking=False)With dp_size=2 and num_replicas=2, this creates 4 total Ray Serve replicas (2 DP groups (num_replicas) x 2 workers (dp_size) each).
- Deploy — calls
serve.run(build_dp_deployment(llm_config))and waits for all 4 replicas to beRUNNING. - Send traffic — spawns a
RequestSenderRay actor that sends 10 concurrent completion requests in a loop, then warms up for 2 minutes. - Kill a GPU process — uses
nvidia-smi --query-compute-apps=pidto find a GPU process and kills it withSIGKILL. - Observe gang teardown — waits for the running replica count to drop below 4 (the entire DP group containing the killed worker is torn down).
- Observe recovery — waits for all 4 replicas to return to
RUNNING(the gang is restarted atomically). - Report results — prints total requests sent and errors encountered during the fault.
On a Ray cluster with at least 4 GPUs:
python dp_group_fault_tolerance_demo.pyThe script keeps the service alive after recovery so you can inspect the Ray Dashboard. Press Ctrl+C to shut down.
- After killing a GPU process, the entire DP group containing that worker is torn down (replica count drops from 4 to 2).
- The surviving DP group continues serving requests.
- The killed DP group is restarted atomically — both workers come back together.
- Replica count returns to 4.
- The
RequestSenderreports errors only for requests that were in-flight on the killed group.
This demo deploys the same model on Anyscale using a declarative dp_group_autoscaling_service.yaml, then uses Locust to drive shaped traffic that triggers autoscaling.
anyscale service deploy -f dp_group_autoscaling_service.yamlNote the service URL and auth token from the output.
The service deploys an OpenAI-compatible LLM endpoint via ray.serve.llm:build_dp_openai_app, which constructs a Ray Serve application with a gang-scheduled DPServer. Unlike Demo 1 which uses a fixed replica count, this config uses num_replicas: auto to enable autoscaling.
anyscale service status --name dp-group-fault-toleranceWait until the service state is RUNNING.
curl -H "Authorization: Bearer <TOKEN>" \
-H "Content-Type: application/json" \
https://<SERVICE_URL>/v1/chat/completions \
-d '{"model": "microsoft/Phi-tiny-MoE-instruct", "messages": [{"role": "user", "content": "Hello"}]}'The load test uses a fixed 14-minute shaped traffic pattern designed to trigger autoscaling:
0:00 - 2:00 baseline (steady at --baseline-users)
2:00 - 6:00 ramp up to --peak-users
6:00 - 8:00 peak (steady at --peak-users)
8:00 - 12:00 ramp down to --baseline-users
12:00 - 14:00 baseline (steady at --baseline-users)
This shape is defined by the TrafficShape class in locustfile.py.
python run_locust.py \
--host https://<SERVICE_URL> \
--token <TOKEN> \
--baseline-users 10 \
--peak-users 50python run_locust.py \
--host https://<SERVICE_URL> \
--token <TOKEN> \
--baseline-users 10 \
--peak-users 200 \
--max-tokens 32 \
--spawn-rate 10- During the ramp-up phase,
target_ongoing_requests: 5is exceeded and the autoscaler adds DP groups (afterupscale_delay_s: 10). - During the ramp-down phase, the autoscaler removes DP groups (after
downscale_delay_s: 20). - Check the Anyscale console / Ray Serve dashboard for replica count changes.
anyscale service terminate --name dp-group-fault-tolerance