DP Group Fault Tolerance for LLM Serving

Demonstrate data-parallel (DP) group fault tolerance and autoscaling on Ray Serve LLM deployments. Both demos use gang-scheduled data-parallel deployments (DPServer), where all workers in a DP group are restarted atomically on failure.

Repository Structure

├── dp_group_fault_tolerance_demo.py   # Demo 1: fault tolerance via Python builders
├── dp_group_autoscaling_service.yaml  # Demo 2: autoscaling via declarative YAML config
├── locustfile.py                      # Locust load test with shaped traffic pattern
├── run_locust.py                      # CLI wrapper for running the load test
├── requirements.txt                   # Python dependencies

Prerequisites

Python 3.10+
An Anyscale account with API access
anyscale CLI installed and authenticated
A Ray cluster with GPUs (for Demo 1) or Anyscale platform access (for Demo 2).

pip install -r requirements.txt

Note

You can either use Python builder or declarative YAML config pattern to spin up a service. The DP group fault tolerance and autoscaling features are agnostic to the builder pattern. Both features are fully supported in Ray OSS 2.55.

Demo 1: Fault Tolerance (Python Builders)

This demo uses dp_group_fault_tolerance_demo.py to deploy a DP group locally on a Ray cluster using Python builders, send continuous traffic, kill a GPU process simulating real-world GPU failures, and observe DP group recovery.

How it works

The script uses build_dp_deployment from ray.serve.llm to construct a DPServer deployment programmatically:

from ray.serve.llm import LLMConfig, ModelLoadingConfig, build_dp_deployment

llm_config = LLMConfig(
    model_loading_config=ModelLoadingConfig(
        model_id="microsoft/Phi-tiny-MoE-instruct",
        model_source="microsoft/Phi-tiny-MoE-instruct",
    ),
    deployment_config=dict(
        num_replicas=2,
    ),
    engine_kwargs=dict(
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=2,
        distributed_executor_backend="ray",
        max_model_len=1024,
        max_num_seqs=32,
        enforce_eager=True,
    ),
    runtime_env={
        "env_vars": {
            "VLLM_DISABLE_COMPILE_CACHE": "1",
        },
    },
)

handle = serve.run(build_dp_deployment(llm_config), blocking=False)

With dp_size=2 and num_replicas=2, this creates 4 total Ray Serve replicas (2 DP groups (num_replicas) x 2 workers (dp_size) each).

What the script does

Deploy — calls serve.run(build_dp_deployment(llm_config)) and waits for all 4 replicas to be RUNNING.
Send traffic — spawns a RequestSender Ray actor that sends 10 concurrent completion requests in a loop, then warms up for 2 minutes.
Kill a GPU process — uses nvidia-smi --query-compute-apps=pid to find a GPU process and kills it with SIGKILL.
Observe gang teardown — waits for the running replica count to drop below 4 (the entire DP group containing the killed worker is torn down).
Observe recovery — waits for all 4 replicas to return to RUNNING (the gang is restarted atomically).
Report results — prints total requests sent and errors encountered during the fault.

Run it

On a Ray cluster with at least 4 GPUs:

python dp_group_fault_tolerance_demo.py

The script keeps the service alive after recovery so you can inspect the Ray Dashboard. Press Ctrl+C to shut down.

What to expect

After killing a GPU process, the entire DP group containing that worker is torn down (replica count drops from 4 to 2).
The surviving DP group continues serving requests.
The killed DP group is restarted atomically — both workers come back together.
Replica count returns to 4.
The RequestSender reports errors only for requests that were in-flight on the killed group.

Demo 2: Autoscaling (Declarative YAML)

This demo deploys the same model on Anyscale using a declarative dp_group_autoscaling_service.yaml, then uses Locust to drive shaped traffic that triggers autoscaling.

Deploy

anyscale service deploy -f dp_group_autoscaling_service.yaml

Note the service URL and auth token from the output.

dp_group_autoscaling_service.yaml configuration reference

The service deploys an OpenAI-compatible LLM endpoint via ray.serve.llm:build_dp_openai_app, which constructs a Ray Serve application with a gang-scheduled DPServer. Unlike Demo 1 which uses a fixed replica count, this config uses num_replicas: auto to enable autoscaling.

Verify the service

anyscale service status --name dp-group-fault-tolerance

Wait until the service state is RUNNING.

Send a test request

curl -H "Authorization: Bearer <TOKEN>" \
     -H "Content-Type: application/json" \
     https://<SERVICE_URL>/v1/chat/completions \
     -d '{"model": "microsoft/Phi-tiny-MoE-instruct", "messages": [{"role": "user", "content": "Hello"}]}'

Generate load with Locust

The load test uses a fixed 14-minute shaped traffic pattern designed to trigger autoscaling:

  0:00 -  2:00   baseline (steady at --baseline-users)
  2:00 -  6:00   ramp up to --peak-users
  6:00 -  8:00   peak (steady at --peak-users)
  8:00 - 12:00   ramp down to --baseline-users
 12:00 - 14:00   baseline (steady at --baseline-users)

This shape is defined by the TrafficShape class in locustfile.py.

Basic run

python run_locust.py \
    --host https://<SERVICE_URL> \
    --token <TOKEN> \
    --baseline-users 10 \
    --peak-users 50

Higher peak with shorter outputs

python run_locust.py \
    --host https://<SERVICE_URL> \
    --token <TOKEN> \
    --baseline-users 10 \
    --peak-users 200 \
    --max-tokens 32 \
    --spawn-rate 10

What to expect

During the ramp-up phase, target_ongoing_requests: 5 is exceeded and the autoscaler adds DP groups (after upscale_delay_s: 10).
During the ramp-down phase, the autoscaler removes DP groups (after downscale_delay_s: 20).
Check the Anyscale console / Ray Serve dashboard for replica count changes.

Cleanup

anyscale service terminate --name dp-group-fault-tolerance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DP Group Fault Tolerance for LLM Serving

Repository Structure

Prerequisites

Note

Demo 1: Fault Tolerance (Python Builders)

How it works

What the script does

Run it

What to expect

Demo 2: Autoscaling (Declarative YAML)

Deploy

dp_group_autoscaling_service.yaml configuration reference

Verify the service

Send a test request

Generate load with Locust

Basic run

Higher peak with shorter outputs

What to expect

Cleanup

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
dp_group_autoscaling_service.yaml		dp_group_autoscaling_service.yaml
dp_group_fault_tolerance_demo.py		dp_group_fault_tolerance_demo.py
locustfile.py		locustfile.py
requirements.txt		requirements.txt
run_locust.py		run_locust.py

Folders and files

Latest commit

History

Repository files navigation

DP Group Fault Tolerance for LLM Serving

Repository Structure

Prerequisites

Note

Demo 1: Fault Tolerance (Python Builders)

How it works

What the script does

Run it

What to expect

Demo 2: Autoscaling (Declarative YAML)

Deploy

dp_group_autoscaling_service.yaml configuration reference

Verify the service

Send a test request

Generate load with Locust

Basic run

Higher peak with shorter outputs

What to expect

Cleanup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages