Skip to content

AdaXL/adaptive-llm-scheduler

Repository files navigation

Adaptive LLM Inference Scheduler

When massive LLM requests arrive simultaneously, how do you smartly schedule GPU inference to minimize latency and maximize throughput?

This is a Python + asyncio research prototype that explores that question. It implements four production-grade scheduling techniques — the same ideas that power systems like vLLM, SGLang, and Orca — and benchmarks them against a naive FIFO baseline in a fully simulated environment. No real GPU or model is required; forward-pass latency is simulated with asyncio.sleep using formulas calibrated to real hardware behaviour.


Motivation

Serving large language models at scale is a systems problem as much as a machine learning one. A single A100 GPU can run one forward pass at a time, so when hundreds of requests arrive per second, everything comes down to scheduling: which requests share a batch, in what order, routed to which device, and with what token-generation strategy?

Naive approaches — serve requests one at a time, or batch by fixed windows — leave significant GPU utilisation on the table and cause tail latency to blow up under load. This project implements and demonstrates the techniques that close that gap.


Techniques Implemented

1. Continuous Batching (Orca-style)

Problem: Static batching forces all requests in a batch to finish before new ones can join, wasting GPU cycles whenever one request finishes early.

Solution: Schedule at the iteration level. After each decode step, the batch is open: finished requests leave and new requests enter immediately. There is no padding, no waiting for the slowest request to finish.

Effect: GPU utilisation stays high throughout a request's lifetime. Throughput scales linearly with batch size instead of being limited by the longest request.

Static batching:  |--req A (long)--|  |--req B--|  idle  |--req C--|
Continuous:       |--req A (long)--B--C--|   (B and C join as slots free)

2. Priority Scheduling with Preemption

Problem: Without priorities, a burst of cheap low-priority requests can starve a critical high-priority request for seconds.

Solution: Maintain a min-heap ordered by (priority DESC, SLO deadline ASC). Two additional mechanisms sharpen this:

  • SLO urgency bumps — requests within 50ms of their deadline are automatically elevated by +2 priority levels, preventing silent deadline misses.
  • Preemption — when all workers are full and a high-priority request is queued, the lowest-priority running request is evicted (returned to the queue with its progress saved) if the priority gap is ≥ 3. This prevents constant thrashing from small priority differences.

Effect: High-priority requests drain first. Under mixed-priority load, the SLO violation rate for high-priority requests drops dramatically compared to FIFO.

3. KV-Cache Aware Routing

Problem: Each request's prompt must be processed token-by-token into a KV cache before generation can begin (the "prefill" phase). If a request is routed to a worker that already processed the same prompt prefix for a previous request, that work can be reused — but only if the routing decision accounts for it.

Solution: Assign each request a prefix hash over its first 256 characters. Score each candidate worker with:

score = cache_hit_tokens × 10   # strongly favour cache reuse
      − utilisation × 100       # penalise overloaded workers
      + kv_cache_free × 0.1     # mild preference for available space
      + batch_capacity × 0.5    # mild preference for open slots

Route to the highest-scoring eligible worker.

Effect: Repeated or similar queries — common in real deployments (few-shot templates, RAG prefixes, system prompts) — skip the prefill phase entirely, cutting TTFT by the cost of re-processing shared tokens. In simulation, the KV hit rate consistently reaches 10–15% with realistic prompt distributions.

4. Speculative Decoding

Problem: Autoregressive decoding generates one token per forward pass of the full model. For long outputs, this is the dominant cost.

Solution: Run a small, fast "draft model" (10× cheaper per token) to speculatively propose K=4 tokens. Then run the large model once to verify all K tokens in parallel. Tokens that pass the verification (acceptance probability p=0.85) are kept; the first rejection terminates the batch.

The expected number of accepted tokens per step follows a stop-at-first-rejection distribution:

E[accepted] = Σ(k=0..K-1) [ k · p^k · (1−p) ] + K · p^K  ≈ 2.71 tokens/step

Theoretical speedup = (1 + K·p) / (1 + α)  ≈ 4.0×
  where α = 0.1 (draft model overhead fraction)

Speculative decoding is gated: it is disabled for requests near their SLO deadline, on overloaded workers (utilisation ≥ 80%), or for requests with priority > 6, since the draft overhead is not worth it in those cases.

Effect: Medium- and low-priority requests with long outputs generate significantly more tokens per wall-clock second. High-priority urgent requests are not burdened with the overhead.


Quick Start

Install

cd adaptive-llm-scheduler
pip install rich pytest pytest-asyncio

Run the demo

# Steady load — clean baseline comparison
python3 main.py --workers 4 --rps 30 --scenario steady --duration 20 --compare-fifo

# Burst traffic: 5 RPS quiet → 200 RPS spike → 5 RPS quiet
python3 main.py --workers 4 --rps 30 --scenario burst --duration 15 --compare-fifo

# Mixed priorities under sustained load — best shows preemption + KV routing
python3 main.py --workers 4 --rps 50 --scenario mixed_priority --duration 20 --compare-fifo

Run tests

python3 -m pytest tests/ -v   # 45 tests across unit + integration

CLI Reference

python3 main.py [OPTIONS]

Options:
  --workers     N       Simulated GPU workers                    (default: 4)
  --rps         R       Base requests per second                 (default: 30)
  --scenario    S       steady | burst | mixed_priority          (default: steady)
  --duration    D       Simulation duration in seconds           (default: 20)
  --kv-slots    K       KV cache budget per worker (token slots) (default: 2048)
  --batch-size  B       Max concurrent requests per worker       (default: 32)
  --seed        N       Random seed for reproducibility          (default: 42)
  --compare-fifo        Run FIFO baseline in parallel for comparison

Traffic Scenarios

Scenario What it simulates What it reveals
steady Constant arrival at --rps Baseline throughput and latency comparison
burst 5 RPS → 200 RPS spike for 5s → 5 RPS Queue drain speed, overload handling
mixed_priority 30–80 RPS with a mix of urgent and relaxed requests Preemption behaviour, SLO compliance under pressure

Live Dashboard

A Rich table updates every second while the simulation runs:

            Adaptive LLM Inference Scheduler — Live Metrics
╭──────────────────────┬──────────────────────┬──────────────────╮
│ Metric               │   Adaptive Scheduler │    FIFO Baseline │
├──────────────────────┼──────────────────────┼──────────────────┤
│ Throughput (tok/s)   │                8,147 │            7,974 │
│ TTFT P50 (ms)        │                  5.1 │              4.1 │
│ TTFT P99 (ms)        │                 46.3 │             53.7 │
│ Avg TBT (ms)         │                  0.6 │              0.4 │
│ SLO Violation %      │                 0.3% │             0.0% │
│ KV Hit Rate          │                11.3% │              N/A │
│ Preemptions          │                    0 │                0 │
│ Completed Requests   │                 1030 │             1030 │
│ Queue Depth          │                    0 │                0 │
│ Workers Busy         │                  3/4 │              4/4 │
╰──────────────────────┴──────────────────────┴──────────────────╯

Metric definitions:

Metric Meaning
Throughput (tok/s) Output tokens generated per second across all workers, measured over a rolling 10-second window
TTFT P50/P99 Time-to-first-token latency at the 50th and 99th percentile — the delay a user experiences before seeing any output
Avg TBT Average time between consecutive output tokens — perceived generation speed once streaming begins
SLO Violation % Fraction of completed requests that finished after their deadline
KV Hit Rate Fraction of prefill phases skipped due to a cached prefix match (adaptive scheduler only)
Preemptions Number of running requests evicted to admit a higher-priority one
Queue Depth Requests waiting to be dispatched — a leading indicator of overload

Findings

The following results are representative of runs on this simulation (your numbers will vary slightly due to traffic randomness, but the relative trends are stable across seeds).

Burst scenario (4 workers, burst 200 RPS for 5s)

Metric Adaptive FIFO
Throughput (tok/s) 8,147 7,974
TTFT P50 5.1 ms 4.1 ms
TTFT P99 46.3 ms 53.7 ms
SLO Violation % 0.3% 0.0%
KV Hit Rate 11.3% 0.0%
All requests completed

Observation: Both schedulers complete all requests and achieve similar raw throughput, because the burst is within the system's sustained capacity. The adaptive scheduler's P99 TTFT is lower (46 ms vs 54 ms) because priority-aware routing drains urgent requests earlier. The KV hit rate (11.3%) means roughly 1 in 9 prefill phases was skipped entirely — a free latency reduction with no throughput cost.

Mixed-priority scenario (4 workers, 50 RPS sustained)

Metric Adaptive FIFO
Throughput (tok/s) 6,594 6,512
TTFT P50 8.0 ms 10.9 ms
TTFT P99 74.2 ms 75.9 ms
SLO Violation % 0.0% 0.0%
KV Hit Rate 11.4% 0.0%

Observation: At moderate load both schedulers meet all SLOs, but the adaptive scheduler's TTFT P50 is meaningfully lower (8 ms vs 11 ms). This reflects priority ordering: high-priority requests move to the front of the queue and are dispatched before low-priority ones that arrived first. FIFO serves them in arrival order regardless of urgency.


Conclusions

What works well

Continuous batching is the single highest-leverage technique. By removing the requirement to wait for a batch to drain before admitting new requests, it keeps GPU utilisation high and eliminates the idle bubbles that dominate static-batching latency. This is why modern serving systems (vLLM, SGLang, TensorRT-LLM) all implement it as a baseline requirement, not an optional feature.

KV-cache aware routing provides free latency reduction at the cost of smarter bookkeeping. Routing to the worker that already holds a request's prompt prefix avoids recomputing the same key-value tensors. In real deployments — where shared system prompts, few-shot templates, and RAG context prepend every request — hit rates of 30–60% are achievable, translating directly to lower TTFT for those requests.

Priority scheduling with SLO-aware urgency bumps is essential for mixed-workload systems. Without it, a batch of cheap low-priority requests blocks an urgent high-priority request for the entire batch duration. The combination of a deadline-ordered heap and automatic urgency escalation ensures that requests approaching their SLO surface at the top of the queue without requiring callers to manually tune priorities.

Speculative decoding offers substantial throughput gains for generation-heavy workloads. The key insight is that the draft model's proposals are almost always correct for predictable text (code completions, templated outputs, simple continuations), so the acceptance rate is high and the cost of the rare rejection is low. The ~2.7 tokens per step versus 1 in autoregressive mode is a real multiplier on decode throughput, not a tuning artifact.

Limitations and trade-offs

Preemption has a cost. Evicting a request that is partially decoded discards its in-flight KV cache (unless the system implements KV swapping to CPU memory, which this prototype does not). This means preempted requests must redo their prefill when rescheduled. In practice, preemption is only justified when the priority gap is large — this implementation requires a gap of ≥ 3 levels, which keeps preemption rates low without sacrificing responsiveness for genuinely urgent requests.

Speculative decoding degrades under distribution shift. If the draft model's token distribution diverges from the large model (e.g., the prompt is out-of-distribution), the acceptance rate drops. At acceptance rates below ~0.6, the overhead of running the draft model outweighs the speedup from accepted tokens. The gate on worker utilisation (< 80%) provides a partial safeguard, but a production system would want to monitor per-request acceptance rates and back off dynamically.

KV cache is a shared, finite resource. The routing score optimises for cache hits, but as the cache fills, LRU eviction displaces entries that may have been useful. Systems like RadixAttention (SGLang) and prefix-aware memory management track prefix trees rather than flat hashes, enabling finer-grained reuse at the cost of implementation complexity.

The scheduler loop introduces 1ms of overhead per dispatch cycle. At very high request rates (>1000 RPS), this becomes a scheduling bottleneck independent of GPU compute. Production systems address this by batching routing decisions, using lock-free queues, or moving the scheduler to a separate thread.

Connection to production systems

Technique Where it appears in production
Continuous batching vLLM, TensorRT-LLM, SGLang, Orca
Priority + preemption vLLM v0.3+, Triton Inference Server
KV-cache routing SGLang (RadixAttention), Mooncake
Speculative decoding vLLM, TensorRT-LLM, Medusa, EAGLE

This prototype deliberately keeps the pieces separate and visible, which makes it a useful lens for understanding why each technique exists and what problem it solves — something that is harder to see in a production codebase where all four are deeply interleaved.


Project Structure

adaptive-llm-scheduler/
├── main.py                    # CLI entry point and simulation orchestrator
│
├── models/
│   ├── request.py             # Request dataclass, RequestStatus enum, heap ordering
│   └── worker.py              # WorkerState + KVCacheSlot with LRU eviction
│
├── scheduler/
│   ├── core.py                # AdaptiveScheduler — 1ms event loop with all four techniques
│   ├── priority_queue.py      # Min-heap with generation-based lazy deletion
│   ├── batching.py            # Continuous batch selection: select_batch(), can_fit_request()
│   ├── routing.py             # KV-cache-aware worker routing: route_request()
│   └── speculative.py         # Speculative decoding: draft → sample → verify
│
├── workers/
│   └── gpu_worker.py          # SimulatedGPUWorker: prefill → decode → complete
│
├── metrics/
│   ├── collector.py           # Rolling TTFT, TBT, throughput, SLO hit rate, KV stats
│   └── display.py             # Rich live table, refreshed every second
│
├── load_gen/
│   └── generator.py           # Poisson arrivals, log-normal prompt lengths, 3 scenarios
│
├── baselines/
│   └── fifo.py                # Naive FIFO + round-robin (no priority, no routing, no spec)
│
└── tests/
    ├── test_models.py              # Heap ordering, LRU eviction, KV cache hit/miss
    ├── test_batching.py            # Batch selection, capacity checks, cache-hit fitting
    ├── test_routing.py             # Routing score, cache preference, capacity exclusion
    ├── test_speculative.py         # Acceptance rate distribution, speedup formula, gate conditions
    └── test_scheduler_integration.py  # End-to-end: completion, priority ordering, KV hits, preemption

How It Works

Simulation Model

Real GPU latency is approximated by two formulas:

prefill_latency(prompt_tokens, batch_size) = 0.05ms × tokens / batch_size + 0.2ms
decode_latency(batch_size)                 = 0.1ms × batch_size + 0.05ms

These capture two key hardware realities:

  • Prefill is compute-bound and parallelisable — processing a batch of prompts simultaneously amortises memory bandwidth cost, so latency scales sub-linearly with batch size.
  • Decode is memory-bandwidth-bound — each step loads the full model weights regardless of batch size, so latency grows linearly.

asyncio Architecture

Every component runs as an independent coroutine in a single event loop — no threads, no locks:

Event Loop:
  ├── AdaptiveScheduler.run()          (1ms scheduling loop)
  │     ├── SLO urgency bumps
  │     ├── preemption check
  │     └── route & dispatch via asyncio.create_task()
  │
  ├── SimulatedGPUWorker[0].run()      (prefill → decode → complete)
  ├── SimulatedGPUWorker[1].run()
  ├── SimulatedGPUWorker[2].run()
  ├── SimulatedGPUWorker[3].run()
  │
  ├── LoadGenerator.run()              (Poisson inter-arrival sleeps)
  └── LiveDisplay.run()                (1s refresh cycle)

State is safe to share without locks because asyncio tasks only interleave at explicit await points — the cooperative scheduling model eliminates data races.

Scheduling Loop (every 1ms)

1. SLO urgency bumps
   For each queued request within 50ms of its deadline:
     priority = min(priority + 2, 9)   → re-inserted with new generation

2. Preemption check
   If top-of-queue priority − lowest-running priority ≥ 3:
     signal lowest-priority running request to stop cooperatively
     re-enqueue it with partial progress saved

3. Route & inject
   Walk priority queue (highest priority first, skip infeasible requests):
     score each eligible worker
     dispatch to best match via asyncio.create_task(worker.add_request(req))
     eagerly reserve batch slot so subsequent dispatches see correct capacity

4. Yield
   asyncio.sleep(1ms) — worker decode steps advance during this window

Key Design Decisions

Generation-based lazy deletion in the priority queue. Requests are re-inserted when their priority is bumped or after preemption. Simple ID-based lazy deletion causes stale heap entries to become "live" again when the same ID is re-pushed, creating phantom duplicates that inflate queue depth and block routing. Generation numbers give each push a unique identity, so old entries are always stale regardless of re-insertion.

Eager batch-slot reservation. The scheduler calls asyncio.create_task(worker.add_request(req)) to dispatch — the task runs after the current coroutine yields. If the batch slot was not reserved synchronously at dispatch time, the scheduler would over-commit within a single loop iteration, dispatching more requests than max_batch_size allows before any add_request task had a chance to update the count.

Skip infeasible head-of-queue, do not break. A request whose prompt_tokens exceeds the worker's entire KV budget can never be served. After SLO urgency bumps, such a request can reach priority 9 and sit permanently at the head of the queue. Breaking the routing loop on route_request() == None would stall all lower-priority feasible requests indefinitely. The scheduler instead walks past infeasible requests and serves what it can.

Minimum priority gap for preemption. Preemption is only triggered when the incoming request's priority exceeds the running victim's by at least 3 levels. Without this floor, a stream of requests with priorities differing by 1 would generate constant preemptions, each wasting a prefill re-computation and degrading overall throughput. The gap of 3 is a deliberate policy choice that can be tuned via MIN_PREEMPT_PRIORITY_GAP in scheduler/core.py.

About

Adaptive LLM inference scheduler simulation — continuous batching, priority preemption, KV-cache routing, and speculative decoding in Python/asyncio.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages