BALES - High-Throughput ML Inference Gateway

Zero-downtime inference with dynamic batching and priority scheduling.

Quick Start • Architecture • Configuration • Running & Stress Testing • Benchmarking • Security • API Reference • Development

Overview

BALES is a production-ready inference gateway designed for high-throughput CPU-based ML serving. It combines Redis-backed priority queues, dynamic request batching, and atomic model hot-swapping to deliver:

>8,000 req/s throughput on CPU
P99 latency <12ms at batch_size=32
Zero-downtime model reloads without dropping in-flight requests

Built with FastAPI, PyTorch, and asyncio, BALES is engineered for safety-first concurrency: inference never blocks the event loop, and every request future is guaranteed to resolve or time out cleanly.

Quick Start

Local Development

Prerequisites: Python 3.14+, Redis 7+, uv

# 1. Clone the repository
git clone https://github.com/ab1nv/bales.git
cd bales

# 2. Install dependencies (first time)
uv sync --extra dev

# 3. Start Redis (if not already running)
redis-server --save "" --appendonly no

# 4. Run the server
uv run python main.py

The gateway will be available at http://localhost:8000.

Docker

# Build and start everything (Redis + Bales)
docker compose up --build

# Optional: include Prometheus for metrics scraping
docker compose --profile monitoring up --build

Architecture

Data Flow

flowchart TD
    Client["Client"]
    Routes["FastAPI Routes"]
    Queue["Redis Priority Queue"]
    Consumer["Consumer Loop"]
    Batcher["Dynamic Batcher"]
    Torch["PyTorch run_in_executor"]
    Response["Response Future"]

    Client -->|POST /infer| Routes
    Routes -->|push request| Queue
    Queue -->|pop batch| Consumer
    Consumer -->|preprocess & submit| Batcher
    Batcher -->|stack tensors| Torch
    Torch -->|postprocess| Response
    Response -->|resolve future| Routes
    Routes -->|JSON response| Client

Key Invariants

PyTorch inference NEVER runs on the event loop thread -- always dispatched via run_in_executor.
A request NEVER touches a half-loaded model during hot-swap -- atomic reference replacement under an async lock.
A request NEVER gets dropped during hot-swap -- in-flight requests hold a local reference to the old model until GC cleans up.
request_id is the single source of truth linking API -> queue -> batcher -> response.
pending_futures is the ONLY place futures are stored.

Configuration

All configuration is read from environment variables (with sensible defaults). Create a .env file from the example:

cp .env.example .env

Variable	Default	Description
`REDIS_URL`	`redis://localhost:6379/0`	Redis connection string
`MAX_BATCH_SIZE`	`32`	Maximum requests per batch
`BATCH_WINDOW_MS`	`5.0`	Collection window in milliseconds
`BATCHER_TIMEOUT_S`	`5.0`	Client timeout before 504
`DEFAULT_MODEL_ID`	`stub_v1`	Default registered model
`THREAD_POOL_SIZE`	`4`	Executor threads for torch inference
`HOST`	`0.0.0.0`	Server bind host
`PORT`	`8000`	Server port
`LOG_LEVEL`	`info`	Logging level
`ENABLE_PROMETHEUS`	`true`	Enable metrics export

Note: workers must remain 1 for in-process shared state (pending_futures). Scale horizontally with Docker replicas instead.

Running and Stress Testing

Start the Server

# Local (requires Redis running)
uv run python main.py

# Or with Docker (includes Redis)
docker compose up --build

The server will start on http://localhost:8000.

Smoke Test

Verify the server is healthy and can serve inference:

# Health check
curl http://localhost:8000/health

# Single inference request
curl -X POST http://localhost:8000/infer \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "stub_v1",
    "model_type": "classification",
    "payload": {"input": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 0.1, 0.2, 0.3]}
  }'

Note: The stub model expects exactly 128 floats in the input array. The example above is truncated for readability.

Stress Test with curl

Send 1000 sequential requests and measure total time:

# Generate a valid 128-element input
input_json=$(python3 -c "import json; print(json.dumps({'input': [0.1]*128}))")

# Sequential stress test
for i in {1..1000}; do
  curl -s -X POST http://localhost:8000/infer \
    -H "Content-Type: application/json" \
    -d "{
      \"model_id\": \"stub_v1\",
      \"model_type\": \"classification\",
      \"priority\": 2,
      \"payload\": $input_json
    }" > /dev/null
done

Concurrent Load with wrk or hey

For true concurrency testing, use a load generator:

Using hey (simple, single-threaded):

# Install: go install github.com/rakyll/hey@latest
# Or: apt-get install hey

# Run 50,000 requests with 500 concurrent connections
hey -n 50000 -c 500 -m POST \
  -H "Content-Type: application/json" \
  -d '{"model_id":"stub_v1","model_type":"classification","priority":2,"payload":{"input":[0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1]}}' \
  http://localhost:8000/infer

Using wrk (more accurate, multi-threaded):

# Install wrk first: https://github.com/wg/wrk/wiki/Installing-Wrk-on-Linux

# Create a Lua script for POST body
cat > infer.lua << 'EOF'
wrk.method = "POST"
wrk.headers["Content-Type"] = "application/json"
wrk.body = '{"model_id":"stub_v1","model_type":"classification","priority":2,"payload":{"input":[0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1]}}'
EOF

# Run with 12 threads, 400 connections, for 30 seconds
wrk -t12 -c400 -d30s -s infer.lua http://localhost:8000/infer

Monitor during stress test:

# Watch queue depth and pending requests
curl http://localhost:8000/health | python3 -m json.tool

# Watch Prometheus metrics
curl http://localhost:8000/metrics | grep bales_

Benchmarking

Isolated Batcher

Test pure batching throughput (no HTTP or Redis overhead). This tells you the theoretical maximum of the batching engine:

uv run python benchmarks/profile_batcher.py

Expected output:

=== Batcher Benchmark ===
Requests:          10,000
Elapsed:           1.23s
Throughput:        8,130 req/s
P50 latency:       2.45ms
P99 latency:       8.12ms
Avg batch size:    31.2
Model calls:       321  (vs 10000 individual = 31.2x reduction)
Target:            >8,000 req/s, P99 <12ms
Pass:              PASS

What to tune:

If throughput is low (< 8,000): increase BATCH_WINDOW_MS slightly (try 5ms -> 8ms) to allow more requests to accumulate per batch
If P99 is high (> 12ms): reduce BATCH_WINDOW_MS (try 5ms -> 3ms) or increase THREAD_POOL_SIZE
If avg batch size is low (< 20): increase concurrent load or reduce window

Full-Stack Load Test with Locust

Benchmark the complete HTTP -> Redis -> Batcher pipeline:

# Install locust (already in dev dependencies)
uv sync --extra dev

# Run headless load test
uv run locust -f benchmarks/locustfile.py \
  --headless -u 500 -r 100 \
  --run-time 60s --host http://localhost:8000

Parameters explained:

-u 500: spawn 500 concurrent users
-r 100: hatch 100 users per second
--run-time 60s: run for 60 seconds
--host http://localhost:8000: target the local server

After the run, Locust prints:

Total requests per second (RPS)
Average, median, and percentile latencies
Failure count and error rate

Interpreting Results

Metric	Target	What to do if failing
Throughput	> 8,000 req/s	Increase `-u` (users) in Locust. Check CPU usage with `htop`.
P99 latency	< 12ms	Reduce `BATCH_WINDOW_MS` or increase `THREAD_POOL_SIZE`. Check `/health` for queue backlog.
Error rate	0%	Check logs for timeout or Redis connection errors. Verify `pending_futures` is 0 in `/health`.
Avg batch size	> 20	Should be close to `MAX_BATCH_SIZE` (32). If low, increase load or window.

Comparison checklist:

Run profile_batcher.py first to establish the ceiling (no HTTP/Redis overhead)
Run locustfile.py to measure real-world throughput
Compare: locust RPS should be ~60-80% of profile_batcher RPS due to HTTP + Redis overhead
If gap is larger: HTTP layer or Redis is the bottleneck, not the batcher
If gap is small: batcher is the bottleneck, tune THREAD_POOL_SIZE or BATCH_WINDOW_MS

Security

BALES follows security best practices:

Input validation: All requests are validated via Pydantic v2 before entering the pipeline.
No shell execution: weights_path in hot-swap is validated with Path.exists() and never passed to shell commands.
Resource limits: Docker Compose enforces CPU (4.0) and memory (2G) caps.
No Redis persistence: Queue data is ephemeral by design (--save "" --appendonly no) to avoid I/O overhead and accidental data retention.
Single worker: Prevents shared-state corruption; horizontal scaling is done via container replicas behind a load balancer.
Healthchecks: Docker HEALTHCHECK polls /health every 10s to detect degraded state.
Non-root container: The Docker image runs as an unprivileged bales user.

API Reference

POST /infer

Submit an inference request.

Request body:

{
  "model_id": "stub_v1",
  "model_type": "classification",
  "priority": 2,
  "payload": {
    "input": [0.1, 0.2, ...]
  }
}

Response:

{
  "request_id": "uuid",
  "model_id": "stub_v1",
  "result": { "label": 3, "confidence": 0.95 },
  "latency_ms": 4.123,
  "batch_size": 16,
  "queued_ms": 1.234
}

GET /health

Returns system health, registered models, queue depths, and pending request count.

curl http://localhost:8000/health

POST /models/{model_id}/reload

Hot-swap a model's weights without dropping traffic.

Request body:

{
  "weights_path": "./weights/new_model.pt"
}

Example:

curl -X POST http://localhost:8000/models/stub_v1/reload \
  -H "Content-Type: application/json" \
  -d '{"weights_path": "./weights/stub_v2.pt"}'

GET /metrics

Prometheus scrape endpoint exposing:

bales_requests_total - Total inference requests by model_id and status
bales_request_latency_ms - End-to-end latency distribution
bales_batch_size - Number of requests in each dispatched batch
bales_queue_depth - Number of requests waiting in priority queue

curl http://localhost:8000/metrics

Development

# Run the test suite (requires Redis on localhost:6379)
uv run pytest tests/ -v

# Run a specific test file
uv run pytest tests/test_integration.py -v

# Run with coverage
uv run pytest tests/ -v --cov=. --cov-report=html

# Profile the batcher
uv run python benchmarks/profile_batcher.py

# Run the load test
uv run locust -f benchmarks/locustfile.py --headless -u 500 -r 100 --run-time 60s --host http://localhost:8000

# Lint check
uv run ruff check .

# Format check
uv run ruff format --check .

# Type check
uv run ty check

Built with FastAPI + PyTorch + Redis + uv
Buy me a coffee

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github		.github
api		api
batcher		batcher
benchmarks		benchmarks
metrics		metrics
models		models
queues		queues
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
config.py		config.py
docker-compose.yml		docker-compose.yml
main.py		main.py
prometheus.yml		prometheus.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BALES - High-Throughput ML Inference Gateway

Overview

Table of Contents

Quick Start

Local Development

Docker

Architecture

Data Flow

Key Invariants

Configuration

Running and Stress Testing

Start the Server

Smoke Test

Stress Test with curl

Concurrent Load with wrk or hey

Benchmarking

Isolated Batcher

Full-Stack Load Test with Locust

Interpreting Results

Security

API Reference

POST /infer

GET /health

POST /models/{model_id}/reload

GET /metrics

Development

About

Uh oh!

Sponsor this project

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

BALES - High-Throughput ML Inference Gateway

Overview

Table of Contents

Quick Start

Local Development

Docker

Architecture

Data Flow

Key Invariants

Configuration

Running and Stress Testing

Start the Server

Smoke Test

Stress Test with curl

Concurrent Load with wrk or hey

Benchmarking

Isolated Batcher

Full-Stack Load Test with Locust

Interpreting Results

Security

API Reference

POST /infer

GET /health

POST /models/{model_id}/reload

GET /metrics

Development

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Contributors

Uh oh!

Languages