Zero-downtime inference with dynamic batching and priority scheduling.
Quick Start • Architecture • Configuration • Running & Stress Testing • Benchmarking • Security • API Reference • Development
BALES is a production-ready inference gateway designed for high-throughput CPU-based ML serving. It combines Redis-backed priority queues, dynamic request batching, and atomic model hot-swapping to deliver:
- >8,000 req/s throughput on CPU
- P99 latency <12ms at
batch_size=32 - Zero-downtime model reloads without dropping in-flight requests
Built with FastAPI, PyTorch, and asyncio, BALES is engineered for safety-first concurrency: inference never blocks the event loop, and every request future is guaranteed to resolve or time out cleanly.
- Overview
- Quick Start
- Architecture
- Configuration
- Running and Stress Testing
- Benchmarking
- Security
- API Reference
- Development
Prerequisites: Python 3.14+, Redis 7+, uv
# 1. Clone the repository
git clone https://github.com/ab1nv/bales.git
cd bales
# 2. Install dependencies (first time)
uv sync --extra dev
# 3. Start Redis (if not already running)
redis-server --save "" --appendonly no
# 4. Run the server
uv run python main.pyThe gateway will be available at http://localhost:8000.
# Build and start everything (Redis + Bales)
docker compose up --build
# Optional: include Prometheus for metrics scraping
docker compose --profile monitoring up --buildflowchart TD
Client["Client"]
Routes["FastAPI Routes"]
Queue["Redis Priority Queue"]
Consumer["Consumer Loop"]
Batcher["Dynamic Batcher"]
Torch["PyTorch run_in_executor"]
Response["Response Future"]
Client -->|POST /infer| Routes
Routes -->|push request| Queue
Queue -->|pop batch| Consumer
Consumer -->|preprocess & submit| Batcher
Batcher -->|stack tensors| Torch
Torch -->|postprocess| Response
Response -->|resolve future| Routes
Routes -->|JSON response| Client
- PyTorch inference NEVER runs on the event loop thread -- always dispatched via
run_in_executor. - A request NEVER touches a half-loaded model during hot-swap -- atomic reference replacement under an async lock.
- A request NEVER gets dropped during hot-swap -- in-flight requests hold a local reference to the old model until GC cleans up.
request_idis the single source of truth linking API -> queue -> batcher -> response.pending_futuresis the ONLY place futures are stored.
All configuration is read from environment variables (with sensible defaults). Create a .env file from the example:
cp .env.example .env| Variable | Default | Description |
|---|---|---|
REDIS_URL |
redis://localhost:6379/0 |
Redis connection string |
MAX_BATCH_SIZE |
32 |
Maximum requests per batch |
BATCH_WINDOW_MS |
5.0 |
Collection window in milliseconds |
BATCHER_TIMEOUT_S |
5.0 |
Client timeout before 504 |
DEFAULT_MODEL_ID |
stub_v1 |
Default registered model |
THREAD_POOL_SIZE |
4 |
Executor threads for torch inference |
HOST |
0.0.0.0 |
Server bind host |
PORT |
8000 |
Server port |
LOG_LEVEL |
info |
Logging level |
ENABLE_PROMETHEUS |
true |
Enable metrics export |
Note:
workersmust remain1for in-process shared state (pending_futures). Scale horizontally with Docker replicas instead.
# Local (requires Redis running)
uv run python main.py
# Or with Docker (includes Redis)
docker compose up --buildThe server will start on http://localhost:8000.
Verify the server is healthy and can serve inference:
# Health check
curl http://localhost:8000/health
# Single inference request
curl -X POST http://localhost:8000/infer \
-H "Content-Type: application/json" \
-d '{
"model_id": "stub_v1",
"model_type": "classification",
"payload": {"input": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 0.1, 0.2, 0.3]}
}'Note: The stub model expects exactly 128 floats in the
inputarray. The example above is truncated for readability.
Send 1000 sequential requests and measure total time:
# Generate a valid 128-element input
input_json=$(python3 -c "import json; print(json.dumps({'input': [0.1]*128}))")
# Sequential stress test
for i in {1..1000}; do
curl -s -X POST http://localhost:8000/infer \
-H "Content-Type: application/json" \
-d "{
\"model_id\": \"stub_v1\",
\"model_type\": \"classification\",
\"priority\": 2,
\"payload\": $input_json
}" > /dev/null
doneFor true concurrency testing, use a load generator:
Using hey (simple, single-threaded):
# Install: go install github.com/rakyll/hey@latest
# Or: apt-get install hey
# Run 50,000 requests with 500 concurrent connections
hey -n 50000 -c 500 -m POST \
-H "Content-Type: application/json" \
-d '{"model_id":"stub_v1","model_type":"classification","priority":2,"payload":{"input":[0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1]}}' \
http://localhost:8000/inferUsing wrk (more accurate, multi-threaded):
# Install wrk first: https://github.com/wg/wrk/wiki/Installing-Wrk-on-Linux
# Create a Lua script for POST body
cat > infer.lua << 'EOF'
wrk.method = "POST"
wrk.headers["Content-Type"] = "application/json"
wrk.body = '{"model_id":"stub_v1","model_type":"classification","priority":2,"payload":{"input":[0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1]}}'
EOF
# Run with 12 threads, 400 connections, for 30 seconds
wrk -t12 -c400 -d30s -s infer.lua http://localhost:8000/inferMonitor during stress test:
# Watch queue depth and pending requests
curl http://localhost:8000/health | python3 -m json.tool
# Watch Prometheus metrics
curl http://localhost:8000/metrics | grep bales_Test pure batching throughput (no HTTP or Redis overhead). This tells you the theoretical maximum of the batching engine:
uv run python benchmarks/profile_batcher.pyExpected output:
=== Batcher Benchmark ===
Requests: 10,000
Elapsed: 1.23s
Throughput: 8,130 req/s
P50 latency: 2.45ms
P99 latency: 8.12ms
Avg batch size: 31.2
Model calls: 321 (vs 10000 individual = 31.2x reduction)
Target: >8,000 req/s, P99 <12ms
Pass: PASS
What to tune:
- If throughput is low (< 8,000): increase
BATCH_WINDOW_MSslightly (try 5ms -> 8ms) to allow more requests to accumulate per batch - If P99 is high (> 12ms): reduce
BATCH_WINDOW_MS(try 5ms -> 3ms) or increaseTHREAD_POOL_SIZE - If avg batch size is low (< 20): increase concurrent load or reduce window
Benchmark the complete HTTP -> Redis -> Batcher pipeline:
# Install locust (already in dev dependencies)
uv sync --extra dev
# Run headless load test
uv run locust -f benchmarks/locustfile.py \
--headless -u 500 -r 100 \
--run-time 60s --host http://localhost:8000Parameters explained:
-u 500: spawn 500 concurrent users-r 100: hatch 100 users per second--run-time 60s: run for 60 seconds--host http://localhost:8000: target the local server
After the run, Locust prints:
- Total requests per second (RPS)
- Average, median, and percentile latencies
- Failure count and error rate
| Metric | Target | What to do if failing |
|---|---|---|
| Throughput | > 8,000 req/s | Increase -u (users) in Locust. Check CPU usage with htop. |
| P99 latency | < 12ms | Reduce BATCH_WINDOW_MS or increase THREAD_POOL_SIZE. Check /health for queue backlog. |
| Error rate | 0% | Check logs for timeout or Redis connection errors. Verify pending_futures is 0 in /health. |
| Avg batch size | > 20 | Should be close to MAX_BATCH_SIZE (32). If low, increase load or window. |
Comparison checklist:
- Run
profile_batcher.pyfirst to establish the ceiling (no HTTP/Redis overhead) - Run
locustfile.pyto measure real-world throughput - Compare:
locust RPSshould be ~60-80% ofprofile_batcher RPSdue to HTTP + Redis overhead - If gap is larger: HTTP layer or Redis is the bottleneck, not the batcher
- If gap is small: batcher is the bottleneck, tune
THREAD_POOL_SIZEorBATCH_WINDOW_MS
BALES follows security best practices:
- Input validation: All requests are validated via Pydantic v2 before entering the pipeline.
- No shell execution:
weights_pathin hot-swap is validated withPath.exists()and never passed to shell commands. - Resource limits: Docker Compose enforces CPU (
4.0) and memory (2G) caps. - No Redis persistence: Queue data is ephemeral by design (
--save "" --appendonly no) to avoid I/O overhead and accidental data retention. - Single worker: Prevents shared-state corruption; horizontal scaling is done via container replicas behind a load balancer.
- Healthchecks: Docker
HEALTHCHECKpolls/healthevery 10s to detect degraded state. - Non-root container: The Docker image runs as an unprivileged
balesuser.
Submit an inference request.
Request body:
{
"model_id": "stub_v1",
"model_type": "classification",
"priority": 2,
"payload": {
"input": [0.1, 0.2, ...]
}
}Response:
{
"request_id": "uuid",
"model_id": "stub_v1",
"result": { "label": 3, "confidence": 0.95 },
"latency_ms": 4.123,
"batch_size": 16,
"queued_ms": 1.234
}Returns system health, registered models, queue depths, and pending request count.
curl http://localhost:8000/healthHot-swap a model's weights without dropping traffic.
Request body:
{
"weights_path": "./weights/new_model.pt"
}Example:
curl -X POST http://localhost:8000/models/stub_v1/reload \
-H "Content-Type: application/json" \
-d '{"weights_path": "./weights/stub_v2.pt"}'Prometheus scrape endpoint exposing:
bales_requests_total- Total inference requests by model_id and statusbales_request_latency_ms- End-to-end latency distributionbales_batch_size- Number of requests in each dispatched batchbales_queue_depth- Number of requests waiting in priority queue
curl http://localhost:8000/metrics# Run the test suite (requires Redis on localhost:6379)
uv run pytest tests/ -v
# Run a specific test file
uv run pytest tests/test_integration.py -v
# Run with coverage
uv run pytest tests/ -v --cov=. --cov-report=html
# Profile the batcher
uv run python benchmarks/profile_batcher.py
# Run the load test
uv run locust -f benchmarks/locustfile.py --headless -u 500 -r 100 --run-time 60s --host http://localhost:8000
# Lint check
uv run ruff check .
# Format check
uv run ruff format --check .
# Type check
uv run ty check
Built with FastAPI + PyTorch + Redis + uv
Buy me a coffee