SmarterRouter supports multiple LLM backends through a unified interface.
| Feature | Ollama | llama.cpp | OpenAI-Compatible |
|---|---|---|---|
| Local inference | ✅ Native | ✅ Server | |
| VRAM management | ✅ Full | ❌ None | |
| Model unloading | ✅ Yes | ❌ No | |
| Embeddings | ✅ Yes | ✅ Yes | ✅ Via API |
| Best for | Local production | High-performance servers | External APIs |
- Install Ollama
- Pull models:
ollama pull llama3:70b ollama pull codellama:34b ollama pull phi3:mini
- Start Ollama service:
systemctl --user start ollama # or ollama serve
ROUTER_PROVIDER=ollama
ROUTER_OLLAMA_URL=http://localhost:11434- Native integration - No API translation layer
- Full VRAM management - SmarterRouter can load/unload models dynamically
- Embeddings support - Uses Ollama's native
/api/embeddendpoint - Production-ready - Stable, well-tested
- Ollama must be running before SmarterRouter starts
- Models are discovered automatically via
/api/tags - VRAM monitoring uses
nvidia-smito measure actual GPU usage
- Build or download llama.cpp
- Start server:
./server -m models/llama3-70b.gguf -c 4096 --port 8080
- Add models by starting additional server instances or using llama-swap
ROUTER_PROVIDER=llama.cpp
ROUTER_OLLAMA_URL=http://localhost:8080 # llama.cpp server URL- High performance - Direct GGUF execution, no Docker overhead
- Flexible deployment - Can run on CPU or GPU
- Multiple backends - Works with llama-swap for dynamic model switching
- No explicit model unloading - llama.cpp server loads models into memory; unloading returns
Falsegracefully but models stay loaded - Manual model management - You manage server instances; SmarterRouter can't load/unload dynamically
- Use llama-swap to dynamically switch models on same server
- Allocate sufficient context buffer:
-c 8192for long conversations - Use
--threadsand--gpu-layersto optimize performance
Works with any service that implements OpenAI's API spec.
- OpenAI (
https://api.openai.com/v1) - Anthropic (via anthropic-openai or LiteLLM)
- vLLM (self-hosted)
- Text Generation Inference (TGI)
- LiteLLM Proxy - Tried for 100+ providers
- LocalAI
- Ollama with OpenAI compatibility (
OLLAMA_ORIGINS=*)
ROUTER_PROVIDER=openai
ROUTER_OPENAI_BASE_URL=https://api.openai.com/v1
ROUTER_OPENAI_API_KEY=sk-your-key-hereROUTER_PROVIDER=openai
ROUTER_OPENAI_BASE_URL=https://openrouter.ai/api/v1
ROUTER_OPENAI_API_KEY=sk-or-v1-your-key
ROUTER_MODEL_PREFIX= # leave empty, model names already include providerNow you can route between OpenAI, Anthropic, and other providers through a single SmarterRouter instance!
ROUTER_PROVIDER=openai
ROUTER_OPENAI_BASE_URL=https://api.together.xyz/v1
ROUTER_OPENAI_API_KEY=your-key- Universal compatibility - Any OpenAI-compatible endpoint works
- Multi-provider routing - Route between OpenAI, Anthropic, etc.
- Cloud scale - No local VRAM constraints
- No local VRAM management - Cloud APIs manage their own resources
- API costs - Pay per token
- Rate limits - Subject to provider limits
- No model unloading - Not applicable
All backends have comprehensive test suites:
# Run backend-specific tests
pytest tests/test_ollama_backend.py -v
pytest tests/test_llama_cpp_backend.py -v
pytest tests/test_openai_backend.py -v
# Run contract tests (ensures all backends behave consistently)
pytest tests/test_backend_contract.py -v- Verify backend is running and accessible:
curl http://localhost:11434/api/tags # adjust port - Check firewall rules if remote
- Verify
ROUTER_OLLAMA_URLis correct - Check Docker networking (use
host.docker.internalor172.17.0.1)
- Ensure model is pulled/loaded in backend
- Check backend's model list endpoint manually
- Restart SmarterRouter after adding new models
- Check context size:
-c 4096or higher recommended - Enable GPU layers if available:
--gpu-layers 100 - Use quantized models (GGUF) for faster CPU inference
- Check provider dashboard for usage
- SmarterRouter now supports built-in retry + circuit breaker resilience controls (see configuration docs)
- Consider adding multiple API keys for load balancing (coming soon)
SmarterRouter includes configurable retry and circuit-breaker resilience for backend calls.
ROUTER_BACKEND_RETRY_ENABLED(default:true)ROUTER_BACKEND_MAX_RETRIES(default:3)ROUTER_BACKEND_RETRY_BASE_DELAY(default:0.5)ROUTER_BACKEND_RETRY_MAX_DELAY(default:8.0)
Retries apply to transient failures (timeouts, network errors, HTTP 429, and HTTP 5xx).
ROUTER_BACKEND_CIRCUIT_BREAKER_ENABLED(default:true)ROUTER_BACKEND_CIRCUIT_BREAKER_FAILURE_THRESHOLD(default:5)ROUTER_BACKEND_CIRCUIT_BREAKER_RESET_TIMEOUT(default:60.0seconds)ROUTER_BACKEND_CIRCUIT_BREAKER_HALF_OPEN_MAX_ATTEMPTS(default:3)ROUTER_BACKEND_CIRCUIT_BREAKER_SLIDING_WINDOW_SIZE(default:100)
Circuit breakers are tracked per backend operation (for example, request and stream-setup paths), so one unstable path can open independently without globally disabling all backend functionality.
The backend abstraction layer makes adding new providers straightforward. Potential future additions:
- HuggingFace Text Generation Inference
- AWS Bedrock
- Google Vertex AI
- Azure OpenAI
- Custom RPC protocols
If you need a specific backend, open an issue.