This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
OptiLLM is an OpenAI API compatible optimizing inference proxy that implements state-of-the-art techniques to improve accuracy and performance of LLMs. It focuses on reasoning improvements for coding, logical, and mathematical queries through inference-time compute optimization.
-
Entry Points:
optillm.py- Main Flask server with inference routingoptillm/inference.py- Local inference engine with transformer models- Setup via
pyproject.tomlwith console scriptoptillm=optillm:main
-
Optimization Techniques (
optillm/):- Reasoning:
cot_reflection.py,plansearch.py,leap.py,reread.py - Sampling:
bon.py(Best of N),moa.py(Mixture of Agents),self_consistency.py - Search:
mcts.py(Monte Carlo Tree Search),rstar.py(R* Algorithm) - Verification:
pvg.py(Prover-Verifier Game),z3_solver.py - Advanced:
cepo/(Cerebras Planning & Optimization),rto.py(Round Trip)
- Reasoning:
-
Decoding Techniques:
cot_decoding.py- Chain-of-thought without explicit promptingentropy_decoding.py- Adaptive sampling based on token uncertaintythinkdeeper.py- Reasoning effort scalingautothink/- Query complexity classification with steering vectors
-
Plugin System (
optillm/plugins/):spl/- System Prompt Learning (third paradigm learning)deepthink/- Gemini-like deep thinking with inference scalinglongcepo/- Long-context processing with divide-and-conquermcp_plugin.py- Model Context Protocol clientmemory_plugin.py- Short-term memory for unbounded contextprivacy_plugin.py- PII anonymization/deanonymizationexecutecode_plugin.py- Code interpreter integrationjson_plugin.py- Structured outputs with outlines library
# Development setup
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Package installation
pip install optillm# Basic server (auto approach detection)
python optillm.py
# With specific approach
python optillm.py --approach moa --model gpt-4o-mini
# With external endpoint
python optillm.py --base_url http://localhost:8080/v1
# Docker
docker compose up -d# Run all approach tests
python test.py
# Test specific approaches
python test.py --approaches moa bon mcts
# Test with specific model/endpoint
python test.py --model gpt-4o-mini --base-url http://localhost:8080/v1
# Single test case
python test.py --single-test "specific_test_name"# Math benchmark evaluation
python scripts/eval_math500_benchmark.py
# AIME benchmark
python scripts/eval_aime_benchmark.py
# Arena Hard Auto evaluation
python scripts/eval_arena_hard_auto_rtc.py
# FRAMES benchmark
python scripts/eval_frames_benchmark.py
# OptiLLM benchmark generation/evaluation
python scripts/gen_optillmbench.py
python scripts/eval_optillmbench.py- Model prefix:
moa-gpt-4o-mini(approach slug + model name) - extra_body field:
{"optillm_approach": "bon|moa|mcts"} - Prompt tags:
<optillm_approach>re2</optillm_approach>in system/user prompt
- Pipeline (
&):cot_reflection&moa- sequential processing - Parallel (
|):bon|moa|mcts- multiple responses returned as list
- Set
OPTILLM_API_KEY=optillmto enable built-in transformer inference - Supports HuggingFace models with LoRA adapters:
model+lora1+lora2 - Advanced decoding:
{"decoding": "cot_decoding", "k": 10}
- MCP:
~/.optillm/mcp_config.jsonfor Model Context Protocol servers - SPL: Built-in system prompt learning for solving strategies
- Memory: Automatic unbounded context via chunking and retrieval
- GenSelect: Quality-based selection from multiple generated candidates
The proxy intercepts OpenAI API calls and applies optimization techniques before forwarding to LLM providers (OpenAI, Cerebras, Azure, LiteLLM). Each technique implements specific reasoning or sampling improvements.
Plugins extend functionality via standardized interfaces. They can modify requests, process responses, add tools, or provide entirely new capabilities like code execution or structured outputs.
Automatically detects and routes to appropriate LLM provider based on environment variables (OPENAI_API_KEY, CEREBRAS_API_KEY, etc.) with fallback to LiteLLM for broader model support.