CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

OptiLLM is an OpenAI API compatible optimizing inference proxy that implements state-of-the-art techniques to improve accuracy and performance of LLMs. It focuses on reasoning improvements for coding, logical, and mathematical queries through inference-time compute optimization.

Core Architecture

Main Components

Entry Points:
- optillm.py - Main Flask server with inference routing
- optillm/inference.py - Local inference engine with transformer models
- Setup via pyproject.toml with console script optillm=optillm:main
Optimization Techniques (optillm/):
- Reasoning: cot_reflection.py, plansearch.py, leap.py, reread.py
- Sampling: bon.py (Best of N), moa.py (Mixture of Agents), self_consistency.py
- Search: mcts.py (Monte Carlo Tree Search), rstar.py (R* Algorithm)
- Verification: pvg.py (Prover-Verifier Game), z3_solver.py
- Advanced: cepo/ (Cerebras Planning & Optimization), rto.py (Round Trip)
Decoding Techniques:
- cot_decoding.py - Chain-of-thought without explicit prompting
- entropy_decoding.py - Adaptive sampling based on token uncertainty
- thinkdeeper.py - Reasoning effort scaling
- autothink/ - Query complexity classification with steering vectors
Plugin System (optillm/plugins/):
- spl/ - System Prompt Learning (third paradigm learning)
- deepthink/ - Gemini-like deep thinking with inference scaling
- longcepo/ - Long-context processing with divide-and-conquer
- mcp_plugin.py - Model Context Protocol client
- memory_plugin.py - Short-term memory for unbounded context
- privacy_plugin.py - PII anonymization/deanonymization
- executecode_plugin.py - Code interpreter integration
- json_plugin.py - Structured outputs with outlines library

Development Commands

Installation & Setup

# Development setup
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Package installation
pip install optillm

Running the Server

# Basic server (auto approach detection)
python optillm.py

# With specific approach
python optillm.py --approach moa --model gpt-4o-mini

# With external endpoint
python optillm.py --base_url http://localhost:8080/v1

# Docker
docker compose up -d

Testing

# Run all approach tests
python test.py

# Test specific approaches
python test.py --approaches moa bon mcts

# Test with specific model/endpoint
python test.py --model gpt-4o-mini --base-url http://localhost:8080/v1

# Single test case
python test.py --single-test "specific_test_name"

Evaluation Scripts

# Math benchmark evaluation
python scripts/eval_math500_benchmark.py

# AIME benchmark
python scripts/eval_aime_benchmark.py

# Arena Hard Auto evaluation  
python scripts/eval_arena_hard_auto_rtc.py

# FRAMES benchmark
python scripts/eval_frames_benchmark.py

# OptiLLM benchmark generation/evaluation
python scripts/gen_optillmbench.py
python scripts/eval_optillmbench.py

Usage Patterns

Approach Selection (Priority Order)

Model prefix: moa-gpt-4o-mini (approach slug + model name)
extra_body field: {"optillm_approach": "bon|moa|mcts"}
Prompt tags: <optillm_approach>re2</optillm_approach> in system/user prompt

Approach Combinations

Pipeline (&): cot_reflection&moa - sequential processing
Parallel (|): bon|moa|mcts - multiple responses returned as list

Local Inference

Set OPTILLM_API_KEY=optillm to enable built-in transformer inference
Supports HuggingFace models with LoRA adapters: model+lora1+lora2
Advanced decoding: {"decoding": "cot_decoding", "k": 10}

Plugin Configuration

MCP: ~/.optillm/mcp_config.json for Model Context Protocol servers
SPL: Built-in system prompt learning for solving strategies
Memory: Automatic unbounded context via chunking and retrieval
GenSelect: Quality-based selection from multiple generated candidates

Key Concepts

Inference Optimization

The proxy intercepts OpenAI API calls and applies optimization techniques before forwarding to LLM providers (OpenAI, Cerebras, Azure, LiteLLM). Each technique implements specific reasoning or sampling improvements.

Plugin Architecture

Plugins extend functionality via standardized interfaces. They can modify requests, process responses, add tools, or provide entirely new capabilities like code execution or structured outputs.

Multi-Provider Support

Automatically detects and routes to appropriate LLM provider based on environment variables (OPENAI_API_KEY, CEREBRAS_API_KEY, etc.) with fallback to LiteLLM for broader model support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Project Overview

Core Architecture

Main Components

Development Commands

Installation & Setup

Running the Server

Testing

Evaluation Scripts

Usage Patterns

Approach Selection (Priority Order)

Approach Combinations

Local Inference

Plugin Configuration

Key Concepts

Inference Optimization

Plugin Architecture

Multi-Provider Support

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Core Architecture

Main Components

Development Commands

Installation & Setup

Running the Server

Testing

Evaluation Scripts

Usage Patterns

Approach Selection (Priority Order)

Approach Combinations

Local Inference

Plugin Configuration

Key Concepts

Inference Optimization

Plugin Architecture

Multi-Provider Support