An implementation of the Recursive Language Models paradigm introduced by Zhang, Kraska & Khattab (2026). Instead of one-shot retrieval + generation (RAG), an RLM gives the LLM a sandboxed Python REPL pre-loaded with the full corpus and lets it iteratively write and execute code to explore, filter, extract, and compose an answer.
This repository contains two things:
rlm/-- a reusable Python library that implements the core RLM loop. You can drop it into your own RAG-style app.playground/-- a full-stack demo app (FastAPI + React) that lets you compare RLM against four traditional retrieval strategies side by side on an AP News health corpus.
A typical execution follows this pattern:
- INIT -- session created, corpus uploaded, Copilot SDK installed in sandbox
- CODE step 1 -- explore context structure (length, sections, format)
- CODE step 2 -- search and filter for query-relevant sections
- CODE step 3 -- extract structured details, optionally spawning
rlm_query()sub-tasks - CODE step 4 -- compose a natural language answer from extracted data
- CODE step 5 -- set
used_sectionsandFinal = answer
The root LLM (via the Copilot SDK) writes Python code, which is executed in a sandboxed REPL (ACA Dynamic Sessions). The REPL has:
- A pre-loaded
contextvariable (the full corpus) - An
rlm_query(question, sub_context=None)function that spawns a full recursive RLM sub-task -- a new Copilot SDK session with its own code-writing loop, running inside the same sandbox - A
Finalvariable (answer slot)
The loop repeats until Final is set or the iteration limit is reached.
Because rlm_query() itself has access to rlm_query(), sub-tasks can
recurse to arbitrary depth (capped by RLM_MAX_DEPTH). At max depth,
rlm_query uses a tool-less Copilot SDK session that answers directly
without code execution. Every level of recursion goes through the Copilot
SDK -- there are no raw HTTP LLM calls.
| Aspect | RAG | RLM |
|---|---|---|
| Retrieval | Single-pass (BM25, vector, hybrid) | Iterative, code-driven |
| Reasoning | Generator reads top-k docs | LLM writes Python to search, filter, extract |
| Adaptability | Fixed ranking function | Each step adapts to the output of the previous step |
| Latency | Fast (~100-350 ms) | Slower (~15-60 s) |
| Best for | High-volume, latency-sensitive queries | Complex analytical questions, multi-hop reasoning |
RAG works well when relevant documents are easily identifiable by keyword or semantic similarity. RLM excels when the question requires multi-hop reasoning, aggregation across documents, or conditional filtering that a static retrieval function cannot express.
The table below maps each aspect of the original paper to what is implemented here, with links to the relevant source files.
| Paper concept | Status | Implementation |
|---|---|---|
| Root LLM drives iterative code generation | Implemented | rlm/loop.py -- rlm_query() orchestrates the full loop |
Sandboxed REPL with context variable |
Implemented | rlm/session_manager.py -- ACA Dynamic Sessions client; rlm/repl_init.py -- injects context, Final, rlm_query() |
| Recursive sub-task spawning from REPL | Implemented | rlm/repl_init.py -- rlm_query() creates a new Copilot SDK session inside the sandbox; sub-tasks can themselves recurse |
Final variable to signal answer completion |
Implemented | rlm/agents.py -- tool handler checks Final after each execution |
run_python custom tool for the root agent |
Implemented | rlm/agents.py -- _make_run_python_tool() |
| System prompt with iterative strategy and code examples | Implemented | rlm/prompts.py -- RLM_ROOT_SYSTEM_PROMPT |
Source attribution via used_sections |
Implemented | rlm/loop.py -- reads used_sections from REPL after completion |
| Step-by-step trace / observability | Implemented | rlm/models.py -- StepEvent emitted at each iteration |
| Corpus chunking strategies (by headers, fixed-size, recursive) | Implemented | Described in the system prompt; the LLM decides the strategy at runtime |
| Benchmarking against traditional retrieval | Extended | playground/evaluation.py -- multi-dimensional scoring (relevance, groundedness, coherence, F1, pairwise) using azure-ai-evaluation SDK |
| Multiple retrieval baselines (keyword, vector, hybrid, agentic) | Extended | playground/retrieval.py, playground/agentic_retrieval.py -- four baseline strategies for comparison |
| Interactive comparison UI | Extended | playground/frontend/ -- React app with chat, side-by-side compare, and benchmark dashboard |
| Streaming execution trace | Extended | playground/main.py -- SSE endpoint streams StepEvents in real time |
| Multi-modal corpora (images, tables, structured data) | Not implemented | The paper discusses multi-modal contexts; this implementation handles text only |
| Alternative sandbox runtimes (local Docker, etc.) | Not implemented | Only ACA Dynamic Sessions is supported as the REPL backend |
rlm/ Core library (pip-installable)
__init__.py Public API: rlm_query, RLMConfig, StepEvent, ...
loop.py RLM orchestrator (iterate: code gen -> REPL exec -> check Final)
agents.py Copilot SDK wrappers (root session, sub-agent, run_python tool)
prompts.py System prompts for root and sub agents
repl_init.py REPL startup code (context loading, rlm_query())
session_manager.py ACA Dynamic Sessions HTTP client
config.py RLMConfig (pydantic-settings, reads from .env)
models.py StepEvent, StepKind, ExecutionResult
data/ Copilot SDK wheel for sandbox installation
playground/ Demo application (FastAPI + React)
main.py API endpoints (chat, benchmark, SSE streaming)
config.py Extends RLMConfig with search/gateway settings
models.py Request/response models for the API
rag_chat.py RAG pipeline (keyword, vector, hybrid, agentic, RLM)
retrieval.py Azure AI Search driver
agentic_retrieval.py Knowledge Base agent strategy
evaluation.py Multi-dimensional evaluation (azure-ai-evaluation SDK)
benchmark_data.py Curated benchmark questions with ground truths
corpus.py AP News data loader
data/ AP News health corpus (raw JSON)
frontend/ React + TypeScript UI (Vite)
static/ Frontend build output (served by FastAPI)
infra/ Bicep modules for Azure deployment (azd)
notebooks/ Jupyter notebooks
rlm_howto.ipynb Step-by-step walkthrough of the RLM loop
- Python 3.11+
- Node.js 20+
- An Azure subscription with:
- Azure AI Foundry (Cognitive Services) with a GPT model deployed
- ACA Dynamic Sessions pool
- Azure Developer CLI (azd)
# Provision infrastructure
azd auth login
azd up
# Install dependencies (includes playground extras)
pip install -e ".[playground]"
cd playground/frontend && npm install && npm run build && cd ../..
# Start the server
bash start.shOpen http://localhost:8000 in a browser.
from rlm import rlm_query
corpus = "\n\n".join(f"## {d['title']}\n{d['content']}" for d in my_docs)
answer, iters, steps, refs = await rlm_query(
question="What patterns appear across all reports?",
corpus=corpus,
)See notebooks/rlm_howto.ipynb for a full walkthrough.
All settings are read from environment variables (or .env):
| Variable | Description | Default |
|---|---|---|
AZURE_PROJECT_ENDPOINT |
Azure AI Foundry project endpoint | (required) |
SESSION_POOL_ENDPOINT |
ACA Dynamic Sessions pool endpoint | (required) |
ROOT_MODEL |
Model for the RLM root agent | gpt-5-4-mini |
SUB_MODEL |
Model used by rlm_query() sub-tasks and leaf calls |
gpt-5-4-mini |
RLM_MAX_ITERATIONS |
Max RLM loop iterations (root level) | 20 |
RLM_MAX_DEPTH |
Max recursion depth for nested rlm_query() calls |
2 |
RLM_SUB_MAX_ITERATIONS |
Max iterations per recursive sub-task | 10 |
RLM_TRUNCATE_CHARS |
Truncation limit for REPL stdout in step events | 2000 |
AZURE_SEARCH_ENDPOINT |
Azure AI Search endpoint (playground only) | |
SEARCH_INDEX_NAME |
Search index name (playground only) | ap-news |
GATEWAY_HOST |
Server bind address (playground only) | 0.0.0.0 |
GATEWAY_PORT |
Server port (playground only) | 8000 |
pytest tests/ -vZhang, A. L., Kraska, T., & Khattab, O. (2026). Recursive Language Models. arXiv:2512.24601

