Recursive Language Models (RLM) on Azure

An implementation of the Recursive Language Models paradigm introduced by Zhang, Kraska & Khattab (2026). Instead of one-shot retrieval + generation (RAG), an RLM gives the LLM a sandboxed Python REPL pre-loaded with the full corpus and lets it iteratively write and execute code to explore, filter, extract, and compose an answer.

This repository contains two things:

rlm/ -- a reusable Python library that implements the core RLM loop. You can drop it into your own RAG-style app.
playground/ -- a full-stack demo app (FastAPI + React) that lets you compare RLM against four traditional retrieval strategies side by side on an AP News health corpus.

How RLM works

A typical execution follows this pattern:

INIT -- session created, corpus uploaded, Copilot SDK installed in sandbox
CODE step 1 -- explore context structure (length, sections, format)
CODE step 2 -- search and filter for query-relevant sections
CODE step 3 -- extract structured details, optionally spawning rlm_query() sub-tasks
CODE step 4 -- compose a natural language answer from extracted data
CODE step 5 -- set used_sections and Final = answer

The root LLM (via the Copilot SDK) writes Python code, which is executed in a sandboxed REPL (ACA Dynamic Sessions). The REPL has:

A pre-loaded context variable (the full corpus)
An rlm_query(question, sub_context=None) function that spawns a full recursive RLM sub-task -- a new Copilot SDK session with its own code-writing loop, running inside the same sandbox
A Final variable (answer slot)

The loop repeats until Final is set or the iteration limit is reached.

Because rlm_query() itself has access to rlm_query(), sub-tasks can recurse to arbitrary depth (capped by RLM_MAX_DEPTH). At max depth, rlm_query uses a tool-less Copilot SDK session that answers directly without code execution. Every level of recursion goes through the Copilot SDK -- there are no raw HTTP LLM calls.

RLM vs RAG

Aspect	RAG	RLM
Retrieval	Single-pass (BM25, vector, hybrid)	Iterative, code-driven
Reasoning	Generator reads top-k docs	LLM writes Python to search, filter, extract
Adaptability	Fixed ranking function	Each step adapts to the output of the previous step
Latency	Fast (~100-350 ms)	Slower (~15-60 s)
Best for	High-volume, latency-sensitive queries	Complex analytical questions, multi-hop reasoning

RAG works well when relevant documents are easily identifiable by keyword or semantic similarity. RLM excels when the question requires multi-hop reasoning, aggregation across documents, or conditional filtering that a static retrieval function cannot express.

Paper vs. this implementation

The table below maps each aspect of the original paper to what is implemented here, with links to the relevant source files.

Paper concept	Status	Implementation
Root LLM drives iterative code generation	Implemented	rlm/loop.py -- `rlm_query()` orchestrates the full loop
Sandboxed REPL with `context` variable	Implemented	rlm/session_manager.py -- ACA Dynamic Sessions client; rlm/repl_init.py -- injects `context`, `Final`, `rlm_query()`
Recursive sub-task spawning from REPL	Implemented	rlm/repl_init.py -- `rlm_query()` creates a new Copilot SDK session inside the sandbox; sub-tasks can themselves recurse
`Final` variable to signal answer completion	Implemented	rlm/agents.py -- tool handler checks `Final` after each execution
`run_python` custom tool for the root agent	Implemented	rlm/agents.py -- `_make_run_python_tool()`
System prompt with iterative strategy and code examples	Implemented	rlm/prompts.py -- `RLM_ROOT_SYSTEM_PROMPT`
Source attribution via `used_sections`	Implemented	rlm/loop.py -- reads `used_sections` from REPL after completion
Step-by-step trace / observability	Implemented	rlm/models.py -- `StepEvent` emitted at each iteration
Corpus chunking strategies (by headers, fixed-size, recursive)	Implemented	Described in the system prompt; the LLM decides the strategy at runtime
Benchmarking against traditional retrieval	Extended	playground/evaluation.py -- multi-dimensional scoring (relevance, groundedness, coherence, F1, pairwise) using azure-ai-evaluation SDK
Multiple retrieval baselines (keyword, vector, hybrid, agentic)	Extended	playground/retrieval.py, playground/agentic_retrieval.py -- four baseline strategies for comparison
Interactive comparison UI	Extended	playground/frontend/ -- React app with chat, side-by-side compare, and benchmark dashboard
Streaming execution trace	Extended	playground/main.py -- SSE endpoint streams `StepEvent`s in real time
Multi-modal corpora (images, tables, structured data)	Not implemented	The paper discusses multi-modal contexts; this implementation handles text only
Alternative sandbox runtimes (local Docker, etc.)	Not implemented	Only ACA Dynamic Sessions is supported as the REPL backend

Project structure

rlm/                   Core library (pip-installable)
  __init__.py           Public API: rlm_query, RLMConfig, StepEvent, ...
  loop.py               RLM orchestrator (iterate: code gen -> REPL exec -> check Final)
  agents.py             Copilot SDK wrappers (root session, sub-agent, run_python tool)
  prompts.py            System prompts for root and sub agents
  repl_init.py          REPL startup code (context loading, rlm_query())
  session_manager.py    ACA Dynamic Sessions HTTP client
  config.py             RLMConfig (pydantic-settings, reads from .env)
  models.py             StepEvent, StepKind, ExecutionResult
  data/                 Copilot SDK wheel for sandbox installation

playground/            Demo application (FastAPI + React)
  main.py               API endpoints (chat, benchmark, SSE streaming)
  config.py             Extends RLMConfig with search/gateway settings
  models.py             Request/response models for the API
  rag_chat.py           RAG pipeline (keyword, vector, hybrid, agentic, RLM)
  retrieval.py          Azure AI Search driver
  agentic_retrieval.py  Knowledge Base agent strategy
  evaluation.py         Multi-dimensional evaluation (azure-ai-evaluation SDK)
  benchmark_data.py     Curated benchmark questions with ground truths
  corpus.py             AP News data loader
  data/                 AP News health corpus (raw JSON)
  frontend/             React + TypeScript UI (Vite)
  static/               Frontend build output (served by FastAPI)
  infra/                Bicep modules for Azure deployment (azd)

notebooks/             Jupyter notebooks
  rlm_howto.ipynb       Step-by-step walkthrough of the RLM loop

Quickstart

Prerequisites

Python 3.11+
Node.js 20+
An Azure subscription with:
- Azure AI Foundry (Cognitive Services) with a GPT model deployed
- ACA Dynamic Sessions pool
Azure Developer CLI (azd)

Run the playground

# Provision infrastructure
azd auth login
azd up

# Install dependencies (includes playground extras)
pip install -e ".[playground]"
cd playground/frontend && npm install && npm run build && cd ../..

# Start the server
bash start.sh

Open http://localhost:8000 in a browser.

Use as a library

from rlm import rlm_query

corpus = "\n\n".join(f"## {d['title']}\n{d['content']}" for d in my_docs)

answer, iters, steps, refs = await rlm_query(
    question="What patterns appear across all reports?",
    corpus=corpus,
)

See notebooks/rlm_howto.ipynb for a full walkthrough.

Configuration

All settings are read from environment variables (or .env):

Variable	Description	Default
`AZURE_PROJECT_ENDPOINT`	Azure AI Foundry project endpoint	(required)
`SESSION_POOL_ENDPOINT`	ACA Dynamic Sessions pool endpoint	(required)
`ROOT_MODEL`	Model for the RLM root agent	`gpt-5-4-mini`
`SUB_MODEL`	Model used by `rlm_query()` sub-tasks and leaf calls	`gpt-5-4-mini`
`RLM_MAX_ITERATIONS`	Max RLM loop iterations (root level)	`20`
`RLM_MAX_DEPTH`	Max recursion depth for nested `rlm_query()` calls	`2`
`RLM_SUB_MAX_ITERATIONS`	Max iterations per recursive sub-task	`10`
`RLM_TRUNCATE_CHARS`	Truncation limit for REPL stdout in step events	`2000`
`AZURE_SEARCH_ENDPOINT`	Azure AI Search endpoint (playground only)
`SEARCH_INDEX_NAME`	Search index name (playground only)	`ap-news`
`GATEWAY_HOST`	Server bind address (playground only)	`0.0.0.0`
`GATEWAY_PORT`	Server port (playground only)	`8000`

Running tests

pytest tests/ -v

References

Zhang, A. L., Kraska, T., & Khattab, O. (2026). Recursive Language Models. arXiv:2512.24601

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.devcontainer		.devcontainer
docs		docs
notebooks		notebooks
playground		playground
rlm		rlm
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
azure.yaml		azure.yaml
pyproject.toml		pyproject.toml
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Recursive Language Models (RLM) on Azure

How RLM works

RLM vs RAG

Paper vs. this implementation

Project structure

Quickstart

Prerequisites

Run the playground

Use as a library

Configuration

Running tests

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Recursive Language Models (RLM) on Azure

How RLM works

RLM vs RAG

Paper vs. this implementation

Project structure

Quickstart

Prerequisites

Run the playground

Use as a library

Configuration

Running tests

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages