Skip to content

aymenfurter/rlm-on-azure

Repository files navigation

RLM

Recursive Language Models (RLM) on Azure

An implementation of the Recursive Language Models paradigm introduced by Zhang, Kraska & Khattab (2026). Instead of one-shot retrieval + generation (RAG), an RLM gives the LLM a sandboxed Python REPL pre-loaded with the full corpus and lets it iteratively write and execute code to explore, filter, extract, and compose an answer.

This repository contains two things:

  1. rlm/ -- a reusable Python library that implements the core RLM loop. You can drop it into your own RAG-style app.
  2. playground/ -- a full-stack demo app (FastAPI + React) that lets you compare RLM against four traditional retrieval strategies side by side on an AP News health corpus.

Trace view

How RLM works

A typical execution follows this pattern:

  1. INIT -- session created, corpus uploaded, Copilot SDK installed in sandbox
  2. CODE step 1 -- explore context structure (length, sections, format)
  3. CODE step 2 -- search and filter for query-relevant sections
  4. CODE step 3 -- extract structured details, optionally spawning rlm_query() sub-tasks
  5. CODE step 4 -- compose a natural language answer from extracted data
  6. CODE step 5 -- set used_sections and Final = answer

The root LLM (via the Copilot SDK) writes Python code, which is executed in a sandboxed REPL (ACA Dynamic Sessions). The REPL has:

  • A pre-loaded context variable (the full corpus)
  • An rlm_query(question, sub_context=None) function that spawns a full recursive RLM sub-task -- a new Copilot SDK session with its own code-writing loop, running inside the same sandbox
  • A Final variable (answer slot)

The loop repeats until Final is set or the iteration limit is reached.

Because rlm_query() itself has access to rlm_query(), sub-tasks can recurse to arbitrary depth (capped by RLM_MAX_DEPTH). At max depth, rlm_query uses a tool-less Copilot SDK session that answers directly without code execution. Every level of recursion goes through the Copilot SDK -- there are no raw HTTP LLM calls.

RLM vs RAG

Aspect RAG RLM
Retrieval Single-pass (BM25, vector, hybrid) Iterative, code-driven
Reasoning Generator reads top-k docs LLM writes Python to search, filter, extract
Adaptability Fixed ranking function Each step adapts to the output of the previous step
Latency Fast (~100-350 ms) Slower (~15-60 s)
Best for High-volume, latency-sensitive queries Complex analytical questions, multi-hop reasoning

RAG works well when relevant documents are easily identifiable by keyword or semantic similarity. RLM excels when the question requires multi-hop reasoning, aggregation across documents, or conditional filtering that a static retrieval function cannot express.

Paper vs. this implementation

The table below maps each aspect of the original paper to what is implemented here, with links to the relevant source files.

Paper concept Status Implementation
Root LLM drives iterative code generation Implemented rlm/loop.py -- rlm_query() orchestrates the full loop
Sandboxed REPL with context variable Implemented rlm/session_manager.py -- ACA Dynamic Sessions client; rlm/repl_init.py -- injects context, Final, rlm_query()
Recursive sub-task spawning from REPL Implemented rlm/repl_init.py -- rlm_query() creates a new Copilot SDK session inside the sandbox; sub-tasks can themselves recurse
Final variable to signal answer completion Implemented rlm/agents.py -- tool handler checks Final after each execution
run_python custom tool for the root agent Implemented rlm/agents.py -- _make_run_python_tool()
System prompt with iterative strategy and code examples Implemented rlm/prompts.py -- RLM_ROOT_SYSTEM_PROMPT
Source attribution via used_sections Implemented rlm/loop.py -- reads used_sections from REPL after completion
Step-by-step trace / observability Implemented rlm/models.py -- StepEvent emitted at each iteration
Corpus chunking strategies (by headers, fixed-size, recursive) Implemented Described in the system prompt; the LLM decides the strategy at runtime
Benchmarking against traditional retrieval Extended playground/evaluation.py -- multi-dimensional scoring (relevance, groundedness, coherence, F1, pairwise) using azure-ai-evaluation SDK
Multiple retrieval baselines (keyword, vector, hybrid, agentic) Extended playground/retrieval.py, playground/agentic_retrieval.py -- four baseline strategies for comparison
Interactive comparison UI Extended playground/frontend/ -- React app with chat, side-by-side compare, and benchmark dashboard
Streaming execution trace Extended playground/main.py -- SSE endpoint streams StepEvents in real time
Multi-modal corpora (images, tables, structured data) Not implemented The paper discusses multi-modal contexts; this implementation handles text only
Alternative sandbox runtimes (local Docker, etc.) Not implemented Only ACA Dynamic Sessions is supported as the REPL backend

Project structure

rlm/                   Core library (pip-installable)
  __init__.py           Public API: rlm_query, RLMConfig, StepEvent, ...
  loop.py               RLM orchestrator (iterate: code gen -> REPL exec -> check Final)
  agents.py             Copilot SDK wrappers (root session, sub-agent, run_python tool)
  prompts.py            System prompts for root and sub agents
  repl_init.py          REPL startup code (context loading, rlm_query())
  session_manager.py    ACA Dynamic Sessions HTTP client
  config.py             RLMConfig (pydantic-settings, reads from .env)
  models.py             StepEvent, StepKind, ExecutionResult
  data/                 Copilot SDK wheel for sandbox installation

playground/            Demo application (FastAPI + React)
  main.py               API endpoints (chat, benchmark, SSE streaming)
  config.py             Extends RLMConfig with search/gateway settings
  models.py             Request/response models for the API
  rag_chat.py           RAG pipeline (keyword, vector, hybrid, agentic, RLM)
  retrieval.py          Azure AI Search driver
  agentic_retrieval.py  Knowledge Base agent strategy
  evaluation.py         Multi-dimensional evaluation (azure-ai-evaluation SDK)
  benchmark_data.py     Curated benchmark questions with ground truths
  corpus.py             AP News data loader
  data/                 AP News health corpus (raw JSON)
  frontend/             React + TypeScript UI (Vite)
  static/               Frontend build output (served by FastAPI)
  infra/                Bicep modules for Azure deployment (azd)

notebooks/             Jupyter notebooks
  rlm_howto.ipynb       Step-by-step walkthrough of the RLM loop

Quickstart

Prerequisites

  • Python 3.11+
  • Node.js 20+
  • An Azure subscription with:
    • Azure AI Foundry (Cognitive Services) with a GPT model deployed
    • ACA Dynamic Sessions pool
  • Azure Developer CLI (azd)

Run the playground

# Provision infrastructure
azd auth login
azd up

# Install dependencies (includes playground extras)
pip install -e ".[playground]"
cd playground/frontend && npm install && npm run build && cd ../..

# Start the server
bash start.sh

Open http://localhost:8000 in a browser.

Use as a library

from rlm import rlm_query

corpus = "\n\n".join(f"## {d['title']}\n{d['content']}" for d in my_docs)

answer, iters, steps, refs = await rlm_query(
    question="What patterns appear across all reports?",
    corpus=corpus,
)

See notebooks/rlm_howto.ipynb for a full walkthrough.

Configuration

All settings are read from environment variables (or .env):

Variable Description Default
AZURE_PROJECT_ENDPOINT Azure AI Foundry project endpoint (required)
SESSION_POOL_ENDPOINT ACA Dynamic Sessions pool endpoint (required)
ROOT_MODEL Model for the RLM root agent gpt-5-4-mini
SUB_MODEL Model used by rlm_query() sub-tasks and leaf calls gpt-5-4-mini
RLM_MAX_ITERATIONS Max RLM loop iterations (root level) 20
RLM_MAX_DEPTH Max recursion depth for nested rlm_query() calls 2
RLM_SUB_MAX_ITERATIONS Max iterations per recursive sub-task 10
RLM_TRUNCATE_CHARS Truncation limit for REPL stdout in step events 2000
AZURE_SEARCH_ENDPOINT Azure AI Search endpoint (playground only)
SEARCH_INDEX_NAME Search index name (playground only) ap-news
GATEWAY_HOST Server bind address (playground only) 0.0.0.0
GATEWAY_PORT Server port (playground only) 8000

Running tests

pytest tests/ -v

References

Zhang, A. L., Kraska, T., & Khattab, O. (2026). Recursive Language Models. arXiv:2512.24601

About

Running Recursive Language Models (RLM) on Azure

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors