Agent-first Text-to-SQL project built around ReAct trajectories, real SQLite execution feedback, DeepSpeed training, and practical deployment on single-GPU machines.
The repository started as a Text-to-SQL SFT baseline and has now pivoted to an agent workflow:
Direct SQLfor fast first-pass generationInteractive ReActfor tool-using correction when direct generation failsReAct SFTas the current best training recipeGRPOas an experimental RL branch for reward-driven improvement
On Spider validation (200 samples), the current best deployed strategy is:
ReAct SFT v2 + Direct + Interactive Fallback:78.6%EXReAct SFT v2direct-only:71.7%EX- Current
GRPObranch is useful as an experiment, but does not beat ReAct SFT yet
See docs/REPORT.md for the full write-up and docs/AGENT.md for the code-level walkthrough.
- ReAct trajectory generation with teacher models and real DB observations
- ReAct SFT training with native DeepSpeed ZeRO-2
- GRPO RL training with execution-based rewards
- Agent evaluation in
single-passandinteractivemodes - Baseline SQL SFT, serving, quantization, and benchmarking utilities
Teacher trajectories are generated for Spider train examples, then every Action SQL is executed against the real SQLite database and rewritten with real Observation text.
Output record format:
{
"schema": "...",
"question": "...",
"gold_sql": "...",
"trajectory": "Thought: ...\nAction: execute_sql[\"...\"]\nObservation: ...\nAnswer: ...",
"error_recovery": true,
"db_id": "concert_singer",
"difficulty": "medium"
}The model is trained to emit full ReAct trajectories:
Thought -> Action -> Observation -> ... -> Answer
This stage is supervised. The target is the full trajectory, not just the final SQL.
The RL stage does not consume gold trajectories. It samples multiple completions per prompt, extracts the final SQL, executes it, and optimizes a reward composed of:
- format adherence
- SQL validity
- structural overlap with gold SQL
- partial execution agreement
- final correctness
This branch is still experimental. Current results show that GRPO improves some local reward behavior but does not yet outperform the ReAct SFT policy used with fallback.
Generate one SQL query directly and execute it.
Generate step by step. After each Action, execute the SQL, inject the real Observation, and continue generation.
This is the current best deployment strategy:
- Try direct SQL generation first
- Execute on the real Spider DB
- If execution fails, switch to interactive ReAct correction
This preserves the speed of direct decoding on easy questions and the robustness of agent-style correction on harder ones.
miniLLM/
├── miniLLM/agent/
│ ├── env.py # SQLite execution environment
│ ├── react.py # ReAct prompt format and trajectory parsing
│ ├── reward.py # GRPO rewards
│ └── data_gen.py # Teacher-generated ReAct trajectories
├── miniLLM/train.py # Baseline SQL SFT
├── miniLLM/train_react_sft.py
├── miniLLM/train_grpo.py
├── miniLLM/eval.py # Baseline SQL eval
├── miniLLM/eval_agent.py # Agent eval: single-pass / interactive
├── miniLLM/eval_spider.py
├── scripts/gen_react_data.sh
├── scripts/train_react_sft.sh
├── scripts/train_grpo.sh
├── scripts/eval_agent.sh
└── docs/
├── REPORT.md
└── AGENT.md
uv syncIf you want the full agent stack:
uv pip install deepspeed openaibash scripts/train.sh
bash scripts/eval.shAPI_KEY=... bash scripts/gen_react_data.shMODEL=Qwen/Qwen2.5-3B-Instruct \
DATA_PATH=outputs/react_trajectories.jsonl \
MAX_STEPS=500 \
bash scripts/train_react_sft.shSingle-pass:
ADAPTER=outputs/react-sft bash scripts/eval_agent.shInteractive:
ADAPTER=outputs/react-sft INTERACTIVE=1 bash scripts/eval_agent.shSFT_ADAPTER=outputs/react-sft \
MAX_STEPS=300 \
bash scripts/train_grpo.shUse this as an experimental branch, not as the default production recipe.
- docs/REPORT.md: experimental results and failure analysis
- docs/AGENT.md: architecture, file map, and training workflow
- docs/LEARNING.md: implementation notes and follow-ups
- The repo is optimized for single-GPU workstations with
24GBVRAM class hardware - DeepSpeed ZeRO-2 is used instead of Trainer wrappers so the training loop is explicit
- Evaluation on Spider should be interpreted carefully: attempted-only EX can overstate performance if many samples end in execution errors
- Current GRPO work is valuable mainly as a negative result and debugging surface, not as the strongest shipped model path