SOP-Bench is a comprehensive benchmark for evaluating LLM-based agents on complex, multi-step Standard Operating Procedures (SOPs) that are fundamental to industrial automation. Built from 2,000+ tasks across 12 industrial domains (healthcare, logistics, finance, content moderation, etc.), SOP-Bench addresses the gap between existing benchmarks and real-world procedural complexity.
🏭 Human Expert-Authored SOPs · 🤖 Human-AI Collaborative Framework · 📊 Executable Interfaces · 🔧 Two Agent Architectures · 📈 11 Frontier Models Evaluated
- [2026-02] 🎉 SOP-Bench submitted to KDD 2026 Datasets and Benchmarks Track.
Standard Operating Procedures (SOPs) are the backbone of industrial operations—from content moderation to healthcare intake to supply chain logistics. These multi-step procedures require:
- Sequential reasoning across 10-50+ decision points
- Tool orchestration to gather information from multiple systems
- Implicit knowledge that humans learn but rarely document
- Ambiguity handling when procedures don't cover edge cases
Can LLM agents reliably execute these procedures?
Our research shows they struggle significantly—with performance varying dramatically across domains (26.7%-94.3% success rates).
📊 Detailed leaderboard and benchmark results coming soon.
Early findings from our evaluation of 11 frontier models:
- Function-Calling agents: ~64% average task success rate
- ReAct agents: ~55% average task success rate
- High execution rates (95%+) indicate failures are reasoning-based, not technical
- Open-source models (DeepSeek-R1, Llama 3.3) approach proprietary performance
- Architecture-model co-design matters: Newer reasoning models can degrade ReAct performance without targeted prompt engineering
2,000+ tasks across 12 industrial domains, created through human-AI collaboration:
| Domain | Description |
|---|---|
| Content Moderation | Bot detection, trust scoring, violation assessment |
| Customer Service | Issue diagnosis, system checks, resolution routing |
| Supply Chain | Safety data sheet analysis, hazard classification |
| Aviation | Pre-flight safety checks, compliance verification |
| Retail | Seller email categorization, routing decisions |
| Finance | Business entity verification, risk assessment |
| Healthcare | Insurance validation, medical history processing |
| Autonomous Driving | Object detection in driving scenarios |
| Media | Content moderation, category assignment |
| Logistics | Package damage assessment, compliance checks |
| ...and more |
Human expert-authored SOPs • Mock tools for reproducibility • Ground-truth outputs • Multiple agent architectures
Here's a simplified example from the Dangerous Goods benchmark:
SOP Instruction (excerpt):
"1. Retrieve the Safety Data Sheet for the product. 2. Check if the product contains any Class 3 flammable liquids. 3. If flash point < 23°C, classify as Packing Group I..."
Task Input:
{
"product_id": "CHEM-2847",
"shipment_type": "air_freight"
}Expected Agent Behavior:
- Call
get_safety_data_sheet(product_id="CHEM-2847") - Call
check_hazard_class(sds_id="SDS-2847") - Call
get_flash_point(sds_id="SDS-2847") - Apply classification logic from SOP
- Return:
{"classification": "Class 3", "packing_group": "II"}
Ground Truth: packing_group: II
The agent must correctly orchestrate tools AND apply the SOP's decision logic.
git clone https://github.com/amazon-science/SOP-Bench.git
cd SOP-Bench
pip install -e .
⚠️ Security Notice: This package is only available via source installation from this GitHub repository. It is not published on PyPI. Do not install any package namedamazon-sop-bench,amazonsopbench, orsop-benchfrom PyPI — those are not official and may contain malicious code.
cp .env.example .env
# Edit .env with your AWS credentials and model ID# List available benchmarks
./sop-bench list
# Evaluate on a single task
./sop-bench evaluate content_flagging --agent function_calling --max-tasks 1
# Expected output:
# Evaluating content_flagging with function_calling agent...
# Limited to 1 tasks
#
# Starting evaluation...
# Evaluating content_flagging ━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:16
#
# ✓ Evaluation Complete!
# Task Success Rate: 100.0%
# Execution Completion Rate: 100.0%
# Tool Accuracy: 100.0%Understanding the Metrics:
- Task Success Rate (TSR): Percentage of tasks where the agent made the correct decision
- Execution Completion Rate (ECR): Percentage of tasks that completed without errors
- Conditional Task Success Rate (C-TSR): Of the tasks that completed execution, how many were accurate? This measures decision accuracy for successfully executed tasks only.
- Tool Accuracy: Percentage of tool calls that were correct
Note: TSR = ECR × C-TSR (Task Success Rate equals Execution Completion Rate times Conditional Task Success Rate)
# Run on all tasks and save results
./sop-bench evaluate content_flagging --agent function_calling --output results.json
# Save execution traces for debugging
./sop-bench evaluate content_flagging --agent function_calling --save-traces
# View results
./sop-bench results results.jsonBy default, AmazonSOPBench runs evaluations sequentially (one task at a time). You can enable parallel execution with the --max-workers option to speed up your evaluations. It is recommended to start with 3 workers and can increase up to 10 if you are not getting throttled. If you get throttling exception, just decrease the number of workers in next run.
# Default: Sequential execution (1 worker)
./sop-bench evaluate content_flagging --agent function_calling
# Run with 10 parallel workers
./sop-bench evaluate content_flagging --agent function_calling --max-workers 10
# Run with 5 parallel workers
./sop-bench evaluate content_flagging --agent function_calling --max-workers 5The --max-workers option is also available in batch_evaluate.py for batch runs across multiple models:
# Batch evaluate with 4 parallel workers per model run
python batch_evaluate.py --sop dangerous_goods \
--models "Claude Opus 4.5,Claude Sonnet 4.5,Llama 3.3 70B,OpenAI GPT-OSS 120B" \
--agents react --max-workers 4 --save-tracesDebugging with Traces:
When using --save-traces, execution traces are saved to results/{benchmark}_{agent}_traces/ for detailed debugging of agent behavior and tool calls.
AWS Credentials: Ensure your AWS account has Bedrock access and Claude model permissions.
# Test AWS access
aws sts get-caller-identity
aws bedrock list-foundation-models --region us-west-2Import Errors: Make sure you're in the AmazonSOPBench directory and installed correctly.
cd /path/to/AmazonSOPBench
pip install -e . --force-reinstallLow Task Success Rate: The framework uses automatic parser fallback (XML → JSON → Dict → Plain Text) to extract agent decisions. For best results, agents should output decisions in structured format:
<final_decision>your_decision_value</final_decision>Debugging Failed Tasks: Use --save-traces to save detailed execution logs:
./sop-bench evaluate content_flagging --agent function_calling --save-traces
# Check traces in: results/content_flagging_function_calling_traces/For comprehensive testing and validation, see VALIDATION_COMMANDS.md.
AmazonSOPBench provides multiple agent implementations for evaluating SOP execution:
The default react agent uses LangChain's create_react_agent / AgentExecutor with automatic stop-sequence handling for all Bedrock model families. This is the agent used for all SOP-Bench experiments.
# Use the ReAct agent (default)
./sop-bench evaluate content_flagging --agent reactFeatures:
- ✅ Works with all model families (Claude, Llama, OpenAI, DeepSeek)
- ✅ Automatic stop-sequence handling via
StopSequenceSafeChatBedrockwrapper - ✅ Client-side truncation and Thought sanitization for non-Claude models
- ✅ Used for all SOP-Bench paper experiments
- ✅ Recommended for all evaluations
The ReAct agent uses a StopSequenceSafeChatBedrock wrapper that automatically handles stop sequences across all model families:
- Claude: Native stop-sequence support (no wrapper needed)
- OpenAI GPT-OSS: Wrapper bypasses LangChain validation, passes stop sequences through natively
- Meta Llama/DeepSeek: Wrapper strips stop sequences, applies client-side truncation + Thought sanitization
Usage Examples:
# Use with Claude models
./sop-bench evaluate content_flagging \
--agent react \
--model us.anthropic.claude-3-5-sonnet-20241022-v2:0
# Use with Llama models
./sop-bench evaluate dangerous_goods \
--agent react \
--model us.meta.llama3-3-70b-instruct-v1:0
# Use with OpenAI models
./sop-bench evaluate dangerous_goods \
--agent react \
--model openai.gpt-oss-120b-1:0Programmatic Usage:
from amazon_sop_bench import evaluate
# Evaluate with ReAct agent
results = evaluate(
benchmark_name="content_flagging",
agent_type="react",
model_id="us.anthropic.claude-3-5-sonnet-20241022-v2:0"
)
print(f"Task Success Rate: {results['task_success_rate']:.1%}")Native function calling using Bedrock's Converse API:
./sop-bench evaluate content_flagging --agent function_calling| Benchmark | Domain | Description | Tasks | Complexity |
|---|---|---|---|---|
| content_flagging | Content Moderation | Evaluate flagged user content through bot detection, trust scoring, and violation assessment | 226 | 9/10 |
| customer_service | Support | Diagnose and resolve customer service issues using system diagnostics | 208 | 8/10 |
| dangerous_goods | Supply Chain | Classify dangerous goods using safety data sheets and scoring systems | 327 | 7/10 |
| aircraft_inspection | Transportation | Conduct pre-flight safety inspections following aviation procedures | 150 | 9/10 |
| email_intent | Retail | Categorize seller support emails and route appropriately | 122 | 7/10 |
| know_your_business | Finance | Verify business entities for compliance and risk assessment | 122 | 9/10 |
| patient_intake | Healthcare | Register new patients with insurance and medical history validation | 90 | 7/10 |
| video_annotation | Autonomous Driving | Detect and annotate objects in driving videos | 168 | 10/10 |
| video_classification | Media | Classify and moderate user-generated video content | 198 | 9/10 |
| warehouse_inspection | Logistics | Inspect packages for damage and compliance | 200 | 9/10 |
from amazon_sop_bench import evaluate, list_benchmarks
# Run evaluation
results = evaluate(
benchmark_name="content_flagging",
agent_type="react",
max_tasks=10
)
print(f"Task Success Rate: {results['task_success_rate']:.1%}")
print(f"Tool Accuracy: {results['tool_accuracy']:.1%}")📖 Full Getting Started Guide →
Here's a simplified example from the Dangerous Goods benchmark:
SOP Instruction (excerpt):
"1. Retrieve the Safety Data Sheet for the product. 2. Check if the product contains any Class 3 flammable liquids. 3. If flash point < 23°C, classify as Packing Group I..."
Task Input:
{
"product_id": "CHEM-2847",
"shipment_type": "air_freight"
}Expected Agent Behavior:
- Call
get_safety_data_sheet(product_id="CHEM-2847") - Call
check_hazard_class(sds_id="SDS-2847") - Call
get_flash_point(sds_id="SDS-2847") - Apply classification logic from SOP
- Return:
{"classification": "Class 3", "packing_group": "II"}
Ground Truth: packing_group: II
The agent must correctly orchestrate tools AND apply the SOP's decision logic.
| Agent | Description | Best For |
|---|---|---|
function_calling |
Native Bedrock Converse API | Structured tool use |
react |
Custom ReAct loop (recommended) | All model families |
SOP-Bench is extensible. Create new benchmarks with:
benchmarks/data/your_benchmark/
├── sop.txt # Natural language procedure
├── tools.py # Tool implementations
├── toolspecs.json # Tool schemas for LLM
├── data.csv # Test cases with ground truth
└── metadata.json # Configuration
| Document | Description |
|---|---|
| Getting Started | Installation, configuration, troubleshooting |
| Agents Guide | Agent types, model compatibility, examples |
| Adding Benchmarks | Create custom SOP benchmarks |
| Architecture | Technical design and internals |
| Examples | Code samples for common use cases |
If you use SOP-Bench in your research, please cite:
@inproceedings{sopbench2026,
title={SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents},
author={Nandi, Subhrangshu and Datta, Arghya and Vichare, Nikhil and
Nama, Rohith and Patel, Udita and Bhattacharya, Indranil and
Asija, Shivam and Gupta, Arushi and Carenini, Giuseppe and
Xu, Jing and Ray, Shayan and Raja, Huzefa and Chan, Aaron and
Carbone, Francesco and Fei, Esther Xu and Du, Gaoyuan and
Akhtar, Zuhaib and Grover, Prince and Bhaduri, Sreyoshi and
Chen, Weian and Zhang, Wei and Xiong, Ming},
booktitle={KDD},
year={2026}
}We welcome contributions! See CONTRIBUTING.md for guidelines.
CC-BY-NC-4.0 — See LICENSE
- 📄 Paper: Coming soon
- 📧 Contact: Open an issue on GitHub
Built by the Applied AI team at Amazon