Skip to content

amazon-science/SOP-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

SOP-Bench : Complex Industrial SOPs for Evaluating LLM Agents

image

Python 3.9+ License: CC BY-NC 4.0 Lint Tests

Overview

SOP-Bench is a comprehensive benchmark for evaluating LLM-based agents on complex, multi-step Standard Operating Procedures (SOPs) that are fundamental to industrial automation. Built from 2,000+ tasks across 12 industrial domains (healthcare, logistics, finance, content moderation, etc.), SOP-Bench addresses the gap between existing benchmarks and real-world procedural complexity.

🏭 Human Expert-Authored SOPs · 🤖 Human-AI Collaborative Framework · 📊 Executable Interfaces · 🔧 Two Agent Architectures · 📈 11 Frontier Models Evaluated

News

  • [2026-02] 🎉 SOP-Bench submitted to KDD 2026 Datasets and Benchmarks Track.

The Problem

Standard Operating Procedures (SOPs) are the backbone of industrial operations—from content moderation to healthcare intake to supply chain logistics. These multi-step procedures require:

  • Sequential reasoning across 10-50+ decision points
  • Tool orchestration to gather information from multiple systems
  • Implicit knowledge that humans learn but rarely document
  • Ambiguity handling when procedures don't cover edge cases

Can LLM agents reliably execute these procedures?

Our research shows they struggle significantly—with performance varying dramatically across domains (26.7%-94.3% success rates).


Key Findings

📊 Detailed leaderboard and benchmark results coming soon.

Early findings from our evaluation of 11 frontier models:

  • Function-Calling agents: ~64% average task success rate
  • ReAct agents: ~55% average task success rate
  • High execution rates (95%+) indicate failures are reasoning-based, not technical
  • Open-source models (DeepSeek-R1, Llama 3.3) approach proprietary performance
  • Architecture-model co-design matters: Newer reasoning models can degrade ReAct performance without targeted prompt engineering

What's in SOP-Bench?

2,000+ tasks across 12 industrial domains, created through human-AI collaboration:

Domain Description
Content Moderation Bot detection, trust scoring, violation assessment
Customer Service Issue diagnosis, system checks, resolution routing
Supply Chain Safety data sheet analysis, hazard classification
Aviation Pre-flight safety checks, compliance verification
Retail Seller email categorization, routing decisions
Finance Business entity verification, risk assessment
Healthcare Insurance validation, medical history processing
Autonomous Driving Object detection in driving scenarios
Media Content moderation, category assignment
Logistics Package damage assessment, compliance checks
...and more

Human expert-authored SOPsMock tools for reproducibilityGround-truth outputsMultiple agent architectures


Example: What Does an SOP Task Look Like?

Here's a simplified example from the Dangerous Goods benchmark:

SOP Instruction (excerpt):

"1. Retrieve the Safety Data Sheet for the product. 2. Check if the product contains any Class 3 flammable liquids. 3. If flash point < 23°C, classify as Packing Group I..."

Task Input:

{
  "product_id": "CHEM-2847",
  "shipment_type": "air_freight"
}

Expected Agent Behavior:

  1. Call get_safety_data_sheet(product_id="CHEM-2847")
  2. Call check_hazard_class(sds_id="SDS-2847")
  3. Call get_flash_point(sds_id="SDS-2847")
  4. Apply classification logic from SOP
  5. Return: {"classification": "Class 3", "packing_group": "II"}

Ground Truth: packing_group: II

The agent must correctly orchestrate tools AND apply the SOP's decision logic.


Quick Start

Install

git clone https://github.com/amazon-science/SOP-Bench.git
cd SOP-Bench
pip install -e .

⚠️ Security Notice: This package is only available via source installation from this GitHub repository. It is not published on PyPI. Do not install any package named amazon-sop-bench, amazonsopbench, or sop-bench from PyPI — those are not official and may contain malicious code.

Configure AWS (for Bedrock models)

cp .env.example .env
# Edit .env with your AWS credentials and model ID

Run Your First Evaluation

# List available benchmarks
./sop-bench list

# Evaluate on a single task
./sop-bench evaluate content_flagging --agent function_calling --max-tasks 1

# Expected output:
# Evaluating content_flagging with function_calling agent...
# Limited to 1 tasks
# 
# Starting evaluation...
# Evaluating content_flagging ━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:16
# 
# ✓ Evaluation Complete!
# Task Success Rate: 100.0%
# Execution Completion Rate: 100.0%
# Tool Accuracy: 100.0%

Understanding the Metrics:

  • Task Success Rate (TSR): Percentage of tasks where the agent made the correct decision
  • Execution Completion Rate (ECR): Percentage of tasks that completed without errors
  • Conditional Task Success Rate (C-TSR): Of the tasks that completed execution, how many were accurate? This measures decision accuracy for successfully executed tasks only.
  • Tool Accuracy: Percentage of tool calls that were correct

Note: TSR = ECR × C-TSR (Task Success Rate equals Execution Completion Rate times Conditional Task Success Rate)

5. Run Full Evaluation

# Run on all tasks and save results
./sop-bench evaluate content_flagging --agent function_calling --output results.json

# Save execution traces for debugging
./sop-bench evaluate content_flagging --agent function_calling --save-traces

# View results
./sop-bench results results.json

Parallel Execution

By default, AmazonSOPBench runs evaluations sequentially (one task at a time). You can enable parallel execution with the --max-workers option to speed up your evaluations. It is recommended to start with 3 workers and can increase up to 10 if you are not getting throttled. If you get throttling exception, just decrease the number of workers in next run.

# Default: Sequential execution (1 worker)
./sop-bench evaluate content_flagging --agent function_calling

# Run with 10 parallel workers
./sop-bench evaluate content_flagging --agent function_calling --max-workers 10

# Run with 5 parallel workers
./sop-bench evaluate content_flagging --agent function_calling --max-workers 5

The --max-workers option is also available in batch_evaluate.py for batch runs across multiple models:

# Batch evaluate with 4 parallel workers per model run
python batch_evaluate.py --sop dangerous_goods \
  --models "Claude Opus 4.5,Claude Sonnet 4.5,Llama 3.3 70B,OpenAI GPT-OSS 120B" \
  --agents react --max-workers 4 --save-traces

Debugging with Traces: When using --save-traces, execution traces are saved to results/{benchmark}_{agent}_traces/ for detailed debugging of agent behavior and tool calls.

Troubleshooting

Common Issues

AWS Credentials: Ensure your AWS account has Bedrock access and Claude model permissions.

# Test AWS access
aws sts get-caller-identity
aws bedrock list-foundation-models --region us-west-2

Import Errors: Make sure you're in the AmazonSOPBench directory and installed correctly.

cd /path/to/AmazonSOPBench
pip install -e . --force-reinstall

Low Task Success Rate: The framework uses automatic parser fallback (XML → JSON → Dict → Plain Text) to extract agent decisions. For best results, agents should output decisions in structured format:

<final_decision>your_decision_value</final_decision>

Debugging Failed Tasks: Use --save-traces to save detailed execution logs:

./sop-bench evaluate content_flagging --agent function_calling --save-traces
# Check traces in: results/content_flagging_function_calling_traces/

For comprehensive testing and validation, see VALIDATION_COMMANDS.md.

Agent Types

AmazonSOPBench provides multiple agent implementations for evaluating SOP execution:

1. ReAct Agent (Default - Recommended)

The default react agent uses LangChain's create_react_agent / AgentExecutor with automatic stop-sequence handling for all Bedrock model families. This is the agent used for all SOP-Bench experiments.

# Use the ReAct agent (default)
./sop-bench evaluate content_flagging --agent react

Features:

  • ✅ Works with all model families (Claude, Llama, OpenAI, DeepSeek)
  • ✅ Automatic stop-sequence handling via StopSequenceSafeChatBedrock wrapper
  • ✅ Client-side truncation and Thought sanitization for non-Claude models
  • ✅ Used for all SOP-Bench paper experiments
  • ✅ Recommended for all evaluations

The ReAct agent uses a StopSequenceSafeChatBedrock wrapper that automatically handles stop sequences across all model families:

  • Claude: Native stop-sequence support (no wrapper needed)
  • OpenAI GPT-OSS: Wrapper bypasses LangChain validation, passes stop sequences through natively
  • Meta Llama/DeepSeek: Wrapper strips stop sequences, applies client-side truncation + Thought sanitization

Usage Examples:

# Use with Claude models
./sop-bench evaluate content_flagging \
  --agent react \
  --model us.anthropic.claude-3-5-sonnet-20241022-v2:0

# Use with Llama models
./sop-bench evaluate dangerous_goods \
  --agent react \
  --model us.meta.llama3-3-70b-instruct-v1:0

# Use with OpenAI models
./sop-bench evaluate dangerous_goods \
  --agent react \
  --model openai.gpt-oss-120b-1:0

Programmatic Usage:

from amazon_sop_bench import evaluate

# Evaluate with ReAct agent
results = evaluate(
    benchmark_name="content_flagging",
    agent_type="react",
    model_id="us.anthropic.claude-3-5-sonnet-20241022-v2:0"
)

print(f"Task Success Rate: {results['task_success_rate']:.1%}")

2. Function Calling Agent

Native function calling using Bedrock's Converse API:

./sop-bench evaluate content_flagging --agent function_calling

Available Benchmarks

Benchmark Domain Description Tasks Complexity
content_flagging Content Moderation Evaluate flagged user content through bot detection, trust scoring, and violation assessment 226 9/10
customer_service Support Diagnose and resolve customer service issues using system diagnostics 208 8/10
dangerous_goods Supply Chain Classify dangerous goods using safety data sheets and scoring systems 327 7/10
aircraft_inspection Transportation Conduct pre-flight safety inspections following aviation procedures 150 9/10
email_intent Retail Categorize seller support emails and route appropriately 122 7/10
know_your_business Finance Verify business entities for compliance and risk assessment 122 9/10
patient_intake Healthcare Register new patients with insurance and medical history validation 90 7/10
video_annotation Autonomous Driving Detect and annotate objects in driving videos 168 10/10
video_classification Media Classify and moderate user-generated video content 198 9/10
warehouse_inspection Logistics Inspect packages for damage and compliance 200 9/10

Programmatic Usage

Basic Evaluation

from amazon_sop_bench import evaluate, list_benchmarks

# Run evaluation
results = evaluate(
    benchmark_name="content_flagging",
    agent_type="react",
    max_tasks=10
)

print(f"Task Success Rate: {results['task_success_rate']:.1%}")
print(f"Tool Accuracy: {results['tool_accuracy']:.1%}")

📖 Full Getting Started Guide →


Example: What Does an SOP Task Look Like?

Here's a simplified example from the Dangerous Goods benchmark:

SOP Instruction (excerpt):

"1. Retrieve the Safety Data Sheet for the product. 2. Check if the product contains any Class 3 flammable liquids. 3. If flash point < 23°C, classify as Packing Group I..."

Task Input:

{
  "product_id": "CHEM-2847",
  "shipment_type": "air_freight"
}

Expected Agent Behavior:

  1. Call get_safety_data_sheet(product_id="CHEM-2847")
  2. Call check_hazard_class(sds_id="SDS-2847")
  3. Call get_flash_point(sds_id="SDS-2847")
  4. Apply classification logic from SOP
  5. Return: {"classification": "Class 3", "packing_group": "II"}

Ground Truth: packing_group: II

The agent must correctly orchestrate tools AND apply the SOP's decision logic.


Agent Types

Agent Description Best For
function_calling Native Bedrock Converse API Structured tool use
react Custom ReAct loop (recommended) All model families

📖 Agent Documentation →


Adding Your Own Benchmarks

SOP-Bench is extensible. Create new benchmarks with:

benchmarks/data/your_benchmark/
├── sop.txt          # Natural language procedure
├── tools.py         # Tool implementations  
├── toolspecs.json   # Tool schemas for LLM
├── data.csv         # Test cases with ground truth
└── metadata.json    # Configuration

📖 Adding Benchmarks Guide →


Documentation

Document Description
Getting Started Installation, configuration, troubleshooting
Agents Guide Agent types, model compatibility, examples
Adding Benchmarks Create custom SOP benchmarks
Architecture Technical design and internals
Examples Code samples for common use cases

Citation

If you use SOP-Bench in your research, please cite:

@inproceedings{sopbench2026,
  title={SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents},
  author={Nandi, Subhrangshu and Datta, Arghya and Vichare, Nikhil and 
          Nama, Rohith and Patel, Udita and Bhattacharya, Indranil and 
          Asija, Shivam and Gupta, Arushi and Carenini, Giuseppe and 
          Xu, Jing and Ray, Shayan and Raja, Huzefa and Chan, Aaron and 
          Carbone, Francesco and Fei, Esther Xu and Du, Gaoyuan and 
          Akhtar, Zuhaib and Grover, Prince and Bhaduri, Sreyoshi and 
          Chen, Weian and Zhang, Wei and Xiong, Ming},
  booktitle={KDD},
  year={2026}
}

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.


License

CC-BY-NC-4.0 — See LICENSE


Links

  • 📄 Paper: Coming soon
  • 📧 Contact: Open an issue on GitHub

Built by the Applied AI team at Amazon

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages