A fully automated Azure environment for demonstrating Azure SRE Agent capabilities. Deploy a breakable multi-service application on AKS and let SRE Agent diagnose and fix the issues!
- Azure Kubernetes Service (AKS) with a multi-pod e-commerce demo application
- 8 breakable scenarios for demonstrating SRE Agent diagnosis
- Azure SRE Agent deployed automatically via Bicep for AI-powered diagnostics
- SRE Agent configuration layer: Knowledge base runbooks, custom agents, connectors, and scheduled tasks
- Full observability stack: Log Analytics, Application Insights, Managed Grafana
- Ready-to-use scripts for deployment and teardown
- Dev container for consistent development experience
- Azure subscription with Owner/Contributor access
- Azure region supporting SRE Agent:
East US 2,Sweden Central, orAustralia East - Azure CLI installed
- VS Code with Dev Containers extension (optional but recommended)
# 1. Login to Azure
az login --use-device-code
# 2. Deploy infrastructure (~15-25 minutes)
.\scripts\deploy.ps1 -Location eastus2 -Yesπ‘ Tip: Type
menuin the terminal to see all available commands including break scenarios, fix commands, and kubectl shortcuts.
Once deployed, you can break the application using shortcut commands:
# Out of Memory scenario
break-oom
# CrashLoopBackOff
break-crash
# Image Pull failure
break-image
# See all scenarios
menuTo restore:
fix-allAfter deployment, deploy.ps1 automatically configures the SRE Agent with:
- Knowledge base β Runbooks for each failure category (pod failures, networking, dependencies, resource exhaustion) plus app architecture and incident report templates
- Custom agents β
incident-handler(alert investigation),cluster-health-monitor(proactive checks), and optionallycode-analyzer(GitHub source code RCA) - Connectors β Azure Monitor (incident source) and optionally GitHub MCP (source code search)
- Scheduled tasks β
daily-health-checkruns cluster-health-monitor every day at 08:00 UTC
- Open the SRE Agent Portal β the URL is displayed in deployment output, or visit sre.azure.com
- Verify configuration β check Builder > Agent Canvas, Knowledge Files
- Break something β
break-oom,break-crash, etc. - Ask the agent to investigate β or create an incident response plan in the portal
- Ask it to diagnose:
- "Why are pods crashing in the pets namespace?"
- "Run a health check on my cluster"
- "Trace the dependency chain β what broke first?"
To enable source code analysis and automated issue creation:
.\scripts\configure-sre-agent.ps1 `
-ResourceGroupName "rg-srelab-eastus2" `
-GitHubPat $env:GITHUB_PAT `
-GitHubRepo "owner/repo"See docs/SRE-AGENT-SETUP.md for detailed instructions, or docs/PROMPTS-GUIDE.md for a full catalog of prompts to try.
| Configuration | Daily Cost | Monthly Cost |
|---|---|---|
| Default deployment | ~$22-28 | ~$650-850 |
| + SRE Agent | ~$32-38 | ~$950-1,150 |
See docs/COSTS.md for detailed breakdown and optimization tips.
| Scenario | Description | SRE Agent Diagnoses |
|---|---|---|
| OOMKilled | Memory limit too low | Memory exhaustion, limit recommendations |
| CrashLoop | App exits immediately | Exit codes, log analysis |
| ImagePullBackOff | Invalid image reference | Registry/image troubleshooting |
| HighCPU | Resource exhaustion | Performance analysis |
| PendingPods | Insufficient cluster resources | Scheduling analysis |
| ProbeFailure | Failing health checks | Probe configuration |
| NetworkBlock | NetworkPolicy blocking traffic | Connectivity analysis |
| MissingConfig | Non-existent ConfigMap | Configuration troubleshooting |
| MongoDBDown | Database offline, cascading failure | Dependency tracing, root cause |
| ServiceMismatch | Wrong Service selector, silent failure | Endpoint/selector analysis |
Note: These PowerShell scripts deploy to Azure and can be run from the dev container, locally on Windows, or on any system with PowerShell Core installed.
| Command | Description |
|---|---|
.\scripts\deploy.ps1 -Location eastus2 |
Deploy all infrastructure to Azure |
.\scripts\deploy.ps1 -WhatIf |
Preview what would be deployed |
.\scripts\configure-sre-agent.ps1 -ResourceGroupName <rg> |
Configure SRE Agent (KB, agents, connectors) |
.\scripts\validate-deployment.ps1 -ResourceGroupName <rg> |
Verify resources and app are healthy |
.\scripts\destroy.ps1 -ResourceGroupName <rg> |
Tear down all infrastructure |
Deploy script parameters:
-Location: Azure region (eastus2,swedencentral,australiaeast) - Default:eastus2-WorkloadName: Resource prefix - Default:srelab-SkipRbac: Skip RBAC assignments if subscription policies block them-WhatIf: Preview deployment without making changes-Yes: Skip confirmation prompts (non-interactive mode)
| Command | Description |
|---|---|
kubectl apply -f k8s/base/application.yaml |
Deploy healthy application |
kubectl apply -f k8s/scenarios/<scenario>.yaml |
Apply a break scenario |
kubectl get pods -n pets |
Check pod status |
kubectl get events -n pets --sort-by='.lastTimestamp' |
View recent events |
The sre-config/ directory contains the SRE Agent configuration layer:
sre-config/
βββ knowledge-base/ # Runbooks uploaded to agent memory
β βββ aks-pod-failures.md # OOM, CrashLoop, ImagePull, Pending, Probe, Config
β βββ network-connectivity.md # Network policies, selector mismatches, DNS
β βββ dependency-failures.md # MongoDB/RabbitMQ outages, cascading analysis
β βββ resource-exhaustion.md # CPU, memory, scheduling, node health
β βββ app-architecture.md # Service map, dependencies, monitoring queries
β βββ incident-report-template.md # Structured GitHub issue template
βββ agents/ # Custom agent YAML specifications
β βββ incident-handler-core.yaml # Log/metric investigation (no GitHub)
β βββ incident-handler-full.yaml # Full investigation + GitHub issues
β βββ cluster-health-monitor.yaml # Proactive health checks
β βββ code-analyzer.yaml # Source code RCA (requires GitHub)
βββ connectors/
βββ azure-monitor.yaml # Azure Monitor incident connector
βββ github-mcp.yaml # GitHub MCP connector template
- SRE Agent Setup Guide β deployment, RBAC, and configuration
- Prompts Guide β prompts, agents, knowledge base, GitHub integration
- Breakable Scenarios Guide
- Cost Estimation
Contributions welcome! Feel free to open issues or submit PRs.
MIT License - see LICENSE for details.
- SRE Agent is currently in Preview
- Only available in East US 2, Sweden Central, and Australia East
- AKS cluster must NOT be a private cluster for SRE Agent to access
- Firewall must allow
*.azuresre.ai
