Skip to content

matthansen0/azure-sre-agent-sandbox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

41 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Azure SRE Agent Demo Lab πŸ”§

A fully automated Azure environment for demonstrating Azure SRE Agent capabilities. Deploy a breakable multi-service application on AKS and let SRE Agent diagnose and fix the issues!

🎯 What This Lab Provides

  • Azure Kubernetes Service (AKS) with a multi-pod e-commerce demo application
  • 8 breakable scenarios for demonstrating SRE Agent diagnosis
  • Azure SRE Agent deployed automatically via Bicep for AI-powered diagnostics
  • SRE Agent configuration layer: Knowledge base runbooks, custom agents, connectors, and scheduled tasks
  • Full observability stack: Log Analytics, Application Insights, Managed Grafana
  • Ready-to-use scripts for deployment and teardown
  • Dev container for consistent development experience

πŸš€ Quick Start

Prerequisites

  • Azure subscription with Owner/Contributor access
  • Azure region supporting SRE Agent: East US 2, Sweden Central, or Australia East
  • Azure CLI installed
  • VS Code with Dev Containers extension (optional but recommended)

Menu

Deploy

# 1. Login to Azure
az login --use-device-code

# 2. Deploy infrastructure (~15-25 minutes)
.\scripts\deploy.ps1 -Location eastus2 -Yes

πŸ’‘ Tip: Type menu in the terminal to see all available commands including break scenarios, fix commands, and kubectl shortcuts.

πŸ’₯ Breaking Things (The Fun Part!)

Once deployed, you can break the application using shortcut commands:

# Out of Memory scenario
break-oom

# CrashLoopBackOff
break-crash

# Image Pull failure
break-image

# See all scenarios
menu

To restore:

fix-all

πŸ€– Using SRE Agent

After deployment, deploy.ps1 automatically configures the SRE Agent with:

  • Knowledge base β€” Runbooks for each failure category (pod failures, networking, dependencies, resource exhaustion) plus app architecture and incident report templates
  • Custom agents β€” incident-handler (alert investigation), cluster-health-monitor (proactive checks), and optionally code-analyzer (GitHub source code RCA)
  • Connectors β€” Azure Monitor (incident source) and optionally GitHub MCP (source code search)
  • Scheduled tasks β€” daily-health-check runs cluster-health-monitor every day at 08:00 UTC

Getting Started

  1. Open the SRE Agent Portal β€” the URL is displayed in deployment output, or visit sre.azure.com
  2. Verify configuration β€” check Builder > Agent Canvas, Knowledge Files
  3. Break something β€” break-oom, break-crash, etc.
  4. Ask the agent to investigate β€” or create an incident response plan in the portal
  5. Ask it to diagnose:
    • "Why are pods crashing in the pets namespace?"
    • "Run a health check on my cluster"
    • "Trace the dependency chain β€” what broke first?"

Adding GitHub Integration

To enable source code analysis and automated issue creation:

.\scripts\configure-sre-agent.ps1 `
    -ResourceGroupName "rg-srelab-eastus2" `
    -GitHubPat $env:GITHUB_PAT `
    -GitHubRepo "owner/repo"

See docs/SRE-AGENT-SETUP.md for detailed instructions, or docs/PROMPTS-GUIDE.md for a full catalog of prompts to try.

πŸ’° Cost Estimate

Configuration Daily Cost Monthly Cost
Default deployment ~$22-28 ~$650-850
+ SRE Agent ~$32-38 ~$950-1,150

See docs/COSTS.md for detailed breakdown and optimization tips.

πŸ”§ Available Scenarios

Scenario Description SRE Agent Diagnoses
OOMKilled Memory limit too low Memory exhaustion, limit recommendations
CrashLoop App exits immediately Exit codes, log analysis
ImagePullBackOff Invalid image reference Registry/image troubleshooting
HighCPU Resource exhaustion Performance analysis
PendingPods Insufficient cluster resources Scheduling analysis
ProbeFailure Failing health checks Probe configuration
NetworkBlock NetworkPolicy blocking traffic Connectivity analysis
MissingConfig Non-existent ConfigMap Configuration troubleshooting
MongoDBDown Database offline, cascading failure Dependency tracing, root cause
ServiceMismatch Wrong Service selector, silent failure Endpoint/selector analysis

πŸ› οΈ Commands Reference

Deployment Scripts (PowerShell)

Note: These PowerShell scripts deploy to Azure and can be run from the dev container, locally on Windows, or on any system with PowerShell Core installed.

Command Description
.\scripts\deploy.ps1 -Location eastus2 Deploy all infrastructure to Azure
.\scripts\deploy.ps1 -WhatIf Preview what would be deployed
.\scripts\configure-sre-agent.ps1 -ResourceGroupName <rg> Configure SRE Agent (KB, agents, connectors)
.\scripts\validate-deployment.ps1 -ResourceGroupName <rg> Verify resources and app are healthy
.\scripts\destroy.ps1 -ResourceGroupName <rg> Tear down all infrastructure

Deploy script parameters:

  • -Location: Azure region (eastus2, swedencentral, australiaeast) - Default: eastus2
  • -WorkloadName: Resource prefix - Default: srelab
  • -SkipRbac: Skip RBAC assignments if subscription policies block them
  • -WhatIf: Preview deployment without making changes
  • -Yes: Skip confirmation prompts (non-interactive mode)

Kubernetes Commands (kubectl)

Command Description
kubectl apply -f k8s/base/application.yaml Deploy healthy application
kubectl apply -f k8s/scenarios/<scenario>.yaml Apply a break scenario
kubectl get pods -n pets Check pod status
kubectl get events -n pets --sort-by='.lastTimestamp' View recent events

οΏ½ SRE Agent Configuration

The sre-config/ directory contains the SRE Agent configuration layer:

sre-config/
β”œβ”€β”€ knowledge-base/              # Runbooks uploaded to agent memory
β”‚   β”œβ”€β”€ aks-pod-failures.md       # OOM, CrashLoop, ImagePull, Pending, Probe, Config
β”‚   β”œβ”€β”€ network-connectivity.md   # Network policies, selector mismatches, DNS
β”‚   β”œβ”€β”€ dependency-failures.md    # MongoDB/RabbitMQ outages, cascading analysis
β”‚   β”œβ”€β”€ resource-exhaustion.md    # CPU, memory, scheduling, node health
β”‚   β”œβ”€β”€ app-architecture.md       # Service map, dependencies, monitoring queries
β”‚   └── incident-report-template.md # Structured GitHub issue template
β”œβ”€β”€ agents/                       # Custom agent YAML specifications
β”‚   β”œβ”€β”€ incident-handler-core.yaml  # Log/metric investigation (no GitHub)
β”‚   β”œβ”€β”€ incident-handler-full.yaml  # Full investigation + GitHub issues
β”‚   β”œβ”€β”€ cluster-health-monitor.yaml # Proactive health checks
β”‚   └── code-analyzer.yaml          # Source code RCA (requires GitHub)
└── connectors/
    β”œβ”€β”€ azure-monitor.yaml         # Azure Monitor incident connector
    └── github-mcp.yaml           # GitHub MCP connector template

πŸ“š Documentation

🀝 Contributing

Contributions welcome! Feel free to open issues or submit PRs.

πŸ“„ License

MIT License - see LICENSE for details.


⚠️ Important Notes:

  • SRE Agent is currently in Preview
  • Only available in East US 2, Sweden Central, and Australia East
  • AKS cluster must NOT be a private cluster for SRE Agent to access
  • Firewall must allow *.azuresre.ai

About

Fully automated deployment of Azure SRE Agent, multi-pod AKS, and "break scenarios" with prompt guides to experiment with Azure SRE Agent.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors