Safety, Security and Reliability engineering for non-deterministic systems
- Traditional chaos: Plenty of resources (Gremlin, Chaos Mesh, Netflix Chaos Engineering)
- AI chaos: Scattered. No unified playbook for model drift, adversarial inputs, hallucinations
- Data chaos: Underexplored. How do you test your data pipeline for failure modes?
This repo bridges that gap with 50+ documented patterns, 20+ runnable experiments, and real incident case studies.
- 🆕 New to chaos? Start with Getting Started
- 🔍 Looking for a pattern? Browse by System Type
- 🧪 Want to run experiments? See Experiments Guide
- 🚨 Incident happened? Jump to Runbooks
- 📖 Learn the theory? Read Resilience Principles
- 50 documented patterns across 7 categories (foundational, traditional, data, ML, GenAI, personalization, cross-system)
- 22 runnable chaos experiments with safety guardrails and integration with Chaos Mesh, Locust, Terraform
- Real incident case studies (anonymized) with lessons learned
- Unified observability checklist for monitoring resilience across all domains
- Cross-system cascade scenarios for complex multi-layer failures
- Complete runbooks for incident response (5 documented playbooks)
- Utility tools for metrics collection, data generation, drift simulation
# Clone the repo
git clone https://github.com/pristley/OpenResilience-AI
cd OpenResilience-AI
# Read the project structure guide
cat PROJECT_STRUCTURE.md
# Pick a pattern to read (cascading-failure is fully detailed)
cat patterns/0-common/cascading-failure/README.md
cat patterns/4-genai-models/hallucination-injection/README.md
# Browse all patterns by category
ls patterns/
# 0-common/ (foundational: 5 patterns)
# 1-traditional/ (APIs, databases, queues: 6 patterns)
# 2-data-pipelines/ (ETL, streaming, quality: 8 patterns)
# 3-ml-models/ (drift, inference, feedback: 9 patterns)
# 4-genai-models/ (hallucination, RAG, tokens: 11 patterns)
# 5-personalization/ (cold start, bucketing, loops: 7 patterns)
# 6-cross-system/ (multi-layer cascades: 4 patterns)
# Run a chaos experiment (with proper approvals!)
cd experiments/genai-models/llm-hallucination-test/
python run.py
# See incident response runbooks
ls runbooks/
# data-quality-incident/
# model-serving-latency/
# genai-hallucination-response/
# pipeline-sla-breach/
# multi-layer-cascade/| System Type | Patterns | Experiments | Status |
|---|---|---|---|
| 0-Common | Network, cascading, resources, retries, timeouts | ✅ Ready | 5 patterns |
| 1-Traditional | APIs, databases, caches, queues, rate limiting, failover | ✅ Ready | 6 patterns |
| 2-Data Pipelines | ETL, streaming, schema evolution, SLA, quality, lineage, feature store | ✅ Ready | 8 patterns |
| 3-ML Models | Drift detection, adversarial injection, inference latency, feedback loops, retraining | ✅ Ready | 9 patterns |
| 4-GenAI Models | Hallucinations, token limits, RAG poisoning, prompt injection, embedding drift | ✅ Ready | 11 patterns |
| 5-Personalization | Cold-start, feedback loops, bucketing, cache invalidation, A/B tests | ✅ Ready | 7 patterns |
| 6-Cross-System | Multi-layer cascades, train/serve mismatches, feature store → model | ✅ Ready | 4 patterns |
- Project Structure — Complete directory overview and navigation
- Getting Started — First time? Start here
- Glossary — Chaos/ML/Data terminology
- Theory — Resilience principles & frameworks
- References — Papers, tools, talks
- Templates — Boilerplate for new patterns (pattern-template.md, experiment-template.py, runbook-template.md)
- Experiments Guide — How to safely run chaos tests
- Contributing Guide — How to submit patterns and experiments
Have a pattern, experiment, or incident to share? We'd love it!
See CONTRIBUTING.md for submission guidelines.
Licensed under the Apache License 2.0. See LICENSE for details.