Design ★ 17,195

chaos-engineering

Use when planning, running, or learning from chaos engineering experiments. Triggers on "chaos experiment", "fault injection", "gameday", "resilience test", "blast radius", "steady state", "abort criteria", "Chaos Toolkit", "Chaos Mesh", "Litmus", "Gremlin", "AWS FIS", or any deliberate failure-injection question. Ships experiment designer, blast-radius calculator, and postmortem generator (all st

cd ~/.claude/skills
git clone https://github.com/alirezarezvani/claude-skills.git claude-skills

Chaos Engineering

Design experiments that surface real weaknesses in production systems — without becoming outages. Most “chaos engineering” attempts skip steady-state measurement, define no abort criteria, and have no blast-radius bound. This skill enforces the discipline that makes chaos experiments safe and useful.

When to use

  • Planning a chaos experiment (what to break, where, when, how to abort)
  • Calculating blast radius before running the experiment
  • Reviewing an existing experiment plan for safety
  • Choosing a chaos tool (Chaos Toolkit / Chaos Mesh / Litmus / Gremlin / AWS FIS)
  • Writing a chaos experiment postmortem
  • Running a Game Day exercise

When NOT to use

  • General incident response (use incident-response)
  • Threat hunting / red-team (use red-team, threat-detection)
  • Performance load testing (different goal — chaos is about failure modes, not capacity)
  • Production debugging (chaos discovers weaknesses preemptively, not after-the-fact)

Core principle: chaos without abort criteria is an outage

The 4 Principles of Chaos Engineering (Netflix, 2016):

  1. Build a hypothesis around steady-state behavior. Not “what breaks?” but “X holds; will it still hold under fault Y?”
  2. Vary real-world events. Inject realistic failures: kill nodes, slow networks, lose cache, throttle dependencies.
  3. Run experiments in production. Staging never has the same failure modes. Start small.
  4. Automate experiments to run continuously. One-off chaos is a press release; continuous chaos is engineering.

Add a fifth: Define abort criteria up front. A chaos experiment with no abort criteria is an outage by another name.

Quick start

SKILL=engineering/chaos-engineering/skills/chaos-engineering

# 1. Design an experiment
python "$SKILL/scripts/experiment_designer.py" --target "checkout-svc" --hypothesis "p99 latency stays <500ms" --attack latency --duration-min 15

# 2. Calculate blast radius
python "$SKILL/scripts/blast_radius_calculator.py" --traffic-share 0.05 --user-pop 1000000 --duration-min 15

# 3. Generate postmortem after the experiment
python "$SKILL/scripts/experiment_postmortem.py" --plan experiment.json --result-log results.txt

The 3 Python tools

All stdlib-only. Run with --help.

experiment_designer.py

Generates a structured experiment plan from inputs. Enforces the required sections (hypothesis, steady-state metric, blast radius, abort criteria, rollback).

python scripts/experiment_designer.py \
  --target "checkout-svc" \
  --hypothesis "p99 latency stays <500ms when payment-svc is slow" \
  --attack latency \
  --magnitude "+200ms" \
  --duration-min 15 \
  --blast-radius "5% of US traffic" \
  --abort-if "p99 > 1000ms OR error_rate > baseline + 1pp"

Outputs a markdown plan with: hypothesis, steady-state, attack, magnitude, duration, blast radius, abort criteria, rollback procedure, monitoring dashboards, and learning question.

blast_radius_calculator.py

Computes the blast radius of a planned experiment. Given traffic share + user population + duration, calculates expected affected users, expected error budget burn, and a risk score.

python scripts/blast_radius_calculator.py \
  --traffic-share 0.05 \
  --user-pop 1000000 \
  --duration-min 15 \
  --baseline-availability 0.999 \
  --expected-impact-availability 0.95

Outputs:

  • Expected affected users
  • Error budget consumed (in minutes of error budget)
  • Risk score: GREEN / YELLOW / RED
  • Recommendation: PROCEED / REDUCE / ABORT

GREEN = <1% error budget; YELLOW = 1-10%; RED = >10%.

experiment_postmortem.py

Produces a structured postmortem from an experiment plan + results. Catches the common postmortem failure modes: no learning recorded, no follow-up actions, blame-laden language.

python scripts/experiment_postmortem.py --plan experiment.json --result-log results.txt

Outputs markdown with: summary, hypothesis (was it confirmed/refuted?), what we learned, what surprised us, follow-up actions with owners, and link to next experiment.

The 7 attack types (taxonomy)

Different attacks reveal different weaknesses. See references/attack_taxonomy.md for full detail.

AttackWhat it testsTooling
LatencyTimeouts, retries, circuit breakerstc, Chaos Mesh NetworkChaos
ErrorError handling, fallback pathsChaos Mesh HTTPChaos, Toxiproxy
Resource (CPU, memory, disk)Saturation handling, autoscalingChaos Mesh StressChaos, stress-ng
Network partitionSplit-brain, consensus, failoverChaos Mesh NetworkChaos partition
Dependency failureGraceful degradation, fallbackService mesh fault injection
TimeClock skew, NTP issueslibfaketime, Chaos Mesh TimeChaos
Infrastructure (kill instance)Auto-recovery, failoverAWS FIS, Chaos Monkey

Pick the attack that matches the hypothesis. “What happens if X is slow?” → latency. “What happens if X loses network?” → partition.

Tooling chooser

ToolBest forPricingStack
Chaos ToolkitLightweight, language-agnostic, JSON experimentsOSSAny
Chaos MeshKubernetes-native, rich CRDs, in-clusterOSSKubernetes
LitmusKubernetes, Argo-integrated, large libraryOSS + EnterpriseKubernetes
GremlinEnterprise SaaS, multi-cloud, auditPaidAny
AWS FISAWS-native, IAM-integrated, EC2/ECS/EKSPaid (AWS)AWS
CustomNiche needs, single-cloud, low budgetNoneAny

Decision rules:

  • k8s-only stack + OSS → Chaos Mesh or Litmus (Litmus has bigger experiment library)
  • Multi-cloud + OSS → Chaos Toolkit
  • AWS-heavy + simple needs → AWS FIS
  • Enterprise + audit/compliance → Gremlin

See references/tooling_landscape.md for trade-offs.

Workflows

Workflow 1: Design and run a single experiment

1. State a hypothesis: "When [fault], steady-state metric X stays within Y."
2. Identify the steady-state metric — must be measurable BEFORE the experiment.
3. Run blast_radius_calculator.py — confirm GREEN before proceeding.
4. Run experiment_designer.py to produce the plan.
5. Get a peer review of the plan; confirm abort criteria are concrete.
6. Notify the on-call team in #incidents (or whatever channel).
7. Run the experiment with monitoring open.
8. If abort criteria are hit, abort immediately; record what happened.
9. Run experiment_postmortem.py to capture learnings.
10. File follow-up actions; link to next experiment.

Workflow 2: Game Day exercise

1. Pick a scenario (e.g., "primary database fails over").
2. Identify all dependent services that should keep working.
3. Build a multi-experiment plan covering each layer.
4. Schedule with stakeholders; on-call coverage required.
5. Run with a facilitator who manages the scenario.
6. Capture observations in a shared doc as they happen.
7. Single combined postmortem covering all observations.
8. Track follow-up actions in a board with owners.

Workflow 3: Continuous chaos (game days → daily)

1. Start: weekly Game Day in staging.
2. Move to: weekly Game Day in production with limited blast radius.
3. Mature to: continuous chaos via scheduled experiments (Litmus chaos schedule, Gremlin scenarios).
4. Wire to deployment: every prod deploy triggers a baseline chaos sweep.
5. Track: experiments per week, weaknesses discovered, MTTR trend.

Composition with other skills

This skill explicitly composes with two others in this library:

SkillComposition
feature-flags-architectKill switches defined there are the abort triggers here
kubernetes-operatorOperators are common chaos targets (test reconcile under fault)
incident-responseChaos experiments that escalate become incidents

Anti-patterns

  • No hypothesis — “let’s break things” is sabotage, not engineering
  • No steady-state metric — without a baseline, you can’t tell if X broke
  • No blast radius bound — full-prod experiment without limits = outage
  • No abort criteria — see above; this is mandatory
  • No on-call coverage — chaos without monitoring is unmonitored production
  • Chaos in staging only — staging never has prod failure modes
  • Chaos in dev — useless; dev has different failure modes from prod
  • One-off chaos — single experiment is a press release; learning requires recurrence
  • Blame-laden postmortem — record causes, not blame; teams stop running chaos otherwise

References

  • references/chaos_principles.md — the 4 principles, history, when to start
  • references/experiment_design.md — hypothesis structure, steady-state metrics, abort criteria
  • references/attack_taxonomy.md — 7 attack types with examples and tooling
  • references/tooling_landscape.md — Chaos Toolkit / Mesh / Litmus / Gremlin / FIS / DIY

Slash command

/chaos-experiment — interactive experiment design wizard that runs all 3 tools.

Asset templates

  • assets/experiment_template.md — fill-in plan template
  • assets/postmortem_template.md — structured postmortem template

Verifiable success

A team using this skill should achieve:

  • 100% of chaos experiments have a written hypothesis, abort criteria, and blast-radius calculation
  • Blast radius for any single experiment never exceeds 10% of error budget
  • Mean time between chaos experiments <14 days (continuous, not one-off)
  • Each experiment produces ≥1 follow-up action that gets shipped
  • No chaos experiment escalates to a customer-impacting incident in trailing 90 days

Similar skills

algorithmic-art Design

Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.

anthropics/skills ★ 146,722
brand-guidelines Design

Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.

anthropics/skills ★ 146,722
claude-api Design

Build, debug, and optimize Claude API / Anthropic SDK apps. Apps built with this skill should include prompt caching. Also handles migrating existing Claude API code between Claude model versions (4.5 → 4.6, 4.6 → 4.7, retired-model replacements). TRIGGER when: code imports `anthropic`/`@anthropic-ai/sdk`; user asks for the Claude API, Anthropic SDK, or Managed Agents; user adds/modifies/tunes a C

anthropics/skills ★ 146,722
frontend-design Design

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, artifacts, posters, or applications (examples include websites, landing pages, dashboards, React components, HTML/CSS layouts, or when styling/beautifying any web UI). Generates creative, polished code and UI design that avoids generic AI aesthetics.

anthropics/skills ★ 146,722
mcp-builder Design

Guide for creating high-quality MCP (Model Context Protocol) servers that enable LLMs to interact with external services through well-designed tools. Use when building MCP servers to integrate external APIs or services, whether in Python (FastMCP) or Node/TypeScript (MCP SDK).

anthropics/skills ★ 146,722
slack-gif-creator Design

Knowledge and utilities for creating animated GIFs optimized for Slack. Provides constraints, validation tools, and animation concepts. Use when users request animated GIFs for Slack like "make me a GIF of X doing Y for Slack.

anthropics/skills ★ 146,722
More in Design →