Post-Incident Review

View learner record →

intermediate 12 minutes

hedgehog fabric post-incident review continuous-improvement sre blameless-culture

Introduction

You've completed an incident:

Diagnosed the issue (Module 4.1)
Rolled back the problematic change (Module 4.2)
Worked with support (Module 4.3)
Resolved the outage

Most teams stop here. Incident resolved, move on to the next task.

High-performing teams add one more step: Post-incident review.

Learning From What Went Wrong

Why conduct post-incident reviews?

Incidents are expensive learning opportunities - Already paid the cost (downtime, lost productivity), maximize the learning
Same issues recur if root causes aren't addressed - Fixing symptoms doesn't prevent recurrence
Team knowledge improves when insights are shared - One person's lesson becomes everyone's knowledge
Process gaps become visible through reflection - See systemic issues that daily operations hide

Traditional response to incidents:

Incident occurs → Firefight → Resolve → Blame someone → Move on → Same incident recurs

High-performing team response:

Incident occurs → Firefight → Resolve → Post-Incident Review → Document learnings →
Update processes → Implement improvements → Incident prevented/easier next time

SRE Culture - Blameless Reviews

Site Reliability Engineering (SRE) teaches:

"Failure is inevitable in complex systems. Learning from failure is optional."

Post-incident reviews (PIRs) embody SRE principles:

Blameless culture:

Focus on systems and processes, not individuals
"What allowed this mistake?" not "Who made this mistake?"
Psychological safety to report issues honestly

Continuous improvement:

Every incident improves operations
Track patterns to identify systemic issues
Measure improvement over time

Shared learning:

Team knowledge grows through documentation
New operators learn from past incidents
Organizational memory prevents forgotten lessons

Forward-looking:

"How do we prevent this?" not "Who caused this?"
Create actionable improvements
Update runbooks and processes

What You'll Learn

Blameless Post-Incident Review:

Creating factual timelines (what happened, when)
Identifying root cause with 5 Whys technique
Distinguishing root cause from proximate cause
Facilitating reviews without blame

Action Item Creation:

SMART action items (Specific, Measurable, Actionable, Relevant, Time-bound)
Categorizing improvements (immediate, short-term, long-term)
Assigning ownership and due dates
Tracking completion

Operational Knowledge Management:

Updating runbooks based on incidents
Building troubleshooting guides
Creating searchable PIR repository
Sharing learnings with team

Module Scenario

You'll conduct a post-incident review for the VLAN conflict issue from Modules 4.1-4.3:

Document complete timeline from detection to resolution
Identify root cause using 5 Whys
Extract lessons learned
Create 2-3 actionable improvements with SMART criteria

By the end of this module, you'll complete the incident lifecycle and contribute to continuous improvement culture.

Learning Objectives

By the end of this module, you will be able to:

Conduct blameless post-incident reviews - Facilitate reviews focused on learning, not blaming
Document lessons learned - Create actionable improvement items from incidents
Update operational runbooks - Improve documentation based on incident experiences
Identify systemic improvements - Recognize patterns that require process or tool changes
Build operational knowledge - Contribute to team learning and continuous improvement

Prerequisites

Before starting this module, you should have:

Completed Modules:

Module 4.1: Diagnosing Fabric Issues (provides incident to review)
Module 4.2: Rollback & Recovery (resolution actions documented)
Module 4.3: Coordinating with Support (escalation timeline)
All previous courses (Courses 1-3)

Understanding:

Incident lifecycle (detection → diagnosis → resolution)
Troubleshooting methodology
GitOps rollback procedures
Support escalation process

Context:

This module reviews the VLAN conflict incident from Modules 4.1-4.3
You'll use real incident data to practice PIR skills

Scenario

Incident Summary (from Modules 4.1-4.3):

What happened:

VPCAttachment customer-app-vpc-server-07 created successfully (Events: <none> — admission webhook validated, no rejections)
Server-07 had no connectivity within VPC
Root cause: VLAN mismatch (VPC expected 1025, switch configured with 1020)
Attempted fix: Update VPC to VLAN 1020 → Failed with "VLAN reserved for system use" error
Resolution: Escalated to support, updated VPC to VLAN 1030, connectivity restored

Timeline:

10:00 UTC: VPCAttachment created
10:07 UTC: Issue reported (no connectivity)
10:15-10:45 UTC: Investigation and diagnosis
11:00 UTC: Attempted self-resolution (failed)
11:30 UTC: Escalated to support
14:00 UTC: Support responded with solution
14:20 UTC: Incident resolved

Duration: 4 hours 20 minutes (detection to resolution)

Your Task:

Conduct a complete post-incident review to extract learnings and create improvements that prevent recurrence.

Core Concepts & Deep Dive

Concept 1: Blameless Culture

What is a Blameless Culture?

Traditional approach (blame-focused):

"Who made the mistake?"
"Why didn't they check before applying the change?"
"This person needs retraining."
Focus on individual actions

Blameless approach (systems-focused):

"Why did the system allow this mistake?"
"What process would have caught this earlier?"
"How do we make the right thing easy to do?"
Focus on systemic improvements

Key Principles

Principle 1: Systems Thinking

Incidents result from system failures, not individual failures.

System failures include:

Process gaps - No VLAN conflict checking before VPC creation
Tool limitations - VLANNamespace doesn't show which VLANs are in use
Documentation gaps - VLAN reservation ranges not documented
Design issues - No validation that prevents conflicts at creation time

Human error is a symptom, not a root cause.

When someone makes a "mistake," ask:

What allowed that mistake to reach production?
What tools or checks would have caught it earlier?
How do we make the correct action the easy action?

Example:

Blame-focused question:

"Why did Alice commit the wrong VLAN to Git?"

Systems-focused question:

"Why doesn't our Git workflow validate VLAN conflicts before accepting commits?"

The second question leads to actionable improvements (pre-commit hooks, CI/CD validation). The first question leads nowhere useful.

Principle 2: Forward-Looking Questions

Avoid backward-looking blame:

"Who committed the broken YAML?"
"Why didn't you test before pushing to prod?"
"How could you not know VLAN 1025 was in use?"

Ask forward-looking improvement questions:

"What would have caught this error earlier in the process?"
"How can we make VLAN selection less error-prone?"
"What tools or checks would prevent this in the future?"

Difference:

Blame Question	Forward-Looking Question	Leads To
"Who broke it?"	"What in our process allowed this?"	Process improvement
"Why didn't you check?"	"How do we make checking automatic?"	Tool development
"You should have known."	"How do we make knowledge explicit?"	Documentation update

Principle 3: Psychological Safety

Teams with blameless culture exhibit:

Report incidents honestly - Don't hide issues out of fear
Share near-misses - "I almost made this mistake" is valuable learning
Ask for help early - Not seen as weakness
Experiment with improvements - Failure during experimentation is acceptable
Admit mistakes quickly - Reduces mean time to resolution

Without psychological safety:

Incidents get hidden - Fear of blame prevents reporting
Knowledge isn't shared - Protective, siloed behavior
Improvements don't happen - Risk-averse culture avoids change
Mean time to resolution increases - People hesitate to escalate
Recurring incidents - Lessons learned aren't captured or shared

Building psychological safety in PIRs:

Facilitator sets blameless tone at start
Redirect blame to system focus
Celebrate learning and honesty
Thank participants for sharing insights
Document improvements, not individual mistakes

Concept 2: Post-Incident Review Template

PIR Structure (4 Sections)

Section 1: What Happened? (Timeline)

Document the incident chronologically with factual observations.

Example Timeline:

2025-10-17 10:00 UTC - VPCAttachment customer-app-vpc-server-07 created via Gitea commit
2025-10-17 10:05 UTC - ArgoCD synced VPCAttachment to cluster successfully
2025-10-17 10:07 UTC - Developer reported: server-07 no connectivity to VPC gateway
2025-10-17 10:15 UTC - Investigation started (admission webhook accepted apply; checked Agent CRD convergence)
2025-10-17 10:30 UTC - Agent CRD checked: VLAN 1020 on leaf-04/E1/8 (VPC expects 1025)
2025-10-17 10:45 UTC - Root cause identified: VLAN conflict (1025 in use, system allocated 1020)
2025-10-17 11:00 UTC - Attempted fix: Updated VPC YAML to VLAN 1020
2025-10-17 11:05 UTC - ArgoCD sync failed: "VLAN 1020 reserved for system use"
2025-10-17 11:30 UTC - Escalated to support (P2 ticket) with complete diagnostics
2025-10-17 14:00 UTC - Support responded: VLANs 1020-1029 reserved, use 1030+
2025-10-17 14:15 UTC - Updated VPC to VLAN 1030, ArgoCD synced successfully
2025-10-17 14:20 UTC - Connectivity verified, incident resolved

Goal: Factual timeline with timestamps. No interpretation, blame, or conclusions yet.

Metrics:

Mean Time to Detect (MTTD): 7 minutes (10:00 creation → 10:07 reported)
Mean Time to Resolve (MTTR): 4 hours 13 minutes (10:07 detected → 14:20 resolved)

Section 2: Why Did It Happen? (Root Cause)

Identify underlying cause using structured technique. Don't stop at first obvious answer.

5 Whys Technique:

Start with symptom, ask "why" five times to reach root cause:

1. Why did server-07 have no connectivity?

Because VPC VLAN (1025) didn't match switch VLAN (1020)

2. Why didn't VPC VLAN match switch VLAN?

Because VLAN 1025 was already in use by another VPC, so system auto-allocated 1020

3. Why wasn't VLAN conflict detected before VPC creation?

Because VLANNamespace doesn't validate VLAN availability at creation time

4. Why doesn't VLANNamespace validate VLAN conflicts?

Because it defines ranges, not tracks usage (current design limitation)

5. Why wasn't VLAN reservation (1020-1029) documented for operators?

Because system reservations aren't exposed via API or operator documentation

Root Cause:

No pre-creation validation for VLAN conflicts. VLANNamespace allows VLAN selection from range without checking current usage or communicating reserved ranges.

Contributing Factors:

Documentation gap: Reserved VLANs (1020-1029) not listed in runbook
Operator knowledge gap: Didn't know to manually check existing VPC VLANs first
Error message unclear: "Reserved for system use" but no list of reserved VLANs provided
No tooling to show available VLANs programmatically

Root Cause vs. Proximate Cause:

Type	Description	Example	Can Recur?
Proximate Cause	Immediate trigger	Operator chose VLAN 1025	Yes - next operator can make same choice
Root Cause	Systemic issue	No VLAN conflict validation	No - if fixed, prevents recurrence

Always address root cause, not just proximate cause.

Section 3: How Was It Resolved? (Actions Taken)

Document resolution path with what worked and what could improve.

Immediate Actions:

Diagnosed VLAN mismatch using Agent CRD inspection (10:30 UTC)
Attempted self-resolution by updating VPC VLAN to 1020 (11:00 UTC)
Encountered "reserved VLAN" error blocking self-resolution (11:05 UTC)
Escalated to support with complete diagnostics (11:30 UTC)
Received clarification from support: VLANs 1020-1029 reserved (14:00 UTC)
Updated VPC to VLAN 1030 and synced via ArgoCD (14:15 UTC)
Verified connectivity restored (14:20 UTC)

What Worked Well:

Systematic troubleshooting using Module 4.1 methodology identified root cause quickly
Complete diagnostic bundle attachment enabled fast support response
ArgoCD GitOps workflow made rollback/update safe and auditable

What Could Be Improved:

Earlier escalation after first failed fix attempt (11:05) could have saved ~3 hours
Documentation of reserved VLANs would have prevented initial issue
VLAN validation tooling would catch conflicts before creation

Time Breakdown:

Detection: 7 minutes (fast - developer reported immediately)
Diagnosis: 38 minutes (10:15-10:53, systematic troubleshooting)
Failed self-resolution: 35 minutes (11:00-11:35)
Waiting for support: 2 hours 30 minutes (11:30-14:00)
Resolution application: 20 minutes (14:00-14:20)

Opportunities: Biggest time saver would be preventing issue entirely through VLAN validation or earlier escalation recognition.

Section 4: How Do We Prevent Recurrence? (Improvements)

Create actionable improvements categorized by timeline.

Immediate (Do This Week):

Update operator runbook: Document reserved VLAN ranges (1020-1029)
Create VLAN selection checklist: Check existing VPCs, avoid reserved ranges, document choices
Add VLAN conflict troubleshooting example to diagnostic guide

Short-Term (Do This Month): 4. File Hedgehog GitHub issue requesting documentation of reserved VLANs in official docs 5. Create kubectl helper script (show-available-vlans.sh) to list reserved, used, and available VLANs

Long-Term (Product Team Proposals): 6. Feature request: VLANNamespace API to expose available VLANs with validation 7. Feature request: VPC creation pre-validation with clear error messages and suggestions

Prioritization: Focus on immediate/short-term (operator-actionable). Long-term documents desired product improvements.

Concept 3: Creating Actionable Improvements

What Makes a Good Action Item?

Bad Action Items (Too Vague):

"Be more careful with VLAN selection" ❌ Not specific, not measurable
"Check things before creating VPCs" ❌ Not actionable ("things"?)
"Improve documentation" ❌ Not time-bound or owned

Good Action Items (Specific, Measurable, Owned):

"Document reserved VLANs 1020-1029 in operator runbook section 'VLAN Selection Guidelines' (Owner: Alice, Due: Oct 20)" ✅
"Create show-available-vlans.sh script and commit to kubectl-fabric-helpers repo (Owner: Bob, Due: Oct 25)" ✅
"File Hedgehog GitHub issue requesting VLAN validation feature with use case and examples (Owner: Charlie, Due: Oct 18)" ✅

SMART Criteria

S - Specific:

What exactly will be done?
Where will it be documented/implemented?
What content will be included?

M - Measurable:

How will we know it's complete?
What are the success criteria?
Can we verify completion objectively?

A - Actionable:

Can someone actually do this?
Do they have the tools/access needed?
Is the action clear?

R - Relevant:

Does this prevent recurrence?
Does it address root cause or contributing factor?
Is it worth the effort?

T - Time-bound:

When will it be done?
Is the deadline realistic?
Does priority match urgency?

SMART Action Item Example

Title: Document Reserved VLAN Ranges
Owner: Alice Thompson
Due Date: 2025-10-20
Priority: High

Description: Update operator runbook section "VLAN Selection Guidelines" documenting reserved VLAN ranges (1020-1029), kubectl command to check existing VLANs, recommended selection procedure, and troubleshooting tips.

Success Criteria:
- Runbook section created with reserved VLANs documented
- kubectl command example included
- Committed to docs repository

Status: In Progress

Key elements: Specific owner and deadline, measurable success criteria, actionable steps.

Concept 4: Operational Knowledge Management

Building Team Knowledge

Post-incident reviews create organizational memory—knowledge that persists beyond individual team members.

Knowledge Artifacts:

1. PIR Documents

Incident history and timeline
Root cause analysis
Lessons learned
Action items tracked to completion

2. Updated Runbooks

Procedures improved based on real experience
New troubleshooting steps added
Known issues documented with workarounds

3. Troubleshooting Guides

Common failure modes and solutions
Decision trees for diagnosis
Quick reference commands

4. Known Issues List

Product limitations with workarounds
Configuration gotchas
Environment-specific quirks

Knowledge Sharing Practices

1. Team PIR Review Meetings (15-30 min)

Present PIR findings to team
Discuss action items
Share insights and questions
Build shared understanding

2. Documentation Repository

Git repository for PIRs and runbooks
Searchable by date, tag, or keyword
Indexed for easy navigation

3. New Operator Onboarding

Review past PIRs as learning material
"Here's what we've learned" orientation
Understand common issues before encountering them

4. Quarterly Pattern Analysis

Review all PIRs from quarter
Identify recurring root causes
Prioritize systemic improvements
Track incident reduction metrics

Continuous Improvement Cycle

Incident Occurs
    ↓
Troubleshoot & Resolve (Modules 4.1-4.3)
    ↓
Post-Incident Review (Module 4.4) ← YOU ARE HERE
    ↓
Document Lessons Learned
    ↓
Update Runbooks & Processes
    ↓
Share Knowledge with Team
    ↓
Implement Improvements (SMART action items)
    ↓
Monitor for Recurrence
    ↓
(Fewer incidents over time, faster resolution when they occur)

Goal: Each incident makes the next one:

Less likely to occur (prevention through improvements)
Faster to detect (better monitoring)
Faster to resolve (documented procedures)

Metrics to Track:

Incident frequency (decreasing over time)
Mean time to detect (MTTD - decreasing)
Mean time to resolve (MTTR - decreasing)
Recurring incident rate (decreasing)
Runbook utilization (increasing)

Before You Begin the Lab

The hands-on exercises in this module require the Hedgehog Virtual AI Data Center (vAIDC) — a pre-configured GCP lab environment that includes a complete Hedgehog fabric, Grafana observability dashboards, and all required services ready to use.

Ensure your vAIDC is running before proceeding. If you haven't set it up yet, complete the Accessing the Hedgehog vAIDC module first — it takes about 20 minutes and only needs to be done once.

Hands-On Lab

Lab Overview

Title: Conduct Post-Incident Review

Scenario:

Conduct a PIR for the VLAN conflict incident from Modules 4.1-4.3.

Duration: 4-5 minutes

Tasks:

Document timeline (1-2 min)
Identify root cause using 5 Whys (1-2 min)
Create 2-3 SMART action items (1-2 min)
(Optional) Update personal runbook (1 min)

Task 1: Document Timeline

Estimated Time: 1-2 minutes

Objective: Create factual chronological timeline of the incident.

Step 1.1: List Key Events

Using information from Modules 4.1-4.3, list events in chronological order.

Your Timeline:

YYYY-MM-DD HH:MM UTC - [Event description]

Events to Include:

VPCAttachment created (via Gitea commit)
ArgoCD synced resource
Issue reported by user/developer
Investigation started
Root cause identified (VLAN mismatch)
Attempted self-resolution (updating VPC YAML)
Self-resolution failed (reserved VLAN error)
Escalated to support (with diagnostics)
Support response received
Resolution applied (VPC updated to VLAN 1030)
Connectivity verified
Incident resolved

Fill in Timeline: Create 10-12 timestamped entries from VPCAttachment creation through incident resolution, including investigation milestones, failed attempts, escalation, and final resolution.

Step 1.2: Calculate Metrics

Mean Time to Detect (MTTD): Time from incident start (10:00) to detection (10:07) = 7 minutes

Mean Time to Resolve (MTTR): Time from detection (10:07) to resolution (14:20) = 4 hours 13 minutes

Success Criteria

✅ At least 10 timeline entries
✅ Events in chronological order with timestamps
✅ Factual descriptions (no blame language like "Alice made mistake")
✅ Metrics calculated (MTTD and MTTR)

Task 2: Root Cause Analysis

Estimated Time: 1-2 minutes

Objective: Use 5 Whys to identify systemic root cause.

Step 2.1: Complete 5 Whys

Start with the symptom "server-07 has no connectivity" and ask "why" five times:

Why did server-07 have no connectivity?
Why [answer to #1]?
Why [answer to #2]?
Why [answer to #3]?
Why [answer to #4]?

Each "why" should dig deeper from the proximate cause toward the systemic root cause.

Step 2.2: State Root Cause

Based on your 5 Whys analysis, identify:

Root Cause: The systemic issue (process gap, tool limitation, or documentation gap)
Contributing Factors: 2-3 additional factors that enabled the issue

Step 2.3: Verify Root Cause

Root Cause Test: "If we fix this root cause, would the same incident be impossible or much less likely?"

Your answer: Yes / No

If "No," your root cause may be a proximate cause. Keep asking "why" until you reach a systemic issue.

Success Criteria

✅ Root cause is systemic (process gap, tool limitation, documentation gap)
✅ Root cause is NOT individual blame ("operator chose wrong VLAN")
✅ Contributing factors identified
✅ Root cause passes verification test (fixing it would prevent recurrence)

Task 3: Create SMART Action Items

Estimated Time: 1-2 minutes

Objective: Define 2-3 specific, actionable improvements.

Step 3.1: Brainstorm Improvements

What could prevent this incident from recurring?

Your Ideas:

Categories to Consider:

Documentation updates (runbook, troubleshooting guide)
Scripts or tools (kubectl helpers, validation scripts)
Process changes (checklist, peer review)
Feature requests (product improvements)

Step 3.2: Make Action Items SMART

Choose 2-3 improvements and apply SMART criteria. For each action item, define:

Title: Clear, specific description
Owner: Specific person responsible
Due Date: Realistic deadline
Description: What will be done, including specific steps
Success Criteria: 2-3 measurable, verifiable outcomes

Success Criteria

✅ Each action item has specific owner
✅ Each action item has realistic due date
✅ Description is specific (not vague like "be more careful")
✅ Success criteria are measurable and verifiable
✅ Action items address root cause or contributing factors

Task 4: Update Personal Runbook (Optional)

Estimated Time: 1 minute

Objective: Document VLAN selection procedure for future reference.

Step 4.1: Create Runbook Section

Add to your personal troubleshooting runbook:

## VLAN Selection for New VPCs

**Before Creating VPC:**
1. Check VLANNamespace range: `kubectl get vlannamespace default -o jsonpath='{.spec.ranges}'`
2. Check existing VPC VLANs: `kubectl get vpc -A -o yaml | grep "vlan:" | sort | uniq`
3. Avoid reserved VLANs: 1020-1029 (system reserved)
4. Choose available VLAN from range, not in use, not reserved
5. Document chosen VLAN in VPC design notes

**Troubleshooting:**
- "VLAN conflict": Choose different VLAN
- "VLAN reserved": Avoid 1020-1029 range
- Use show-available-vlans.sh script or escalate if unclear

Success Criteria

✅ Runbook section created with clear title
✅ Procedure has numbered steps
✅ kubectl commands included with examples
✅ Reserved VLANs documented
✅ Troubleshooting tips included

Lab Summary

What You Accomplished:

You conducted a blameless post-incident review with:

✅ Factual timeline with timestamps and metrics
✅ Root cause identified using 5 Whys technique
✅ Systemic root cause (not individual blame)
✅ 2-3 SMART action items created
✅ Personal runbook updated with learnings

Key Takeaways:

Blameless reviews focus on systems, not people - "What allowed this?" not "Who did this?"
5 Whys reveals root cause - Keep asking "why" until you reach a systemic issue
SMART action items are actionable - Specific, Measurable, Assigned, Relevant, Time-bound
Operational knowledge compounds - Each PIR improves team competency
Continuous improvement is iterative - Small improvements add up over time

This PIR Would:

Prevent VLAN conflict recurrence through documentation and tooling
Improve team knowledge about VLAN selection
Create reusable procedures for future operators
Demonstrate professional operational practices

Troubleshooting

Common Lab Challenges

Challenge: "I can't identify a systemic root cause"

Symptom: Your root cause is individual action, not systemic.

Solution: Keep asking "why" until you reach system issue. Test: "If we fix this, can the same mistake happen again?" If YES, keep digging.

Challenge: "My action items are vague"

Symptom: Action items like "Improve documentation" or "Be more careful."

Solution: Apply SMART criteria: Specific owner, deadline, measurable success criteria, actionable steps.

Challenge: "Timeline is incomplete"

Solution: Include creation, detection, investigation, diagnosis, resolution attempts, escalation, and final resolution with timestamps.

Resources

Reference Documentation

Related Modules:

Module 4.1: Diagnosing Fabric Issues (incident details)
Module 4.2: Rollback & Recovery (resolution actions)
Module 4.3: Coordinating with Support (escalation timeline)

PIR Templates:

Complete template (provided in Concept 2)
SMART action item template (provided in Concept 3)

SRE Best Practices:

Google SRE Book: "Postmortem Culture: Learning from Failure"
Etsy's "Blameless PostMortems and Just Culture"
Atlassian Incident Handbook: "Post-Incident Reviews"

Quick Reference: PIR Checklist

Before PIR:

Gather incident timeline from logs, tickets, chat
Collect diagnostic evidence and resolution steps
Schedule PIR meeting within 1-3 days of resolution
Invite participants (operators, support, stakeholders)

During PIR:

Set blameless tone at opening
Document factual timeline
Conduct 5 Whys for root cause
Brainstorm improvements (defer judgment)
Create SMART action items with owners

After PIR:

Publish PIR document to team repository
Track action items to completion
Update runbooks and procedures
Share learnings with team
Monitor for recurrence

5 Whys Quick Reference

How to use:

Start with symptom/problem statement
Ask "Why did this happen?" → Record answer
Ask "Why [answer to 2]?" → Record answer
Ask "Why [answer to 3]?" → Record answer
Ask "Why [answer to 4]?" → Record answer
Ask "Why [answer to 5]?" → Record answer ← ROOT CAUSE

Root cause characteristics:

Systemic (process, tool, design issue)
Actionable (can be addressed with improvements)
Preventive (fixing it prevents recurrence)

If you reach individual blame ("operator mistake"), keep asking "why" - you haven't reached root cause yet.

SMART Action Item Checklist

Before finalizing action item, verify:

Specific: What exactly will be done? Where?
Measurable: How will we verify completion?
Actionable: Can assigned person do this? Do they have access/tools?
Relevant: Does this address root cause or contributing factor?
Time-bound: When will this be done? Is deadline realistic?

If any criteria not met, refine action item until all criteria satisfied.

Hands-On Lab

Complete the hands-on lab activities above, then click below to mark the lab as complete.

Introduction

Learning From What Went Wrong

SRE Culture - Blameless Reviews

What You'll Learn

Module Scenario

Learning Objectives

Prerequisites

Scenario

Core Concepts & Deep Dive

Concept 1: Blameless Culture

What is a Blameless Culture?

Key Principles

Concept 2: Post-Incident Review Template

PIR Structure (4 Sections)

Concept 3: Creating Actionable Improvements

What Makes a Good Action Item?

SMART Criteria

SMART Action Item Example

Concept 4: Operational Knowledge Management

Building Team Knowledge

Knowledge Sharing Practices

Continuous Improvement Cycle

Hands-On Lab

Lab Overview

Task 1: Document Timeline

Step 1.1: List Key Events

Step 1.2: Calculate Metrics

Success Criteria

Task 2: Root Cause Analysis

Step 2.1: Complete 5 Whys

Step 2.2: State Root Cause

Step 2.3: Verify Root Cause

Success Criteria

Task 3: Create SMART Action Items

Step 3.1: Brainstorm Improvements

Step 3.2: Make Action Items SMART

Success Criteria

Task 4: Update Personal Runbook (Optional)

Step 4.1: Create Runbook Section

Success Criteria

Lab Summary

Troubleshooting

Common Lab Challenges

Challenge: "I can't identify a systemic root cause"

Challenge: "My action items are vague"

Challenge: "Timeline is incomplete"

Resources

Reference Documentation

Quick Reference: PIR Checklist

5 Whys Quick Reference

SMART Action Item Checklist

Module Assessment

Hands-On Lab