Introduction
Even with the best testing, production issues happen. Having a solid rollback and recovery strategy is critical for minimizing downtime and data loss when deployments go wrong.
This guide visualizes the complete rollback process:
- Issue Detection: Monitoring alerts and health checks
- Rollback Decision: When to rollback vs forward fix
- Rollback Execution: Different rollback strategies
- Data Recovery: Handling database changes
- Post-Incident: Learning and prevention
Part 1: Issue Detection Flow
From Healthy to Incident
%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%%
flowchart TD
Start([Production deployment
completed]) --> Monitor[Monitoring Systems
- Prometheus metrics
- Application logs
- User reports
- Health checks] Monitor --> Baseline[Baseline Metrics:
✓ Error rate: 0.1%
✓ Latency p95: 150ms
✓ Traffic: 10k req/min
✓ CPU: 40%
✓ Memory: 60%] Baseline --> Time[Time passes...
Minutes after deployment] Time --> Detect{Issue
detected?} Detect -->|No issue| Healthy[✅ Deployment Healthy
Continue monitoring
All metrics normal] Detect -->|Yes| IssueType{Issue
type?} IssueType --> ErrorSpike[🔴 Error Rate Spike
0.1% → 15%
Alert: HighErrorRate firing] IssueType --> LatencySpike[🟡 Latency Increase
p95: 150ms → 5000ms
Alert: HighLatency firing] IssueType --> TrafficDrop[🟠 Traffic Drop
10k → 1k req/min
Users can't access] IssueType --> ResourceIssue[🔴 Resource Exhaustion
CPU: 40% → 100%
OOMKilled events] IssueType --> DataCorruption[🔴 Data Issues
Database errors
Invalid data returned] ErrorSpike --> Severity1[Severity: CRITICAL
User impact: HIGH
Affecting all users] LatencySpike --> Severity2[Severity: WARNING
User impact: MEDIUM
Slow but functional] TrafficDrop --> Severity3[Severity: CRITICAL
User impact: HIGH
Complete outage] ResourceIssue --> Severity4[Severity: CRITICAL
User impact: HIGH
Pods crashing] DataCorruption --> Severity5[Severity: CRITICAL
User impact: CRITICAL
Data integrity at risk] Severity1 --> AutoAlert[🚨 Automated Alerts:
- PagerDuty page
- Slack notification
- Email alerts
- Status page update] Severity2 --> AutoAlert Severity3 --> AutoAlert Severity4 --> AutoAlert Severity5 --> AutoAlert AutoAlert --> OnCall[On-Call Engineer
Receives alert
Acknowledges incident] OnCall --> Investigate[Quick Investigation:
- Check deployment timeline
- Review recent changes
- Check logs
- Verify metrics] Investigate --> RootCause{Root cause
identified?} RootCause -->|Yes - Recent deployment| Decision[Go to Rollback Decision] RootCause -->|Yes - Other cause| OtherFix[Different remediation
Not deployment-related] RootCause -->|No - Time critical| Decision style Healthy fill:#064e3b,stroke:#10b981 style Severity1 fill:#7f1d1d,stroke:#ef4444 style Severity3 fill:#7f1d1d,stroke:#ef4444 style Severity4 fill:#7f1d1d,stroke:#ef4444 style Severity5 fill:#7f1d1d,stroke:#ef4444 style Severity2 fill:#78350f,stroke:#f59e0b
completed]) --> Monitor[Monitoring Systems
- Prometheus metrics
- Application logs
- User reports
- Health checks] Monitor --> Baseline[Baseline Metrics:
✓ Error rate: 0.1%
✓ Latency p95: 150ms
✓ Traffic: 10k req/min
✓ CPU: 40%
✓ Memory: 60%] Baseline --> Time[Time passes...
Minutes after deployment] Time --> Detect{Issue
detected?} Detect -->|No issue| Healthy[✅ Deployment Healthy
Continue monitoring
All metrics normal] Detect -->|Yes| IssueType{Issue
type?} IssueType --> ErrorSpike[🔴 Error Rate Spike
0.1% → 15%
Alert: HighErrorRate firing] IssueType --> LatencySpike[🟡 Latency Increase
p95: 150ms → 5000ms
Alert: HighLatency firing] IssueType --> TrafficDrop[🟠 Traffic Drop
10k → 1k req/min
Users can't access] IssueType --> ResourceIssue[🔴 Resource Exhaustion
CPU: 40% → 100%
OOMKilled events] IssueType --> DataCorruption[🔴 Data Issues
Database errors
Invalid data returned] ErrorSpike --> Severity1[Severity: CRITICAL
User impact: HIGH
Affecting all users] LatencySpike --> Severity2[Severity: WARNING
User impact: MEDIUM
Slow but functional] TrafficDrop --> Severity3[Severity: CRITICAL
User impact: HIGH
Complete outage] ResourceIssue --> Severity4[Severity: CRITICAL
User impact: HIGH
Pods crashing] DataCorruption --> Severity5[Severity: CRITICAL
User impact: CRITICAL
Data integrity at risk] Severity1 --> AutoAlert[🚨 Automated Alerts:
- PagerDuty page
- Slack notification
- Email alerts
- Status page update] Severity2 --> AutoAlert Severity3 --> AutoAlert Severity4 --> AutoAlert Severity5 --> AutoAlert AutoAlert --> OnCall[On-Call Engineer
Receives alert
Acknowledges incident] OnCall --> Investigate[Quick Investigation:
- Check deployment timeline
- Review recent changes
- Check logs
- Verify metrics] Investigate --> RootCause{Root cause
identified?} RootCause -->|Yes - Recent deployment| Decision[Go to Rollback Decision] RootCause -->|Yes - Other cause| OtherFix[Different remediation
Not deployment-related] RootCause -->|No - Time critical| Decision style Healthy fill:#064e3b,stroke:#10b981 style Severity1 fill:#7f1d1d,stroke:#ef4444 style Severity3 fill:#7f1d1d,stroke:#ef4444 style Severity4 fill:#7f1d1d,stroke:#ef4444 style Severity5 fill:#7f1d1d,stroke:#ef4444 style Severity2 fill:#78350f,stroke:#f59e0b
Part 2: Rollback Decision Tree
When to Rollback vs Forward Fix
%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%%
flowchart TD
Start([Production issue detected]) --> Assess[Assess situation:
- User impact
- Severity
- Time deployed
- Data changes] Assess --> Q1{Can issue be
fixed quickly?
5 min} Q1 -->|Yes - Simple config| QuickFix[Forward Fix:
- Update config map
- Restart pods
- No rollback needed] Q1 -->|No| Q2{Is issue caused
by latest
deployment?} Q2 -->|No - External issue| External[External Root Cause:
- Third-party API down
- Database issue
- Infrastructure problem
→ Fix underlying issue] Q2 -->|Yes| Q3{User impact
severity?} Q3 -->|Low - Minor bugs| Q4{Time since
deployment?} Q4 -->|< 30 min| RollbackLow[Consider Rollback:
Low risk, easy rollback
Users barely affected] Q4 -->|> 30 min| ForwardFix[Forward Fix:
Deploy hotfix
More data changes
Rollback riskier] Q3 -->|Medium - Degraded| Q5{Data changes
made?} Q5 -->|No DB changes| RollbackMed[Rollback:
Safe to revert
No data migration
Quick recovery] Q5 -->|DB changes made| Q6{Can revert
DB changes?} Q6 -->|Yes - Reversible| RollbackWithDB[Rollback + DB Revert:
1. Revert application
2. Run down migration
Coordinate carefully] Q6 -->|No - Irreversible| ForwardOnly[Forward Fix ONLY:
Cannot rollback
Fix bug in new version
Data can't be reverted] Q3 -->|High - Outage| Q7{Rollback
time?} Q7 -->|< 5 min| ImmediateRollback[IMMEDIATE Rollback:
User impact too high
Rollback first
Debug later] Q7 -->|> 5 min| Q8{Forward fix
faster?} Q8 -->|Yes| HotfixDeploy[Deploy Hotfix:
If fix is obvious
and can deploy
faster than rollback] Q8 -->|No| ImmediateRollback QuickFix --> Monitor[Monitor metrics
Verify fix worked] RollbackLow --> ExecuteRollback[Execute Rollback] RollbackMed --> ExecuteRollback RollbackWithDB --> ExecuteRollback ImmediateRollback --> ExecuteRollback ForwardFix --> DeployFix[Deploy Forward Fix] HotfixDeploy --> DeployFix ForwardOnly --> DeployFix style ImmediateRollback fill:#7f1d1d,stroke:#ef4444 style RollbackWithDB fill:#78350f,stroke:#f59e0b style ForwardOnly fill:#78350f,stroke:#f59e0b style QuickFix fill:#064e3b,stroke:#10b981
- User impact
- Severity
- Time deployed
- Data changes] Assess --> Q1{Can issue be
fixed quickly?
5 min} Q1 -->|Yes - Simple config| QuickFix[Forward Fix:
- Update config map
- Restart pods
- No rollback needed] Q1 -->|No| Q2{Is issue caused
by latest
deployment?} Q2 -->|No - External issue| External[External Root Cause:
- Third-party API down
- Database issue
- Infrastructure problem
→ Fix underlying issue] Q2 -->|Yes| Q3{User impact
severity?} Q3 -->|Low - Minor bugs| Q4{Time since
deployment?} Q4 -->|< 30 min| RollbackLow[Consider Rollback:
Low risk, easy rollback
Users barely affected] Q4 -->|> 30 min| ForwardFix[Forward Fix:
Deploy hotfix
More data changes
Rollback riskier] Q3 -->|Medium - Degraded| Q5{Data changes
made?} Q5 -->|No DB changes| RollbackMed[Rollback:
Safe to revert
No data migration
Quick recovery] Q5 -->|DB changes made| Q6{Can revert
DB changes?} Q6 -->|Yes - Reversible| RollbackWithDB[Rollback + DB Revert:
1. Revert application
2. Run down migration
Coordinate carefully] Q6 -->|No - Irreversible| ForwardOnly[Forward Fix ONLY:
Cannot rollback
Fix bug in new version
Data can't be reverted] Q3 -->|High - Outage| Q7{Rollback
time?} Q7 -->|< 5 min| ImmediateRollback[IMMEDIATE Rollback:
User impact too high
Rollback first
Debug later] Q7 -->|> 5 min| Q8{Forward fix
faster?} Q8 -->|Yes| HotfixDeploy[Deploy Hotfix:
If fix is obvious
and can deploy
faster than rollback] Q8 -->|No| ImmediateRollback QuickFix --> Monitor[Monitor metrics
Verify fix worked] RollbackLow --> ExecuteRollback[Execute Rollback] RollbackMed --> ExecuteRollback RollbackWithDB --> ExecuteRollback ImmediateRollback --> ExecuteRollback ForwardFix --> DeployFix[Deploy Forward Fix] HotfixDeploy --> DeployFix ForwardOnly --> DeployFix style ImmediateRollback fill:#7f1d1d,stroke:#ef4444 style RollbackWithDB fill:#78350f,stroke:#f59e0b style ForwardOnly fill:#78350f,stroke:#f59e0b style QuickFix fill:#064e3b,stroke:#10b981
Part 3: Rollback Execution Strategies
Application Rollback Methods
%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%%
flowchart TD
Start([Decision: Rollback]) --> Method{Deployment
strategy
used?} Method --> K8sRolling[Kubernetes Rolling Update] Method --> BlueGreen[Blue-Green Deployment] Method --> Canary[Canary Deployment] subgraph RollingRollback[Kubernetes Rolling Rollback] K8sRolling --> K8s1[kubectl rollout undo
deployment myapp] K8s1 --> K8s2[Kubernetes:
- Find previous ReplicaSet
- Rolling update to old version
- maxSurge: 1, maxUnavailable: 1] K8s2 --> K8s3[Gradual Pod Replacement:
1. Create 1 old version pod
2. Wait for ready
3. Terminate 1 new version pod
4. Repeat until all replaced] K8s3 --> K8s4[Time to rollback: 2-5 min
Downtime: None
Some users see old, some new] end subgraph BGRollback[Blue-Green Rollback] BlueGreen --> BG1[Current state:
Blue v1.0 IDLE
Green v2.0 ACTIVE 100%] BG1 --> BG2[Update Service selector:
version: green → version: blue] BG2 --> BG3[Instant Traffic Switch:
Blue v1.0 ACTIVE 100%
Green v2.0 IDLE 0%] BG3 --> BG4[Time to rollback: 1-2 sec
Downtime: ~1 sec
All users switched instantly] end subgraph CanaryRollback[Canary Rollback] Canary --> C1[Current state:
v1.0: 0 replicas
v2.0: 10 replicas 100%] C1 --> C2[Scale down v2.0:
v2.0: 10 → 0 replicas] C2 --> C3[Scale up v1.0:
v1.0: 0 → 10 replicas] C3 --> C4[Time to rollback: 1-3 min
Downtime: Minimal
Gradual traffic shift] end K8s4 --> Verify[Verification Steps] BG4 --> Verify C4 --> Verify Verify --> V1[1. Check pod status
kubectl get pods
All running?] V1 --> V2[2. Run health checks
curl /health
All healthy?] V2 --> V3[3. Monitor metrics
Error rate back to normal?
Latency improved?] V3 --> V4[4. Check user reports
Are users reporting success?] V4 --> Success{Rollback
successful?} Success -->|Yes| Complete[✅ Rollback Complete
Service restored
Monitor closely] Success -->|No| StillBroken[🚨 Still Broken!
Issue not deployment-related
Deeper investigation needed] style K8s4 fill:#1e3a8a,stroke:#3b82f6 style BG4 fill:#064e3b,stroke:#10b981 style C4 fill:#1e3a8a,stroke:#3b82f6 style Complete fill:#064e3b,stroke:#10b981 style StillBroken fill:#7f1d1d,stroke:#ef4444
strategy
used?} Method --> K8sRolling[Kubernetes Rolling Update] Method --> BlueGreen[Blue-Green Deployment] Method --> Canary[Canary Deployment] subgraph RollingRollback[Kubernetes Rolling Rollback] K8sRolling --> K8s1[kubectl rollout undo
deployment myapp] K8s1 --> K8s2[Kubernetes:
- Find previous ReplicaSet
- Rolling update to old version
- maxSurge: 1, maxUnavailable: 1] K8s2 --> K8s3[Gradual Pod Replacement:
1. Create 1 old version pod
2. Wait for ready
3. Terminate 1 new version pod
4. Repeat until all replaced] K8s3 --> K8s4[Time to rollback: 2-5 min
Downtime: None
Some users see old, some new] end subgraph BGRollback[Blue-Green Rollback] BlueGreen --> BG1[Current state:
Blue v1.0 IDLE
Green v2.0 ACTIVE 100%] BG1 --> BG2[Update Service selector:
version: green → version: blue] BG2 --> BG3[Instant Traffic Switch:
Blue v1.0 ACTIVE 100%
Green v2.0 IDLE 0%] BG3 --> BG4[Time to rollback: 1-2 sec
Downtime: ~1 sec
All users switched instantly] end subgraph CanaryRollback[Canary Rollback] Canary --> C1[Current state:
v1.0: 0 replicas
v2.0: 10 replicas 100%] C1 --> C2[Scale down v2.0:
v2.0: 10 → 0 replicas] C2 --> C3[Scale up v1.0:
v1.0: 0 → 10 replicas] C3 --> C4[Time to rollback: 1-3 min
Downtime: Minimal
Gradual traffic shift] end K8s4 --> Verify[Verification Steps] BG4 --> Verify C4 --> Verify Verify --> V1[1. Check pod status
kubectl get pods
All running?] V1 --> V2[2. Run health checks
curl /health
All healthy?] V2 --> V3[3. Monitor metrics
Error rate back to normal?
Latency improved?] V3 --> V4[4. Check user reports
Are users reporting success?] V4 --> Success{Rollback
successful?} Success -->|Yes| Complete[✅ Rollback Complete
Service restored
Monitor closely] Success -->|No| StillBroken[🚨 Still Broken!
Issue not deployment-related
Deeper investigation needed] style K8s4 fill:#1e3a8a,stroke:#3b82f6 style BG4 fill:#064e3b,stroke:#10b981 style C4 fill:#1e3a8a,stroke:#3b82f6 style Complete fill:#064e3b,stroke:#10b981 style StillBroken fill:#7f1d1d,stroke:#ef4444
Part 4: Database Rollback Complexity
Handling Database Migrations
%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%%
flowchart TD
Start([Need to rollback
with DB changes]) --> Analyze[Analyze migration type] Analyze --> Type{Migration
type?} Type --> AddColumn[Added Column
ALTER TABLE users
ADD COLUMN email] Type --> DropColumn[Dropped Column
ALTER TABLE users
DROP COLUMN phone] Type --> ModifyColumn[Modified Column
ALTER TABLE users
ALTER COLUMN age TYPE bigint] Type --> AddTable[Added Table
CREATE TABLE orders] AddColumn --> AC1{Column has
data?} AC1 -->|No data yet| AC2[Safe Rollback:
1. Deploy old app version
2. DROP COLUMN email
Old app doesn't use it] AC1 -->|Has data| AC3[⚠️ Data Loss Risk:
1. Backup table first
2. Consider keeping column
3. Deploy old app version
Column ignored by old app] DropColumn --> DC1[🚨 CANNOT Rollback:
Data already lost
Forward fix ONLY
Options:
1. Restore from backup
2. Accept data loss
3. Recreate from logs] ModifyColumn --> MC1{Data
compatible?} MC1 -->|Yes - reversible| MC2[Revert Column Type:
ALTER COLUMN age TYPE int
Verify no data truncation
Then deploy old app] MC1 -->|No - data loss| MC3[🚨 Cannot Revert:
bigint values exceed int range
Forward fix ONLY] AddTable --> AT1{Table has
critical data?} AT1 -->|No data| AT2[Safe Rollback:
1. Deploy old app version
2. DROP TABLE orders
No data lost] AT1 -->|Has data| AT3[Risky Rollback:
1. BACKUP TABLE orders
2. DROP TABLE orders
3. Deploy old app version
Data preserved in backup] AC2 --> SafeProcess[Safe Rollback Process:
✅ No data loss
✅ Quick rollback
✅ Reversible] AC3 --> RiskyProcess[Risky Rollback Process:
⚠️ Potential data loss
⚠️ Need backup
⚠️ Manual intervention] DC1 --> NoRollback[Forward Fix Only:
❌ Cannot rollback
❌ Data already lost
❌ Must fix forward] MC2 --> SafeProcess MC3 --> NoRollback AT2 --> SafeProcess AT3 --> RiskyProcess SafeProcess --> Execute1[Execute Safe Rollback] RiskyProcess --> Decision{Acceptable
risk?} Decision -->|Yes| Execute2[Execute with Caution] Decision -->|No| NoRollback NoRollback --> HotfixDeploy[Deploy Hotfix:
New version with fix
Keep new schema] style SafeProcess fill:#064e3b,stroke:#10b981 style RiskyProcess fill:#78350f,stroke:#f59e0b style NoRollback fill:#7f1d1d,stroke:#ef4444 style DC1 fill:#7f1d1d,stroke:#ef4444 style MC3 fill:#7f1d1d,stroke:#ef4444
with DB changes]) --> Analyze[Analyze migration type] Analyze --> Type{Migration
type?} Type --> AddColumn[Added Column
ALTER TABLE users
ADD COLUMN email] Type --> DropColumn[Dropped Column
ALTER TABLE users
DROP COLUMN phone] Type --> ModifyColumn[Modified Column
ALTER TABLE users
ALTER COLUMN age TYPE bigint] Type --> AddTable[Added Table
CREATE TABLE orders] AddColumn --> AC1{Column has
data?} AC1 -->|No data yet| AC2[Safe Rollback:
1. Deploy old app version
2. DROP COLUMN email
Old app doesn't use it] AC1 -->|Has data| AC3[⚠️ Data Loss Risk:
1. Backup table first
2. Consider keeping column
3. Deploy old app version
Column ignored by old app] DropColumn --> DC1[🚨 CANNOT Rollback:
Data already lost
Forward fix ONLY
Options:
1. Restore from backup
2. Accept data loss
3. Recreate from logs] ModifyColumn --> MC1{Data
compatible?} MC1 -->|Yes - reversible| MC2[Revert Column Type:
ALTER COLUMN age TYPE int
Verify no data truncation
Then deploy old app] MC1 -->|No - data loss| MC3[🚨 Cannot Revert:
bigint values exceed int range
Forward fix ONLY] AddTable --> AT1{Table has
critical data?} AT1 -->|No data| AT2[Safe Rollback:
1. Deploy old app version
2. DROP TABLE orders
No data lost] AT1 -->|Has data| AT3[Risky Rollback:
1. BACKUP TABLE orders
2. DROP TABLE orders
3. Deploy old app version
Data preserved in backup] AC2 --> SafeProcess[Safe Rollback Process:
✅ No data loss
✅ Quick rollback
✅ Reversible] AC3 --> RiskyProcess[Risky Rollback Process:
⚠️ Potential data loss
⚠️ Need backup
⚠️ Manual intervention] DC1 --> NoRollback[Forward Fix Only:
❌ Cannot rollback
❌ Data already lost
❌ Must fix forward] MC2 --> SafeProcess MC3 --> NoRollback AT2 --> SafeProcess AT3 --> RiskyProcess SafeProcess --> Execute1[Execute Safe Rollback] RiskyProcess --> Decision{Acceptable
risk?} Decision -->|Yes| Execute2[Execute with Caution] Decision -->|No| NoRollback NoRollback --> HotfixDeploy[Deploy Hotfix:
New version with fix
Keep new schema] style SafeProcess fill:#064e3b,stroke:#10b981 style RiskyProcess fill:#78350f,stroke:#f59e0b style NoRollback fill:#7f1d1d,stroke:#ef4444 style DC1 fill:#7f1d1d,stroke:#ef4444 style MC3 fill:#7f1d1d,stroke:#ef4444
Part 5: Complete Rollback Workflow
From Detection to Recovery
%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%%
sequenceDiagram
participant Monitor as Monitoring
participant Alert as Alerting
participant Engineer as On-Call Engineer
participant Incident as Incident Channel
participant K8s as Kubernetes
participant DB as Database
participant Users as End Users
Note over Monitor: 5 minutes after deployment
Monitor->>Monitor: Detect anomaly:
Error rate: 0.1% → 18%
Latency p95: 150ms → 3000ms Monitor->>Alert: Trigger alert:
HighErrorRate FIRING Alert->>Engineer: 🚨 PagerDuty call
Critical alert
Production incident Engineer->>Alert: Acknowledge alert
Stop escalation Engineer->>Incident: Create #incident-456
"High error rate after v2.5 deployment" Note over Engineer: Open laptop
Start investigation Engineer->>Monitor: Check Grafana dashboard
When did issue start?
Which endpoints affected? Monitor-->>Engineer: Started 5 min ago
Right after deployment
All endpoints affected Engineer->>K8s: kubectl get pods
Check pod status K8s-->>Engineer: All pods Running
No crashes
Health checks passing Engineer->>K8s: kubectl logs deployment/myapp
Check application logs K8s-->>Engineer: ERROR: Cannot connect to cache
ERROR: Redis timeout
ERROR: Connection refused Note over Engineer: Root cause: New version
has Redis connection bug Engineer->>Incident: Update: Redis connection issue in v2.5
Decision: Rollback to v2.4 Note over Engineer: Check deployment history Engineer->>K8s: kubectl rollout history deployment/myapp K8s-->>Engineer: REVISION 10: v2.5 (current)
REVISION 9: v2.4 (previous) Engineer->>Incident: Starting rollback to v2.4
ETA: 3 minutes Engineer->>K8s: kubectl rollout undo deployment/myapp K8s->>K8s: Start rollback:
- Create pods with v2.4
- Wait for ready
- Terminate v2.5 pods loop Rolling Update K8s->>Users: Some users on v2.4 ✓
Some users on v2.5 ✗ Note over K8s: Pod 1: v2.4 Ready
Terminating v2.5 Pod 1 Engineer->>K8s: kubectl rollout status
deployment/myapp --watch K8s-->>Engineer: Waiting for rollout:
2/5 pods updated end K8s->>Users: All users now on v2.4 ✓ K8s-->>Engineer: Rollout complete:
deployment "myapp" successfully rolled out Engineer->>Monitor: Check metrics Note over Monitor: Wait 2 minutes
for metrics to stabilize Monitor-->>Engineer: ✅ Error rate: 0.1%
✅ Latency p95: 160ms
✅ All metrics normal Note over Alert: Metrics normalized Alert->>Engineer: ✅ Alert resolved:
HighErrorRate Engineer->>Users: Verify user experience Users-->>Engineer: No error reports
Application working Engineer->>Incident: ✅ Incident resolved
Service restored to v2.4
Duration: 12 minutes
Root cause: Redis bug in v2.5 Engineer->>Incident: Next steps:
1. Fix Redis bug
2. Add integration test
3. Post-mortem scheduled Note over Engineer: Create follow-up tasks Engineer->>Engineer: Create Jira tickets:
- BUG-789: Fix Redis connection
- TEST-123: Add cache integration test
- DOC-456: Update deployment checklist Note over Engineer,Users: Service restored ✓
Monitoring continues
Error rate: 0.1% → 18%
Latency p95: 150ms → 3000ms Monitor->>Alert: Trigger alert:
HighErrorRate FIRING Alert->>Engineer: 🚨 PagerDuty call
Critical alert
Production incident Engineer->>Alert: Acknowledge alert
Stop escalation Engineer->>Incident: Create #incident-456
"High error rate after v2.5 deployment" Note over Engineer: Open laptop
Start investigation Engineer->>Monitor: Check Grafana dashboard
When did issue start?
Which endpoints affected? Monitor-->>Engineer: Started 5 min ago
Right after deployment
All endpoints affected Engineer->>K8s: kubectl get pods
Check pod status K8s-->>Engineer: All pods Running
No crashes
Health checks passing Engineer->>K8s: kubectl logs deployment/myapp
Check application logs K8s-->>Engineer: ERROR: Cannot connect to cache
ERROR: Redis timeout
ERROR: Connection refused Note over Engineer: Root cause: New version
has Redis connection bug Engineer->>Incident: Update: Redis connection issue in v2.5
Decision: Rollback to v2.4 Note over Engineer: Check deployment history Engineer->>K8s: kubectl rollout history deployment/myapp K8s-->>Engineer: REVISION 10: v2.5 (current)
REVISION 9: v2.4 (previous) Engineer->>Incident: Starting rollback to v2.4
ETA: 3 minutes Engineer->>K8s: kubectl rollout undo deployment/myapp K8s->>K8s: Start rollback:
- Create pods with v2.4
- Wait for ready
- Terminate v2.5 pods loop Rolling Update K8s->>Users: Some users on v2.4 ✓
Some users on v2.5 ✗ Note over K8s: Pod 1: v2.4 Ready
Terminating v2.5 Pod 1 Engineer->>K8s: kubectl rollout status
deployment/myapp --watch K8s-->>Engineer: Waiting for rollout:
2/5 pods updated end K8s->>Users: All users now on v2.4 ✓ K8s-->>Engineer: Rollout complete:
deployment "myapp" successfully rolled out Engineer->>Monitor: Check metrics Note over Monitor: Wait 2 minutes
for metrics to stabilize Monitor-->>Engineer: ✅ Error rate: 0.1%
✅ Latency p95: 160ms
✅ All metrics normal Note over Alert: Metrics normalized Alert->>Engineer: ✅ Alert resolved:
HighErrorRate Engineer->>Users: Verify user experience Users-->>Engineer: No error reports
Application working Engineer->>Incident: ✅ Incident resolved
Service restored to v2.4
Duration: 12 minutes
Root cause: Redis bug in v2.5 Engineer->>Incident: Next steps:
1. Fix Redis bug
2. Add integration test
3. Post-mortem scheduled Note over Engineer: Create follow-up tasks Engineer->>Engineer: Create Jira tickets:
- BUG-789: Fix Redis connection
- TEST-123: Add cache integration test
- DOC-456: Update deployment checklist Note over Engineer,Users: Service restored ✓
Monitoring continues
Part 6: Automated Rollback
Auto-Rollback Decision Flow
%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%%
flowchart TD
Start([Deployment completed]) --> Monitor[Continuous Monitoring
Every 30 seconds] Monitor --> Collect[Collect Metrics:
- Error rate
- Latency p95/p99
- Success rate
- Pod health
- Resource usage] Collect --> Check1{Error rate
> 5%?} Check1 -->|Yes| Trigger1[🚨 Trigger auto-rollback
Error threshold exceeded] Check1 -->|No| Check2{Latency p95
> 2x baseline?} Check2 -->|Yes| Trigger2[🚨 Trigger auto-rollback
Latency degradation] Check2 -->|No| Check3{Pod crash
rate > 50%?} Check3 -->|Yes| Trigger3[🚨 Trigger auto-rollback
Pods failing] Check3 -->|No| Check4{Custom metric
threshold?} Check4 -->|Yes| Trigger4[🚨 Trigger auto-rollback
Business metric failed] Check4 -->|No| Healthy[✅ All checks passed
Continue monitoring] Healthy --> TimeCheck{Monitoring
duration?} TimeCheck -->|< 15 min| Monitor TimeCheck -->|>= 15 min| Stable[✅ Deployment STABLE
Passed soak period
Auto-rollback disabled] Trigger1 --> Rollback[Execute Auto-Rollback] Trigger2 --> Rollback Trigger3 --> Rollback Trigger4 --> Rollback Rollback --> R1[1. Log rollback decision
Metrics that triggered
Timestamp] R1 --> R2[2. Alert team:
PagerDuty critical
Slack notification
"Auto-rollback initiated"] R2 --> R3[3. Execute rollback:
kubectl rollout undo
deployment/myapp] R3 --> R4[4. Wait for rollback:
Monitor pod status
Wait for all pods ready] R4 --> R5[5. Verify recovery:
Check metrics again
Error rate normal?
Latency normal?] R5 --> Verify{Recovery
successful?} Verify -->|Yes| Success[✅ Auto-Rollback Success
Service restored
Notify team
Create incident report] Verify -->|No| StillFailing[🚨 Still Failing!
Issue not deployment
Page on-call immediately
Manual intervention needed] style Healthy fill:#064e3b,stroke:#10b981 style Stable fill:#064e3b,stroke:#10b981 style Success fill:#064e3b,stroke:#10b981 style Trigger1 fill:#7f1d1d,stroke:#ef4444 style Trigger2 fill:#7f1d1d,stroke:#ef4444 style Trigger3 fill:#7f1d1d,stroke:#ef4444 style Trigger4 fill:#7f1d1d,stroke:#ef4444 style StillFailing fill:#7f1d1d,stroke:#ef4444
Every 30 seconds] Monitor --> Collect[Collect Metrics:
- Error rate
- Latency p95/p99
- Success rate
- Pod health
- Resource usage] Collect --> Check1{Error rate
> 5%?} Check1 -->|Yes| Trigger1[🚨 Trigger auto-rollback
Error threshold exceeded] Check1 -->|No| Check2{Latency p95
> 2x baseline?} Check2 -->|Yes| Trigger2[🚨 Trigger auto-rollback
Latency degradation] Check2 -->|No| Check3{Pod crash
rate > 50%?} Check3 -->|Yes| Trigger3[🚨 Trigger auto-rollback
Pods failing] Check3 -->|No| Check4{Custom metric
threshold?} Check4 -->|Yes| Trigger4[🚨 Trigger auto-rollback
Business metric failed] Check4 -->|No| Healthy[✅ All checks passed
Continue monitoring] Healthy --> TimeCheck{Monitoring
duration?} TimeCheck -->|< 15 min| Monitor TimeCheck -->|>= 15 min| Stable[✅ Deployment STABLE
Passed soak period
Auto-rollback disabled] Trigger1 --> Rollback[Execute Auto-Rollback] Trigger2 --> Rollback Trigger3 --> Rollback Trigger4 --> Rollback Rollback --> R1[1. Log rollback decision
Metrics that triggered
Timestamp] R1 --> R2[2. Alert team:
PagerDuty critical
Slack notification
"Auto-rollback initiated"] R2 --> R3[3. Execute rollback:
kubectl rollout undo
deployment/myapp] R3 --> R4[4. Wait for rollback:
Monitor pod status
Wait for all pods ready] R4 --> R5[5. Verify recovery:
Check metrics again
Error rate normal?
Latency normal?] R5 --> Verify{Recovery
successful?} Verify -->|Yes| Success[✅ Auto-Rollback Success
Service restored
Notify team
Create incident report] Verify -->|No| StillFailing[🚨 Still Failing!
Issue not deployment
Page on-call immediately
Manual intervention needed] style Healthy fill:#064e3b,stroke:#10b981 style Stable fill:#064e3b,stroke:#10b981 style Success fill:#064e3b,stroke:#10b981 style Trigger1 fill:#7f1d1d,stroke:#ef4444 style Trigger2 fill:#7f1d1d,stroke:#ef4444 style Trigger3 fill:#7f1d1d,stroke:#ef4444 style Trigger4 fill:#7f1d1d,stroke:#ef4444 style StillFailing fill:#7f1d1d,stroke:#ef4444
Auto-Rollback Configuration
# Flagger auto-rollback configuration
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
service:
port: 8080
# Canary analysis
analysis:
interval: 30s
threshold: 5 # Rollback after 5 failed checks
maxWeight: 50
stepWeight: 10
# Metrics for auto-rollback decision
metrics:
# HTTP error rate
- name: request-success-rate
thresholdRange:
min: 95 # Rollback if success rate < 95%
interval: 1m
# HTTP latency
- name: request-duration
thresholdRange:
max: 500 # Rollback if p95 > 500ms
interval: 1m
# Custom business metric
- name: conversion-rate
thresholdRange:
min: 80 # Rollback if conversion < 80% of baseline
interval: 2m
# Webhooks for additional checks
webhooks:
- name: load-test
url: http://flagger-loadtester/
timeout: 5s
metadata:
type: bash
cmd: "hey -z 1m -q 10 http://myapp-canary:8080/"
# Alerting on rollback
alerts:
- name: slack
severity: error
providerRef:
name: slack
namespace: flagger
Part 7: Post-Incident Process
Learning from Rollbacks
%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%%
flowchart TD
Start([Rollback completed
Service restored]) --> Timeline[Create Incident Timeline:
- Deployment time
- Issue detection time
- Rollback decision time
- Recovery time
Total duration] Timeline --> PostMortem[Schedule Post-Mortem:
Within 48 hours
All stakeholders invited
Blameless culture] PostMortem --> Analyze[Root Cause Analysis:
Why did issue occur?
Why wasn't it caught?
What can we learn?] Analyze --> Categories{Issue
category?} Categories --> Testing[Insufficient Testing:
- Missing test case
- Integration gap
- Load testing needed] Categories --> Monitoring[Monitoring Gap:
- Missing alert
- Wrong threshold
- Blind spot found] Categories --> Process[Process Issue:
- Skipped step
- Wrong timing
- Communication gap] Categories --> Code[Code Quality:
- Bug in code
- Edge case
- Dependency issue] Testing --> Actions1[Action Items:
□ Add integration test
□ Expand E2E coverage
□ Add load test
□ Test in staging first] Monitoring --> Actions2[Action Items:
□ Add new alert
□ Adjust thresholds
□ Add dashboard
□ Improve visibility] Process --> Actions3[Action Items:
□ Update runbook
□ Add checklist item
□ Change deployment time
□ Improve communication] Code --> Actions4[Action Items:
□ Fix bug
□ Add validation
□ Update dependency
□ Code review process] Actions1 --> Assign[Assign Owners:
Each action has owner
Each action has deadline
Track in project board] Actions2 --> Assign Actions3 --> Assign Actions4 --> Assign Assign --> Document[Document Learnings:
- Update wiki
- Share with team
- Add to knowledge base
- Update training] Document --> Prevent[Prevent Recurrence:
✓ Tests added
✓ Monitoring improved
✓ Process updated
✓ Team educated] Prevent --> Complete[✅ Post-Incident Complete
Stronger system
Better prepared
Continuous improvement] style Complete fill:#064e3b,stroke:#10b981
Service restored]) --> Timeline[Create Incident Timeline:
- Deployment time
- Issue detection time
- Rollback decision time
- Recovery time
Total duration] Timeline --> PostMortem[Schedule Post-Mortem:
Within 48 hours
All stakeholders invited
Blameless culture] PostMortem --> Analyze[Root Cause Analysis:
Why did issue occur?
Why wasn't it caught?
What can we learn?] Analyze --> Categories{Issue
category?} Categories --> Testing[Insufficient Testing:
- Missing test case
- Integration gap
- Load testing needed] Categories --> Monitoring[Monitoring Gap:
- Missing alert
- Wrong threshold
- Blind spot found] Categories --> Process[Process Issue:
- Skipped step
- Wrong timing
- Communication gap] Categories --> Code[Code Quality:
- Bug in code
- Edge case
- Dependency issue] Testing --> Actions1[Action Items:
□ Add integration test
□ Expand E2E coverage
□ Add load test
□ Test in staging first] Monitoring --> Actions2[Action Items:
□ Add new alert
□ Adjust thresholds
□ Add dashboard
□ Improve visibility] Process --> Actions3[Action Items:
□ Update runbook
□ Add checklist item
□ Change deployment time
□ Improve communication] Code --> Actions4[Action Items:
□ Fix bug
□ Add validation
□ Update dependency
□ Code review process] Actions1 --> Assign[Assign Owners:
Each action has owner
Each action has deadline
Track in project board] Actions2 --> Assign Actions3 --> Assign Actions4 --> Assign Assign --> Document[Document Learnings:
- Update wiki
- Share with team
- Add to knowledge base
- Update training] Document --> Prevent[Prevent Recurrence:
✓ Tests added
✓ Monitoring improved
✓ Process updated
✓ Team educated] Prevent --> Complete[✅ Post-Incident Complete
Stronger system
Better prepared
Continuous improvement] style Complete fill:#064e3b,stroke:#10b981
Part 8: Rollback Checklist
Pre-Deployment Rollback Readiness
Before Every Deployment:
✅ Rollback Readiness Checklist:
Database:
□ Migrations are reversible (up/down scripts)
□ Schema changes backward compatible
□ Database backup taken before deployment
□ Tested rollback procedure in staging
Application:
□ Previous version still available in registry
□ Previous version deployment manifests saved
□ Feature flags enabled for new features
□ Rollback procedure documented
Monitoring:
□ Alerts configured for key metrics
□ Dashboards updated with new metrics
□ Auto-rollback thresholds configured
□ On-call rotation confirmed
Communication:
□ Team notified of deployment
□ Status page ready for updates
□ Incident channel created
□ Rollback decision-maker identified
Testing:
□ Canary/blue-green strategy chosen
□ Smoke tests automated
□ Rollback tested in staging
□ Load tests completed
Conclusion
Effective rollback and recovery requires:
- Fast Detection: Monitoring and alerting to catch issues early
- Clear Decision Process: When to rollback vs forward fix
- Automated Rollback: Reduce manual intervention and human error
- Database Strategy: Handle schema changes carefully
- Post-Incident Learning: Improve system after each incident
Key principles:
- Have a rollback plan before deploying
- Automate rollback triggers when possible
- Maintain backward compatibility
- Test rollback procedures regularly
- Learn from every incident
Rollback strategies by deployment type:
- Rolling Update:
kubectl rollout undo(2-5 min) - Blue-Green: Switch service selector (1-2 sec)
- Canary: Scale down new, scale up old (1-3 min)
The visual diagrams in this guide show the complete rollback process from detection to recovery, ensuring you can restore service quickly when things go wrong.
Further Reading
- Google SRE Book - Emergency Response
- Kubernetes Rollback Documentation
- Database Migration Best Practices
- Post-Incident Review Templates
Plan for rollback, hope you never need it!