Introduction

Even with the best testing, production issues happen. Having a solid rollback and recovery strategy is critical for minimizing downtime and data loss when deployments go wrong.

This guide visualizes the complete rollback process:

Issue Detection: Monitoring alerts and health checks
Rollback Decision: When to rollback vs forward fix
Rollback Execution: Different rollback strategies
Data Recovery: Handling database changes
Post-Incident: Learning and prevention

Part 1: Issue Detection Flow

From Healthy to Incident

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Production deployment
completed]) --> Monitor[Monitoring Systems
- Prometheus metrics
- Application logs
- User reports
- Health checks] Monitor --> Baseline[Baseline Metrics:
✓ Error rate: 0.1%
✓ Latency p95: 150ms
✓ Traffic: 10k req/min
✓ CPU: 40%
✓ Memory: 60%] Baseline --> Time[Time passes...
Minutes after deployment] Time --> Detect{Issue
detected?} Detect -->|No issue| Healthy[✅ Deployment Healthy
Continue monitoring
All metrics normal] Detect -->|Yes| IssueType{Issue
type?} IssueType --> ErrorSpike[🔴 Error Rate Spike
0.1% → 15%
Alert: HighErrorRate firing] IssueType --> LatencySpike[🟡 Latency Increase
p95: 150ms → 5000ms
Alert: HighLatency firing] IssueType --> TrafficDrop[🟠 Traffic Drop
10k → 1k req/min
Users can't access] IssueType --> ResourceIssue[🔴 Resource Exhaustion
CPU: 40% → 100%
OOMKilled events] IssueType --> DataCorruption[🔴 Data Issues
Database errors
Invalid data returned] ErrorSpike --> Severity1[Severity: CRITICAL
User impact: HIGH
Affecting all users] LatencySpike --> Severity2[Severity: WARNING
User impact: MEDIUM
Slow but functional] TrafficDrop --> Severity3[Severity: CRITICAL
User impact: HIGH
Complete outage] ResourceIssue --> Severity4[Severity: CRITICAL
User impact: HIGH
Pods crashing] DataCorruption --> Severity5[Severity: CRITICAL
User impact: CRITICAL
Data integrity at risk] Severity1 --> AutoAlert[🚨 Automated Alerts:
- PagerDuty page
- Slack notification
- Email alerts
- Status page update] Severity2 --> AutoAlert Severity3 --> AutoAlert Severity4 --> AutoAlert Severity5 --> AutoAlert AutoAlert --> OnCall[On-Call Engineer
Receives alert
Acknowledges incident] OnCall --> Investigate[Quick Investigation:
- Check deployment timeline
- Review recent changes
- Check logs
- Verify metrics] Investigate --> RootCause{Root cause
identified?} RootCause -->|Yes - Recent deployment| Decision[Go to Rollback Decision] RootCause -->|Yes - Other cause| OtherFix[Different remediation
Not deployment-related] RootCause -->|No - Time critical| Decision style Healthy fill:#064e3b,stroke:#10b981 style Severity1 fill:#7f1d1d,stroke:#ef4444 style Severity3 fill:#7f1d1d,stroke:#ef4444 style Severity4 fill:#7f1d1d,stroke:#ef4444 style Severity5 fill:#7f1d1d,stroke:#ef4444 style Severity2 fill:#78350f,stroke:#f59e0b

Part 2: Rollback Decision Tree

When to Rollback vs Forward Fix

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Production issue detected]) --> Assess[Assess situation:
- User impact
- Severity
- Time deployed
- Data changes] Assess --> Q1{Can issue be
fixed quickly?
5 min} Q1 -->|Yes - Simple config| QuickFix[Forward Fix:
- Update config map
- Restart pods
- No rollback needed] Q1 -->|No| Q2{Is issue caused
by latest
deployment?} Q2 -->|No - External issue| External[External Root Cause:
- Third-party API down
- Database issue
- Infrastructure problem
→ Fix underlying issue] Q2 -->|Yes| Q3{User impact
severity?} Q3 -->|Low - Minor bugs| Q4{Time since
deployment?} Q4 -->|< 30 min| RollbackLow[Consider Rollback:
Low risk, easy rollback
Users barely affected] Q4 -->|> 30 min| ForwardFix[Forward Fix:
Deploy hotfix
More data changes
Rollback riskier] Q3 -->|Medium - Degraded| Q5{Data changes
made?} Q5 -->|No DB changes| RollbackMed[Rollback:
Safe to revert
No data migration
Quick recovery] Q5 -->|DB changes made| Q6{Can revert
DB changes?} Q6 -->|Yes - Reversible| RollbackWithDB[Rollback + DB Revert:
1. Revert application
2. Run down migration
Coordinate carefully] Q6 -->|No - Irreversible| ForwardOnly[Forward Fix ONLY:
Cannot rollback
Fix bug in new version
Data can't be reverted] Q3 -->|High - Outage| Q7{Rollback
time?} Q7 -->|< 5 min| ImmediateRollback[IMMEDIATE Rollback:
User impact too high
Rollback first
Debug later] Q7 -->|> 5 min| Q8{Forward fix
faster?} Q8 -->|Yes| HotfixDeploy[Deploy Hotfix:
If fix is obvious
and can deploy
faster than rollback] Q8 -->|No| ImmediateRollback QuickFix --> Monitor[Monitor metrics
Verify fix worked] RollbackLow --> ExecuteRollback[Execute Rollback] RollbackMed --> ExecuteRollback RollbackWithDB --> ExecuteRollback ImmediateRollback --> ExecuteRollback ForwardFix --> DeployFix[Deploy Forward Fix] HotfixDeploy --> DeployFix ForwardOnly --> DeployFix style ImmediateRollback fill:#7f1d1d,stroke:#ef4444 style RollbackWithDB fill:#78350f,stroke:#f59e0b style ForwardOnly fill:#78350f,stroke:#f59e0b style QuickFix fill:#064e3b,stroke:#10b981

Part 3: Rollback Execution Strategies

Application Rollback Methods

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Decision: Rollback]) --> Method{Deployment
strategy
used?} Method --> K8sRolling[Kubernetes Rolling Update] Method --> BlueGreen[Blue-Green Deployment] Method --> Canary[Canary Deployment] subgraph RollingRollback[Kubernetes Rolling Rollback] K8sRolling --> K8s1[kubectl rollout undo
deployment myapp] K8s1 --> K8s2[Kubernetes:
- Find previous ReplicaSet
- Rolling update to old version
- maxSurge: 1, maxUnavailable: 1] K8s2 --> K8s3[Gradual Pod Replacement:
1. Create 1 old version pod
2. Wait for ready
3. Terminate 1 new version pod
4. Repeat until all replaced] K8s3 --> K8s4[Time to rollback: 2-5 min
Downtime: None
Some users see old, some new] end subgraph BGRollback[Blue-Green Rollback] BlueGreen --> BG1[Current state:
Blue v1.0 IDLE
Green v2.0 ACTIVE 100%] BG1 --> BG2[Update Service selector:
version: green → version: blue] BG2 --> BG3[Instant Traffic Switch:
Blue v1.0 ACTIVE 100%
Green v2.0 IDLE 0%] BG3 --> BG4[Time to rollback: 1-2 sec
Downtime: ~1 sec
All users switched instantly] end subgraph CanaryRollback[Canary Rollback] Canary --> C1[Current state:
v1.0: 0 replicas
v2.0: 10 replicas 100%] C1 --> C2[Scale down v2.0:
v2.0: 10 → 0 replicas] C2 --> C3[Scale up v1.0:
v1.0: 0 → 10 replicas] C3 --> C4[Time to rollback: 1-3 min
Downtime: Minimal
Gradual traffic shift] end K8s4 --> Verify[Verification Steps] BG4 --> Verify C4 --> Verify Verify --> V1[1. Check pod status
kubectl get pods
All running?] V1 --> V2[2. Run health checks
curl /health
All healthy?] V2 --> V3[3. Monitor metrics
Error rate back to normal?
Latency improved?] V3 --> V4[4. Check user reports
Are users reporting success?] V4 --> Success{Rollback
successful?} Success -->|Yes| Complete[✅ Rollback Complete
Service restored
Monitor closely] Success -->|No| StillBroken[🚨 Still Broken!
Issue not deployment-related
Deeper investigation needed] style K8s4 fill:#1e3a8a,stroke:#3b82f6 style BG4 fill:#064e3b,stroke:#10b981 style C4 fill:#1e3a8a,stroke:#3b82f6 style Complete fill:#064e3b,stroke:#10b981 style StillBroken fill:#7f1d1d,stroke:#ef4444

Part 4: Database Rollback Complexity

Handling Database Migrations

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Need to rollback
with DB changes]) --> Analyze[Analyze migration type] Analyze --> Type{Migration
type?} Type --> AddColumn[Added Column
ALTER TABLE users
ADD COLUMN email] Type --> DropColumn[Dropped Column
ALTER TABLE users
DROP COLUMN phone] Type --> ModifyColumn[Modified Column
ALTER TABLE users
ALTER COLUMN age TYPE bigint] Type --> AddTable[Added Table
CREATE TABLE orders] AddColumn --> AC1{Column has
data?} AC1 -->|No data yet| AC2[Safe Rollback:
1. Deploy old app version
2. DROP COLUMN email
Old app doesn't use it] AC1 -->|Has data| AC3[⚠️ Data Loss Risk:
1. Backup table first
2. Consider keeping column
3. Deploy old app version
Column ignored by old app] DropColumn --> DC1[🚨 CANNOT Rollback:
Data already lost
Forward fix ONLY
Options:
1. Restore from backup
2. Accept data loss
3. Recreate from logs] ModifyColumn --> MC1{Data
compatible?} MC1 -->|Yes - reversible| MC2[Revert Column Type:
ALTER COLUMN age TYPE int
Verify no data truncation
Then deploy old app] MC1 -->|No - data loss| MC3[🚨 Cannot Revert:
bigint values exceed int range
Forward fix ONLY] AddTable --> AT1{Table has
critical data?} AT1 -->|No data| AT2[Safe Rollback:
1. Deploy old app version
2. DROP TABLE orders
No data lost] AT1 -->|Has data| AT3[Risky Rollback:
1. BACKUP TABLE orders
2. DROP TABLE orders
3. Deploy old app version
Data preserved in backup] AC2 --> SafeProcess[Safe Rollback Process:
✅ No data loss
✅ Quick rollback
✅ Reversible] AC3 --> RiskyProcess[Risky Rollback Process:
⚠️ Potential data loss
⚠️ Need backup
⚠️ Manual intervention] DC1 --> NoRollback[Forward Fix Only:
❌ Cannot rollback
❌ Data already lost
❌ Must fix forward] MC2 --> SafeProcess MC3 --> NoRollback AT2 --> SafeProcess AT3 --> RiskyProcess SafeProcess --> Execute1[Execute Safe Rollback] RiskyProcess --> Decision{Acceptable
risk?} Decision -->|Yes| Execute2[Execute with Caution] Decision -->|No| NoRollback NoRollback --> HotfixDeploy[Deploy Hotfix:
New version with fix
Keep new schema] style SafeProcess fill:#064e3b,stroke:#10b981 style RiskyProcess fill:#78350f,stroke:#f59e0b style NoRollback fill:#7f1d1d,stroke:#ef4444 style DC1 fill:#7f1d1d,stroke:#ef4444 style MC3 fill:#7f1d1d,stroke:#ef4444

Part 5: Complete Rollback Workflow

From Detection to Recovery

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant Monitor as Monitoring participant Alert as Alerting participant Engineer as On-Call Engineer participant Incident as Incident Channel participant K8s as Kubernetes participant DB as Database participant Users as End Users Note over Monitor: 5 minutes after deployment Monitor->>Monitor: Detect anomaly:
Error rate: 0.1% → 18%
Latency p95: 150ms → 3000ms Monitor->>Alert: Trigger alert:
HighErrorRate FIRING Alert->>Engineer: 🚨 PagerDuty call
Critical alert
Production incident Engineer->>Alert: Acknowledge alert
Stop escalation Engineer->>Incident: Create #incident-456
"High error rate after v2.5 deployment" Note over Engineer: Open laptop
Start investigation Engineer->>Monitor: Check Grafana dashboard
When did issue start?
Which endpoints affected? Monitor-->>Engineer: Started 5 min ago
Right after deployment
All endpoints affected Engineer->>K8s: kubectl get pods
Check pod status K8s-->>Engineer: All pods Running
No crashes
Health checks passing Engineer->>K8s: kubectl logs deployment/myapp
Check application logs K8s-->>Engineer: ERROR: Cannot connect to cache
ERROR: Redis timeout
ERROR: Connection refused Note over Engineer: Root cause: New version
has Redis connection bug Engineer->>Incident: Update: Redis connection issue in v2.5
Decision: Rollback to v2.4 Note over Engineer: Check deployment history Engineer->>K8s: kubectl rollout history deployment/myapp K8s-->>Engineer: REVISION 10: v2.5 (current)
REVISION 9: v2.4 (previous) Engineer->>Incident: Starting rollback to v2.4
ETA: 3 minutes Engineer->>K8s: kubectl rollout undo deployment/myapp K8s->>K8s: Start rollback:
- Create pods with v2.4
- Wait for ready
- Terminate v2.5 pods loop Rolling Update K8s->>Users: Some users on v2.4 ✓
Some users on v2.5 ✗ Note over K8s: Pod 1: v2.4 Ready
Terminating v2.5 Pod 1 Engineer->>K8s: kubectl rollout status
deployment/myapp --watch K8s-->>Engineer: Waiting for rollout:
2/5 pods updated end K8s->>Users: All users now on v2.4 ✓ K8s-->>Engineer: Rollout complete:
deployment "myapp" successfully rolled out Engineer->>Monitor: Check metrics Note over Monitor: Wait 2 minutes
for metrics to stabilize Monitor-->>Engineer: ✅ Error rate: 0.1%
✅ Latency p95: 160ms
✅ All metrics normal Note over Alert: Metrics normalized Alert->>Engineer: ✅ Alert resolved:
HighErrorRate Engineer->>Users: Verify user experience Users-->>Engineer: No error reports
Application working Engineer->>Incident: ✅ Incident resolved
Service restored to v2.4
Duration: 12 minutes
Root cause: Redis bug in v2.5 Engineer->>Incident: Next steps:
1. Fix Redis bug
2. Add integration test
3. Post-mortem scheduled Note over Engineer: Create follow-up tasks Engineer->>Engineer: Create Jira tickets:
- BUG-789: Fix Redis connection
- TEST-123: Add cache integration test
- DOC-456: Update deployment checklist Note over Engineer,Users: Service restored ✓
Monitoring continues

Part 6: Automated Rollback

Auto-Rollback Decision Flow

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Deployment completed]) --> Monitor[Continuous Monitoring
Every 30 seconds] Monitor --> Collect[Collect Metrics:
- Error rate
- Latency p95/p99
- Success rate
- Pod health
- Resource usage] Collect --> Check1{Error rate
> 5%?} Check1 -->|Yes| Trigger1[🚨 Trigger auto-rollback
Error threshold exceeded] Check1 -->|No| Check2{Latency p95
> 2x baseline?} Check2 -->|Yes| Trigger2[🚨 Trigger auto-rollback
Latency degradation] Check2 -->|No| Check3{Pod crash
rate > 50%?} Check3 -->|Yes| Trigger3[🚨 Trigger auto-rollback
Pods failing] Check3 -->|No| Check4{Custom metric
threshold?} Check4 -->|Yes| Trigger4[🚨 Trigger auto-rollback
Business metric failed] Check4 -->|No| Healthy[✅ All checks passed
Continue monitoring] Healthy --> TimeCheck{Monitoring
duration?} TimeCheck -->|< 15 min| Monitor TimeCheck -->|>= 15 min| Stable[✅ Deployment STABLE
Passed soak period
Auto-rollback disabled] Trigger1 --> Rollback[Execute Auto-Rollback] Trigger2 --> Rollback Trigger3 --> Rollback Trigger4 --> Rollback Rollback --> R1[1. Log rollback decision
Metrics that triggered
Timestamp] R1 --> R2[2. Alert team:
PagerDuty critical
Slack notification
"Auto-rollback initiated"] R2 --> R3[3. Execute rollback:
kubectl rollout undo
deployment/myapp] R3 --> R4[4. Wait for rollback:
Monitor pod status
Wait for all pods ready] R4 --> R5[5. Verify recovery:
Check metrics again
Error rate normal?
Latency normal?] R5 --> Verify{Recovery
successful?} Verify -->|Yes| Success[✅ Auto-Rollback Success
Service restored
Notify team
Create incident report] Verify -->|No| StillFailing[🚨 Still Failing!
Issue not deployment
Page on-call immediately
Manual intervention needed] style Healthy fill:#064e3b,stroke:#10b981 style Stable fill:#064e3b,stroke:#10b981 style Success fill:#064e3b,stroke:#10b981 style Trigger1 fill:#7f1d1d,stroke:#ef4444 style Trigger2 fill:#7f1d1d,stroke:#ef4444 style Trigger3 fill:#7f1d1d,stroke:#ef4444 style Trigger4 fill:#7f1d1d,stroke:#ef4444 style StillFailing fill:#7f1d1d,stroke:#ef4444

Auto-Rollback Configuration

# Flagger auto-rollback configuration
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp

  service:
    port: 8080

  # Canary analysis
  analysis:
    interval: 30s
    threshold: 5  # Rollback after 5 failed checks
    maxWeight: 50
    stepWeight: 10

    # Metrics for auto-rollback decision
    metrics:
    # HTTP error rate
    - name: request-success-rate
      thresholdRange:
        min: 95  # Rollback if success rate < 95%
      interval: 1m

    # HTTP latency
    - name: request-duration
      thresholdRange:
        max: 500  # Rollback if p95 > 500ms
      interval: 1m

    # Custom business metric
    - name: conversion-rate
      thresholdRange:
        min: 80  # Rollback if conversion < 80% of baseline
      interval: 2m

    # Webhooks for additional checks
    webhooks:
    - name: load-test
      url: http://flagger-loadtester/
      timeout: 5s
      metadata:
        type: bash
        cmd: "hey -z 1m -q 10 http://myapp-canary:8080/"

  # Alerting on rollback
  alerts:
  - name: slack
    severity: error
    providerRef:
      name: slack
      namespace: flagger

Part 7: Post-Incident Process

Learning from Rollbacks

Part 8: Rollback Checklist

Pre-Deployment Rollback Readiness

Before Every Deployment:

✅ Rollback Readiness Checklist:

Database:
□ Migrations are reversible (up/down scripts)
□ Schema changes backward compatible
□ Database backup taken before deployment
□ Tested rollback procedure in staging

Application:
□ Previous version still available in registry
□ Previous version deployment manifests saved
□ Feature flags enabled for new features
□ Rollback procedure documented

Monitoring:
□ Alerts configured for key metrics
□ Dashboards updated with new metrics
□ Auto-rollback thresholds configured
□ On-call rotation confirmed

Communication:
□ Team notified of deployment
□ Status page ready for updates
□ Incident channel created
□ Rollback decision-maker identified

Testing:
□ Canary/blue-green strategy chosen
□ Smoke tests automated
□ Rollback tested in staging
□ Load tests completed

Conclusion

Effective rollback and recovery requires:

Fast Detection: Monitoring and alerting to catch issues early
Clear Decision Process: When to rollback vs forward fix
Automated Rollback: Reduce manual intervention and human error
Database Strategy: Handle schema changes carefully
Post-Incident Learning: Improve system after each incident

Key principles:

Have a rollback plan before deploying
Automate rollback triggers when possible
Maintain backward compatibility
Test rollback procedures regularly
Learn from every incident

Rollback strategies by deployment type:

Rolling Update: kubectl rollout undo (2-5 min)
Blue-Green: Switch service selector (1-2 sec)
Canary: Scale down new, scale up old (1-3 min)

The visual diagrams in this guide show the complete rollback process from detection to recovery, ensuring you can restore service quickly when things go wrong.

Rollback & Recovery: Detection to Previous Version

Introduction

Part 1: Issue Detection Flow

From Healthy to Incident

Part 2: Rollback Decision Tree

When to Rollback vs Forward Fix

Part 3: Rollback Execution Strategies

Application Rollback Methods

Part 4: Database Rollback Complexity

Handling Database Migrations

Part 5: Complete Rollback Workflow

From Detection to Recovery

Part 6: Automated Rollback

Auto-Rollback Decision Flow

Auto-Rollback Configuration

Part 7: Post-Incident Process

Learning from Rollbacks

Part 8: Rollback Checklist

Pre-Deployment Rollback Readiness

Conclusion

Further Reading

AI Assistant

Hi! I'm your AI assistant

Introduction#

Part 1: Issue Detection Flow#

From Healthy to Incident#

Part 2: Rollback Decision Tree#

When to Rollback vs Forward Fix#

Part 3: Rollback Execution Strategies#

Application Rollback Methods#

Part 4: Database Rollback Complexity#

Handling Database Migrations#

Part 5: Complete Rollback Workflow#

From Detection to Recovery#

Part 6: Automated Rollback#

Auto-Rollback Decision Flow#

Auto-Rollback Configuration#

Part 7: Post-Incident Process#

Learning from Rollbacks#

Part 8: Rollback Checklist#

Pre-Deployment Rollback Readiness#

Conclusion#

Further Reading#

Introduction

Part 1: Issue Detection Flow

From Healthy to Incident

Part 2: Rollback Decision Tree

When to Rollback vs Forward Fix

Part 3: Rollback Execution Strategies

Application Rollback Methods

Part 4: Database Rollback Complexity

Handling Database Migrations

Part 5: Complete Rollback Workflow

From Detection to Recovery

Part 6: Automated Rollback

Auto-Rollback Decision Flow

Auto-Rollback Configuration

Part 7: Post-Incident Process

Learning from Rollbacks

Part 8: Rollback Checklist

Pre-Deployment Rollback Readiness

Conclusion

Further Reading