Rollback & Recovery: Detection to Previous Version

    Introduction Even with the best testing, production issues happen. Having a solid rollback and recovery strategy is critical for minimizing downtime and data loss when deployments go wrong. This guide visualizes the complete rollback process: Issue Detection: Monitoring alerts and health checks Rollback Decision: When to rollback vs forward fix Rollback Execution: Different rollback strategies Data Recovery: Handling database changes Post-Incident: Learning and prevention Part 1: Issue Detection Flow From Healthy to Incident %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Production deploymentcompleted]) --> Monitor[Monitoring Systems- Prometheus metrics- Application logs- User reports- Health checks] Monitor --> Baseline[Baseline Metrics:✓ Error rate: 0.1%✓ Latency p95: 150ms✓ Traffic: 10k req/min✓ CPU: 40%✓ Memory: 60%] Baseline --> Time[Time passes...Minutes after deployment] Time --> Detect{Issuedetected?} Detect -->|No issue| Healthy[✅ Deployment HealthyContinue monitoringAll metrics normal] Detect -->|Yes| IssueType{Issuetype?} IssueType --> ErrorSpike[🔴 Error Rate Spike0.1% → 15%Alert: HighErrorRate firing] IssueType --> LatencySpike[🟡 Latency Increasep95: 150ms → 5000msAlert: HighLatency firing] IssueType --> TrafficDrop[🟠 Traffic Drop10k → 1k req/minUsers can't access] IssueType --> ResourceIssue[🔴 Resource ExhaustionCPU: 40% → 100%OOMKilled events] IssueType --> DataCorruption[🔴 Data IssuesDatabase errorsInvalid data returned] ErrorSpike --> Severity1[Severity: CRITICALUser impact: HIGHAffecting all users] LatencySpike --> Severity2[Severity: WARNINGUser impact: MEDIUMSlow but functional] TrafficDrop --> Severity3[Severity: CRITICALUser impact: HIGHComplete outage] ResourceIssue --> Severity4[Severity: CRITICALUser impact: HIGHPods crashing] DataCorruption --> Severity5[Severity: CRITICALUser impact: CRITICALData integrity at risk] Severity1 --> AutoAlert[🚨 Automated Alerts:- PagerDuty page- Slack notification- Email alerts- Status page update] Severity2 --> AutoAlert Severity3 --> AutoAlert Severity4 --> AutoAlert Severity5 --> AutoAlert AutoAlert --> OnCall[On-Call EngineerReceives alertAcknowledges incident] OnCall --> Investigate[Quick Investigation:- Check deployment timeline- Review recent changes- Check logs- Verify metrics] Investigate --> RootCause{Root causeidentified?} RootCause -->|Yes - Recent deployment| Decision[Go to Rollback Decision] RootCause -->|Yes - Other cause| OtherFix[Different remediationNot deployment-related] RootCause -->|No - Time critical| Decision style Healthy fill:#064e3b,stroke:#10b981 style Severity1 fill:#7f1d1d,stroke:#ef4444 style Severity3 fill:#7f1d1d,stroke:#ef4444 style Severity4 fill:#7f1d1d,stroke:#ef4444 style Severity5 fill:#7f1d1d,stroke:#ef4444 style Severity2 fill:#78350f,stroke:#f59e0b Part 2: Rollback Decision Tree When to Rollback vs Forward Fix %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Production issue detected]) --> Assess[Assess situation:- User impact- Severity- Time deployed- Data changes] Assess --> Q1{Can issue befixed quickly?5 min} Q1 -->|Yes - Simple config| QuickFix[Forward Fix:- Update config map- Restart pods- No rollback needed] Q1 -->|No| Q2{Is issue causedby latestdeployment?} Q2 -->|No - External issue| External[External Root Cause:- Third-party API down- Database issue- Infrastructure problem→ Fix underlying issue] Q2 -->|Yes| Q3{User impactseverity?} Q3 -->|Low - Minor bugs| Q4{Time sincedeployment?} Q4 -->|< 30 min| RollbackLow[Consider Rollback:Low risk, easy rollbackUsers barely affected] Q4 -->|> 30 min| ForwardFix[Forward Fix:Deploy hotfixMore data changesRollback riskier] Q3 -->|Medium - Degraded| Q5{Data changesmade?} Q5 -->|No DB changes| RollbackMed[Rollback:Safe to revertNo data migrationQuick recovery] Q5 -->|DB changes made| Q6{Can revertDB changes?} Q6 -->|Yes - Reversible| RollbackWithDB[Rollback + DB Revert:1. Revert application2. Run down migrationCoordinate carefully] Q6 -->|No - Irreversible| ForwardOnly[Forward Fix ONLY:Cannot rollbackFix bug in new versionData can't be reverted] Q3 -->|High - Outage| Q7{Rollbacktime?} Q7 -->|< 5 min| ImmediateRollback[IMMEDIATE Rollback:User impact too highRollback firstDebug later] Q7 -->|> 5 min| Q8{Forward fixfaster?} Q8 -->|Yes| HotfixDeploy[Deploy Hotfix:If fix is obviousand can deployfaster than rollback] Q8 -->|No| ImmediateRollback QuickFix --> Monitor[Monitor metricsVerify fix worked] RollbackLow --> ExecuteRollback[Execute Rollback] RollbackMed --> ExecuteRollback RollbackWithDB --> ExecuteRollback ImmediateRollback --> ExecuteRollback ForwardFix --> DeployFix[Deploy Forward Fix] HotfixDeploy --> DeployFix ForwardOnly --> DeployFix style ImmediateRollback fill:#7f1d1d,stroke:#ef4444 style RollbackWithDB fill:#78350f,stroke:#f59e0b style ForwardOnly fill:#78350f,stroke:#f59e0b style QuickFix fill:#064e3b,stroke:#10b981 Part 3: Rollback Execution Strategies Application Rollback Methods %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Decision: Rollback]) --> Method{Deploymentstrategyused?} Method --> K8sRolling[Kubernetes Rolling Update] Method --> BlueGreen[Blue-Green Deployment] Method --> Canary[Canary Deployment] subgraph RollingRollback[Kubernetes Rolling Rollback] K8sRolling --> K8s1[kubectl rollout undodeployment myapp] K8s1 --> K8s2[Kubernetes:- Find previous ReplicaSet- Rolling update to old version- maxSurge: 1, maxUnavailable: 1] K8s2 --> K8s3[Gradual Pod Replacement:1. Create 1 old version pod2. Wait for ready3. Terminate 1 new version pod4. Repeat until all replaced] K8s3 --> K8s4[Time to rollback: 2-5 minDowntime: NoneSome users see old, some new] end subgraph BGRollback[Blue-Green Rollback] BlueGreen --> BG1[Current state:Blue v1.0 IDLEGreen v2.0 ACTIVE 100%] BG1 --> BG2[Update Service selector:version: green → version: blue] BG2 --> BG3[Instant Traffic Switch:Blue v1.0 ACTIVE 100%Green v2.0 IDLE 0%] BG3 --> BG4[Time to rollback: 1-2 secDowntime: ~1 secAll users switched instantly] end subgraph CanaryRollback[Canary Rollback] Canary --> C1[Current state:v1.0: 0 replicasv2.0: 10 replicas 100%] C1 --> C2[Scale down v2.0:v2.0: 10 → 0 replicas] C2 --> C3[Scale up v1.0:v1.0: 0 → 10 replicas] C3 --> C4[Time to rollback: 1-3 minDowntime: MinimalGradual traffic shift] end K8s4 --> Verify[Verification Steps] BG4 --> Verify C4 --> Verify Verify --> V1[1. Check pod statuskubectl get podsAll running?] V1 --> V2[2. Run health checkscurl /healthAll healthy?] V2 --> V3[3. Monitor metricsError rate back to normal?Latency improved?] V3 --> V4[4. Check user reportsAre users reporting success?] V4 --> Success{Rollbacksuccessful?} Success -->|Yes| Complete[✅ Rollback CompleteService restoredMonitor closely] Success -->|No| StillBroken[🚨 Still Broken!Issue not deployment-relatedDeeper investigation needed] style K8s4 fill:#1e3a8a,stroke:#3b82f6 style BG4 fill:#064e3b,stroke:#10b981 style C4 fill:#1e3a8a,stroke:#3b82f6 style Complete fill:#064e3b,stroke:#10b981 style StillBroken fill:#7f1d1d,stroke:#ef4444 Part 4: Database Rollback Complexity Handling Database Migrations %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Need to rollbackwith DB changes]) --> Analyze[Analyze migration type] Analyze --> Type{Migrationtype?} Type --> AddColumn[Added ColumnALTER TABLE usersADD COLUMN email] Type --> DropColumn[Dropped ColumnALTER TABLE usersDROP COLUMN phone] Type --> ModifyColumn[Modified ColumnALTER TABLE usersALTER COLUMN age TYPE bigint] Type --> AddTable[Added TableCREATE TABLE orders] AddColumn --> AC1{Column hasdata?} AC1 -->|No data yet| AC2[Safe Rollback:1. Deploy old app version2. DROP COLUMN emailOld app doesn't use it] AC1 -->|Has data| AC3[⚠️ Data Loss Risk:1. Backup table first2. Consider keeping column3. Deploy old app versionColumn ignored by old app] DropColumn --> DC1[🚨 CANNOT Rollback:Data already lostForward fix ONLYOptions:1. Restore from backup2. Accept data loss3. Recreate from logs] ModifyColumn --> MC1{Datacompatible?} MC1 -->|Yes - reversible| MC2[Revert Column Type:ALTER COLUMN age TYPE intVerify no data truncationThen deploy old app] MC1 -->|No - data loss| MC3[🚨 Cannot Revert:bigint values exceed int rangeForward fix ONLY] AddTable --> AT1{Table hascritical data?} AT1 -->|No data| AT2[Safe Rollback:1. Deploy old app version2. DROP TABLE ordersNo data lost] AT1 -->|Has data| AT3[Risky Rollback:1. BACKUP TABLE orders2. DROP TABLE orders3. Deploy old app versionData preserved in backup] AC2 --> SafeProcess[Safe Rollback Process:✅ No data loss✅ Quick rollback✅ Reversible] AC3 --> RiskyProcess[Risky Rollback Process:⚠️ Potential data loss⚠️ Need backup⚠️ Manual intervention] DC1 --> NoRollback[Forward Fix Only:❌ Cannot rollback❌ Data already lost❌ Must fix forward] MC2 --> SafeProcess MC3 --> NoRollback AT2 --> SafeProcess AT3 --> RiskyProcess SafeProcess --> Execute1[Execute Safe Rollback] RiskyProcess --> Decision{Acceptablerisk?} Decision -->|Yes| Execute2[Execute with Caution] Decision -->|No| NoRollback NoRollback --> HotfixDeploy[Deploy Hotfix:New version with fixKeep new schema] style SafeProcess fill:#064e3b,stroke:#10b981 style RiskyProcess fill:#78350f,stroke:#f59e0b style NoRollback fill:#7f1d1d,stroke:#ef4444 style DC1 fill:#7f1d1d,stroke:#ef4444 style MC3 fill:#7f1d1d,stroke:#ef4444 Part 5: Complete Rollback Workflow From Detection to Recovery %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant Monitor as Monitoring participant Alert as Alerting participant Engineer as On-Call Engineer participant Incident as Incident Channel participant K8s as Kubernetes participant DB as Database participant Users as End Users Note over Monitor: 5 minutes after deployment Monitor->>Monitor: Detect anomaly:Error rate: 0.1% → 18%Latency p95: 150ms → 3000ms Monitor->>Alert: Trigger alert:HighErrorRate FIRING Alert->>Engineer: 🚨 PagerDuty callCritical alertProduction incident Engineer->>Alert: Acknowledge alertStop escalation Engineer->>Incident: Create #incident-456"High error rate after v2.5 deployment" Note over Engineer: Open laptopStart investigation Engineer->>Monitor: Check Grafana dashboardWhen did issue start?Which endpoints affected? Monitor-->>Engineer: Started 5 min agoRight after deploymentAll endpoints affected Engineer->>K8s: kubectl get podsCheck pod status K8s-->>Engineer: All pods RunningNo crashesHealth checks passing Engineer->>K8s: kubectl logs deployment/myappCheck application logs K8s-->>Engineer: ERROR: Cannot connect to cacheERROR: Redis timeoutERROR: Connection refused Note over Engineer: Root cause: New versionhas Redis connection bug Engineer->>Incident: Update: Redis connection issue in v2.5Decision: Rollback to v2.4 Note over Engineer: Check deployment history Engineer->>K8s: kubectl rollout history deployment/myapp K8s-->>Engineer: REVISION 10: v2.5 (current)REVISION 9: v2.4 (previous) Engineer->>Incident: Starting rollback to v2.4ETA: 3 minutes Engineer->>K8s: kubectl rollout undo deployment/myapp K8s->>K8s: Start rollback:- Create pods with v2.4- Wait for ready- Terminate v2.5 pods loop Rolling Update K8s->>Users: Some users on v2.4 ✓Some users on v2.5 ✗ Note over K8s: Pod 1: v2.4 ReadyTerminating v2.5 Pod 1 Engineer->>K8s: kubectl rollout statusdeployment/myapp --watch K8s-->>Engineer: Waiting for rollout:2/5 pods updated end K8s->>Users: All users now on v2.4 ✓ K8s-->>Engineer: Rollout complete:deployment "myapp" successfully rolled out Engineer->>Monitor: Check metrics Note over Monitor: Wait 2 minutesfor metrics to stabilize Monitor-->>Engineer: ✅ Error rate: 0.1%✅ Latency p95: 160ms✅ All metrics normal Note over Alert: Metrics normalized Alert->>Engineer: ✅ Alert resolved:HighErrorRate Engineer->>Users: Verify user experience Users-->>Engineer: No error reportsApplication working Engineer->>Incident: ✅ Incident resolvedService restored to v2.4Duration: 12 minutesRoot cause: Redis bug in v2.5 Engineer->>Incident: Next steps:1. Fix Redis bug2. Add integration test3. Post-mortem scheduled Note over Engineer: Create follow-up tasks Engineer->>Engineer: Create Jira tickets:- BUG-789: Fix Redis connection- TEST-123: Add cache integration test- DOC-456: Update deployment checklist Note over Engineer,Users: Service restored ✓Monitoring continues Part 6: Automated Rollback Auto-Rollback Decision Flow %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Deployment completed]) --> Monitor[Continuous MonitoringEvery 30 seconds] Monitor --> Collect[Collect Metrics:- Error rate- Latency p95/p99- Success rate- Pod health- Resource usage] Collect --> Check1{Error rate> 5%?} Check1 -->|Yes| Trigger1[🚨 Trigger auto-rollbackError threshold exceeded] Check1 -->|No| Check2{Latency p95> 2x baseline?} Check2 -->|Yes| Trigger2[🚨 Trigger auto-rollbackLatency degradation] Check2 -->|No| Check3{Pod crashrate > 50%?} Check3 -->|Yes| Trigger3[🚨 Trigger auto-rollbackPods failing] Check3 -->|No| Check4{Custom metricthreshold?} Check4 -->|Yes| Trigger4[🚨 Trigger auto-rollbackBusiness metric failed] Check4 -->|No| Healthy[✅ All checks passedContinue monitoring] Healthy --> TimeCheck{Monitoringduration?} TimeCheck -->|< 15 min| Monitor TimeCheck -->|>= 15 min| Stable[✅ Deployment STABLEPassed soak periodAuto-rollback disabled] Trigger1 --> Rollback[Execute Auto-Rollback] Trigger2 --> Rollback Trigger3 --> Rollback Trigger4 --> Rollback Rollback --> R1[1. Log rollback decisionMetrics that triggeredTimestamp] R1 --> R2[2. Alert team:PagerDuty criticalSlack notification"Auto-rollback initiated"] R2 --> R3[3. Execute rollback:kubectl rollout undodeployment/myapp] R3 --> R4[4. Wait for rollback:Monitor pod statusWait for all pods ready] R4 --> R5[5. Verify recovery:Check metrics againError rate normal?Latency normal?] R5 --> Verify{Recoverysuccessful?} Verify -->|Yes| Success[✅ Auto-Rollback SuccessService restoredNotify teamCreate incident report] Verify -->|No| StillFailing[🚨 Still Failing!Issue not deploymentPage on-call immediatelyManual intervention needed] style Healthy fill:#064e3b,stroke:#10b981 style Stable fill:#064e3b,stroke:#10b981 style Success fill:#064e3b,stroke:#10b981 style Trigger1 fill:#7f1d1d,stroke:#ef4444 style Trigger2 fill:#7f1d1d,stroke:#ef4444 style Trigger3 fill:#7f1d1d,stroke:#ef4444 style Trigger4 fill:#7f1d1d,stroke:#ef4444 style StillFailing fill:#7f1d1d,stroke:#ef4444 Auto-Rollback Configuration # Flagger auto-rollback configuration apiVersion: flagger.app/v1beta1 kind: Canary metadata: name: myapp namespace: production spec: targetRef: apiVersion: apps/v1 kind: Deployment name: myapp service: port: 8080 # Canary analysis analysis: interval: 30s threshold: 5 # Rollback after 5 failed checks maxWeight: 50 stepWeight: 10 # Metrics for auto-rollback decision metrics: # HTTP error rate - name: request-success-rate thresholdRange: min: 95 # Rollback if success rate < 95% interval: 1m # HTTP latency - name: request-duration thresholdRange: max: 500 # Rollback if p95 > 500ms interval: 1m # Custom business metric - name: conversion-rate thresholdRange: min: 80 # Rollback if conversion < 80% of baseline interval: 2m # Webhooks for additional checks webhooks: - name: load-test url: http://flagger-loadtester/ timeout: 5s metadata: type: bash cmd: "hey -z 1m -q 10 http://myapp-canary:8080/" # Alerting on rollback alerts: - name: slack severity: error providerRef: name: slack namespace: flagger Part 7: Post-Incident Process Learning from Rollbacks %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Rollback completedService restored]) --> Timeline[Create Incident Timeline:- Deployment time- Issue detection time- Rollback decision time- Recovery timeTotal duration] Timeline --> PostMortem[Schedule Post-Mortem:Within 48 hoursAll stakeholders invitedBlameless culture] PostMortem --> Analyze[Root Cause Analysis:Why did issue occur?Why wasn't it caught?What can we learn?] Analyze --> Categories{Issuecategory?} Categories --> Testing[Insufficient Testing:- Missing test case- Integration gap- Load testing needed] Categories --> Monitoring[Monitoring Gap:- Missing alert- Wrong threshold- Blind spot found] Categories --> Process[Process Issue:- Skipped step- Wrong timing- Communication gap] Categories --> Code[Code Quality:- Bug in code- Edge case- Dependency issue] Testing --> Actions1[Action Items:□ Add integration test□ Expand E2E coverage□ Add load test□ Test in staging first] Monitoring --> Actions2[Action Items:□ Add new alert□ Adjust thresholds□ Add dashboard□ Improve visibility] Process --> Actions3[Action Items:□ Update runbook□ Add checklist item□ Change deployment time□ Improve communication] Code --> Actions4[Action Items:□ Fix bug□ Add validation□ Update dependency□ Code review process] Actions1 --> Assign[Assign Owners:Each action has ownerEach action has deadlineTrack in project board] Actions2 --> Assign Actions3 --> Assign Actions4 --> Assign Assign --> Document[Document Learnings:- Update wiki- Share with team- Add to knowledge base- Update training] Document --> Prevent[Prevent Recurrence:✓ Tests added✓ Monitoring improved✓ Process updated✓ Team educated] Prevent --> Complete[✅ Post-Incident CompleteStronger systemBetter preparedContinuous improvement] style Complete fill:#064e3b,stroke:#10b981 Part 8: Rollback Checklist Pre-Deployment Rollback Readiness Before Every Deployment: ...

    January 23, 2025 · 11 min · Rafiul Alam