Introduction

Even with the best testing, production issues happen. Having a solid rollback and recovery strategy is critical for minimizing downtime and data loss when deployments go wrong.

This guide visualizes the complete rollback process:

  • Issue Detection: Monitoring alerts and health checks
  • Rollback Decision: When to rollback vs forward fix
  • Rollback Execution: Different rollback strategies
  • Data Recovery: Handling database changes
  • Post-Incident: Learning and prevention

Part 1: Issue Detection Flow

From Healthy to Incident

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Production deployment
completed]) --> Monitor[Monitoring Systems
- Prometheus metrics
- Application logs
- User reports
- Health checks] Monitor --> Baseline[Baseline Metrics:
✓ Error rate: 0.1%
✓ Latency p95: 150ms
✓ Traffic: 10k req/min
✓ CPU: 40%
✓ Memory: 60%] Baseline --> Time[Time passes...
Minutes after deployment] Time --> Detect{Issue
detected?} Detect -->|No issue| Healthy[✅ Deployment Healthy
Continue monitoring
All metrics normal] Detect -->|Yes| IssueType{Issue
type?} IssueType --> ErrorSpike[🔴 Error Rate Spike
0.1% → 15%
Alert: HighErrorRate firing] IssueType --> LatencySpike[🟡 Latency Increase
p95: 150ms → 5000ms
Alert: HighLatency firing] IssueType --> TrafficDrop[🟠 Traffic Drop
10k → 1k req/min
Users can't access] IssueType --> ResourceIssue[🔴 Resource Exhaustion
CPU: 40% → 100%
OOMKilled events] IssueType --> DataCorruption[🔴 Data Issues
Database errors
Invalid data returned] ErrorSpike --> Severity1[Severity: CRITICAL
User impact: HIGH
Affecting all users] LatencySpike --> Severity2[Severity: WARNING
User impact: MEDIUM
Slow but functional] TrafficDrop --> Severity3[Severity: CRITICAL
User impact: HIGH
Complete outage] ResourceIssue --> Severity4[Severity: CRITICAL
User impact: HIGH
Pods crashing] DataCorruption --> Severity5[Severity: CRITICAL
User impact: CRITICAL
Data integrity at risk] Severity1 --> AutoAlert[🚨 Automated Alerts:
- PagerDuty page
- Slack notification
- Email alerts
- Status page update] Severity2 --> AutoAlert Severity3 --> AutoAlert Severity4 --> AutoAlert Severity5 --> AutoAlert AutoAlert --> OnCall[On-Call Engineer
Receives alert
Acknowledges incident] OnCall --> Investigate[Quick Investigation:
- Check deployment timeline
- Review recent changes
- Check logs
- Verify metrics] Investigate --> RootCause{Root cause
identified?} RootCause -->|Yes - Recent deployment| Decision[Go to Rollback Decision] RootCause -->|Yes - Other cause| OtherFix[Different remediation
Not deployment-related] RootCause -->|No - Time critical| Decision style Healthy fill:#064e3b,stroke:#10b981 style Severity1 fill:#7f1d1d,stroke:#ef4444 style Severity3 fill:#7f1d1d,stroke:#ef4444 style Severity4 fill:#7f1d1d,stroke:#ef4444 style Severity5 fill:#7f1d1d,stroke:#ef4444 style Severity2 fill:#78350f,stroke:#f59e0b

Part 2: Rollback Decision Tree

When to Rollback vs Forward Fix

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Production issue detected]) --> Assess[Assess situation:
- User impact
- Severity
- Time deployed
- Data changes] Assess --> Q1{Can issue be
fixed quickly?
5 min} Q1 -->|Yes - Simple config| QuickFix[Forward Fix:
- Update config map
- Restart pods
- No rollback needed] Q1 -->|No| Q2{Is issue caused
by latest
deployment?} Q2 -->|No - External issue| External[External Root Cause:
- Third-party API down
- Database issue
- Infrastructure problem
→ Fix underlying issue] Q2 -->|Yes| Q3{User impact
severity?} Q3 -->|Low - Minor bugs| Q4{Time since
deployment?} Q4 -->|< 30 min| RollbackLow[Consider Rollback:
Low risk, easy rollback
Users barely affected] Q4 -->|> 30 min| ForwardFix[Forward Fix:
Deploy hotfix
More data changes
Rollback riskier] Q3 -->|Medium - Degraded| Q5{Data changes
made?} Q5 -->|No DB changes| RollbackMed[Rollback:
Safe to revert
No data migration
Quick recovery] Q5 -->|DB changes made| Q6{Can revert
DB changes?} Q6 -->|Yes - Reversible| RollbackWithDB[Rollback + DB Revert:
1. Revert application
2. Run down migration
Coordinate carefully] Q6 -->|No - Irreversible| ForwardOnly[Forward Fix ONLY:
Cannot rollback
Fix bug in new version
Data can't be reverted] Q3 -->|High - Outage| Q7{Rollback
time?} Q7 -->|< 5 min| ImmediateRollback[IMMEDIATE Rollback:
User impact too high
Rollback first
Debug later] Q7 -->|> 5 min| Q8{Forward fix
faster?} Q8 -->|Yes| HotfixDeploy[Deploy Hotfix:
If fix is obvious
and can deploy
faster than rollback] Q8 -->|No| ImmediateRollback QuickFix --> Monitor[Monitor metrics
Verify fix worked] RollbackLow --> ExecuteRollback[Execute Rollback] RollbackMed --> ExecuteRollback RollbackWithDB --> ExecuteRollback ImmediateRollback --> ExecuteRollback ForwardFix --> DeployFix[Deploy Forward Fix] HotfixDeploy --> DeployFix ForwardOnly --> DeployFix style ImmediateRollback fill:#7f1d1d,stroke:#ef4444 style RollbackWithDB fill:#78350f,stroke:#f59e0b style ForwardOnly fill:#78350f,stroke:#f59e0b style QuickFix fill:#064e3b,stroke:#10b981

Part 3: Rollback Execution Strategies

Application Rollback Methods

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Decision: Rollback]) --> Method{Deployment
strategy
used?} Method --> K8sRolling[Kubernetes Rolling Update] Method --> BlueGreen[Blue-Green Deployment] Method --> Canary[Canary Deployment] subgraph RollingRollback[Kubernetes Rolling Rollback] K8sRolling --> K8s1[kubectl rollout undo
deployment myapp] K8s1 --> K8s2[Kubernetes:
- Find previous ReplicaSet
- Rolling update to old version
- maxSurge: 1, maxUnavailable: 1] K8s2 --> K8s3[Gradual Pod Replacement:
1. Create 1 old version pod
2. Wait for ready
3. Terminate 1 new version pod
4. Repeat until all replaced] K8s3 --> K8s4[Time to rollback: 2-5 min
Downtime: None
Some users see old, some new] end subgraph BGRollback[Blue-Green Rollback] BlueGreen --> BG1[Current state:
Blue v1.0 IDLE
Green v2.0 ACTIVE 100%] BG1 --> BG2[Update Service selector:
version: green → version: blue] BG2 --> BG3[Instant Traffic Switch:
Blue v1.0 ACTIVE 100%
Green v2.0 IDLE 0%] BG3 --> BG4[Time to rollback: 1-2 sec
Downtime: ~1 sec
All users switched instantly] end subgraph CanaryRollback[Canary Rollback] Canary --> C1[Current state:
v1.0: 0 replicas
v2.0: 10 replicas 100%] C1 --> C2[Scale down v2.0:
v2.0: 10 → 0 replicas] C2 --> C3[Scale up v1.0:
v1.0: 0 → 10 replicas] C3 --> C4[Time to rollback: 1-3 min
Downtime: Minimal
Gradual traffic shift] end K8s4 --> Verify[Verification Steps] BG4 --> Verify C4 --> Verify Verify --> V1[1. Check pod status
kubectl get pods
All running?] V1 --> V2[2. Run health checks
curl /health
All healthy?] V2 --> V3[3. Monitor metrics
Error rate back to normal?
Latency improved?] V3 --> V4[4. Check user reports
Are users reporting success?] V4 --> Success{Rollback
successful?} Success -->|Yes| Complete[✅ Rollback Complete
Service restored
Monitor closely] Success -->|No| StillBroken[🚨 Still Broken!
Issue not deployment-related
Deeper investigation needed] style K8s4 fill:#1e3a8a,stroke:#3b82f6 style BG4 fill:#064e3b,stroke:#10b981 style C4 fill:#1e3a8a,stroke:#3b82f6 style Complete fill:#064e3b,stroke:#10b981 style StillBroken fill:#7f1d1d,stroke:#ef4444

Part 4: Database Rollback Complexity

Handling Database Migrations

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Need to rollback
with DB changes]) --> Analyze[Analyze migration type] Analyze --> Type{Migration
type?} Type --> AddColumn[Added Column
ALTER TABLE users
ADD COLUMN email] Type --> DropColumn[Dropped Column
ALTER TABLE users
DROP COLUMN phone] Type --> ModifyColumn[Modified Column
ALTER TABLE users
ALTER COLUMN age TYPE bigint] Type --> AddTable[Added Table
CREATE TABLE orders] AddColumn --> AC1{Column has
data?} AC1 -->|No data yet| AC2[Safe Rollback:
1. Deploy old app version
2. DROP COLUMN email
Old app doesn't use it] AC1 -->|Has data| AC3[⚠️ Data Loss Risk:
1. Backup table first
2. Consider keeping column
3. Deploy old app version
Column ignored by old app] DropColumn --> DC1[🚨 CANNOT Rollback:
Data already lost
Forward fix ONLY
Options:
1. Restore from backup
2. Accept data loss
3. Recreate from logs] ModifyColumn --> MC1{Data
compatible?} MC1 -->|Yes - reversible| MC2[Revert Column Type:
ALTER COLUMN age TYPE int
Verify no data truncation
Then deploy old app] MC1 -->|No - data loss| MC3[🚨 Cannot Revert:
bigint values exceed int range
Forward fix ONLY] AddTable --> AT1{Table has
critical data?} AT1 -->|No data| AT2[Safe Rollback:
1. Deploy old app version
2. DROP TABLE orders
No data lost] AT1 -->|Has data| AT3[Risky Rollback:
1. BACKUP TABLE orders
2. DROP TABLE orders
3. Deploy old app version
Data preserved in backup] AC2 --> SafeProcess[Safe Rollback Process:
✅ No data loss
✅ Quick rollback
✅ Reversible] AC3 --> RiskyProcess[Risky Rollback Process:
⚠️ Potential data loss
⚠️ Need backup
⚠️ Manual intervention] DC1 --> NoRollback[Forward Fix Only:
❌ Cannot rollback
❌ Data already lost
❌ Must fix forward] MC2 --> SafeProcess MC3 --> NoRollback AT2 --> SafeProcess AT3 --> RiskyProcess SafeProcess --> Execute1[Execute Safe Rollback] RiskyProcess --> Decision{Acceptable
risk?} Decision -->|Yes| Execute2[Execute with Caution] Decision -->|No| NoRollback NoRollback --> HotfixDeploy[Deploy Hotfix:
New version with fix
Keep new schema] style SafeProcess fill:#064e3b,stroke:#10b981 style RiskyProcess fill:#78350f,stroke:#f59e0b style NoRollback fill:#7f1d1d,stroke:#ef4444 style DC1 fill:#7f1d1d,stroke:#ef4444 style MC3 fill:#7f1d1d,stroke:#ef4444

Part 5: Complete Rollback Workflow

From Detection to Recovery

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant Monitor as Monitoring participant Alert as Alerting participant Engineer as On-Call Engineer participant Incident as Incident Channel participant K8s as Kubernetes participant DB as Database participant Users as End Users Note over Monitor: 5 minutes after deployment Monitor->>Monitor: Detect anomaly:
Error rate: 0.1% → 18%
Latency p95: 150ms → 3000ms Monitor->>Alert: Trigger alert:
HighErrorRate FIRING Alert->>Engineer: 🚨 PagerDuty call
Critical alert
Production incident Engineer->>Alert: Acknowledge alert
Stop escalation Engineer->>Incident: Create #incident-456
"High error rate after v2.5 deployment" Note over Engineer: Open laptop
Start investigation Engineer->>Monitor: Check Grafana dashboard
When did issue start?
Which endpoints affected? Monitor-->>Engineer: Started 5 min ago
Right after deployment
All endpoints affected Engineer->>K8s: kubectl get pods
Check pod status K8s-->>Engineer: All pods Running
No crashes
Health checks passing Engineer->>K8s: kubectl logs deployment/myapp
Check application logs K8s-->>Engineer: ERROR: Cannot connect to cache
ERROR: Redis timeout
ERROR: Connection refused Note over Engineer: Root cause: New version
has Redis connection bug Engineer->>Incident: Update: Redis connection issue in v2.5
Decision: Rollback to v2.4 Note over Engineer: Check deployment history Engineer->>K8s: kubectl rollout history deployment/myapp K8s-->>Engineer: REVISION 10: v2.5 (current)
REVISION 9: v2.4 (previous) Engineer->>Incident: Starting rollback to v2.4
ETA: 3 minutes Engineer->>K8s: kubectl rollout undo deployment/myapp K8s->>K8s: Start rollback:
- Create pods with v2.4
- Wait for ready
- Terminate v2.5 pods loop Rolling Update K8s->>Users: Some users on v2.4 ✓
Some users on v2.5 ✗ Note over K8s: Pod 1: v2.4 Ready
Terminating v2.5 Pod 1 Engineer->>K8s: kubectl rollout status
deployment/myapp --watch K8s-->>Engineer: Waiting for rollout:
2/5 pods updated end K8s->>Users: All users now on v2.4 ✓ K8s-->>Engineer: Rollout complete:
deployment "myapp" successfully rolled out Engineer->>Monitor: Check metrics Note over Monitor: Wait 2 minutes
for metrics to stabilize Monitor-->>Engineer: ✅ Error rate: 0.1%
✅ Latency p95: 160ms
✅ All metrics normal Note over Alert: Metrics normalized Alert->>Engineer: ✅ Alert resolved:
HighErrorRate Engineer->>Users: Verify user experience Users-->>Engineer: No error reports
Application working Engineer->>Incident: ✅ Incident resolved
Service restored to v2.4
Duration: 12 minutes
Root cause: Redis bug in v2.5 Engineer->>Incident: Next steps:
1. Fix Redis bug
2. Add integration test
3. Post-mortem scheduled Note over Engineer: Create follow-up tasks Engineer->>Engineer: Create Jira tickets:
- BUG-789: Fix Redis connection
- TEST-123: Add cache integration test
- DOC-456: Update deployment checklist Note over Engineer,Users: Service restored ✓
Monitoring continues

Part 6: Automated Rollback

Auto-Rollback Decision Flow

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Deployment completed]) --> Monitor[Continuous Monitoring
Every 30 seconds] Monitor --> Collect[Collect Metrics:
- Error rate
- Latency p95/p99
- Success rate
- Pod health
- Resource usage] Collect --> Check1{Error rate
> 5%?} Check1 -->|Yes| Trigger1[🚨 Trigger auto-rollback
Error threshold exceeded] Check1 -->|No| Check2{Latency p95
> 2x baseline?} Check2 -->|Yes| Trigger2[🚨 Trigger auto-rollback
Latency degradation] Check2 -->|No| Check3{Pod crash
rate > 50%?} Check3 -->|Yes| Trigger3[🚨 Trigger auto-rollback
Pods failing] Check3 -->|No| Check4{Custom metric
threshold?} Check4 -->|Yes| Trigger4[🚨 Trigger auto-rollback
Business metric failed] Check4 -->|No| Healthy[✅ All checks passed
Continue monitoring] Healthy --> TimeCheck{Monitoring
duration?} TimeCheck -->|< 15 min| Monitor TimeCheck -->|>= 15 min| Stable[✅ Deployment STABLE
Passed soak period
Auto-rollback disabled] Trigger1 --> Rollback[Execute Auto-Rollback] Trigger2 --> Rollback Trigger3 --> Rollback Trigger4 --> Rollback Rollback --> R1[1. Log rollback decision
Metrics that triggered
Timestamp] R1 --> R2[2. Alert team:
PagerDuty critical
Slack notification
"Auto-rollback initiated"] R2 --> R3[3. Execute rollback:
kubectl rollout undo
deployment/myapp] R3 --> R4[4. Wait for rollback:
Monitor pod status
Wait for all pods ready] R4 --> R5[5. Verify recovery:
Check metrics again
Error rate normal?
Latency normal?] R5 --> Verify{Recovery
successful?} Verify -->|Yes| Success[✅ Auto-Rollback Success
Service restored
Notify team
Create incident report] Verify -->|No| StillFailing[🚨 Still Failing!
Issue not deployment
Page on-call immediately
Manual intervention needed] style Healthy fill:#064e3b,stroke:#10b981 style Stable fill:#064e3b,stroke:#10b981 style Success fill:#064e3b,stroke:#10b981 style Trigger1 fill:#7f1d1d,stroke:#ef4444 style Trigger2 fill:#7f1d1d,stroke:#ef4444 style Trigger3 fill:#7f1d1d,stroke:#ef4444 style Trigger4 fill:#7f1d1d,stroke:#ef4444 style StillFailing fill:#7f1d1d,stroke:#ef4444

Auto-Rollback Configuration

# Flagger auto-rollback configuration
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp

  service:
    port: 8080

  # Canary analysis
  analysis:
    interval: 30s
    threshold: 5  # Rollback after 5 failed checks
    maxWeight: 50
    stepWeight: 10

    # Metrics for auto-rollback decision
    metrics:
    # HTTP error rate
    - name: request-success-rate
      thresholdRange:
        min: 95  # Rollback if success rate < 95%
      interval: 1m

    # HTTP latency
    - name: request-duration
      thresholdRange:
        max: 500  # Rollback if p95 > 500ms
      interval: 1m

    # Custom business metric
    - name: conversion-rate
      thresholdRange:
        min: 80  # Rollback if conversion < 80% of baseline
      interval: 2m

    # Webhooks for additional checks
    webhooks:
    - name: load-test
      url: http://flagger-loadtester/
      timeout: 5s
      metadata:
        type: bash
        cmd: "hey -z 1m -q 10 http://myapp-canary:8080/"

  # Alerting on rollback
  alerts:
  - name: slack
    severity: error
    providerRef:
      name: slack
      namespace: flagger

Part 7: Post-Incident Process

Learning from Rollbacks

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Rollback completed
Service restored]) --> Timeline[Create Incident Timeline:
- Deployment time
- Issue detection time
- Rollback decision time
- Recovery time
Total duration] Timeline --> PostMortem[Schedule Post-Mortem:
Within 48 hours
All stakeholders invited
Blameless culture] PostMortem --> Analyze[Root Cause Analysis:
Why did issue occur?
Why wasn't it caught?
What can we learn?] Analyze --> Categories{Issue
category?} Categories --> Testing[Insufficient Testing:
- Missing test case
- Integration gap
- Load testing needed] Categories --> Monitoring[Monitoring Gap:
- Missing alert
- Wrong threshold
- Blind spot found] Categories --> Process[Process Issue:
- Skipped step
- Wrong timing
- Communication gap] Categories --> Code[Code Quality:
- Bug in code
- Edge case
- Dependency issue] Testing --> Actions1[Action Items:
□ Add integration test
□ Expand E2E coverage
□ Add load test
□ Test in staging first] Monitoring --> Actions2[Action Items:
□ Add new alert
□ Adjust thresholds
□ Add dashboard
□ Improve visibility] Process --> Actions3[Action Items:
□ Update runbook
□ Add checklist item
□ Change deployment time
□ Improve communication] Code --> Actions4[Action Items:
□ Fix bug
□ Add validation
□ Update dependency
□ Code review process] Actions1 --> Assign[Assign Owners:
Each action has owner
Each action has deadline
Track in project board] Actions2 --> Assign Actions3 --> Assign Actions4 --> Assign Assign --> Document[Document Learnings:
- Update wiki
- Share with team
- Add to knowledge base
- Update training] Document --> Prevent[Prevent Recurrence:
✓ Tests added
✓ Monitoring improved
✓ Process updated
✓ Team educated] Prevent --> Complete[✅ Post-Incident Complete
Stronger system
Better prepared
Continuous improvement] style Complete fill:#064e3b,stroke:#10b981

Part 8: Rollback Checklist

Pre-Deployment Rollback Readiness

Before Every Deployment:

✅ Rollback Readiness Checklist:

Database:
□ Migrations are reversible (up/down scripts)
□ Schema changes backward compatible
□ Database backup taken before deployment
□ Tested rollback procedure in staging

Application:
□ Previous version still available in registry
□ Previous version deployment manifests saved
□ Feature flags enabled for new features
□ Rollback procedure documented

Monitoring:
□ Alerts configured for key metrics
□ Dashboards updated with new metrics
□ Auto-rollback thresholds configured
□ On-call rotation confirmed

Communication:
□ Team notified of deployment
□ Status page ready for updates
□ Incident channel created
□ Rollback decision-maker identified

Testing:
□ Canary/blue-green strategy chosen
□ Smoke tests automated
□ Rollback tested in staging
□ Load tests completed

Conclusion

Effective rollback and recovery requires:

  • Fast Detection: Monitoring and alerting to catch issues early
  • Clear Decision Process: When to rollback vs forward fix
  • Automated Rollback: Reduce manual intervention and human error
  • Database Strategy: Handle schema changes carefully
  • Post-Incident Learning: Improve system after each incident

Key principles:

  • Have a rollback plan before deploying
  • Automate rollback triggers when possible
  • Maintain backward compatibility
  • Test rollback procedures regularly
  • Learn from every incident

Rollback strategies by deployment type:

  • Rolling Update: kubectl rollout undo (2-5 min)
  • Blue-Green: Switch service selector (1-2 sec)
  • Canary: Scale down new, scale up old (1-3 min)

The visual diagrams in this guide show the complete rollback process from detection to recovery, ensuring you can restore service quickly when things go wrong.


Further Reading


Plan for rollback, hope you never need it!