Sre

Monitoring & Alerting: Metrics to Action Flow

Introduction Effective monitoring and alerting are critical for maintaining reliable systems. Without proper observability, you’re flying blind when issues occur in production. This guide visualizes the complete monitoring and alerting flow: Metrics Collection: From instrumentation to storage Alert Evaluation: When metrics cross thresholds Notification Routing: Getting alerts to the right people Incident Response: From alert to resolution The Three Pillars: Metrics, Logs, and Traces Part 1: Complete Monitoring & Alerting Flow End-to-End Overview %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD subgraph Apps[Application Layer] App1[Application 1Exposes /metrics] App2[Application 2Exposes /metrics] App3[Application 3Exposes /metrics] end subgraph Collection[Metrics Collection] Prometheus[Prometheus ServerScrapes metrics every 15sStores time-series data] end subgraph Rules[Alert Rules Engine] Rules1[Alert Rule 1:High Error Raterate > 5%] Rules2[Alert Rule 2:High Latencyp95 > 500ms] Rules3[Alert Rule 3:Low Availabilityuptime < 99%] end subgraph AlertMgr[Alert Manager] Routing[Alert Routing- Group similar alerts- Deduplicate- Apply silences] Throttle[Throttling- Rate limiting- Grouping window- Repeat interval] end subgraph Notification[Notification Channels] PagerDuty[PagerDutyCritical alertsOn-call engineer] Slack[SlackWarning alertsTeam channel] Email[EmailInfo alertsDistribution list] end subgraph Response[Incident Response] OnCall[On-Call EngineerReceives alert] Investigate[Investigate Issue- Check dashboards- Review logs- Analyze traces] Fix[Apply Fix- Deploy patch- Scale resources- Restart service] Resolve[Resolve AlertMetrics return to normal] end App1 --> |Scrape /metrics| Prometheus App2 --> |Scrape /metrics| Prometheus App3 --> |Scrape /metrics| Prometheus Prometheus --> |Evaluate every 1m| Rules1 Prometheus --> |Evaluate every 1m| Rules2 Prometheus --> |Evaluate every 1m| Rules3 Rules1 --> |Trigger if true| Routing Rules2 --> |Trigger if true| Routing Rules3 --> |Trigger if true| Routing Routing --> Throttle Throttle --> |Severity: Critical| PagerDuty Throttle --> |Severity: Warning| Slack Throttle --> |Severity: Info| Email PagerDuty --> OnCall Slack --> OnCall OnCall --> Investigate Investigate --> Fix Fix --> Resolve Resolve -.->|Metrics normalized| Prometheus style Prometheus fill:#1e3a8a,stroke:#3b82f6 style Routing fill:#1e3a8a,stroke:#3b82f6 style PagerDuty fill:#7f1d1d,stroke:#ef4444 style Resolve fill:#064e3b,stroke:#10b981 Part 2: Metrics Collection Process Prometheus Scrape Flow %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant App as Application participant Metrics as /metrics Endpoint participant Prom as Prometheus participant TSDB as Time-Series Database participant Grafana as Grafana Dashboard Note over App: Application runningIncrementing countersRecording histograms App->>Metrics: Update in-memory metricshttp_requests_total++http_request_duration_seconds loop Every 15 seconds Prom->>Metrics: HTTP GET /metrics Metrics-->>Prom: Return current metrics# TYPE http_requests_total counterhttp_requests_total{method="GET",status="200"} 1523http_requests_total{method="GET",status="500"} 12 Note over Prom: Parse metricsAdd labels:- job="myapp"- instance="pod-1:8080"- timestamp Prom->>TSDB: Store time-series dataAppend to existing seriesCreate new series if needed Note over TSDB: Compress and store:http_requests_total{ job="myapp", instance="pod-1:8080", method="GET", status="200"} = 1523 @ timestamp end Note over Prom,TSDB: Data retained for 15 daysOlder data deleted automatically Grafana->>Prom: PromQL Query:rate(http_requests_total[5m]) Prom->>TSDB: Fetch time-series datafor last 5 minutes TSDB-->>Prom: Return raw data points Note over Prom: Calculate rate:Δ value / Δ time Prom-->>Grafana: Return computed values Grafana->>Grafana: Render graphDisplay on dashboard Metrics Instrumentation Example package main import ( "net/http" "time" "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promhttp" ) // Define metrics var ( // Counter - only goes up httpRequestsTotal = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests", }, []string{"method", "endpoint", "status"}, ) // Histogram - for request durations httpRequestDuration = prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "http_request_duration_seconds", Help: "HTTP request duration in seconds", Buckets: prometheus.DefBuckets, // 0.005, 0.01, 0.025, 0.05, ... }, []string{"method", "endpoint"}, ) // Gauge - current value (can go up or down) activeConnections = prometheus.NewGauge( prometheus.GaugeOpts{ Name: "active_connections", Help: "Number of active connections", }, ) ) func init() { // Register metrics with Prometheus prometheus.MustRegister(httpRequestsTotal) prometheus.MustRegister(httpRequestDuration) prometheus.MustRegister(activeConnections) } func trackMetrics(method, endpoint string, statusCode int, duration time.Duration) { // Increment request counter httpRequestsTotal.WithLabelValues( method, endpoint, fmt.Sprintf("%d", statusCode), ).Inc() // Record request duration httpRequestDuration.WithLabelValues( method, endpoint, ).Observe(duration.Seconds()) } func handleRequest(w http.ResponseWriter, r *http.Request) { start := time.Now() // Increment active connections activeConnections.Inc() defer activeConnections.Dec() // Your application logic here processRequest(w, r) // Track metrics duration := time.Since(start) trackMetrics(r.Method, r.URL.Path, http.StatusOK, duration) } func main() { // Expose /metrics endpoint for Prometheus http.Handle("/metrics", promhttp.Handler()) // Application endpoints http.HandleFunc("/api/users", handleRequest) http.ListenAndServe(":8080", nil) } Part 3: Alert Evaluation and Firing Alert Rule Decision Tree %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Prometheus evaluatesalert rules every 1m]) --> Query[Execute PromQL query:rate5m > threshold] Query --> Result{Queryreturns data?} Result -->|No data| Inactive[Alert: InactiveNo time-series matchNo notification] Result -->|Data exists| CheckCondition{Conditiontrue?} CheckCondition -->|False| Resolved{Alert wasfiring?} Resolved -->|Yes| SendResolved[Alert: ResolvedSend resolved notificationGreen alert to channel] Resolved -->|No| Inactive CheckCondition -->|True| Duration{Condition truefor 'for' duration?} Duration -->|No| Pending[Alert: PendingWaiting for duratione.g., 5 minutesNo notification yet] Pending -.->|Check again| Start Duration -->|Yes| Firing[Alert: Firing 🔥Send to Alertmanager] Firing --> Dedupe{Alreadyfiring?} Dedupe -->|Yes| Throttle[Respect repeat_intervale.g., every 4 hoursDon't spam] Dedupe -->|No| NewAlert[New alert!Send notification immediately] Throttle --> TimeCheck{Repeat intervalelapsed?} TimeCheck -->|No| Wait[Wait...Don't send yet] TimeCheck -->|Yes| Reminder[Send remindernotification] NewAlert --> AlertManager[Send to Alertmanager] Reminder --> AlertManager AlertManager --> Route[Route based on labelsApply routing rules] style Inactive fill:#1e3a8a,stroke:#3b82f6 style Pending fill:#78350f,stroke:#f59e0b style Firing fill:#7f1d1d,stroke:#ef4444 style SendResolved fill:#064e3b,stroke:#10b981 style NewAlert fill:#7f1d1d,stroke:#ef4444 Alert Rule Configuration # prometheus-rules.yaml groups: - name: application_alerts interval: 60s # Evaluate every 60 seconds rules: # High Error Rate Alert - alert: HighErrorRate expr: | ( rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) ) > 0.05 for: 5m # Must be true for 5 minutes before firing labels: severity: critical team: backend annotations: summary: "High error rate on {{ $labels.instance }}" description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)" dashboard: "https://grafana.example.com/d/app" # High Latency Alert - alert: HighLatency expr: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) > 0.5 for: 10m labels: severity: warning team: backend annotations: summary: "High latency on {{ $labels.instance }}" description: "P95 latency is {{ $value }}s (threshold: 0.5s)" # Service Down Alert - alert: ServiceDown expr: up{job="myapp"} == 0 for: 1m labels: severity: critical team: sre annotations: summary: "Service {{ $labels.instance }} is down" description: "Cannot scrape metrics from {{ $labels.instance }}" # Memory Usage Alert - alert: HighMemoryUsage expr: | ( container_memory_usage_bytes{pod=~"myapp-.*"} / container_spec_memory_limit_bytes{pod=~"myapp-.*"} ) > 0.90 for: 5m labels: severity: warning team: platform annotations: summary: "High memory usage on {{ $labels.pod }}" description: "Memory usage is {{ $value | humanizePercentage }} of limit" Part 4: Alert Routing and Notification Alertmanager Processing Flow %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Alert received fromPrometheus]) --> Inhibit{Inhibitionrules match?} Inhibit -->|Yes| Suppress[Alert suppressedHigher priority alertalready firinge.g., NodeDown inhibitsall pod alerts on that node] Inhibit -->|No| Silence{Silencematches?} Silence -->|Yes| Silenced[Alert silencedManual suppressionDuring maintenance windowNo notification sent] Silence -->|No| Group[Group alertsBy: cluster, alertnameCombine similar alerts] Group --> Wait[Wait for group_waitDefault: 30sCollect more alerts] Wait --> Batch[Create notification batchMultiple alerts groupedSingle notification] Batch --> Route{Matchrouting tree?} Route --> Critical{severity:critical?} Route --> Warning{severity:warning?} Route --> Default[Default route] Critical --> Team1{team:backend?} Team1 -->|Yes| PagerDuty[PagerDutyPage on-call engineerEscalate if no ackin 5 minutes] Team1 -->|No| Team2[Other team's PagerDuty] Warning --> SlackRoute{team:backend?} SlackRoute -->|Yes| Slack[Slack #backend-alertsPost message@here mention] SlackRoute -->|No| SlackOther[Other team's Slack] Default --> Email[EmailSend to mailing listLow priority] PagerDuty --> Track[Track notificationSet repeat_interval timer4 hours until resolved] Slack --> Track Email --> Track Track --> Resolved{Alertresolved?} Resolved -->|No| RepeatCheck{repeat_intervalelapsed?} RepeatCheck -->|Yes| Resend[Resend notificationReminder that alertstill firing] Resend -.-> Track RepeatCheck -->|No| Wait2[Wait...] Wait2 -.-> Resolved Resolved -->|Yes| SendResolved[Send resolved notificationAll is well ✓] style Suppress fill:#1e3a8a,stroke:#3b82f6 style Silenced fill:#1e3a8a,stroke:#3b82f6 style PagerDuty fill:#7f1d1d,stroke:#ef4444 style SendResolved fill:#064e3b,stroke:#10b981 Alertmanager Configuration # alertmanager.yaml global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/XXX' pagerduty_url: 'https://events.pagerduty.com/v2/enqueue' # Inhibition rules - suppress alerts when higher priority alert is firing inhibit_rules: # If node is down, don't alert on pods on that node - source_match: alertname: 'NodeDown' target_match: alertname: 'PodDown' equal: ['node'] # If entire cluster is down, don't alert on individual services - source_match: severity: 'critical' alertname: 'ClusterDown' target_match_re: severity: 'warning|info' equal: ['cluster'] # Route tree - how to send alerts route: receiver: 'default-email' group_by: ['alertname', 'cluster', 'service'] group_wait: 30s # Wait 30s to collect more alerts group_interval: 5m # Send updates every 5m for grouped alerts repeat_interval: 4h # Resend if still firing after 4h routes: # Critical alerts to PagerDuty - match: severity: critical receiver: 'pagerduty-critical' group_wait: 10s # Page quickly for critical continue: true # Also send to Slack - match: severity: critical receiver: 'slack-critical' # Warning alerts to Slack - match: severity: warning receiver: 'slack-warnings' group_wait: 1m # Team-specific routing - match: team: backend receiver: 'backend-team' - match: team: frontend receiver: 'frontend-team' # Receivers - where to send alerts receivers: - name: 'default-email' email_configs: - to: '[email protected]' headers: Subject: '{{ .GroupLabels.alertname }}: {{ .Status | toUpper }}' - name: 'pagerduty-critical' pagerduty_configs: - service_key: 'your-pagerduty-key' description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}' severity: 'critical' - name: 'slack-critical' slack_configs: - channel: '#alerts-critical' title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}' text: | {{ range .Alerts }} *Alert:* {{ .Annotations.summary }} *Description:* {{ .Annotations.description }} *Severity:* {{ .Labels.severity }} *Dashboard:* {{ .Annotations.dashboard }} {{ end }} color: 'danger' send_resolved: true - name: 'slack-warnings' slack_configs: - channel: '#alerts-warning' title: '⚠️ WARNING: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}' color: 'warning' - name: 'backend-team' slack_configs: - channel: '#backend-alerts' Part 5: Incident Response Workflow From Alert to Resolution %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant Alert as Alert System participant PD as PagerDuty participant Eng as On-Call Engineer participant Dash as Grafana Dashboard participant Logs as Log Aggregator participant Trace as Tracing System participant K8s as Kubernetes participant Incident as Incident Channel Alert->>PD: 🚨 Critical AlertHighErrorRate firingService: myappError rate: 12% PD->>Eng: 📱 Phone call + SMS + PushIncident created Note over Eng: Engineer woken upat 3 AM 😴 Eng->>PD: Acknowledge incidentStop escalation Eng->>Incident: Create #incident-123Post initial status Note over Eng: Open laptopStart investigation Eng->>Dash: Open dashboardCheck error rate graph Dash-->>Eng: Graph shows spikeStarted 5 minutes agoOnly affects /api/payment Eng->>Logs: Query logs:level=error ANDpath=/api/payment Logs-->>Eng: Errors:"Database connection timeout""Cannot connect to db:5432" Note over Eng: Database issue suspected Eng->>K8s: kubectl get pods -n database K8s-->>Eng: postgres-0: CrashLoopBackOffRestart count: 8 Eng->>K8s: kubectl describe pod postgres-0 K8s-->>Eng: Event: Liveness probe failedEvent: OOMKilledMemory: 2.1Gi / 2Gi limit Note over Eng: Database OOMKilled!Need more memory Eng->>Incident: Update: Database OOMAction: Increasing memory limit Eng->>K8s: kubectl edit statefulset postgresChange: 2Gi → 4Gi memory K8s-->>Eng: Statefulset updated Note over K8s: Rolling restartpostgres-0 recreatedwith 4Gi memory Eng->>K8s: kubectl get pods -n database -wWatch pod status K8s-->>Eng: postgres-0: Running ✓Ready: 1/1 Note over Eng: Wait for metricsto normalize Eng->>Dash: Refresh dashboard Dash-->>Eng: Error rate: 0.3% ✓Latency: normal ✓Back to baseline Note over Alert: Metrics normalizedAlert conditions false Alert->>PD: ✅ Alert resolved PD->>Eng: Incident auto-resolved Eng->>Incident: Incident resolved ✓Root cause: DB OOMFix: Increased memoryDuration: 23 minutes Eng->>Eng: Create follow-up tasks:1. Set memory alerts2. Review query performance3. Consider connection pooling Note over Eng: Back to sleep 😴Post-mortem tomorrow Part 6: The Three Pillars of Observability Metrics, Logs, and Traces Integration %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Issue([Production Issue Detected]) --> Which{Which pillarto start with?} Which --> Metrics[1️⃣ METRICSWhat is broken?] Which --> Logs[2️⃣ LOGSWhy is it broken?] Which --> Traces[3️⃣ TRACESWhere is it broken?] Metrics --> M1[Check Grafana- Error rate spiking?- Latency increased?- Which service?- Which endpoint?] M1 --> M2[Identify:✓ Service: payment-api✓ Endpoint: /checkout✓ Metric: p95 latency 5000ms✓ Time: Started 10m ago] M2 --> UseTrace{Need to seerequest flow?} UseTrace -->|Yes| Traces Logs --> L1[Search logs in ELK/Lokiservice=payment-api ANDpath=/checkout ANDlevel=error] L1 --> L2[Find errors:"Database query timeout""SELECT * FROM ordersWHERE user_id=123execution time: 5200ms"] L2 --> L3[Context found:✓ Specific query is slow✓ Affecting user_id=123✓ No index on user_id?] L3 --> UseMetrics{Verify withmetrics?} UseMetrics -->|Yes| Metrics Traces --> T1[Open Jaeger/TempoSearch trace_id orservice=payment-api] T1 --> T2[View distributed trace:┌─ payment-api: 5100ms│ ├─ auth-svc: 20ms ✓│ ├─ inventory-svc: 30ms ✓│ └─ database: 5000ms ❌│ └─ query: SELECT * FROM orders] T2 --> T3[Identify bottleneck:✓ Database query is slow✓ Affects only /checkout✓ Other services healthy] T3 --> UseLogs{Need errordetails?} UseLogs -->|Yes| Logs M2 --> RootCause[Combine insights:METRICS: Latency spike on /checkoutLOGS: Specific query timeoutTRACES: Database is bottleneck] L3 --> RootCause T3 --> RootCause RootCause --> Fix[Root Cause Found:Missing database indexon orders.user_idFix: CREATE INDEXidx_user_id ON orders] style Metrics fill:#1e3a8a,stroke:#3b82f6 style Logs fill:#78350f,stroke:#f59e0b style Traces fill:#064e3b,stroke:#10b981 style RootCause fill:#064e3b,stroke:#10b981 style Fix fill:#064e3b,stroke:#10b981 When to Use Each Pillar Pillar Best For Example Questions Tools Metrics Detecting issues, trends - Is the service up?- What’s the error rate?- Is latency increasing? Prometheus, Grafana, Datadog Logs Understanding what happened - What was the error message?- Which user was affected?- What was the input? ELK, Loki, Splunk Traces Finding bottlenecks - Which service is slow?- Where is the delay?- How do requests flow? Jaeger, Tempo, Zipkin Part 7: Setting Up Effective Alerts Alert Quality Framework %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([New Alert Idea]) --> Question1{Does this requireimmediate action?} Question1 -->|No| Ticket[Create ticket insteadNot an alertReview during business hours] Question1 -->|Yes| Question2{Can it beautomated away?} Question2 -->|Yes| Automate[Build automationAuto-scalingAuto-healingSelf-recovery] Question2 -->|No| Question3{Is it actionable?} Question3 -->|No| Rethink[Rethink the alertWhat action shouldthe engineer take?If none, not an alert] Question3 -->|Yes| Question4{Is the signalclear?} Question4 -->|No| Refine[Refine the thresholdAdd 'for' durationAdjust sensitivityReduce false positives] Question4 -->|Yes| Question5{Provides enoughcontext?} Question5 -->|No| AddContext[Add context:- Dashboard link- Runbook link- Query to debug- Recent changes] Question5 -->|Yes| Question6{Correctseverity?} Question6 -->|No| Severity[Adjust severity:Critical = PageWarning = SlackInfo = Email] Question6 -->|Yes| GoodAlert[✅ Good Alert!- Actionable- Clear signal- Right severity- Good context] GoodAlert --> Deploy[Deploy alertMonitor for:- False positives- Alert fatigue- Resolution time] style Ticket fill:#1e3a8a,stroke:#3b82f6 style Automate fill:#064e3b,stroke:#10b981 style GoodAlert fill:#064e3b,stroke:#10b981 style Rethink fill:#7f1d1d,stroke:#ef4444 Part 8: Best Practices DO’s and DON’Ts ✅ DO: ...

Rollback & Recovery: Detection to Previous Version

Introduction Even with the best testing, production issues happen. Having a solid rollback and recovery strategy is critical for minimizing downtime and data loss when deployments go wrong. This guide visualizes the complete rollback process: Issue Detection: Monitoring alerts and health checks Rollback Decision: When to rollback vs forward fix Rollback Execution: Different rollback strategies Data Recovery: Handling database changes Post-Incident: Learning and prevention Part 1: Issue Detection Flow From Healthy to Incident %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Production deploymentcompleted]) --> Monitor[Monitoring Systems- Prometheus metrics- Application logs- User reports- Health checks] Monitor --> Baseline[Baseline Metrics:✓ Error rate: 0.1%✓ Latency p95: 150ms✓ Traffic: 10k req/min✓ CPU: 40%✓ Memory: 60%] Baseline --> Time[Time passes...Minutes after deployment] Time --> Detect{Issuedetected?} Detect -->|No issue| Healthy[✅ Deployment HealthyContinue monitoringAll metrics normal] Detect -->|Yes| IssueType{Issuetype?} IssueType --> ErrorSpike[🔴 Error Rate Spike0.1% → 15%Alert: HighErrorRate firing] IssueType --> LatencySpike[🟡 Latency Increasep95: 150ms → 5000msAlert: HighLatency firing] IssueType --> TrafficDrop[🟠 Traffic Drop10k → 1k req/minUsers can't access] IssueType --> ResourceIssue[🔴 Resource ExhaustionCPU: 40% → 100%OOMKilled events] IssueType --> DataCorruption[🔴 Data IssuesDatabase errorsInvalid data returned] ErrorSpike --> Severity1[Severity: CRITICALUser impact: HIGHAffecting all users] LatencySpike --> Severity2[Severity: WARNINGUser impact: MEDIUMSlow but functional] TrafficDrop --> Severity3[Severity: CRITICALUser impact: HIGHComplete outage] ResourceIssue --> Severity4[Severity: CRITICALUser impact: HIGHPods crashing] DataCorruption --> Severity5[Severity: CRITICALUser impact: CRITICALData integrity at risk] Severity1 --> AutoAlert[🚨 Automated Alerts:- PagerDuty page- Slack notification- Email alerts- Status page update] Severity2 --> AutoAlert Severity3 --> AutoAlert Severity4 --> AutoAlert Severity5 --> AutoAlert AutoAlert --> OnCall[On-Call EngineerReceives alertAcknowledges incident] OnCall --> Investigate[Quick Investigation:- Check deployment timeline- Review recent changes- Check logs- Verify metrics] Investigate --> RootCause{Root causeidentified?} RootCause -->|Yes - Recent deployment| Decision[Go to Rollback Decision] RootCause -->|Yes - Other cause| OtherFix[Different remediationNot deployment-related] RootCause -->|No - Time critical| Decision style Healthy fill:#064e3b,stroke:#10b981 style Severity1 fill:#7f1d1d,stroke:#ef4444 style Severity3 fill:#7f1d1d,stroke:#ef4444 style Severity4 fill:#7f1d1d,stroke:#ef4444 style Severity5 fill:#7f1d1d,stroke:#ef4444 style Severity2 fill:#78350f,stroke:#f59e0b Part 2: Rollback Decision Tree When to Rollback vs Forward Fix %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Production issue detected]) --> Assess[Assess situation:- User impact- Severity- Time deployed- Data changes] Assess --> Q1{Can issue befixed quickly?5 min} Q1 -->|Yes - Simple config| QuickFix[Forward Fix:- Update config map- Restart pods- No rollback needed] Q1 -->|No| Q2{Is issue causedby latestdeployment?} Q2 -->|No - External issue| External[External Root Cause:- Third-party API down- Database issue- Infrastructure problem→ Fix underlying issue] Q2 -->|Yes| Q3{User impactseverity?} Q3 -->|Low - Minor bugs| Q4{Time sincedeployment?} Q4 -->|< 30 min| RollbackLow[Consider Rollback:Low risk, easy rollbackUsers barely affected] Q4 -->|> 30 min| ForwardFix[Forward Fix:Deploy hotfixMore data changesRollback riskier] Q3 -->|Medium - Degraded| Q5{Data changesmade?} Q5 -->|No DB changes| RollbackMed[Rollback:Safe to revertNo data migrationQuick recovery] Q5 -->|DB changes made| Q6{Can revertDB changes?} Q6 -->|Yes - Reversible| RollbackWithDB[Rollback + DB Revert:1. Revert application2. Run down migrationCoordinate carefully] Q6 -->|No - Irreversible| ForwardOnly[Forward Fix ONLY:Cannot rollbackFix bug in new versionData can't be reverted] Q3 -->|High - Outage| Q7{Rollbacktime?} Q7 -->|< 5 min| ImmediateRollback[IMMEDIATE Rollback:User impact too highRollback firstDebug later] Q7 -->|> 5 min| Q8{Forward fixfaster?} Q8 -->|Yes| HotfixDeploy[Deploy Hotfix:If fix is obviousand can deployfaster than rollback] Q8 -->|No| ImmediateRollback QuickFix --> Monitor[Monitor metricsVerify fix worked] RollbackLow --> ExecuteRollback[Execute Rollback] RollbackMed --> ExecuteRollback RollbackWithDB --> ExecuteRollback ImmediateRollback --> ExecuteRollback ForwardFix --> DeployFix[Deploy Forward Fix] HotfixDeploy --> DeployFix ForwardOnly --> DeployFix style ImmediateRollback fill:#7f1d1d,stroke:#ef4444 style RollbackWithDB fill:#78350f,stroke:#f59e0b style ForwardOnly fill:#78350f,stroke:#f59e0b style QuickFix fill:#064e3b,stroke:#10b981 Part 3: Rollback Execution Strategies Application Rollback Methods %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Decision: Rollback]) --> Method{Deploymentstrategyused?} Method --> K8sRolling[Kubernetes Rolling Update] Method --> BlueGreen[Blue-Green Deployment] Method --> Canary[Canary Deployment] subgraph RollingRollback[Kubernetes Rolling Rollback] K8sRolling --> K8s1[kubectl rollout undodeployment myapp] K8s1 --> K8s2[Kubernetes:- Find previous ReplicaSet- Rolling update to old version- maxSurge: 1, maxUnavailable: 1] K8s2 --> K8s3[Gradual Pod Replacement:1. Create 1 old version pod2. Wait for ready3. Terminate 1 new version pod4. Repeat until all replaced] K8s3 --> K8s4[Time to rollback: 2-5 minDowntime: NoneSome users see old, some new] end subgraph BGRollback[Blue-Green Rollback] BlueGreen --> BG1[Current state:Blue v1.0 IDLEGreen v2.0 ACTIVE 100%] BG1 --> BG2[Update Service selector:version: green → version: blue] BG2 --> BG3[Instant Traffic Switch:Blue v1.0 ACTIVE 100%Green v2.0 IDLE 0%] BG3 --> BG4[Time to rollback: 1-2 secDowntime: ~1 secAll users switched instantly] end subgraph CanaryRollback[Canary Rollback] Canary --> C1[Current state:v1.0: 0 replicasv2.0: 10 replicas 100%] C1 --> C2[Scale down v2.0:v2.0: 10 → 0 replicas] C2 --> C3[Scale up v1.0:v1.0: 0 → 10 replicas] C3 --> C4[Time to rollback: 1-3 minDowntime: MinimalGradual traffic shift] end K8s4 --> Verify[Verification Steps] BG4 --> Verify C4 --> Verify Verify --> V1[1. Check pod statuskubectl get podsAll running?] V1 --> V2[2. Run health checkscurl /healthAll healthy?] V2 --> V3[3. Monitor metricsError rate back to normal?Latency improved?] V3 --> V4[4. Check user reportsAre users reporting success?] V4 --> Success{Rollbacksuccessful?} Success -->|Yes| Complete[✅ Rollback CompleteService restoredMonitor closely] Success -->|No| StillBroken[🚨 Still Broken!Issue not deployment-relatedDeeper investigation needed] style K8s4 fill:#1e3a8a,stroke:#3b82f6 style BG4 fill:#064e3b,stroke:#10b981 style C4 fill:#1e3a8a,stroke:#3b82f6 style Complete fill:#064e3b,stroke:#10b981 style StillBroken fill:#7f1d1d,stroke:#ef4444 Part 4: Database Rollback Complexity Handling Database Migrations %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Need to rollbackwith DB changes]) --> Analyze[Analyze migration type] Analyze --> Type{Migrationtype?} Type --> AddColumn[Added ColumnALTER TABLE usersADD COLUMN email] Type --> DropColumn[Dropped ColumnALTER TABLE usersDROP COLUMN phone] Type --> ModifyColumn[Modified ColumnALTER TABLE usersALTER COLUMN age TYPE bigint] Type --> AddTable[Added TableCREATE TABLE orders] AddColumn --> AC1{Column hasdata?} AC1 -->|No data yet| AC2[Safe Rollback:1. Deploy old app version2. DROP COLUMN emailOld app doesn't use it] AC1 -->|Has data| AC3[⚠️ Data Loss Risk:1. Backup table first2. Consider keeping column3. Deploy old app versionColumn ignored by old app] DropColumn --> DC1[🚨 CANNOT Rollback:Data already lostForward fix ONLYOptions:1. Restore from backup2. Accept data loss3. Recreate from logs] ModifyColumn --> MC1{Datacompatible?} MC1 -->|Yes - reversible| MC2[Revert Column Type:ALTER COLUMN age TYPE intVerify no data truncationThen deploy old app] MC1 -->|No - data loss| MC3[🚨 Cannot Revert:bigint values exceed int rangeForward fix ONLY] AddTable --> AT1{Table hascritical data?} AT1 -->|No data| AT2[Safe Rollback:1. Deploy old app version2. DROP TABLE ordersNo data lost] AT1 -->|Has data| AT3[Risky Rollback:1. BACKUP TABLE orders2. DROP TABLE orders3. Deploy old app versionData preserved in backup] AC2 --> SafeProcess[Safe Rollback Process:✅ No data loss✅ Quick rollback✅ Reversible] AC3 --> RiskyProcess[Risky Rollback Process:⚠️ Potential data loss⚠️ Need backup⚠️ Manual intervention] DC1 --> NoRollback[Forward Fix Only:❌ Cannot rollback❌ Data already lost❌ Must fix forward] MC2 --> SafeProcess MC3 --> NoRollback AT2 --> SafeProcess AT3 --> RiskyProcess SafeProcess --> Execute1[Execute Safe Rollback] RiskyProcess --> Decision{Acceptablerisk?} Decision -->|Yes| Execute2[Execute with Caution] Decision -->|No| NoRollback NoRollback --> HotfixDeploy[Deploy Hotfix:New version with fixKeep new schema] style SafeProcess fill:#064e3b,stroke:#10b981 style RiskyProcess fill:#78350f,stroke:#f59e0b style NoRollback fill:#7f1d1d,stroke:#ef4444 style DC1 fill:#7f1d1d,stroke:#ef4444 style MC3 fill:#7f1d1d,stroke:#ef4444 Part 5: Complete Rollback Workflow From Detection to Recovery %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant Monitor as Monitoring participant Alert as Alerting participant Engineer as On-Call Engineer participant Incident as Incident Channel participant K8s as Kubernetes participant DB as Database participant Users as End Users Note over Monitor: 5 minutes after deployment Monitor->>Monitor: Detect anomaly:Error rate: 0.1% → 18%Latency p95: 150ms → 3000ms Monitor->>Alert: Trigger alert:HighErrorRate FIRING Alert->>Engineer: 🚨 PagerDuty callCritical alertProduction incident Engineer->>Alert: Acknowledge alertStop escalation Engineer->>Incident: Create #incident-456"High error rate after v2.5 deployment" Note over Engineer: Open laptopStart investigation Engineer->>Monitor: Check Grafana dashboardWhen did issue start?Which endpoints affected? Monitor-->>Engineer: Started 5 min agoRight after deploymentAll endpoints affected Engineer->>K8s: kubectl get podsCheck pod status K8s-->>Engineer: All pods RunningNo crashesHealth checks passing Engineer->>K8s: kubectl logs deployment/myappCheck application logs K8s-->>Engineer: ERROR: Cannot connect to cacheERROR: Redis timeoutERROR: Connection refused Note over Engineer: Root cause: New versionhas Redis connection bug Engineer->>Incident: Update: Redis connection issue in v2.5Decision: Rollback to v2.4 Note over Engineer: Check deployment history Engineer->>K8s: kubectl rollout history deployment/myapp K8s-->>Engineer: REVISION 10: v2.5 (current)REVISION 9: v2.4 (previous) Engineer->>Incident: Starting rollback to v2.4ETA: 3 minutes Engineer->>K8s: kubectl rollout undo deployment/myapp K8s->>K8s: Start rollback:- Create pods with v2.4- Wait for ready- Terminate v2.5 pods loop Rolling Update K8s->>Users: Some users on v2.4 ✓Some users on v2.5 ✗ Note over K8s: Pod 1: v2.4 ReadyTerminating v2.5 Pod 1 Engineer->>K8s: kubectl rollout statusdeployment/myapp --watch K8s-->>Engineer: Waiting for rollout:2/5 pods updated end K8s->>Users: All users now on v2.4 ✓ K8s-->>Engineer: Rollout complete:deployment "myapp" successfully rolled out Engineer->>Monitor: Check metrics Note over Monitor: Wait 2 minutesfor metrics to stabilize Monitor-->>Engineer: ✅ Error rate: 0.1%✅ Latency p95: 160ms✅ All metrics normal Note over Alert: Metrics normalized Alert->>Engineer: ✅ Alert resolved:HighErrorRate Engineer->>Users: Verify user experience Users-->>Engineer: No error reportsApplication working Engineer->>Incident: ✅ Incident resolvedService restored to v2.4Duration: 12 minutesRoot cause: Redis bug in v2.5 Engineer->>Incident: Next steps:1. Fix Redis bug2. Add integration test3. Post-mortem scheduled Note over Engineer: Create follow-up tasks Engineer->>Engineer: Create Jira tickets:- BUG-789: Fix Redis connection- TEST-123: Add cache integration test- DOC-456: Update deployment checklist Note over Engineer,Users: Service restored ✓Monitoring continues Part 6: Automated Rollback Auto-Rollback Decision Flow %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Deployment completed]) --> Monitor[Continuous MonitoringEvery 30 seconds] Monitor --> Collect[Collect Metrics:- Error rate- Latency p95/p99- Success rate- Pod health- Resource usage] Collect --> Check1{Error rate> 5%?} Check1 -->|Yes| Trigger1[🚨 Trigger auto-rollbackError threshold exceeded] Check1 -->|No| Check2{Latency p95> 2x baseline?} Check2 -->|Yes| Trigger2[🚨 Trigger auto-rollbackLatency degradation] Check2 -->|No| Check3{Pod crashrate > 50%?} Check3 -->|Yes| Trigger3[🚨 Trigger auto-rollbackPods failing] Check3 -->|No| Check4{Custom metricthreshold?} Check4 -->|Yes| Trigger4[🚨 Trigger auto-rollbackBusiness metric failed] Check4 -->|No| Healthy[✅ All checks passedContinue monitoring] Healthy --> TimeCheck{Monitoringduration?} TimeCheck -->|< 15 min| Monitor TimeCheck -->|>= 15 min| Stable[✅ Deployment STABLEPassed soak periodAuto-rollback disabled] Trigger1 --> Rollback[Execute Auto-Rollback] Trigger2 --> Rollback Trigger3 --> Rollback Trigger4 --> Rollback Rollback --> R1[1. Log rollback decisionMetrics that triggeredTimestamp] R1 --> R2[2. Alert team:PagerDuty criticalSlack notification"Auto-rollback initiated"] R2 --> R3[3. Execute rollback:kubectl rollout undodeployment/myapp] R3 --> R4[4. Wait for rollback:Monitor pod statusWait for all pods ready] R4 --> R5[5. Verify recovery:Check metrics againError rate normal?Latency normal?] R5 --> Verify{Recoverysuccessful?} Verify -->|Yes| Success[✅ Auto-Rollback SuccessService restoredNotify teamCreate incident report] Verify -->|No| StillFailing[🚨 Still Failing!Issue not deploymentPage on-call immediatelyManual intervention needed] style Healthy fill:#064e3b,stroke:#10b981 style Stable fill:#064e3b,stroke:#10b981 style Success fill:#064e3b,stroke:#10b981 style Trigger1 fill:#7f1d1d,stroke:#ef4444 style Trigger2 fill:#7f1d1d,stroke:#ef4444 style Trigger3 fill:#7f1d1d,stroke:#ef4444 style Trigger4 fill:#7f1d1d,stroke:#ef4444 style StillFailing fill:#7f1d1d,stroke:#ef4444 Auto-Rollback Configuration # Flagger auto-rollback configuration apiVersion: flagger.app/v1beta1 kind: Canary metadata: name: myapp namespace: production spec: targetRef: apiVersion: apps/v1 kind: Deployment name: myapp service: port: 8080 # Canary analysis analysis: interval: 30s threshold: 5 # Rollback after 5 failed checks maxWeight: 50 stepWeight: 10 # Metrics for auto-rollback decision metrics: # HTTP error rate - name: request-success-rate thresholdRange: min: 95 # Rollback if success rate < 95% interval: 1m # HTTP latency - name: request-duration thresholdRange: max: 500 # Rollback if p95 > 500ms interval: 1m # Custom business metric - name: conversion-rate thresholdRange: min: 80 # Rollback if conversion < 80% of baseline interval: 2m # Webhooks for additional checks webhooks: - name: load-test url: http://flagger-loadtester/ timeout: 5s metadata: type: bash cmd: "hey -z 1m -q 10 http://myapp-canary:8080/" # Alerting on rollback alerts: - name: slack severity: error providerRef: name: slack namespace: flagger Part 7: Post-Incident Process Learning from Rollbacks %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Rollback completedService restored]) --> Timeline[Create Incident Timeline:- Deployment time- Issue detection time- Rollback decision time- Recovery timeTotal duration] Timeline --> PostMortem[Schedule Post-Mortem:Within 48 hoursAll stakeholders invitedBlameless culture] PostMortem --> Analyze[Root Cause Analysis:Why did issue occur?Why wasn't it caught?What can we learn?] Analyze --> Categories{Issuecategory?} Categories --> Testing[Insufficient Testing:- Missing test case- Integration gap- Load testing needed] Categories --> Monitoring[Monitoring Gap:- Missing alert- Wrong threshold- Blind spot found] Categories --> Process[Process Issue:- Skipped step- Wrong timing- Communication gap] Categories --> Code[Code Quality:- Bug in code- Edge case- Dependency issue] Testing --> Actions1[Action Items:□ Add integration test□ Expand E2E coverage□ Add load test□ Test in staging first] Monitoring --> Actions2[Action Items:□ Add new alert□ Adjust thresholds□ Add dashboard□ Improve visibility] Process --> Actions3[Action Items:□ Update runbook□ Add checklist item□ Change deployment time□ Improve communication] Code --> Actions4[Action Items:□ Fix bug□ Add validation□ Update dependency□ Code review process] Actions1 --> Assign[Assign Owners:Each action has ownerEach action has deadlineTrack in project board] Actions2 --> Assign Actions3 --> Assign Actions4 --> Assign Assign --> Document[Document Learnings:- Update wiki- Share with team- Add to knowledge base- Update training] Document --> Prevent[Prevent Recurrence:✓ Tests added✓ Monitoring improved✓ Process updated✓ Team educated] Prevent --> Complete[✅ Post-Incident CompleteStronger systemBetter preparedContinuous improvement] style Complete fill:#064e3b,stroke:#10b981 Part 8: Rollback Checklist Pre-Deployment Rollback Readiness Before Every Deployment: ...

Monitoring & Alerting: Metrics to Action Flow

Rollback & Recovery: Detection to Previous Version

AI Assistant

Hi! I'm your AI assistant