Introduction

Effective monitoring and alerting are critical for maintaining reliable systems. Without proper observability, you’re flying blind when issues occur in production.

This guide visualizes the complete monitoring and alerting flow:

  • Metrics Collection: From instrumentation to storage
  • Alert Evaluation: When metrics cross thresholds
  • Notification Routing: Getting alerts to the right people
  • Incident Response: From alert to resolution
  • The Three Pillars: Metrics, Logs, and Traces

Part 1: Complete Monitoring & Alerting Flow

End-to-End Overview

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD subgraph Apps[Application Layer] App1[Application 1
Exposes /metrics] App2[Application 2
Exposes /metrics] App3[Application 3
Exposes /metrics] end subgraph Collection[Metrics Collection] Prometheus[Prometheus Server
Scrapes metrics every 15s
Stores time-series data] end subgraph Rules[Alert Rules Engine] Rules1[Alert Rule 1:
High Error Rate
rate > 5%] Rules2[Alert Rule 2:
High Latency
p95 > 500ms] Rules3[Alert Rule 3:
Low Availability
uptime < 99%] end subgraph AlertMgr[Alert Manager] Routing[Alert Routing
- Group similar alerts
- Deduplicate
- Apply silences] Throttle[Throttling
- Rate limiting
- Grouping window
- Repeat interval] end subgraph Notification[Notification Channels] PagerDuty[PagerDuty
Critical alerts
On-call engineer] Slack[Slack
Warning alerts
Team channel] Email[Email
Info alerts
Distribution list] end subgraph Response[Incident Response] OnCall[On-Call Engineer
Receives alert] Investigate[Investigate Issue
- Check dashboards
- Review logs
- Analyze traces] Fix[Apply Fix
- Deploy patch
- Scale resources
- Restart service] Resolve[Resolve Alert
Metrics return to normal] end App1 --> |Scrape /metrics| Prometheus App2 --> |Scrape /metrics| Prometheus App3 --> |Scrape /metrics| Prometheus Prometheus --> |Evaluate every 1m| Rules1 Prometheus --> |Evaluate every 1m| Rules2 Prometheus --> |Evaluate every 1m| Rules3 Rules1 --> |Trigger if true| Routing Rules2 --> |Trigger if true| Routing Rules3 --> |Trigger if true| Routing Routing --> Throttle Throttle --> |Severity: Critical| PagerDuty Throttle --> |Severity: Warning| Slack Throttle --> |Severity: Info| Email PagerDuty --> OnCall Slack --> OnCall OnCall --> Investigate Investigate --> Fix Fix --> Resolve Resolve -.->|Metrics normalized| Prometheus style Prometheus fill:#1e3a8a,stroke:#3b82f6 style Routing fill:#1e3a8a,stroke:#3b82f6 style PagerDuty fill:#7f1d1d,stroke:#ef4444 style Resolve fill:#064e3b,stroke:#10b981

Part 2: Metrics Collection Process

Prometheus Scrape Flow

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant App as Application participant Metrics as /metrics Endpoint participant Prom as Prometheus participant TSDB as Time-Series Database participant Grafana as Grafana Dashboard Note over App: Application running
Incrementing counters
Recording histograms App->>Metrics: Update in-memory metrics
http_requests_total++
http_request_duration_seconds loop Every 15 seconds Prom->>Metrics: HTTP GET /metrics Metrics-->>Prom: Return current metrics
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1523
http_requests_total{method="GET",status="500"} 12 Note over Prom: Parse metrics
Add labels:
- job="myapp"
- instance="pod-1:8080"
- timestamp Prom->>TSDB: Store time-series data
Append to existing series
Create new series if needed Note over TSDB: Compress and store:
http_requests_total{
job="myapp",
instance="pod-1:8080",
method="GET",
status="200"
} = 1523 @ timestamp end Note over Prom,TSDB: Data retained for 15 days
Older data deleted automatically Grafana->>Prom: PromQL Query:
rate(http_requests_total[5m]) Prom->>TSDB: Fetch time-series data
for last 5 minutes TSDB-->>Prom: Return raw data points Note over Prom: Calculate rate:
Δ value / Δ time Prom-->>Grafana: Return computed values Grafana->>Grafana: Render graph
Display on dashboard

Metrics Instrumentation Example

package main

import (
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

// Define metrics
var (
    // Counter - only goes up
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    // Histogram - for request durations
    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration in seconds",
            Buckets: prometheus.DefBuckets, // 0.005, 0.01, 0.025, 0.05, ...
        },
        []string{"method", "endpoint"},
    )

    // Gauge - current value (can go up or down)
    activeConnections = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "active_connections",
            Help: "Number of active connections",
        },
    )
)

func init() {
    // Register metrics with Prometheus
    prometheus.MustRegister(httpRequestsTotal)
    prometheus.MustRegister(httpRequestDuration)
    prometheus.MustRegister(activeConnections)
}

func trackMetrics(method, endpoint string, statusCode int, duration time.Duration) {
    // Increment request counter
    httpRequestsTotal.WithLabelValues(
        method,
        endpoint,
        fmt.Sprintf("%d", statusCode),
    ).Inc()

    // Record request duration
    httpRequestDuration.WithLabelValues(
        method,
        endpoint,
    ).Observe(duration.Seconds())
}

func handleRequest(w http.ResponseWriter, r *http.Request) {
    start := time.Now()

    // Increment active connections
    activeConnections.Inc()
    defer activeConnections.Dec()

    // Your application logic here
    processRequest(w, r)

    // Track metrics
    duration := time.Since(start)
    trackMetrics(r.Method, r.URL.Path, http.StatusOK, duration)
}

func main() {
    // Expose /metrics endpoint for Prometheus
    http.Handle("/metrics", promhttp.Handler())

    // Application endpoints
    http.HandleFunc("/api/users", handleRequest)

    http.ListenAndServe(":8080", nil)
}

Part 3: Alert Evaluation and Firing

Alert Rule Decision Tree

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Prometheus evaluates
alert rules every 1m]) --> Query[Execute PromQL query:
rate5m > threshold] Query --> Result{Query
returns data?} Result -->|No data| Inactive[Alert: Inactive
No time-series match
No notification] Result -->|Data exists| CheckCondition{Condition
true?} CheckCondition -->|False| Resolved{Alert was
firing?} Resolved -->|Yes| SendResolved[Alert: Resolved
Send resolved notification
Green alert to channel] Resolved -->|No| Inactive CheckCondition -->|True| Duration{Condition true
for 'for' duration?} Duration -->|No| Pending[Alert: Pending
Waiting for duration
e.g., 5 minutes
No notification yet] Pending -.->|Check again| Start Duration -->|Yes| Firing[Alert: Firing 🔥
Send to Alertmanager] Firing --> Dedupe{Already
firing?} Dedupe -->|Yes| Throttle[Respect repeat_interval
e.g., every 4 hours
Don't spam] Dedupe -->|No| NewAlert[New alert!
Send notification immediately] Throttle --> TimeCheck{Repeat interval
elapsed?} TimeCheck -->|No| Wait[Wait...
Don't send yet] TimeCheck -->|Yes| Reminder[Send reminder
notification] NewAlert --> AlertManager[Send to Alertmanager] Reminder --> AlertManager AlertManager --> Route[Route based on labels
Apply routing rules] style Inactive fill:#1e3a8a,stroke:#3b82f6 style Pending fill:#78350f,stroke:#f59e0b style Firing fill:#7f1d1d,stroke:#ef4444 style SendResolved fill:#064e3b,stroke:#10b981 style NewAlert fill:#7f1d1d,stroke:#ef4444

Alert Rule Configuration

# prometheus-rules.yaml
groups:
- name: application_alerts
  interval: 60s  # Evaluate every 60 seconds

  rules:
  # High Error Rate Alert
  - alert: HighErrorRate
    expr: |
      (
        rate(http_requests_total{status=~"5.."}[5m])
        /
        rate(http_requests_total[5m])
      ) > 0.05      
    for: 5m  # Must be true for 5 minutes before firing
    labels:
      severity: critical
      team: backend
    annotations:
      summary: "High error rate on {{ $labels.instance }}"
      description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
      dashboard: "https://grafana.example.com/d/app"

  # High Latency Alert
  - alert: HighLatency
    expr: |
      histogram_quantile(0.95,
        rate(http_request_duration_seconds_bucket[5m])
      ) > 0.5      
    for: 10m
    labels:
      severity: warning
      team: backend
    annotations:
      summary: "High latency on {{ $labels.instance }}"
      description: "P95 latency is {{ $value }}s (threshold: 0.5s)"

  # Service Down Alert
  - alert: ServiceDown
    expr: up{job="myapp"} == 0
    for: 1m
    labels:
      severity: critical
      team: sre
    annotations:
      summary: "Service {{ $labels.instance }} is down"
      description: "Cannot scrape metrics from {{ $labels.instance }}"

  # Memory Usage Alert
  - alert: HighMemoryUsage
    expr: |
      (
        container_memory_usage_bytes{pod=~"myapp-.*"}
        /
        container_spec_memory_limit_bytes{pod=~"myapp-.*"}
      ) > 0.90      
    for: 5m
    labels:
      severity: warning
      team: platform
    annotations:
      summary: "High memory usage on {{ $labels.pod }}"
      description: "Memory usage is {{ $value | humanizePercentage }} of limit"

Part 4: Alert Routing and Notification

Alertmanager Processing Flow

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Alert received from
Prometheus]) --> Inhibit{Inhibition
rules match?} Inhibit -->|Yes| Suppress[Alert suppressed
Higher priority alert
already firing
e.g., NodeDown inhibits
all pod alerts on that node] Inhibit -->|No| Silence{Silence
matches?} Silence -->|Yes| Silenced[Alert silenced
Manual suppression
During maintenance window
No notification sent] Silence -->|No| Group[Group alerts
By: cluster, alertname
Combine similar alerts] Group --> Wait[Wait for group_wait
Default: 30s
Collect more alerts] Wait --> Batch[Create notification batch
Multiple alerts grouped
Single notification] Batch --> Route{Match
routing tree?} Route --> Critical{severity:
critical?} Route --> Warning{severity:
warning?} Route --> Default[Default route] Critical --> Team1{team:
backend?} Team1 -->|Yes| PagerDuty[PagerDuty
Page on-call engineer
Escalate if no ack
in 5 minutes] Team1 -->|No| Team2[Other team's PagerDuty] Warning --> SlackRoute{team:
backend?} SlackRoute -->|Yes| Slack[Slack #backend-alerts
Post message
@here mention] SlackRoute -->|No| SlackOther[Other team's Slack] Default --> Email[Email
Send to mailing list
Low priority] PagerDuty --> Track[Track notification
Set repeat_interval timer
4 hours until resolved] Slack --> Track Email --> Track Track --> Resolved{Alert
resolved?} Resolved -->|No| RepeatCheck{repeat_interval
elapsed?} RepeatCheck -->|Yes| Resend[Resend notification
Reminder that alert
still firing] Resend -.-> Track RepeatCheck -->|No| Wait2[Wait...] Wait2 -.-> Resolved Resolved -->|Yes| SendResolved[Send resolved notification
All is well ✓] style Suppress fill:#1e3a8a,stroke:#3b82f6 style Silenced fill:#1e3a8a,stroke:#3b82f6 style PagerDuty fill:#7f1d1d,stroke:#ef4444 style SendResolved fill:#064e3b,stroke:#10b981

Alertmanager Configuration

# alertmanager.yaml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/XXX'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

# Inhibition rules - suppress alerts when higher priority alert is firing
inhibit_rules:
  # If node is down, don't alert on pods on that node
  - source_match:
      alertname: 'NodeDown'
    target_match:
      alertname: 'PodDown'
    equal: ['node']

  # If entire cluster is down, don't alert on individual services
  - source_match:
      severity: 'critical'
      alertname: 'ClusterDown'
    target_match_re:
      severity: 'warning|info'
    equal: ['cluster']

# Route tree - how to send alerts
route:
  receiver: 'default-email'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s        # Wait 30s to collect more alerts
  group_interval: 5m     # Send updates every 5m for grouped alerts
  repeat_interval: 4h    # Resend if still firing after 4h

  routes:
  # Critical alerts to PagerDuty
  - match:
      severity: critical
    receiver: 'pagerduty-critical'
    group_wait: 10s      # Page quickly for critical
    continue: true       # Also send to Slack

  - match:
      severity: critical
    receiver: 'slack-critical'

  # Warning alerts to Slack
  - match:
      severity: warning
    receiver: 'slack-warnings'
    group_wait: 1m

  # Team-specific routing
  - match:
      team: backend
    receiver: 'backend-team'

  - match:
      team: frontend
    receiver: 'frontend-team'

# Receivers - where to send alerts
receivers:
- name: 'default-email'
  email_configs:
  - to: '[email protected]'
    headers:
      Subject: '{{ .GroupLabels.alertname }}: {{ .Status | toUpper }}'

- name: 'pagerduty-critical'
  pagerduty_configs:
  - service_key: 'your-pagerduty-key'
    description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
    severity: 'critical'

- name: 'slack-critical'
  slack_configs:
  - channel: '#alerts-critical'
    title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
    text: |
      {{ range .Alerts }}
      *Alert:* {{ .Annotations.summary }}
      *Description:* {{ .Annotations.description }}
      *Severity:* {{ .Labels.severity }}
      *Dashboard:* {{ .Annotations.dashboard }}
      {{ end }}      
    color: 'danger'
    send_resolved: true

- name: 'slack-warnings'
  slack_configs:
  - channel: '#alerts-warning'
    title: '⚠️  WARNING: {{ .GroupLabels.alertname }}'
    text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
    color: 'warning'

- name: 'backend-team'
  slack_configs:
  - channel: '#backend-alerts'

Part 5: Incident Response Workflow

From Alert to Resolution

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant Alert as Alert System participant PD as PagerDuty participant Eng as On-Call Engineer participant Dash as Grafana Dashboard participant Logs as Log Aggregator participant Trace as Tracing System participant K8s as Kubernetes participant Incident as Incident Channel Alert->>PD: 🚨 Critical Alert
HighErrorRate firing
Service: myapp
Error rate: 12% PD->>Eng: 📱 Phone call + SMS + Push
Incident created Note over Eng: Engineer woken up
at 3 AM 😴 Eng->>PD: Acknowledge incident
Stop escalation Eng->>Incident: Create #incident-123
Post initial status Note over Eng: Open laptop
Start investigation Eng->>Dash: Open dashboard
Check error rate graph Dash-->>Eng: Graph shows spike
Started 5 minutes ago
Only affects /api/payment Eng->>Logs: Query logs:
level=error AND
path=/api/payment Logs-->>Eng: Errors:
"Database connection timeout"
"Cannot connect to db:5432" Note over Eng: Database issue suspected Eng->>K8s: kubectl get pods -n database K8s-->>Eng: postgres-0: CrashLoopBackOff
Restart count: 8 Eng->>K8s: kubectl describe pod postgres-0 K8s-->>Eng: Event: Liveness probe failed
Event: OOMKilled
Memory: 2.1Gi / 2Gi limit Note over Eng: Database OOMKilled!
Need more memory Eng->>Incident: Update: Database OOM
Action: Increasing memory limit Eng->>K8s: kubectl edit statefulset postgres
Change: 2Gi → 4Gi memory K8s-->>Eng: Statefulset updated Note over K8s: Rolling restart
postgres-0 recreated
with 4Gi memory Eng->>K8s: kubectl get pods -n database -w
Watch pod status K8s-->>Eng: postgres-0: Running ✓
Ready: 1/1 Note over Eng: Wait for metrics
to normalize Eng->>Dash: Refresh dashboard Dash-->>Eng: Error rate: 0.3% ✓
Latency: normal ✓
Back to baseline Note over Alert: Metrics normalized
Alert conditions false Alert->>PD: ✅ Alert resolved PD->>Eng: Incident auto-resolved Eng->>Incident: Incident resolved ✓
Root cause: DB OOM
Fix: Increased memory
Duration: 23 minutes Eng->>Eng: Create follow-up tasks:
1. Set memory alerts
2. Review query performance
3. Consider connection pooling Note over Eng: Back to sleep 😴
Post-mortem tomorrow

Part 6: The Three Pillars of Observability

Metrics, Logs, and Traces Integration

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Issue([Production Issue Detected]) --> Which{Which pillar
to start with?} Which --> Metrics[1️⃣ METRICS
What is broken?] Which --> Logs[2️⃣ LOGS
Why is it broken?] Which --> Traces[3️⃣ TRACES
Where is it broken?] Metrics --> M1[Check Grafana
- Error rate spiking?
- Latency increased?
- Which service?
- Which endpoint?] M1 --> M2[Identify:
✓ Service: payment-api
✓ Endpoint: /checkout
✓ Metric: p95 latency 5000ms
✓ Time: Started 10m ago] M2 --> UseTrace{Need to see
request flow?} UseTrace -->|Yes| Traces Logs --> L1[Search logs in ELK/Loki
service=payment-api AND
path=/checkout AND
level=error] L1 --> L2[Find errors:
"Database query timeout"
"SELECT * FROM orders
WHERE user_id=123
execution time: 5200ms"] L2 --> L3[Context found:
✓ Specific query is slow
✓ Affecting user_id=123
✓ No index on user_id?] L3 --> UseMetrics{Verify with
metrics?} UseMetrics -->|Yes| Metrics Traces --> T1[Open Jaeger/Tempo
Search trace_id or
service=payment-api] T1 --> T2[View distributed trace:
┌─ payment-api: 5100ms
│ ├─ auth-svc: 20ms ✓
│ ├─ inventory-svc: 30ms ✓
│ └─ database: 5000ms ❌
│ └─ query: SELECT * FROM orders] T2 --> T3[Identify bottleneck:
✓ Database query is slow
✓ Affects only /checkout
✓ Other services healthy] T3 --> UseLogs{Need error
details?} UseLogs -->|Yes| Logs M2 --> RootCause[Combine insights:
METRICS: Latency spike on /checkout
LOGS: Specific query timeout
TRACES: Database is bottleneck] L3 --> RootCause T3 --> RootCause RootCause --> Fix[Root Cause Found:
Missing database index
on orders.user_id

Fix: CREATE INDEX
idx_user_id ON orders] style Metrics fill:#1e3a8a,stroke:#3b82f6 style Logs fill:#78350f,stroke:#f59e0b style Traces fill:#064e3b,stroke:#10b981 style RootCause fill:#064e3b,stroke:#10b981 style Fix fill:#064e3b,stroke:#10b981

When to Use Each Pillar

Pillar Best For Example Questions Tools
Metrics Detecting issues, trends - Is the service up?
- What’s the error rate?
- Is latency increasing?
Prometheus, Grafana, Datadog
Logs Understanding what happened - What was the error message?
- Which user was affected?
- What was the input?
ELK, Loki, Splunk
Traces Finding bottlenecks - Which service is slow?
- Where is the delay?
- How do requests flow?
Jaeger, Tempo, Zipkin

Part 7: Setting Up Effective Alerts

Alert Quality Framework

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([New Alert Idea]) --> Question1{Does this require
immediate action?} Question1 -->|No| Ticket[Create ticket instead
Not an alert
Review during business hours] Question1 -->|Yes| Question2{Can it be
automated away?} Question2 -->|Yes| Automate[Build automation
Auto-scaling
Auto-healing
Self-recovery] Question2 -->|No| Question3{Is it actionable?} Question3 -->|No| Rethink[Rethink the alert
What action should
the engineer take?
If none, not an alert] Question3 -->|Yes| Question4{Is the signal
clear?} Question4 -->|No| Refine[Refine the threshold
Add 'for' duration
Adjust sensitivity
Reduce false positives] Question4 -->|Yes| Question5{Provides enough
context?} Question5 -->|No| AddContext[Add context:
- Dashboard link
- Runbook link
- Query to debug
- Recent changes] Question5 -->|Yes| Question6{Correct
severity?} Question6 -->|No| Severity[Adjust severity:
Critical = Page
Warning = Slack
Info = Email] Question6 -->|Yes| GoodAlert[✅ Good Alert!
- Actionable
- Clear signal
- Right severity
- Good context] GoodAlert --> Deploy[Deploy alert
Monitor for:
- False positives
- Alert fatigue
- Resolution time] style Ticket fill:#1e3a8a,stroke:#3b82f6 style Automate fill:#064e3b,stroke:#10b981 style GoodAlert fill:#064e3b,stroke:#10b981 style Rethink fill:#7f1d1d,stroke:#ef4444

Part 8: Best Practices

DO’s and DON’Ts

✅ DO:

  • Alert on symptoms (user-facing issues), not causes
  • Include links to dashboards and runbooks
  • Use for duration to avoid flapping alerts
  • Set appropriate severity levels
  • Group related alerts
  • Send resolved notifications
  • Review and prune unused alerts

❌ DON’T:

  • Alert on everything “just in case”
  • Page for issues that can wait until morning
  • Create alerts without clear action items
  • Ignore alert fatigue
  • Alert on metrics you don’t understand
  • Forget to test alerts before deploying

Alert Checklist

Before deploying an alert, ask:

□ Is this symptom-based or user-impacting?
□ Does it require immediate action?
□ Can the on-call engineer fix it?
□ Is the threshold appropriate?
□ Is the 'for' duration set correctly?
□ Does it include dashboard/runbook links?
□ Is the severity level correct?
□ Will this wake someone at 3 AM? Should it?
□ Have we tested it?
□ Is there a plan to reduce these over time?

Conclusion

Effective monitoring and alerting requires:

  • Comprehensive Metrics: Instrument your applications properly
  • Smart Alerting: Alert on symptoms that require action
  • Proper Routing: Get alerts to the right people at the right time
  • Good Context: Include dashboards, runbooks, and debugging info
  • Three Pillars: Use metrics, logs, and traces together

Key principles:

  • Alerts should be actionable
  • Reduce noise, increase signal
  • Automate what you can
  • Page only when necessary
  • Iterate and improve continuously

The visual diagrams in this guide show the complete flow from metrics collection to incident resolution, helping you build robust observability systems.


Further Reading


Build observability into your systems from day one!