HTTP Polling Patterns in Go: From Simple Polling to Server Push

    Backend Communication Current: HTTP Polling Patterns Previous All Posts Server-Sent Events What are HTTP Polling Patterns? HTTP polling patterns are techniques for achieving near-real-time communication between clients and servers using the standard HTTP request-response model. While not truly real-time like WebSockets, these patterns are simpler to implement, easier to debug, and work reliably across all networks and proxies. ...

    January 23, 2025 · 12 min · Rafiul Alam

    Kubernetes Pod Lifecycle: Pending → Running → Succeeded

    Introduction Kubernetes Pods are the smallest deployable units in Kubernetes, representing one or more containers that share resources. Understanding the Pod lifecycle is crucial for debugging, monitoring, and managing applications in Kubernetes. This guide visualizes the complete Pod lifecycle: Pod Creation: From YAML manifest to scheduling State Transitions: Pending → Running → Succeeded/Failed Init Containers: Pre-application setup Container Restart Policies: How Kubernetes handles failures Termination: Graceful shutdown process Part 1: Pod Lifecycle Overview Complete Pod State Machine %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% stateDiagram-v2 [*] --> Pending: Pod created Pending --> Running: All containers started Pending --> Failed: Scheduling failedImage pull failedInvalid config Running --> Succeeded: All containerscompleted successfully(restartPolicy: Never/OnFailure) Running --> Failed: Container failedand won't restartPod deleted during run Running --> Running: Container restarted(restartPolicy: Always/OnFailure) Succeeded --> [*]: Pod cleanup Failed --> [*]: Pod cleanup Running --> Terminating: Delete requestreceived Terminating --> Succeeded: Graceful shutdownsuccessful Terminating --> Failed: Force terminationafter grace period note right of Pending Pod accepted by cluster - Waiting for scheduling - Pulling images - Starting init containers - Creating container runtime end note note right of Running Pod is executing - At least 1 container running - Could be starting/restarting - Application serving traffic - Health checks active end note note right of Succeeded All containers terminated successfully - Exit code 0 - Will not be restarted - Job/CronJob completed end note note right of Failed Pod terminated in failure - Non-zero exit code - OOMKilled - Exceeded restart limit - Node failure end note note right of Terminating Pod shutting down - SIGTERM sent - Grace period active - Endpoints removed - Cleanup in progress end note Pod Creation to Running Flow %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([kubectl apply -f pod.yaml]) --> APIServer[API ServerValidates YAMLWrites to etcd] APIServer --> Scheduler{Scheduler findssuitable node?} Scheduler -->|No| PendingNoNode[Status: PendingReason: Unschedulable- Insufficient resources- Node selector mismatch- Taints/tolerations] Scheduler -->|Yes| AssignNode[Pod assigned to NodeUpdate: spec.nodeName] AssignNode --> Kubelet[Kubelet on target nodereceives Pod spec] Kubelet --> PullImages{Pull containerimages} PullImages -->|Failed| ImagePullError[Status: PendingReason: ImagePullBackOff- Image doesn't exist- Registry auth failed- Network issues] PullImages -->|Success| InitContainers{Init containersdefined?} InitContainers -->|Yes| RunInit[Run init containerssequentially] InitContainers -->|No| CreateContainers RunInit --> InitSuccess{All initcontainerssucceeded?} InitSuccess -->|No| InitFailed[Status: Init:Erroror Init:CrashLoopBackOff] InitSuccess -->|Yes| CreateContainers[Create main containersSetup networkingMount volumes] CreateContainers --> StartContainers[Start all containersin Pod] StartContainers --> HealthChecks{Startup probedefined?} HealthChecks -->|Yes| StartupProbe[Execute startup probe] HealthChecks -->|No| Running StartupProbe --> StartupResult{Probepassed?} StartupResult -->|No| ProbeFailed[Container not readyIf fails too long:CrashLoopBackOff] StartupResult -->|Yes| Running[Status: Running- Container ready- Liveness probe active- Readiness probe active] Running --> ServingTraffic[Pod receives trafficAdded to Service endpoints] style PendingNoNode fill:#78350f,stroke:#f59e0b style ImagePullError fill:#7f1d1d,stroke:#ef4444 style InitFailed fill:#7f1d1d,stroke:#ef4444 style Running fill:#064e3b,stroke:#10b981 style ServingTraffic fill:#064e3b,stroke:#10b981 Part 2: Pod Creation Sequence API Server to Kubelet Communication %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant User as Developer participant API as API Server participant ETCD as etcd participant Sched as Scheduler participant Kubelet as Kubelet (Node) participant Runtime as Container Runtime participant Reg as Container Registry User->>API: kubectl apply -f pod.yaml Note over API: Validate Pod spec- Required fields- Resource limits- Security context API->>ETCD: Write Pod objectStatus: PendingnodeName: ETCD-->>API: Acknowledged API-->>User: Pod created Note over Sched: Watch for unscheduled Pods Sched->>API: List Pods with nodeName="" API-->>Sched: Pod list Note over Sched: Score nodes:- CPU/Memory available- Affinity rules- Taints/TolerationsBest node: node-1 Sched->>API: Bind Pod to node-1 API->>ETCD: Update Pod.spec.nodeName = "node-1" Note over Kubelet: Watch for Pods on node-1 Kubelet->>API: Get Pod specifications API-->>Kubelet: Pod details Kubelet->>Runtime: Pull image: nginx:1.21 Runtime->>Reg: Pull nginx:1.21 Reg-->>Runtime: Image layers Note over Runtime: Extract and cache image Kubelet->>Runtime: Create containerwith Pod spec config Runtime-->>Kubelet: Container created Kubelet->>Runtime: Start container Runtime-->>Kubelet: Container started Kubelet->>API: Update Pod Status:Phase: RunningcontainerStatuses: ready API->>ETCD: Save Pod status Kubelet->>Kubelet: Start health checks- Startup probe- Readiness probe- Liveness probe Note over Kubelet,Runtime: Continuous monitoringand health checking Part 3: Init Containers Init containers run before app containers and must complete successfully before the main containers start. ...

    January 23, 2025 · 11 min · Rafiul Alam

    Monitoring & Alerting: Metrics to Action Flow

    Introduction Effective monitoring and alerting are critical for maintaining reliable systems. Without proper observability, you’re flying blind when issues occur in production. This guide visualizes the complete monitoring and alerting flow: Metrics Collection: From instrumentation to storage Alert Evaluation: When metrics cross thresholds Notification Routing: Getting alerts to the right people Incident Response: From alert to resolution The Three Pillars: Metrics, Logs, and Traces Part 1: Complete Monitoring & Alerting Flow End-to-End Overview %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD subgraph Apps[Application Layer] App1[Application 1Exposes /metrics] App2[Application 2Exposes /metrics] App3[Application 3Exposes /metrics] end subgraph Collection[Metrics Collection] Prometheus[Prometheus ServerScrapes metrics every 15sStores time-series data] end subgraph Rules[Alert Rules Engine] Rules1[Alert Rule 1:High Error Raterate > 5%] Rules2[Alert Rule 2:High Latencyp95 > 500ms] Rules3[Alert Rule 3:Low Availabilityuptime < 99%] end subgraph AlertMgr[Alert Manager] Routing[Alert Routing- Group similar alerts- Deduplicate- Apply silences] Throttle[Throttling- Rate limiting- Grouping window- Repeat interval] end subgraph Notification[Notification Channels] PagerDuty[PagerDutyCritical alertsOn-call engineer] Slack[SlackWarning alertsTeam channel] Email[EmailInfo alertsDistribution list] end subgraph Response[Incident Response] OnCall[On-Call EngineerReceives alert] Investigate[Investigate Issue- Check dashboards- Review logs- Analyze traces] Fix[Apply Fix- Deploy patch- Scale resources- Restart service] Resolve[Resolve AlertMetrics return to normal] end App1 --> |Scrape /metrics| Prometheus App2 --> |Scrape /metrics| Prometheus App3 --> |Scrape /metrics| Prometheus Prometheus --> |Evaluate every 1m| Rules1 Prometheus --> |Evaluate every 1m| Rules2 Prometheus --> |Evaluate every 1m| Rules3 Rules1 --> |Trigger if true| Routing Rules2 --> |Trigger if true| Routing Rules3 --> |Trigger if true| Routing Routing --> Throttle Throttle --> |Severity: Critical| PagerDuty Throttle --> |Severity: Warning| Slack Throttle --> |Severity: Info| Email PagerDuty --> OnCall Slack --> OnCall OnCall --> Investigate Investigate --> Fix Fix --> Resolve Resolve -.->|Metrics normalized| Prometheus style Prometheus fill:#1e3a8a,stroke:#3b82f6 style Routing fill:#1e3a8a,stroke:#3b82f6 style PagerDuty fill:#7f1d1d,stroke:#ef4444 style Resolve fill:#064e3b,stroke:#10b981 Part 2: Metrics Collection Process Prometheus Scrape Flow %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant App as Application participant Metrics as /metrics Endpoint participant Prom as Prometheus participant TSDB as Time-Series Database participant Grafana as Grafana Dashboard Note over App: Application runningIncrementing countersRecording histograms App->>Metrics: Update in-memory metricshttp_requests_total++http_request_duration_seconds loop Every 15 seconds Prom->>Metrics: HTTP GET /metrics Metrics-->>Prom: Return current metrics# TYPE http_requests_total counterhttp_requests_total{method="GET",status="200"} 1523http_requests_total{method="GET",status="500"} 12 Note over Prom: Parse metricsAdd labels:- job="myapp"- instance="pod-1:8080"- timestamp Prom->>TSDB: Store time-series dataAppend to existing seriesCreate new series if needed Note over TSDB: Compress and store:http_requests_total{ job="myapp", instance="pod-1:8080", method="GET", status="200"} = 1523 @ timestamp end Note over Prom,TSDB: Data retained for 15 daysOlder data deleted automatically Grafana->>Prom: PromQL Query:rate(http_requests_total[5m]) Prom->>TSDB: Fetch time-series datafor last 5 minutes TSDB-->>Prom: Return raw data points Note over Prom: Calculate rate:Δ value / Δ time Prom-->>Grafana: Return computed values Grafana->>Grafana: Render graphDisplay on dashboard Metrics Instrumentation Example package main import ( "net/http" "time" "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promhttp" ) // Define metrics var ( // Counter - only goes up httpRequestsTotal = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests", }, []string{"method", "endpoint", "status"}, ) // Histogram - for request durations httpRequestDuration = prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "http_request_duration_seconds", Help: "HTTP request duration in seconds", Buckets: prometheus.DefBuckets, // 0.005, 0.01, 0.025, 0.05, ... }, []string{"method", "endpoint"}, ) // Gauge - current value (can go up or down) activeConnections = prometheus.NewGauge( prometheus.GaugeOpts{ Name: "active_connections", Help: "Number of active connections", }, ) ) func init() { // Register metrics with Prometheus prometheus.MustRegister(httpRequestsTotal) prometheus.MustRegister(httpRequestDuration) prometheus.MustRegister(activeConnections) } func trackMetrics(method, endpoint string, statusCode int, duration time.Duration) { // Increment request counter httpRequestsTotal.WithLabelValues( method, endpoint, fmt.Sprintf("%d", statusCode), ).Inc() // Record request duration httpRequestDuration.WithLabelValues( method, endpoint, ).Observe(duration.Seconds()) } func handleRequest(w http.ResponseWriter, r *http.Request) { start := time.Now() // Increment active connections activeConnections.Inc() defer activeConnections.Dec() // Your application logic here processRequest(w, r) // Track metrics duration := time.Since(start) trackMetrics(r.Method, r.URL.Path, http.StatusOK, duration) } func main() { // Expose /metrics endpoint for Prometheus http.Handle("/metrics", promhttp.Handler()) // Application endpoints http.HandleFunc("/api/users", handleRequest) http.ListenAndServe(":8080", nil) } Part 3: Alert Evaluation and Firing Alert Rule Decision Tree %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Prometheus evaluatesalert rules every 1m]) --> Query[Execute PromQL query:rate5m > threshold] Query --> Result{Queryreturns data?} Result -->|No data| Inactive[Alert: InactiveNo time-series matchNo notification] Result -->|Data exists| CheckCondition{Conditiontrue?} CheckCondition -->|False| Resolved{Alert wasfiring?} Resolved -->|Yes| SendResolved[Alert: ResolvedSend resolved notificationGreen alert to channel] Resolved -->|No| Inactive CheckCondition -->|True| Duration{Condition truefor 'for' duration?} Duration -->|No| Pending[Alert: PendingWaiting for duratione.g., 5 minutesNo notification yet] Pending -.->|Check again| Start Duration -->|Yes| Firing[Alert: Firing 🔥Send to Alertmanager] Firing --> Dedupe{Alreadyfiring?} Dedupe -->|Yes| Throttle[Respect repeat_intervale.g., every 4 hoursDon't spam] Dedupe -->|No| NewAlert[New alert!Send notification immediately] Throttle --> TimeCheck{Repeat intervalelapsed?} TimeCheck -->|No| Wait[Wait...Don't send yet] TimeCheck -->|Yes| Reminder[Send remindernotification] NewAlert --> AlertManager[Send to Alertmanager] Reminder --> AlertManager AlertManager --> Route[Route based on labelsApply routing rules] style Inactive fill:#1e3a8a,stroke:#3b82f6 style Pending fill:#78350f,stroke:#f59e0b style Firing fill:#7f1d1d,stroke:#ef4444 style SendResolved fill:#064e3b,stroke:#10b981 style NewAlert fill:#7f1d1d,stroke:#ef4444 Alert Rule Configuration # prometheus-rules.yaml groups: - name: application_alerts interval: 60s # Evaluate every 60 seconds rules: # High Error Rate Alert - alert: HighErrorRate expr: | ( rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) ) > 0.05 for: 5m # Must be true for 5 minutes before firing labels: severity: critical team: backend annotations: summary: "High error rate on {{ $labels.instance }}" description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)" dashboard: "https://grafana.example.com/d/app" # High Latency Alert - alert: HighLatency expr: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) > 0.5 for: 10m labels: severity: warning team: backend annotations: summary: "High latency on {{ $labels.instance }}" description: "P95 latency is {{ $value }}s (threshold: 0.5s)" # Service Down Alert - alert: ServiceDown expr: up{job="myapp"} == 0 for: 1m labels: severity: critical team: sre annotations: summary: "Service {{ $labels.instance }} is down" description: "Cannot scrape metrics from {{ $labels.instance }}" # Memory Usage Alert - alert: HighMemoryUsage expr: | ( container_memory_usage_bytes{pod=~"myapp-.*"} / container_spec_memory_limit_bytes{pod=~"myapp-.*"} ) > 0.90 for: 5m labels: severity: warning team: platform annotations: summary: "High memory usage on {{ $labels.pod }}" description: "Memory usage is {{ $value | humanizePercentage }} of limit" Part 4: Alert Routing and Notification Alertmanager Processing Flow %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Alert received fromPrometheus]) --> Inhibit{Inhibitionrules match?} Inhibit -->|Yes| Suppress[Alert suppressedHigher priority alertalready firinge.g., NodeDown inhibitsall pod alerts on that node] Inhibit -->|No| Silence{Silencematches?} Silence -->|Yes| Silenced[Alert silencedManual suppressionDuring maintenance windowNo notification sent] Silence -->|No| Group[Group alertsBy: cluster, alertnameCombine similar alerts] Group --> Wait[Wait for group_waitDefault: 30sCollect more alerts] Wait --> Batch[Create notification batchMultiple alerts groupedSingle notification] Batch --> Route{Matchrouting tree?} Route --> Critical{severity:critical?} Route --> Warning{severity:warning?} Route --> Default[Default route] Critical --> Team1{team:backend?} Team1 -->|Yes| PagerDuty[PagerDutyPage on-call engineerEscalate if no ackin 5 minutes] Team1 -->|No| Team2[Other team's PagerDuty] Warning --> SlackRoute{team:backend?} SlackRoute -->|Yes| Slack[Slack #backend-alertsPost message@here mention] SlackRoute -->|No| SlackOther[Other team's Slack] Default --> Email[EmailSend to mailing listLow priority] PagerDuty --> Track[Track notificationSet repeat_interval timer4 hours until resolved] Slack --> Track Email --> Track Track --> Resolved{Alertresolved?} Resolved -->|No| RepeatCheck{repeat_intervalelapsed?} RepeatCheck -->|Yes| Resend[Resend notificationReminder that alertstill firing] Resend -.-> Track RepeatCheck -->|No| Wait2[Wait...] Wait2 -.-> Resolved Resolved -->|Yes| SendResolved[Send resolved notificationAll is well ✓] style Suppress fill:#1e3a8a,stroke:#3b82f6 style Silenced fill:#1e3a8a,stroke:#3b82f6 style PagerDuty fill:#7f1d1d,stroke:#ef4444 style SendResolved fill:#064e3b,stroke:#10b981 Alertmanager Configuration # alertmanager.yaml global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/XXX' pagerduty_url: 'https://events.pagerduty.com/v2/enqueue' # Inhibition rules - suppress alerts when higher priority alert is firing inhibit_rules: # If node is down, don't alert on pods on that node - source_match: alertname: 'NodeDown' target_match: alertname: 'PodDown' equal: ['node'] # If entire cluster is down, don't alert on individual services - source_match: severity: 'critical' alertname: 'ClusterDown' target_match_re: severity: 'warning|info' equal: ['cluster'] # Route tree - how to send alerts route: receiver: 'default-email' group_by: ['alertname', 'cluster', 'service'] group_wait: 30s # Wait 30s to collect more alerts group_interval: 5m # Send updates every 5m for grouped alerts repeat_interval: 4h # Resend if still firing after 4h routes: # Critical alerts to PagerDuty - match: severity: critical receiver: 'pagerduty-critical' group_wait: 10s # Page quickly for critical continue: true # Also send to Slack - match: severity: critical receiver: 'slack-critical' # Warning alerts to Slack - match: severity: warning receiver: 'slack-warnings' group_wait: 1m # Team-specific routing - match: team: backend receiver: 'backend-team' - match: team: frontend receiver: 'frontend-team' # Receivers - where to send alerts receivers: - name: 'default-email' email_configs: - to: '[email protected]' headers: Subject: '{{ .GroupLabels.alertname }}: {{ .Status | toUpper }}' - name: 'pagerduty-critical' pagerduty_configs: - service_key: 'your-pagerduty-key' description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}' severity: 'critical' - name: 'slack-critical' slack_configs: - channel: '#alerts-critical' title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}' text: | {{ range .Alerts }} *Alert:* {{ .Annotations.summary }} *Description:* {{ .Annotations.description }} *Severity:* {{ .Labels.severity }} *Dashboard:* {{ .Annotations.dashboard }} {{ end }} color: 'danger' send_resolved: true - name: 'slack-warnings' slack_configs: - channel: '#alerts-warning' title: '⚠️ WARNING: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}' color: 'warning' - name: 'backend-team' slack_configs: - channel: '#backend-alerts' Part 5: Incident Response Workflow From Alert to Resolution %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant Alert as Alert System participant PD as PagerDuty participant Eng as On-Call Engineer participant Dash as Grafana Dashboard participant Logs as Log Aggregator participant Trace as Tracing System participant K8s as Kubernetes participant Incident as Incident Channel Alert->>PD: 🚨 Critical AlertHighErrorRate firingService: myappError rate: 12% PD->>Eng: 📱 Phone call + SMS + PushIncident created Note over Eng: Engineer woken upat 3 AM 😴 Eng->>PD: Acknowledge incidentStop escalation Eng->>Incident: Create #incident-123Post initial status Note over Eng: Open laptopStart investigation Eng->>Dash: Open dashboardCheck error rate graph Dash-->>Eng: Graph shows spikeStarted 5 minutes agoOnly affects /api/payment Eng->>Logs: Query logs:level=error ANDpath=/api/payment Logs-->>Eng: Errors:"Database connection timeout""Cannot connect to db:5432" Note over Eng: Database issue suspected Eng->>K8s: kubectl get pods -n database K8s-->>Eng: postgres-0: CrashLoopBackOffRestart count: 8 Eng->>K8s: kubectl describe pod postgres-0 K8s-->>Eng: Event: Liveness probe failedEvent: OOMKilledMemory: 2.1Gi / 2Gi limit Note over Eng: Database OOMKilled!Need more memory Eng->>Incident: Update: Database OOMAction: Increasing memory limit Eng->>K8s: kubectl edit statefulset postgresChange: 2Gi → 4Gi memory K8s-->>Eng: Statefulset updated Note over K8s: Rolling restartpostgres-0 recreatedwith 4Gi memory Eng->>K8s: kubectl get pods -n database -wWatch pod status K8s-->>Eng: postgres-0: Running ✓Ready: 1/1 Note over Eng: Wait for metricsto normalize Eng->>Dash: Refresh dashboard Dash-->>Eng: Error rate: 0.3% ✓Latency: normal ✓Back to baseline Note over Alert: Metrics normalizedAlert conditions false Alert->>PD: ✅ Alert resolved PD->>Eng: Incident auto-resolved Eng->>Incident: Incident resolved ✓Root cause: DB OOMFix: Increased memoryDuration: 23 minutes Eng->>Eng: Create follow-up tasks:1. Set memory alerts2. Review query performance3. Consider connection pooling Note over Eng: Back to sleep 😴Post-mortem tomorrow Part 6: The Three Pillars of Observability Metrics, Logs, and Traces Integration %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Issue([Production Issue Detected]) --> Which{Which pillarto start with?} Which --> Metrics[1️⃣ METRICSWhat is broken?] Which --> Logs[2️⃣ LOGSWhy is it broken?] Which --> Traces[3️⃣ TRACESWhere is it broken?] Metrics --> M1[Check Grafana- Error rate spiking?- Latency increased?- Which service?- Which endpoint?] M1 --> M2[Identify:✓ Service: payment-api✓ Endpoint: /checkout✓ Metric: p95 latency 5000ms✓ Time: Started 10m ago] M2 --> UseTrace{Need to seerequest flow?} UseTrace -->|Yes| Traces Logs --> L1[Search logs in ELK/Lokiservice=payment-api ANDpath=/checkout ANDlevel=error] L1 --> L2[Find errors:"Database query timeout""SELECT * FROM ordersWHERE user_id=123execution time: 5200ms"] L2 --> L3[Context found:✓ Specific query is slow✓ Affecting user_id=123✓ No index on user_id?] L3 --> UseMetrics{Verify withmetrics?} UseMetrics -->|Yes| Metrics Traces --> T1[Open Jaeger/TempoSearch trace_id orservice=payment-api] T1 --> T2[View distributed trace:┌─ payment-api: 5100ms│ ├─ auth-svc: 20ms ✓│ ├─ inventory-svc: 30ms ✓│ └─ database: 5000ms ❌│ └─ query: SELECT * FROM orders] T2 --> T3[Identify bottleneck:✓ Database query is slow✓ Affects only /checkout✓ Other services healthy] T3 --> UseLogs{Need errordetails?} UseLogs -->|Yes| Logs M2 --> RootCause[Combine insights:METRICS: Latency spike on /checkoutLOGS: Specific query timeoutTRACES: Database is bottleneck] L3 --> RootCause T3 --> RootCause RootCause --> Fix[Root Cause Found:Missing database indexon orders.user_idFix: CREATE INDEXidx_user_id ON orders] style Metrics fill:#1e3a8a,stroke:#3b82f6 style Logs fill:#78350f,stroke:#f59e0b style Traces fill:#064e3b,stroke:#10b981 style RootCause fill:#064e3b,stroke:#10b981 style Fix fill:#064e3b,stroke:#10b981 When to Use Each Pillar Pillar Best For Example Questions Tools Metrics Detecting issues, trends - Is the service up?- What’s the error rate?- Is latency increasing? Prometheus, Grafana, Datadog Logs Understanding what happened - What was the error message?- Which user was affected?- What was the input? ELK, Loki, Splunk Traces Finding bottlenecks - Which service is slow?- Where is the delay?- How do requests flow? Jaeger, Tempo, Zipkin Part 7: Setting Up Effective Alerts Alert Quality Framework %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([New Alert Idea]) --> Question1{Does this requireimmediate action?} Question1 -->|No| Ticket[Create ticket insteadNot an alertReview during business hours] Question1 -->|Yes| Question2{Can it beautomated away?} Question2 -->|Yes| Automate[Build automationAuto-scalingAuto-healingSelf-recovery] Question2 -->|No| Question3{Is it actionable?} Question3 -->|No| Rethink[Rethink the alertWhat action shouldthe engineer take?If none, not an alert] Question3 -->|Yes| Question4{Is the signalclear?} Question4 -->|No| Refine[Refine the thresholdAdd 'for' durationAdjust sensitivityReduce false positives] Question4 -->|Yes| Question5{Provides enoughcontext?} Question5 -->|No| AddContext[Add context:- Dashboard link- Runbook link- Query to debug- Recent changes] Question5 -->|Yes| Question6{Correctseverity?} Question6 -->|No| Severity[Adjust severity:Critical = PageWarning = SlackInfo = Email] Question6 -->|Yes| GoodAlert[✅ Good Alert!- Actionable- Clear signal- Right severity- Good context] GoodAlert --> Deploy[Deploy alertMonitor for:- False positives- Alert fatigue- Resolution time] style Ticket fill:#1e3a8a,stroke:#3b82f6 style Automate fill:#064e3b,stroke:#10b981 style GoodAlert fill:#064e3b,stroke:#10b981 style Rethink fill:#7f1d1d,stroke:#ef4444 Part 8: Best Practices DO’s and DON’Ts ✅ DO: ...

    January 23, 2025 · 12 min · Rafiul Alam

    Multi-Environment Pipeline: Dev → Staging → Production

    Introduction Multi-environment pipelines enable safe, progressive deployment of code changes through isolated environments. Each environment serves a specific purpose in validating changes before they reach production users. This guide visualizes the multi-environment deployment flow: Environment Hierarchy: Dev → Staging → Production Environment Isolation: Separate configs, databases, resources Progressive Promotion: Automated testing at each stage Approval Gates: Manual checkpoints for production Configuration Management: Environment-specific settings Part 1: Multi-Environment Architecture Complete Environment Flow %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Dev([👨‍💻 Developer]) --> LocalDev[Local DevelopmentLaptop/Docker DesktopFast iteration] LocalDev --> Push[git push origin feature/new-api] Push --> CI[CI Pipeline TriggeredBuild + Test + Lint] CI --> CIPass{CIPassed?} CIPass -->|No| FixLocal[❌ Fix locallyCheck logsRun tests] FixLocal -.-> LocalDev CIPass -->|Yes| FeatureBranch{Branchtype?} FeatureBranch -->|feature/*| DevEnv[🔧 Dev EnvironmentNamespace: devAuto-deploy on push] FeatureBranch -->|main| StagingEnv[🎯 Staging EnvironmentNamespace: stagingAuto-deploy on merge] subgraph DevEnvironment[Development Environment] DevEnv --> DevConfig[Configuration:- Debug mode ON- Verbose logging- Mock external APIs- Dev database- Minimal replicas: 1] DevConfig --> DevTest[Basic Tests:- Smoke tests- Health checks- Manual QA] DevTest --> DevDone[✅ Dev validatedReady for staging] end DevDone --> MergePR[Merge Pull Requestto main branch] MergePR --> StagingEnv subgraph StagingEnvironment[Staging Environment] StagingEnv --> StagingConfig[Configuration:- Production-like setup- Staging database- Real external APIs test- Replicas: 2-3- Resource limits] StagingConfig --> StagingTest[Comprehensive Tests:- Integration tests- E2E tests- Performance tests- Security scans] StagingTest --> StagingResult{All testspassed?} StagingResult -->|No| StagingFail[❌ Staging failedRollback stagingFix issues] StagingFail -.-> FixLocal StagingResult -->|Yes| StagingMonitor[Monitor staging:- Error rates- Performance metrics- User acceptance testing] StagingMonitor --> StagingReady[✅ Staging validatedReady for production] end StagingReady --> ApprovalGate{ManualApprovalRequired} ApprovalGate --> ReviewTeam[Team Lead Review:- Code changes- Test results- Risk assessment- Deployment timing] ReviewTeam --> Approved{Approved?} Approved -->|No| Rejected[❌ RejectedMore testing neededor wrong timing] Approved -->|Yes| ProdEnv[🚀 Production EnvironmentNamespace: productionManual trigger only] subgraph ProductionEnvironment[Production Environment] ProdEnv --> ProdConfig[Configuration:- Production settings- Production database- High availability- Replicas: 5-10- Strict resource limits- Auto-scaling enabled] ProdConfig --> ProdDeploy[Deployment Strategy:- Blue-green or- Canary or- Rolling update] ProdDeploy --> ProdHealth{Productionhealthy?} ProdHealth -->|No| AutoRollback[🚨 Auto-rollbackRevert to previousAlert on-call team] ProdHealth -->|Yes| ProdMonitor[Monitor Production:- Real user metrics- Error rates- Business KPIs- SLO compliance] ProdMonitor --> ProdStable{Stable for15 minutes?} ProdStable -->|No| AutoRollback ProdStable -->|Yes| Success[✅ Deployment Complete!New version liveMonitor continues] end style DevEnv fill:#064e3b,stroke:#10b981 style StagingEnv fill:#78350f,stroke:#f59e0b style ProdEnv fill:#1e3a8a,stroke:#3b82f6 style Success fill:#064e3b,stroke:#10b981 style StagingFail fill:#7f1d1d,stroke:#ef4444 style AutoRollback fill:#7f1d1d,stroke:#ef4444 style Rejected fill:#7f1d1d,stroke:#ef4444 Part 2: Environment Comparison Environment Characteristics %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% graph TB subgraph Local[🏠 Local Development] LocalProps[Properties:✓ Fast iteration✓ Developer's laptop✓ Docker Compose✓ Mock services✓ Hot reload enabled] LocalData[Data:- SQLite or local DB- Seed data- No real user data- Quick reset] LocalAccess[Access:- localhost only- No authentication- Debug tools enabled] end subgraph Dev[🔧 Development Environment] DevProps[Properties:✓ Shared team env✓ Kubernetes cluster✓ Continuous deployment✓ Latest features✓ Can be unstable] DevData[Data:- Dev database- Synthetic test data- Reset weekly- No PII] DevAccess[Access:- VPN required- Basic auth- All developers- Debug mode ON] end subgraph Staging[🎯 Staging Environment] StagingProps[Properties:✓ Production mirror✓ Same infrastructure✓ Pre-production testing✓ Stable builds only✓ Performance testing] StagingData[Data:- Staging database- Anonymized prod data- Or realistic test data- Refreshed monthly] StagingAccess[Access:- VPN required- OAuth/SSO- Developers + QA- Debug mode OFF] end subgraph Prod[🚀 Production Environment] ProdProps[Properties:✓ Live customer traffic✓ High availability✓ Auto-scaling✓ Disaster recovery✓ Maximum stability] ProdData[Data:- Production database- Real user data- Encrypted at rest- Regular backups] ProdAccess[Access:- Public internet- Full authentication- Limited admin access- Audit logging enabled] end Local --> |git push feature/*| Dev Dev --> |Merge to main| Staging Staging --> |Manual approval| Prod style Local fill:#064e3b,stroke:#10b981 style Dev fill:#064e3b,stroke:#10b981 style Staging fill:#78350f,stroke:#f59e0b style Prod fill:#1e3a8a,stroke:#3b82f6 Environment Configuration Matrix Aspect Local Dev Staging Production Purpose Development Feature testing Pre-production validation Live users Deployment Manual Auto on push Auto on merge Manual approval Replicas 1 1-2 2-3 5-10+ Database Local SQLite Shared dev DB Staging DB (prod-like) Production DB Resources Minimal Low Medium (prod-like) High Monitoring None Basic Full Full + Alerts Debug Mode Yes Yes No No Logging Level DEBUG DEBUG INFO WARN/ERROR External APIs Mocked Test endpoints Test endpoints Production endpoints Data Seed data Synthetic Anonymized Real user data Access localhost VPN + Basic auth VPN + SSO Public + Full auth Uptime SLA N/A None None 99.9%+ Part 3: Progressive Promotion Pipeline Promotion Flow with Quality Gates %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart LR subgraph LocalStage[Local Stage] L1[Write Code] L2[Run Unit Tests] L3[Manual Testing] L1 --> L2 --> L3 end subgraph DevStage[Dev Stage] D1[Auto Deploy] D2[Smoke Tests] D3{TestsPass?} D4[Dev Validated ✓] D1 --> D2 --> D3 D3 -->|Yes| D4 D3 -->|No| D5[❌ Fix] D5 -.-> L1 end subgraph StagingStage[Staging Stage] S1[Auto Deploy] S2[Integration Tests] S3[E2E Tests] S4[Performance Tests] S5{All Pass?} S6[Staging Validated ✓] S1 --> S2 --> S3 --> S4 --> S5 S5 -->|Yes| S6 S5 -->|No| S7[❌ Fix] S7 -.-> L1 end subgraph ApprovalStage[Approval Gate] A1[Create Release] A2[Code Review] A3[Change Advisory] A4{Approved?} A1 --> A2 --> A3 --> A4 A4 -->|No| A5[❌ Rejected] A5 -.-> L1 end subgraph ProdStage[Production Stage] P1[Manual Deploy] P2[Canary 10%] P3{Healthy?} P4[Increase to 50%] P5{Healthy?} P6[Complete 100%] P7[Monitor] P8[Success ✓] P1 --> P2 --> P3 P3 -->|Yes| P4 --> P5 P5 -->|Yes| P6 --> P7 --> P8 P3 -->|No| P9[🚨 Rollback] P5 -->|No| P9 end L3 --> |git push| D1 D4 --> |Merge PR| S1 S6 --> A1 A4 -->|Yes| P1 style L3 fill:#064e3b,stroke:#10b981 style D4 fill:#064e3b,stroke:#10b981 style S6 fill:#064e3b,stroke:#10b981 style P8 fill:#064e3b,stroke:#10b981 style D5 fill:#7f1d1d,stroke:#ef4444 style S7 fill:#7f1d1d,stroke:#ef4444 style P9 fill:#7f1d1d,stroke:#ef4444 Part 4: Environment-Specific Configuration Configuration Management Strategy %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Application needs config]) --> Method{ConfigMethod?} Method --> EnvVars[Environment Variables] Method --> ConfigMaps[Kubernetes ConfigMaps] Method --> Secrets[Kubernetes Secrets] EnvVars --> EnvExample[Examples:- NODE_ENV=production- LOG_LEVEL=info- FEATURE_FLAGS=true] ConfigMaps --> CMExample[Examples:- app-config.yaml- nginx.conf- application.properties] Secrets --> SecretExample[Examples:- DATABASE_PASSWORD- API_KEYS- TLS certificates] EnvExample --> Override{Override perenvironment?} CMExample --> Override SecretExample --> Override Override --> DevOverride[Dev Environment:DEBUG=trueDB_HOST=dev-dbREPLICAS=1CACHE_TTL=60s] Override --> StagingOverride[Staging Environment:DEBUG=falseDB_HOST=staging-dbREPLICAS=3CACHE_TTL=300s] Override --> ProdOverride[Production Environment:DEBUG=falseDB_HOST=prod-dbREPLICAS=10CACHE_TTL=600s] DevOverride --> Inject[Inject at deployment:kubectl apply -f k8s/dev/- deployment.yaml- configmap.yaml- secrets.yaml] StagingOverride --> Inject ProdOverride --> Inject style EnvVars fill:#1e3a8a,stroke:#3b82f6 style ConfigMaps fill:#1e3a8a,stroke:#3b82f6 style Secrets fill:#7f1d1d,stroke:#ef4444 Kubernetes Configuration Example # k8s/base/deployment.yaml (Common base) apiVersion: apps/v1 kind: Deployment metadata: name: myapp spec: selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: containers: - name: myapp image: myapp:latest # Overridden per environment ports: - containerPort: 8080 envFrom: - configMapRef: name: myapp-config - secretRef: name: myapp-secrets resources: # Overridden per environment requests: memory: "128Mi" cpu: "100m" limits: memory: "256Mi" cpu: "200m" --- # k8s/dev/configmap.yaml apiVersion: v1 kind: ConfigMap metadata: name: myapp-config namespace: dev data: NODE_ENV: "development" LOG_LEVEL: "debug" DATABASE_HOST: "postgres.dev.svc.cluster.local" REDIS_HOST: "redis.dev.svc.cluster.local" FEATURE_NEW_UI: "true" FEATURE_BETA_API: "true" --- # k8s/staging/configmap.yaml apiVersion: v1 kind: ConfigMap metadata: name: myapp-config namespace: staging data: NODE_ENV: "staging" LOG_LEVEL: "info" DATABASE_HOST: "postgres.staging.svc.cluster.local" REDIS_HOST: "redis.staging.svc.cluster.local" FEATURE_NEW_UI: "true" FEATURE_BETA_API: "false" --- # k8s/production/configmap.yaml apiVersion: v1 kind: ConfigMap metadata: name: myapp-config namespace: production data: NODE_ENV: "production" LOG_LEVEL: "warn" DATABASE_HOST: "postgres.production.svc.cluster.local" REDIS_HOST: "redis.production.svc.cluster.local" FEATURE_NEW_UI: "false" # Gradual rollout FEATURE_BETA_API: "false" --- # k8s/dev/kustomization.yaml apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization namespace: dev resources: - ../base/deployment.yaml - configmap.yaml - secrets.yaml images: - name: myapp newTag: dev-abc123 replicas: - name: myapp count: 1 patches: - patch: |- - op: replace path: /spec/template/spec/containers/0/resources/requests/memory value: 128Mi - op: replace path: /spec/template/spec/containers/0/resources/limits/memory value: 256Mi target: kind: Deployment name: myapp --- # k8s/production/kustomization.yaml apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization namespace: production resources: - ../base/deployment.yaml - configmap.yaml - secrets.yaml images: - name: myapp newTag: v1.2.3 replicas: - name: myapp count: 10 patches: - patch: |- - op: replace path: /spec/template/spec/containers/0/resources/requests/memory value: 512Mi - op: replace path: /spec/template/spec/containers/0/resources/limits/memory value: 1Gi - op: replace path: /spec/template/spec/containers/0/resources/requests/cpu value: 500m - op: replace path: /spec/template/spec/containers/0/resources/limits/cpu value: 1000m target: kind: Deployment name: myapp Part 5: Database Migration Strategy Multi-Environment Database Flow %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant Dev as Developer participant DevDB as Dev Database participant StagingDB as Staging Database participant ProdDB as Production Database participant Migration as Migration Tool Note over Dev: Write migration:001_add_users_table.sql Dev->>DevDB: Run migration locallyCREATE TABLE users... DevDB-->>Dev: Migration applied ✓ Dev->>Dev: Test applicationwith new schema Dev->>Dev: git push feature/add-users Note over DevDB: CI/CD Pipeline triggered Dev->>DevDB: Auto-run migrationsin dev environment DevDB-->>Dev: Dev DB updated ✓ Note over Dev: Create Pull RequestMerge to main Dev->>StagingDB: Trigger staging deployment Note over Migration,StagingDB: Pre-deployment hook Migration->>StagingDB: Backup databasepg_dump > backup.sql Migration->>StagingDB: Run migrations001_add_users_table.sql StagingDB-->>Migration: Migration applied ✓ Note over StagingDB: Deploy applicationTest with new schema alt Migration Failed Migration->>StagingDB: Rollback migrationRestore from backup StagingDB-->>Migration: Rolled back end Note over Dev: Manual approvalfor production Dev->>ProdDB: Trigger production deployment Note over Migration,ProdDB: Pre-deployment steps Migration->>ProdDB: Full database backupSnapshot created Migration->>ProdDB: Check migration statusSELECT version FROM schema_migrations ProdDB-->>Migration: Current version: 000 Migration->>ProdDB: Run migrationsin transaction Note over Migration,ProdDB: BEGIN;CREATE TABLE users;INSERT INTO schema_migrationsVALUES ('001');COMMIT; ProdDB-->>Migration: Migration successful ✓ Note over ProdDB: Deploy new applicationversion alt Production Issues Migration->>ProdDB: Rollback migrationRun down migration:DROP TABLE users; Note over ProdDB: Deploy previousapplication version end Migration->>ProdDB: Verify data integrityCheck constraints ProdDB-->>Migration: All checks passed ✓ Note over Dev,ProdDB: Production updated successfully Part 6: Multi-Environment CI/CD Pipeline Complete Pipeline Configuration # .github/workflows/multi-env-deploy.yml name: Multi-Environment Deployment on: push: branches: - main - develop pull_request: branches: - main env: REGISTRY: ghcr.io IMAGE_NAME: ${{ github.repository }} jobs: # CI - Same for all environments build-and-test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Run linting run: npm run lint - name: Run unit tests run: npm test - name: Build Docker image run: docker build -t $IMAGE_NAME:${{ github.sha }} . - name: Run integration tests run: docker-compose -f docker-compose.test.yml up --abort-on-container-exit - name: Push image run: | echo ${{ secrets.GITHUB_TOKEN }} | docker login ghcr.io -u ${{ github.actor }} --password-stdin docker push $IMAGE_NAME:${{ github.sha }} # Deploy to Dev - Auto on feature branches deploy-dev: needs: build-and-test if: github.ref != 'refs/heads/main' runs-on: ubuntu-latest environment: name: development url: https://dev.example.com steps: - uses: actions/checkout@v3 - name: Deploy to Dev run: | kubectl config set-cluster dev --server="${{ secrets.DEV_K8S_SERVER }}" kubectl config set-credentials admin --token="${{ secrets.DEV_K8S_TOKEN }}" kubectl set image deployment/myapp myapp=$IMAGE_NAME:${{ github.sha }} -n dev kubectl rollout status deployment/myapp -n dev - name: Run smoke tests run: | curl https://dev.example.com/health npm run test:smoke -- --env=dev # Deploy to Staging - Auto on main branch deploy-staging: needs: build-and-test if: github.ref == 'refs/heads/main' runs-on: ubuntu-latest environment: name: staging url: https://staging.example.com steps: - uses: actions/checkout@v3 - name: Run database migrations run: | kubectl exec -n staging deployment/postgres -- \ psql -U postgres -d app -f /migrations/migrate.sql - name: Deploy to Staging run: | kubectl config set-cluster staging --server="${{ secrets.STAGING_K8S_SERVER }}" kubectl config set-credentials admin --token="${{ secrets.STAGING_K8S_TOKEN }}" kubectl apply -k k8s/staging/ kubectl rollout status deployment/myapp -n staging --timeout=5m - name: Run E2E tests run: npm run test:e2e -- --env=staging - name: Run performance tests run: | k6 run --vus 10 --duration 30s tests/performance.js - name: Check staging health run: | curl https://staging.example.com/health | jq '.status' | grep -q "healthy" # Deploy to Production - Manual approval required deploy-production: needs: deploy-staging runs-on: ubuntu-latest environment: name: production url: https://example.com steps: - uses: actions/checkout@v3 - name: Backup production database run: | kubectl exec -n production deployment/postgres -- \ pg_dump -U postgres app > backup-$(date +%Y%m%d-%H%M%S).sql - name: Run database migrations run: | kubectl exec -n production deployment/postgres -- \ psql -U postgres -d app -f /migrations/migrate.sql - name: Deploy to Production (Blue-Green) run: | kubectl config set-cluster prod --server="${{ secrets.PROD_K8S_SERVER }}" kubectl config set-credentials admin --token="${{ secrets.PROD_K8S_TOKEN }}" # Deploy green version kubectl apply -k k8s/production/ kubectl rollout status deployment/myapp-green -n production --timeout=10m # Switch traffic to green kubectl patch service myapp -n production -p '{"spec":{"selector":{"version":"green"}}}' - name: Monitor production metrics run: | sleep 300 # Wait 5 minutes ERROR_RATE=$(curl -s prometheus.example.com/api/v1/query?query=rate5m) if [ "$ERROR_RATE" -gt "0.01" ]; then echo "Error rate too high, rolling back" kubectl patch service myapp -n production -p '{"spec":{"selector":{"version":"blue"}}}' exit 1 fi - name: Notify team if: success() uses: slackapi/slack-github-action@v1 with: payload: | { "text": "✅ Production deployment successful!", "version": "${{ github.sha }}", "deployed_by": "${{ github.actor }}" } Part 7: Best Practices Environment Management Checklist ✅ DO: ...

    January 23, 2025 · 11 min · Rafiul Alam

    Rollback & Recovery: Detection to Previous Version

    Introduction Even with the best testing, production issues happen. Having a solid rollback and recovery strategy is critical for minimizing downtime and data loss when deployments go wrong. This guide visualizes the complete rollback process: Issue Detection: Monitoring alerts and health checks Rollback Decision: When to rollback vs forward fix Rollback Execution: Different rollback strategies Data Recovery: Handling database changes Post-Incident: Learning and prevention Part 1: Issue Detection Flow From Healthy to Incident %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Production deploymentcompleted]) --> Monitor[Monitoring Systems- Prometheus metrics- Application logs- User reports- Health checks] Monitor --> Baseline[Baseline Metrics:✓ Error rate: 0.1%✓ Latency p95: 150ms✓ Traffic: 10k req/min✓ CPU: 40%✓ Memory: 60%] Baseline --> Time[Time passes...Minutes after deployment] Time --> Detect{Issuedetected?} Detect -->|No issue| Healthy[✅ Deployment HealthyContinue monitoringAll metrics normal] Detect -->|Yes| IssueType{Issuetype?} IssueType --> ErrorSpike[🔴 Error Rate Spike0.1% → 15%Alert: HighErrorRate firing] IssueType --> LatencySpike[🟡 Latency Increasep95: 150ms → 5000msAlert: HighLatency firing] IssueType --> TrafficDrop[🟠 Traffic Drop10k → 1k req/minUsers can't access] IssueType --> ResourceIssue[🔴 Resource ExhaustionCPU: 40% → 100%OOMKilled events] IssueType --> DataCorruption[🔴 Data IssuesDatabase errorsInvalid data returned] ErrorSpike --> Severity1[Severity: CRITICALUser impact: HIGHAffecting all users] LatencySpike --> Severity2[Severity: WARNINGUser impact: MEDIUMSlow but functional] TrafficDrop --> Severity3[Severity: CRITICALUser impact: HIGHComplete outage] ResourceIssue --> Severity4[Severity: CRITICALUser impact: HIGHPods crashing] DataCorruption --> Severity5[Severity: CRITICALUser impact: CRITICALData integrity at risk] Severity1 --> AutoAlert[🚨 Automated Alerts:- PagerDuty page- Slack notification- Email alerts- Status page update] Severity2 --> AutoAlert Severity3 --> AutoAlert Severity4 --> AutoAlert Severity5 --> AutoAlert AutoAlert --> OnCall[On-Call EngineerReceives alertAcknowledges incident] OnCall --> Investigate[Quick Investigation:- Check deployment timeline- Review recent changes- Check logs- Verify metrics] Investigate --> RootCause{Root causeidentified?} RootCause -->|Yes - Recent deployment| Decision[Go to Rollback Decision] RootCause -->|Yes - Other cause| OtherFix[Different remediationNot deployment-related] RootCause -->|No - Time critical| Decision style Healthy fill:#064e3b,stroke:#10b981 style Severity1 fill:#7f1d1d,stroke:#ef4444 style Severity3 fill:#7f1d1d,stroke:#ef4444 style Severity4 fill:#7f1d1d,stroke:#ef4444 style Severity5 fill:#7f1d1d,stroke:#ef4444 style Severity2 fill:#78350f,stroke:#f59e0b Part 2: Rollback Decision Tree When to Rollback vs Forward Fix %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Production issue detected]) --> Assess[Assess situation:- User impact- Severity- Time deployed- Data changes] Assess --> Q1{Can issue befixed quickly?5 min} Q1 -->|Yes - Simple config| QuickFix[Forward Fix:- Update config map- Restart pods- No rollback needed] Q1 -->|No| Q2{Is issue causedby latestdeployment?} Q2 -->|No - External issue| External[External Root Cause:- Third-party API down- Database issue- Infrastructure problem→ Fix underlying issue] Q2 -->|Yes| Q3{User impactseverity?} Q3 -->|Low - Minor bugs| Q4{Time sincedeployment?} Q4 -->|< 30 min| RollbackLow[Consider Rollback:Low risk, easy rollbackUsers barely affected] Q4 -->|> 30 min| ForwardFix[Forward Fix:Deploy hotfixMore data changesRollback riskier] Q3 -->|Medium - Degraded| Q5{Data changesmade?} Q5 -->|No DB changes| RollbackMed[Rollback:Safe to revertNo data migrationQuick recovery] Q5 -->|DB changes made| Q6{Can revertDB changes?} Q6 -->|Yes - Reversible| RollbackWithDB[Rollback + DB Revert:1. Revert application2. Run down migrationCoordinate carefully] Q6 -->|No - Irreversible| ForwardOnly[Forward Fix ONLY:Cannot rollbackFix bug in new versionData can't be reverted] Q3 -->|High - Outage| Q7{Rollbacktime?} Q7 -->|< 5 min| ImmediateRollback[IMMEDIATE Rollback:User impact too highRollback firstDebug later] Q7 -->|> 5 min| Q8{Forward fixfaster?} Q8 -->|Yes| HotfixDeploy[Deploy Hotfix:If fix is obviousand can deployfaster than rollback] Q8 -->|No| ImmediateRollback QuickFix --> Monitor[Monitor metricsVerify fix worked] RollbackLow --> ExecuteRollback[Execute Rollback] RollbackMed --> ExecuteRollback RollbackWithDB --> ExecuteRollback ImmediateRollback --> ExecuteRollback ForwardFix --> DeployFix[Deploy Forward Fix] HotfixDeploy --> DeployFix ForwardOnly --> DeployFix style ImmediateRollback fill:#7f1d1d,stroke:#ef4444 style RollbackWithDB fill:#78350f,stroke:#f59e0b style ForwardOnly fill:#78350f,stroke:#f59e0b style QuickFix fill:#064e3b,stroke:#10b981 Part 3: Rollback Execution Strategies Application Rollback Methods %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Decision: Rollback]) --> Method{Deploymentstrategyused?} Method --> K8sRolling[Kubernetes Rolling Update] Method --> BlueGreen[Blue-Green Deployment] Method --> Canary[Canary Deployment] subgraph RollingRollback[Kubernetes Rolling Rollback] K8sRolling --> K8s1[kubectl rollout undodeployment myapp] K8s1 --> K8s2[Kubernetes:- Find previous ReplicaSet- Rolling update to old version- maxSurge: 1, maxUnavailable: 1] K8s2 --> K8s3[Gradual Pod Replacement:1. Create 1 old version pod2. Wait for ready3. Terminate 1 new version pod4. Repeat until all replaced] K8s3 --> K8s4[Time to rollback: 2-5 minDowntime: NoneSome users see old, some new] end subgraph BGRollback[Blue-Green Rollback] BlueGreen --> BG1[Current state:Blue v1.0 IDLEGreen v2.0 ACTIVE 100%] BG1 --> BG2[Update Service selector:version: green → version: blue] BG2 --> BG3[Instant Traffic Switch:Blue v1.0 ACTIVE 100%Green v2.0 IDLE 0%] BG3 --> BG4[Time to rollback: 1-2 secDowntime: ~1 secAll users switched instantly] end subgraph CanaryRollback[Canary Rollback] Canary --> C1[Current state:v1.0: 0 replicasv2.0: 10 replicas 100%] C1 --> C2[Scale down v2.0:v2.0: 10 → 0 replicas] C2 --> C3[Scale up v1.0:v1.0: 0 → 10 replicas] C3 --> C4[Time to rollback: 1-3 minDowntime: MinimalGradual traffic shift] end K8s4 --> Verify[Verification Steps] BG4 --> Verify C4 --> Verify Verify --> V1[1. Check pod statuskubectl get podsAll running?] V1 --> V2[2. Run health checkscurl /healthAll healthy?] V2 --> V3[3. Monitor metricsError rate back to normal?Latency improved?] V3 --> V4[4. Check user reportsAre users reporting success?] V4 --> Success{Rollbacksuccessful?} Success -->|Yes| Complete[✅ Rollback CompleteService restoredMonitor closely] Success -->|No| StillBroken[🚨 Still Broken!Issue not deployment-relatedDeeper investigation needed] style K8s4 fill:#1e3a8a,stroke:#3b82f6 style BG4 fill:#064e3b,stroke:#10b981 style C4 fill:#1e3a8a,stroke:#3b82f6 style Complete fill:#064e3b,stroke:#10b981 style StillBroken fill:#7f1d1d,stroke:#ef4444 Part 4: Database Rollback Complexity Handling Database Migrations %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Need to rollbackwith DB changes]) --> Analyze[Analyze migration type] Analyze --> Type{Migrationtype?} Type --> AddColumn[Added ColumnALTER TABLE usersADD COLUMN email] Type --> DropColumn[Dropped ColumnALTER TABLE usersDROP COLUMN phone] Type --> ModifyColumn[Modified ColumnALTER TABLE usersALTER COLUMN age TYPE bigint] Type --> AddTable[Added TableCREATE TABLE orders] AddColumn --> AC1{Column hasdata?} AC1 -->|No data yet| AC2[Safe Rollback:1. Deploy old app version2. DROP COLUMN emailOld app doesn't use it] AC1 -->|Has data| AC3[⚠️ Data Loss Risk:1. Backup table first2. Consider keeping column3. Deploy old app versionColumn ignored by old app] DropColumn --> DC1[🚨 CANNOT Rollback:Data already lostForward fix ONLYOptions:1. Restore from backup2. Accept data loss3. Recreate from logs] ModifyColumn --> MC1{Datacompatible?} MC1 -->|Yes - reversible| MC2[Revert Column Type:ALTER COLUMN age TYPE intVerify no data truncationThen deploy old app] MC1 -->|No - data loss| MC3[🚨 Cannot Revert:bigint values exceed int rangeForward fix ONLY] AddTable --> AT1{Table hascritical data?} AT1 -->|No data| AT2[Safe Rollback:1. Deploy old app version2. DROP TABLE ordersNo data lost] AT1 -->|Has data| AT3[Risky Rollback:1. BACKUP TABLE orders2. DROP TABLE orders3. Deploy old app versionData preserved in backup] AC2 --> SafeProcess[Safe Rollback Process:✅ No data loss✅ Quick rollback✅ Reversible] AC3 --> RiskyProcess[Risky Rollback Process:⚠️ Potential data loss⚠️ Need backup⚠️ Manual intervention] DC1 --> NoRollback[Forward Fix Only:❌ Cannot rollback❌ Data already lost❌ Must fix forward] MC2 --> SafeProcess MC3 --> NoRollback AT2 --> SafeProcess AT3 --> RiskyProcess SafeProcess --> Execute1[Execute Safe Rollback] RiskyProcess --> Decision{Acceptablerisk?} Decision -->|Yes| Execute2[Execute with Caution] Decision -->|No| NoRollback NoRollback --> HotfixDeploy[Deploy Hotfix:New version with fixKeep new schema] style SafeProcess fill:#064e3b,stroke:#10b981 style RiskyProcess fill:#78350f,stroke:#f59e0b style NoRollback fill:#7f1d1d,stroke:#ef4444 style DC1 fill:#7f1d1d,stroke:#ef4444 style MC3 fill:#7f1d1d,stroke:#ef4444 Part 5: Complete Rollback Workflow From Detection to Recovery %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant Monitor as Monitoring participant Alert as Alerting participant Engineer as On-Call Engineer participant Incident as Incident Channel participant K8s as Kubernetes participant DB as Database participant Users as End Users Note over Monitor: 5 minutes after deployment Monitor->>Monitor: Detect anomaly:Error rate: 0.1% → 18%Latency p95: 150ms → 3000ms Monitor->>Alert: Trigger alert:HighErrorRate FIRING Alert->>Engineer: 🚨 PagerDuty callCritical alertProduction incident Engineer->>Alert: Acknowledge alertStop escalation Engineer->>Incident: Create #incident-456"High error rate after v2.5 deployment" Note over Engineer: Open laptopStart investigation Engineer->>Monitor: Check Grafana dashboardWhen did issue start?Which endpoints affected? Monitor-->>Engineer: Started 5 min agoRight after deploymentAll endpoints affected Engineer->>K8s: kubectl get podsCheck pod status K8s-->>Engineer: All pods RunningNo crashesHealth checks passing Engineer->>K8s: kubectl logs deployment/myappCheck application logs K8s-->>Engineer: ERROR: Cannot connect to cacheERROR: Redis timeoutERROR: Connection refused Note over Engineer: Root cause: New versionhas Redis connection bug Engineer->>Incident: Update: Redis connection issue in v2.5Decision: Rollback to v2.4 Note over Engineer: Check deployment history Engineer->>K8s: kubectl rollout history deployment/myapp K8s-->>Engineer: REVISION 10: v2.5 (current)REVISION 9: v2.4 (previous) Engineer->>Incident: Starting rollback to v2.4ETA: 3 minutes Engineer->>K8s: kubectl rollout undo deployment/myapp K8s->>K8s: Start rollback:- Create pods with v2.4- Wait for ready- Terminate v2.5 pods loop Rolling Update K8s->>Users: Some users on v2.4 ✓Some users on v2.5 ✗ Note over K8s: Pod 1: v2.4 ReadyTerminating v2.5 Pod 1 Engineer->>K8s: kubectl rollout statusdeployment/myapp --watch K8s-->>Engineer: Waiting for rollout:2/5 pods updated end K8s->>Users: All users now on v2.4 ✓ K8s-->>Engineer: Rollout complete:deployment "myapp" successfully rolled out Engineer->>Monitor: Check metrics Note over Monitor: Wait 2 minutesfor metrics to stabilize Monitor-->>Engineer: ✅ Error rate: 0.1%✅ Latency p95: 160ms✅ All metrics normal Note over Alert: Metrics normalized Alert->>Engineer: ✅ Alert resolved:HighErrorRate Engineer->>Users: Verify user experience Users-->>Engineer: No error reportsApplication working Engineer->>Incident: ✅ Incident resolvedService restored to v2.4Duration: 12 minutesRoot cause: Redis bug in v2.5 Engineer->>Incident: Next steps:1. Fix Redis bug2. Add integration test3. Post-mortem scheduled Note over Engineer: Create follow-up tasks Engineer->>Engineer: Create Jira tickets:- BUG-789: Fix Redis connection- TEST-123: Add cache integration test- DOC-456: Update deployment checklist Note over Engineer,Users: Service restored ✓Monitoring continues Part 6: Automated Rollback Auto-Rollback Decision Flow %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Deployment completed]) --> Monitor[Continuous MonitoringEvery 30 seconds] Monitor --> Collect[Collect Metrics:- Error rate- Latency p95/p99- Success rate- Pod health- Resource usage] Collect --> Check1{Error rate> 5%?} Check1 -->|Yes| Trigger1[🚨 Trigger auto-rollbackError threshold exceeded] Check1 -->|No| Check2{Latency p95> 2x baseline?} Check2 -->|Yes| Trigger2[🚨 Trigger auto-rollbackLatency degradation] Check2 -->|No| Check3{Pod crashrate > 50%?} Check3 -->|Yes| Trigger3[🚨 Trigger auto-rollbackPods failing] Check3 -->|No| Check4{Custom metricthreshold?} Check4 -->|Yes| Trigger4[🚨 Trigger auto-rollbackBusiness metric failed] Check4 -->|No| Healthy[✅ All checks passedContinue monitoring] Healthy --> TimeCheck{Monitoringduration?} TimeCheck -->|< 15 min| Monitor TimeCheck -->|>= 15 min| Stable[✅ Deployment STABLEPassed soak periodAuto-rollback disabled] Trigger1 --> Rollback[Execute Auto-Rollback] Trigger2 --> Rollback Trigger3 --> Rollback Trigger4 --> Rollback Rollback --> R1[1. Log rollback decisionMetrics that triggeredTimestamp] R1 --> R2[2. Alert team:PagerDuty criticalSlack notification"Auto-rollback initiated"] R2 --> R3[3. Execute rollback:kubectl rollout undodeployment/myapp] R3 --> R4[4. Wait for rollback:Monitor pod statusWait for all pods ready] R4 --> R5[5. Verify recovery:Check metrics againError rate normal?Latency normal?] R5 --> Verify{Recoverysuccessful?} Verify -->|Yes| Success[✅ Auto-Rollback SuccessService restoredNotify teamCreate incident report] Verify -->|No| StillFailing[🚨 Still Failing!Issue not deploymentPage on-call immediatelyManual intervention needed] style Healthy fill:#064e3b,stroke:#10b981 style Stable fill:#064e3b,stroke:#10b981 style Success fill:#064e3b,stroke:#10b981 style Trigger1 fill:#7f1d1d,stroke:#ef4444 style Trigger2 fill:#7f1d1d,stroke:#ef4444 style Trigger3 fill:#7f1d1d,stroke:#ef4444 style Trigger4 fill:#7f1d1d,stroke:#ef4444 style StillFailing fill:#7f1d1d,stroke:#ef4444 Auto-Rollback Configuration # Flagger auto-rollback configuration apiVersion: flagger.app/v1beta1 kind: Canary metadata: name: myapp namespace: production spec: targetRef: apiVersion: apps/v1 kind: Deployment name: myapp service: port: 8080 # Canary analysis analysis: interval: 30s threshold: 5 # Rollback after 5 failed checks maxWeight: 50 stepWeight: 10 # Metrics for auto-rollback decision metrics: # HTTP error rate - name: request-success-rate thresholdRange: min: 95 # Rollback if success rate < 95% interval: 1m # HTTP latency - name: request-duration thresholdRange: max: 500 # Rollback if p95 > 500ms interval: 1m # Custom business metric - name: conversion-rate thresholdRange: min: 80 # Rollback if conversion < 80% of baseline interval: 2m # Webhooks for additional checks webhooks: - name: load-test url: http://flagger-loadtester/ timeout: 5s metadata: type: bash cmd: "hey -z 1m -q 10 http://myapp-canary:8080/" # Alerting on rollback alerts: - name: slack severity: error providerRef: name: slack namespace: flagger Part 7: Post-Incident Process Learning from Rollbacks %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Rollback completedService restored]) --> Timeline[Create Incident Timeline:- Deployment time- Issue detection time- Rollback decision time- Recovery timeTotal duration] Timeline --> PostMortem[Schedule Post-Mortem:Within 48 hoursAll stakeholders invitedBlameless culture] PostMortem --> Analyze[Root Cause Analysis:Why did issue occur?Why wasn't it caught?What can we learn?] Analyze --> Categories{Issuecategory?} Categories --> Testing[Insufficient Testing:- Missing test case- Integration gap- Load testing needed] Categories --> Monitoring[Monitoring Gap:- Missing alert- Wrong threshold- Blind spot found] Categories --> Process[Process Issue:- Skipped step- Wrong timing- Communication gap] Categories --> Code[Code Quality:- Bug in code- Edge case- Dependency issue] Testing --> Actions1[Action Items:□ Add integration test□ Expand E2E coverage□ Add load test□ Test in staging first] Monitoring --> Actions2[Action Items:□ Add new alert□ Adjust thresholds□ Add dashboard□ Improve visibility] Process --> Actions3[Action Items:□ Update runbook□ Add checklist item□ Change deployment time□ Improve communication] Code --> Actions4[Action Items:□ Fix bug□ Add validation□ Update dependency□ Code review process] Actions1 --> Assign[Assign Owners:Each action has ownerEach action has deadlineTrack in project board] Actions2 --> Assign Actions3 --> Assign Actions4 --> Assign Assign --> Document[Document Learnings:- Update wiki- Share with team- Add to knowledge base- Update training] Document --> Prevent[Prevent Recurrence:✓ Tests added✓ Monitoring improved✓ Process updated✓ Team educated] Prevent --> Complete[✅ Post-Incident CompleteStronger systemBetter preparedContinuous improvement] style Complete fill:#064e3b,stroke:#10b981 Part 8: Rollback Checklist Pre-Deployment Rollback Readiness Before Every Deployment: ...

    January 23, 2025 · 11 min · Rafiul Alam

    Domain-Driven Design in Go: Building Complex Business Systems

    Go Architecture Patterns Series: ← Hexagonal Architecture | Series Overview | Next: Modular Monolith → What is Domain-Driven Design? Domain-Driven Design (DDD) is a software development approach introduced by Eric Evans that focuses on creating software that matches the business domain. It emphasizes collaboration between technical and domain experts using a common language (Ubiquitous Language) and strategic/tactical patterns to handle complex business logic. Key Principles: Ubiquitous Language: Common language shared by developers and domain experts Bounded Contexts: Explicit boundaries where a particular domain model applies Domain Model: Rich model that captures business rules and behavior Strategic Design: High-level patterns for organizing large systems Tactical Design: Building blocks for implementing domain models DDD Strategic Patterns %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% graph TB subgraph "E-commerce System" subgraph "Ordering Context" OC[Order Aggregate] OL[Order Line Items] OP[Order Payment] end subgraph "Inventory Context" IC[Product Catalog] IS[Stock Management] IW[Warehouse] end subgraph "Shipping Context" SC[Shipment] SD[Delivery] ST[Tracking] end subgraph "Customer Context" CC[Customer Profile] CA[Address] CP[Preferences] end end OC -.->|Anti-Corruption Layer| IC OC -.->|Shared Kernel| CC SC -.->|Published Language| OC style OC fill:#78350f,color:#fff style IC fill:#1e3a5f,color:#fff style SC fill:#134e4a,color:#fff style CC fill:#4c1d95,color:#fff Bounded Context Map %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% graph LR subgraph "Sales Context" S[Sales DomainCustomer, Order, Product] end subgraph "Support Context" SP[Support DomainTicket, Customer, Issue] end subgraph "Billing Context" B[Billing DomainInvoice, Payment, Customer] end subgraph "Shared Kernel" SK[Customer Identity] end S -.->|Conformist| SK SP -.->|Customer/Supplier| S B -.->|Partnership| S style S fill:#78350f,color:#fff style SP fill:#1e3a5f,color:#fff style B fill:#134e4a,color:#fff style SK fill:#4c1d95,color:#fff DDD Tactical Patterns %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% graph TD subgraph "Aggregate Root" AR[OrderAggregate Root] E1[Order LineEntity] E2[Payment InfoEntity] VO1[MoneyValue Object] VO2[AddressValue Object] end subgraph "Domain Services" DS[Pricing Service] DS2[Shipping Calculator] end subgraph "Repositories" R[Order Repository] end AR --> E1 AR --> E2 E1 --> VO1 AR --> VO2 AR -.->|uses| DS R -.->|persists| AR style AR fill:#78350f,stroke:#fb923c,stroke-width:3px,color:#fff style VO1 fill:#1e3a5f,color:#fff style DS fill:#134e4a,color:#fff style R fill:#4c1d95,color:#fff Real-World Use Cases E-commerce Platforms: Complex ordering, inventory, and payment systems Financial Systems: Banking, trading, and payment processing Healthcare Systems: Patient records, appointments, and billing Supply Chain Management: Inventory, shipping, and logistics Enterprise Resource Planning: Multi-domain business systems Insurance Systems: Policy management, claims, and underwriting Project Structure ├── cmd/ │ └── api/ │ └── main.go ├── internal/ │ ├── domain/ │ │ ├── order/ # Order Bounded Context │ │ │ ├── aggregate/ │ │ │ │ └── order.go # Order aggregate root │ │ │ ├── entity/ │ │ │ │ └── order_line.go │ │ │ ├── valueobject/ │ │ │ │ ├── money.go │ │ │ │ └── quantity.go │ │ │ ├── service/ │ │ │ │ └── pricing_service.go │ │ │ ├── repository/ │ │ │ │ └── order_repository.go │ │ │ └── event/ │ │ │ └── order_placed.go │ │ ├── customer/ # Customer Bounded Context │ │ │ ├── aggregate/ │ │ │ └── valueobject/ │ │ └── shared/ # Shared Kernel │ │ └── valueobject/ │ ├── application/ # Application Services │ │ ├── order/ │ │ │ └── order_service.go │ │ └── customer/ │ │ └── customer_service.go │ └── infrastructure/ │ ├── persistence/ │ └── messaging/ └── go.mod Building Blocks: Value Objects package valueobject import ( "errors" "fmt" ) // Money represents a monetary value (Value Object) // Value objects are immutable and compared by value, not identity type Money struct { amount int64 // Amount in smallest currency unit (cents) currency string } // NewMoney creates a new Money value object func NewMoney(amount int64, currency string) (Money, error) { if currency == "" { return Money{}, errors.New("currency cannot be empty") } if amount < 0 { return Money{}, errors.New("amount cannot be negative") } return Money{amount: amount, currency: currency}, nil } // Amount returns the amount func (m Money) Amount() int64 { return m.amount } // Currency returns the currency func (m Money) Currency() string { return m.currency } // Add adds two money values func (m Money) Add(other Money) (Money, error) { if m.currency != other.currency { return Money{}, errors.New("cannot add different currencies") } return Money{ amount: m.amount + other.amount, currency: m.currency, }, nil } // Multiply multiplies money by a quantity func (m Money) Multiply(multiplier int) Money { return Money{ amount: m.amount * int64(multiplier), currency: m.currency, } } // IsZero checks if money is zero func (m Money) IsZero() bool { return m.amount == 0 } // Equals checks equality (value objects compare by value) func (m Money) Equals(other Money) bool { return m.amount == other.amount && m.currency == other.currency } // String returns string representation func (m Money) String() string { return fmt.Sprintf("%d %s", m.amount, m.currency) } // Email represents an email address (Value Object) type Email struct { value string } // NewEmail creates a new Email value object func NewEmail(email string) (Email, error) { if !isValidEmail(email) { return Email{}, errors.New("invalid email format") } return Email{value: email}, nil } // String returns the email string func (e Email) String() string { return e.value } // Equals checks equality func (e Email) Equals(other Email) bool { return e.value == other.value } func isValidEmail(email string) bool { // Simplified validation return len(email) > 3 && contains(email, "@") && contains(email, ".") } func contains(s, substr string) bool { for i := 0; i <= len(s)-len(substr); i++ { if s[i:i+len(substr)] == substr { return true } } return false } // Address represents a physical address (Value Object) type Address struct { street string city string state string postalCode string country string } // NewAddress creates a new Address value object func NewAddress(street, city, state, postalCode, country string) (Address, error) { if street == "" || city == "" || country == "" { return Address{}, errors.New("street, city, and country are required") } return Address{ street: street, city: city, state: state, postalCode: postalCode, country: country, }, nil } // Street returns the street func (a Address) Street() string { return a.street } // City returns the city func (a Address) City() string { return a.city } // Country returns the country func (a Address) Country() string { return a.country } // Quantity represents a product quantity (Value Object) type Quantity struct { value int } // NewQuantity creates a new Quantity value object func NewQuantity(value int) (Quantity, error) { if value < 0 { return Quantity{}, errors.New("quantity cannot be negative") } return Quantity{value: value}, nil } // Value returns the quantity value func (q Quantity) Value() int { return q.value } // Add adds two quantities func (q Quantity) Add(other Quantity) Quantity { return Quantity{value: q.value + other.value} } // IsZero checks if quantity is zero func (q Quantity) IsZero() bool { return q.value == 0 } Building Blocks: Entities package entity import ( "time" "myapp/internal/domain/order/valueobject" ) // OrderLine is an entity (has identity) // Entities have identity and lifecycle type OrderLine struct { id string productID string product string quantity valueobject.Quantity unitPrice valueobject.Money createdAt time.Time } // NewOrderLine creates a new order line entity func NewOrderLine(id, productID, product string, quantity valueobject.Quantity, unitPrice valueobject.Money) *OrderLine { return &OrderLine{ id: id, productID: productID, product: product, quantity: quantity, unitPrice: unitPrice, createdAt: time.Now(), } } // ID returns the order line ID (identity) func (ol *OrderLine) ID() string { return ol.id } // ProductID returns the product ID func (ol *OrderLine) ProductID() string { return ol.productID } // Quantity returns the quantity func (ol *OrderLine) Quantity() valueobject.Quantity { return ol.quantity } // UnitPrice returns the unit price func (ol *OrderLine) UnitPrice() valueobject.Money { return ol.unitPrice } // TotalPrice calculates the total price for this line func (ol *OrderLine) TotalPrice() valueobject.Money { return ol.unitPrice.Multiply(ol.quantity.Value()) } // UpdateQuantity updates the quantity func (ol *OrderLine) UpdateQuantity(newQuantity valueobject.Quantity) { ol.quantity = newQuantity } // Payment is an entity representing payment information type Payment struct { id string method PaymentMethod amount valueobject.Money transactionID string status PaymentStatus paidAt time.Time } // PaymentMethod represents payment method type PaymentMethod string const ( PaymentMethodCreditCard PaymentMethod = "credit_card" PaymentMethodDebitCard PaymentMethod = "debit_card" PaymentMethodPayPal PaymentMethod = "paypal" ) // PaymentStatus represents payment status type PaymentStatus string const ( PaymentStatusPending PaymentStatus = "pending" PaymentStatusCompleted PaymentStatus = "completed" PaymentStatusFailed PaymentStatus = "failed" ) Building Blocks: Aggregates package aggregate import ( "errors" "time" "myapp/internal/domain/order/entity" "myapp/internal/domain/order/event" "myapp/internal/domain/order/valueobject" ) // Order is an aggregate root // Aggregate roots maintain consistency boundaries and control access to entities type Order struct { id string customerID string orderLines []*entity.OrderLine shippingAddress valueobject.Address billingAddress valueobject.Address payment *entity.Payment status OrderStatus total valueobject.Money createdAt time.Time updatedAt time.Time // Domain events events []event.DomainEvent } // OrderStatus represents the status of an order type OrderStatus string const ( OrderStatusDraft OrderStatus = "draft" OrderStatusPlaced OrderStatus = "placed" OrderStatusConfirmed OrderStatus = "confirmed" OrderStatusShipped OrderStatus = "shipped" OrderStatusDelivered OrderStatus = "delivered" OrderStatusCancelled OrderStatus = "cancelled" ) // NewOrder creates a new order aggregate func NewOrder(id, customerID string, shippingAddress, billingAddress valueobject.Address) (*Order, error) { if id == "" { return nil, errors.New("order ID is required") } if customerID == "" { return nil, errors.New("customer ID is required") } return &Order{ id: id, customerID: customerID, orderLines: make([]*entity.OrderLine, 0), shippingAddress: shippingAddress, billingAddress: billingAddress, status: OrderStatusDraft, createdAt: time.Now(), updatedAt: time.Now(), events: make([]event.DomainEvent, 0), }, nil } // ID returns the order ID func (o *Order) ID() string { return o.id } // CustomerID returns the customer ID func (o *Order) CustomerID() string { return o.customerID } // Status returns the order status func (o *Order) Status() OrderStatus { return o.status } // Total returns the total amount func (o *Order) Total() valueobject.Money { return o.total } // AddOrderLine adds an order line to the order (aggregate invariant) func (o *Order) AddOrderLine(orderLine *entity.OrderLine) error { // Business rule: Cannot add lines to non-draft orders if o.status != OrderStatusDraft { return errors.New("cannot add items to non-draft order") } // Business rule: Check for duplicate products for _, line := range o.orderLines { if line.ProductID() == orderLine.ProductID() { return errors.New("product already exists in order") } } o.orderLines = append(o.orderLines, orderLine) o.recalculateTotal() o.updatedAt = time.Now() return nil } // RemoveOrderLine removes an order line (aggregate invariant) func (o *Order) RemoveOrderLine(orderLineID string) error { // Business rule: Cannot remove lines from non-draft orders if o.status != OrderStatusDraft { return errors.New("cannot remove items from non-draft order") } for i, line := range o.orderLines { if line.ID() == orderLineID { o.orderLines = append(o.orderLines[:i], o.orderLines[i+1:]...) o.recalculateTotal() o.updatedAt = time.Now() return nil } } return errors.New("order line not found") } // PlaceOrder places the order (state transition) func (o *Order) PlaceOrder() error { // Business rule: Can only place draft orders if o.status != OrderStatusDraft { return errors.New("can only place draft orders") } // Business rule: Order must have at least one line if len(o.orderLines) == 0 { return errors.New("order must have at least one item") } // Business rule: Order must have payment if o.payment == nil { return errors.New("order must have payment information") } o.status = OrderStatusPlaced o.updatedAt = time.Now() // Raise domain event o.addEvent(event.NewOrderPlacedEvent(o.id, o.customerID, o.total)) return nil } // ConfirmOrder confirms the order func (o *Order) ConfirmOrder() error { if o.status != OrderStatusPlaced { return errors.New("can only confirm placed orders") } o.status = OrderStatusConfirmed o.updatedAt = time.Now() o.addEvent(event.NewOrderConfirmedEvent(o.id)) return nil } // ShipOrder marks the order as shipped func (o *Order) ShipOrder() error { if o.status != OrderStatusConfirmed { return errors.New("can only ship confirmed orders") } o.status = OrderStatusShipped o.updatedAt = time.Now() o.addEvent(event.NewOrderShippedEvent(o.id, o.shippingAddress)) return nil } // CancelOrder cancels the order func (o *Order) CancelOrder(reason string) error { // Business rule: Cannot cancel shipped or delivered orders if o.status == OrderStatusShipped || o.status == OrderStatusDelivered { return errors.New("cannot cancel shipped or delivered orders") } if o.status == OrderStatusCancelled { return errors.New("order is already cancelled") } o.status = OrderStatusCancelled o.updatedAt = time.Now() o.addEvent(event.NewOrderCancelledEvent(o.id, reason)) return nil } // AddPayment adds payment to the order func (o *Order) AddPayment(payment *entity.Payment) error { if o.payment != nil { return errors.New("payment already exists") } // Business rule: Payment amount must match order total if !payment.Amount.Equals(o.total) { return errors.New("payment amount must match order total") } o.payment = payment o.updatedAt = time.Now() return nil } // recalculateTotal recalculates the order total func (o *Order) recalculateTotal() { if len(o.orderLines) == 0 { o.total = valueobject.Money{} return } total := o.orderLines[0].TotalPrice() for i := 1; i < len(o.orderLines); i++ { var err error total, err = total.Add(o.orderLines[i].TotalPrice()) if err != nil { // Handle error - in production, log this return } } o.total = total } // GetDomainEvents returns all domain events func (o *Order) GetDomainEvents() []event.DomainEvent { return o.events } // ClearDomainEvents clears all domain events func (o *Order) ClearDomainEvents() { o.events = make([]event.DomainEvent, 0) } // addEvent adds a domain event func (o *Order) addEvent(e event.DomainEvent) { o.events = append(o.events, e) } // OrderLines returns a copy of order lines func (o *Order) OrderLines() []*entity.OrderLine { // Return copy to prevent external modification lines := make([]*entity.OrderLine, len(o.orderLines)) copy(lines, o.orderLines) return lines } Building Blocks: Domain Events package event import ( "time" "myapp/internal/domain/order/valueobject" ) // DomainEvent is the base interface for all domain events type DomainEvent interface { OccurredAt() time.Time EventType() string } // OrderPlacedEvent is raised when an order is placed type OrderPlacedEvent struct { orderID string customerID string total valueobject.Money occurredAt time.Time } // NewOrderPlacedEvent creates a new OrderPlacedEvent func NewOrderPlacedEvent(orderID, customerID string, total valueobject.Money) *OrderPlacedEvent { return &OrderPlacedEvent{ orderID: orderID, customerID: customerID, total: total, occurredAt: time.Now(), } } // OrderID returns the order ID func (e *OrderPlacedEvent) OrderID() string { return e.orderID } // CustomerID returns the customer ID func (e *OrderPlacedEvent) CustomerID() string { return e.customerID } // Total returns the total amount func (e *OrderPlacedEvent) Total() valueobject.Money { return e.total } // OccurredAt returns when the event occurred func (e *OrderPlacedEvent) OccurredAt() time.Time { return e.occurredAt } // EventType returns the event type func (e *OrderPlacedEvent) EventType() string { return "OrderPlaced" } // OrderConfirmedEvent is raised when an order is confirmed type OrderConfirmedEvent struct { orderID string occurredAt time.Time } // NewOrderConfirmedEvent creates a new OrderConfirmedEvent func NewOrderConfirmedEvent(orderID string) *OrderConfirmedEvent { return &OrderConfirmedEvent{ orderID: orderID, occurredAt: time.Now(), } } // OrderID returns the order ID func (e *OrderConfirmedEvent) OrderID() string { return e.orderID } // OccurredAt returns when the event occurred func (e *OrderConfirmedEvent) OccurredAt() time.Time { return e.occurredAt } // EventType returns the event type func (e *OrderConfirmedEvent) EventType() string { return "OrderConfirmed" } // OrderShippedEvent is raised when an order is shipped type OrderShippedEvent struct { orderID string shippingAddress valueobject.Address occurredAt time.Time } // NewOrderShippedEvent creates a new OrderShippedEvent func NewOrderShippedEvent(orderID string, shippingAddress valueobject.Address) *OrderShippedEvent { return &OrderShippedEvent{ orderID: orderID, shippingAddress: shippingAddress, occurredAt: time.Now(), } } // EventType returns the event type func (e *OrderShippedEvent) EventType() string { return "OrderShipped" } // OccurredAt returns when the event occurred func (e *OrderShippedEvent) OccurredAt() time.Time { return e.occurredAt } // OrderCancelledEvent is raised when an order is cancelled type OrderCancelledEvent struct { orderID string reason string occurredAt time.Time } // NewOrderCancelledEvent creates a new OrderCancelledEvent func NewOrderCancelledEvent(orderID, reason string) *OrderCancelledEvent { return &OrderCancelledEvent{ orderID: orderID, reason: reason, occurredAt: time.Now(), } } // EventType returns the event type func (e *OrderCancelledEvent) EventType() string { return "OrderCancelled" } // OccurredAt returns when the event occurred func (e *OrderCancelledEvent) OccurredAt() time.Time { return e.occurredAt } Building Blocks: Domain Services package service import ( "errors" "myapp/internal/domain/order/valueobject" ) // PricingService is a domain service for calculating prices // Domain services contain business logic that doesn't belong to any entity type PricingService struct { taxRate float64 } // NewPricingService creates a new pricing service func NewPricingService(taxRate float64) *PricingService { return &PricingService{taxRate: taxRate} } // CalculateOrderTotal calculates the total price with tax and shipping func (s *PricingService) CalculateOrderTotal( subtotal valueobject.Money, shippingCost valueobject.Money, ) (valueobject.Money, error) { // Add shipping to subtotal total, err := subtotal.Add(shippingCost) if err != nil { return valueobject.Money{}, err } // Calculate tax taxAmount := int64(float64(total.Amount()) * s.taxRate) tax, err := valueobject.NewMoney(taxAmount, total.Currency()) if err != nil { return valueobject.Money{}, err } // Add tax to total return total.Add(tax) } // ApplyDiscount applies discount to money func (s *PricingService) ApplyDiscount(amount valueobject.Money, discountPercent int) (valueobject.Money, error) { if discountPercent < 0 || discountPercent > 100 { return valueobject.Money{}, errors.New("invalid discount percentage") } discountAmount := amount.Amount() * int64(discountPercent) / 100 finalAmount := amount.Amount() - discountAmount return valueobject.NewMoney(finalAmount, amount.Currency()) } // ShippingCalculator is a domain service for calculating shipping costs type ShippingCalculator struct{} // NewShippingCalculator creates a new shipping calculator func NewShippingCalculator() *ShippingCalculator { return &ShippingCalculator{} } // CalculateShippingCost calculates shipping cost based on weight and distance func (s *ShippingCalculator) CalculateShippingCost( weightKg float64, distanceKm float64, currency string, ) (valueobject.Money, error) { // Simple formula: base rate + weight factor + distance factor baseRate := int64(500) // $5.00 base weightFactor := int64(weightKg * 100) distanceFactor := int64(distanceKm * 10) totalCost := baseRate + weightFactor + distanceFactor return valueobject.NewMoney(totalCost, currency) } Building Blocks: Repositories package repository import ( "context" "myapp/internal/domain/order/aggregate" ) // OrderRepository defines the repository interface for orders // Repositories provide collection-like interface for aggregates type OrderRepository interface { // Save saves an order aggregate Save(ctx context.Context, order *aggregate.Order) error // FindByID finds an order by ID FindByID(ctx context.Context, id string) (*aggregate.Order, error) // FindByCustomerID finds orders by customer ID FindByCustomerID(ctx context.Context, customerID string) ([]*aggregate.Order, error) // Update updates an existing order Update(ctx context.Context, order *aggregate.Order) error // Delete deletes an order Delete(ctx context.Context, id string) error // NextIdentity generates the next order identity NextIdentity() string } Application Service package application import ( "context" "fmt" "myapp/internal/domain/order/aggregate" "myapp/internal/domain/order/entity" "myapp/internal/domain/order/repository" "myapp/internal/domain/order/service" "myapp/internal/domain/order/valueobject" ) // OrderService is an application service that orchestrates use cases // Application services coordinate domain objects and infrastructure type OrderService struct { orderRepo repository.OrderRepository pricingService *service.PricingService shippingCalculator *service.ShippingCalculator eventPublisher EventPublisher } // EventPublisher publishes domain events type EventPublisher interface { Publish(ctx context.Context, events []event.DomainEvent) error } // NewOrderService creates a new order service func NewOrderService( orderRepo repository.OrderRepository, pricingService *service.PricingService, shippingCalculator *service.ShippingCalculator, eventPublisher EventPublisher, ) *OrderService { return &OrderService{ orderRepo: orderRepo, pricingService: pricingService, shippingCalculator: shippingCalculator, eventPublisher: eventPublisher, } } // CreateOrderCommand represents the command to create an order type CreateOrderCommand struct { CustomerID string ShippingAddress AddressDTO BillingAddress AddressDTO Items []OrderItemDTO } // AddressDTO is a data transfer object for address type AddressDTO struct { Street string City string State string PostalCode string Country string } // OrderItemDTO is a data transfer object for order items type OrderItemDTO struct { ProductID string Product string Quantity int UnitPrice MoneyDTO } // MoneyDTO is a data transfer object for money type MoneyDTO struct { Amount int64 Currency string } // CreateOrder creates a new order (use case) func (s *OrderService) CreateOrder(ctx context.Context, cmd CreateOrderCommand) (string, error) { // Convert DTOs to value objects shippingAddr, err := valueobject.NewAddress( cmd.ShippingAddress.Street, cmd.ShippingAddress.City, cmd.ShippingAddress.State, cmd.ShippingAddress.PostalCode, cmd.ShippingAddress.Country, ) if err != nil { return "", fmt.Errorf("invalid shipping address: %w", err) } billingAddr, err := valueobject.NewAddress( cmd.BillingAddress.Street, cmd.BillingAddress.City, cmd.BillingAddress.State, cmd.BillingAddress.PostalCode, cmd.BillingAddress.Country, ) if err != nil { return "", fmt.Errorf("invalid billing address: %w", err) } // Create order aggregate orderID := s.orderRepo.NextIdentity() order, err := aggregate.NewOrder(orderID, cmd.CustomerID, shippingAddr, billingAddr) if err != nil { return "", err } // Add order lines for _, item := range cmd.Items { quantity, err := valueobject.NewQuantity(item.Quantity) if err != nil { return "", err } unitPrice, err := valueobject.NewMoney(item.UnitPrice.Amount, item.UnitPrice.Currency) if err != nil { return "", err } orderLine := entity.NewOrderLine( fmt.Sprintf("%s-line-%s", orderID, item.ProductID), item.ProductID, item.Product, quantity, unitPrice, ) if err := order.AddOrderLine(orderLine); err != nil { return "", err } } // Save order if err := s.orderRepo.Save(ctx, order); err != nil { return "", fmt.Errorf("failed to save order: %w", err) } return orderID, nil } // PlaceOrder places an order (use case) func (s *OrderService) PlaceOrder(ctx context.Context, orderID string, payment PaymentDTO) error { // Load order aggregate order, err := s.orderRepo.FindByID(ctx, orderID) if err != nil { return fmt.Errorf("order not found: %w", err) } // Create payment entity paymentAmount, err := valueobject.NewMoney(payment.Amount, payment.Currency) if err != nil { return err } paymentEntity := &entity.Payment{ // Payment entity fields } // Add payment to order if err := order.AddPayment(paymentEntity); err != nil { return err } // Place order (this enforces business rules) if err := order.PlaceOrder(); err != nil { return err } // Save order if err := s.orderRepo.Update(ctx, order); err != nil { return fmt.Errorf("failed to update order: %w", err) } // Publish domain events events := order.GetDomainEvents() if err := s.eventPublisher.Publish(ctx, events); err != nil { // Log error but don't fail the transaction fmt.Printf("failed to publish events: %v\n", err) } order.ClearDomainEvents() return nil } // PaymentDTO is a data transfer object for payment type PaymentDTO struct { Amount int64 Currency string Method string } Best Practices Ubiquitous Language: Use domain language in code and conversations Bounded Contexts: Define clear boundaries between contexts Aggregate Boundaries: Keep aggregates small and focused Immutable Value Objects: Make value objects immutable Domain Events: Use events to communicate between aggregates Repository per Aggregate: One repository per aggregate root Anemic Models: Avoid anemic domain models - put logic in entities Common Pitfalls Large Aggregates: Creating aggregates that are too large Breaking Invariants: Modifying entities without going through aggregate root Transaction Boundaries: Spanning transactions across multiple aggregates Ignoring Ubiquitous Language: Not collaborating with domain experts Over-engineering: Applying DDD to simple CRUD applications Missing Bounded Contexts: Not identifying context boundaries When to Use DDD Use When: ...

    January 22, 2025 · 18 min · Rafiul Alam

    Attention is All You Need: Visualized and Explained

    Introduction: The Paper That Changed Everything In 2017, Google researchers published “Attention is All You Need”, introducing the Transformer architecture. This single paper: Eliminated recurrence in sequence modeling Introduced pure attention mechanisms Enabled massive parallelization Became the foundation for GPT, BERT, and all modern LLMs Let’s visualize and demystify this revolutionary architecture, piece by piece. The Problem: Sequential Processing is Slow Before Transformers: RNNs and LSTMs graph LR A[Word 1The] --> B[Hidden h1] B --> C[Word 2cat] C --> D[Hidden h2] D --> E[Word 3sat] E --> F[Hidden h3] style B fill:#e74c3c style D fill:#e74c3c style F fill:#e74c3c Problem: Sequential processing-each step depends on the previous. Can’t parallelize! ...

    January 21, 2025 · 11 min · Rafiul Alam

    Go Concurrency Pattern: The Collatz Explorer

    ← Mandelbrot Set | Series Overview The Problem: The Simplest Unsolved Math Problem The Collatz conjecture (3n+1 problem) is deceptively simple: Start with any positive integer n If n is even: divide by 2 If n is odd: multiply by 3 and add 1 Repeat until you reach 1 The conjecture: Every positive integer eventually reaches 1. Example (n=12): 12 → 6 → 3 → 10 → 5 → 16 → 8 → 4 → 2 → 1 The mystery: This has been verified for numbers up to 2^68, but never proven. It’s one of mathematics’ most famous unsolved problems. ...

    January 20, 2025 · 12 min · Rafiul Alam

    Modular Monolith Architecture in Go: Scaling Without Microservices

    Go Architecture Patterns Series: ← Previous: Domain-Driven Design | Series Overview | Next: Microservices Architecture → What is Modular Monolith Architecture? Modular Monolith Architecture is an approach that combines the simplicity of monolithic deployment with the modularity of microservices. It organizes code into independent, loosely coupled modules with well-defined boundaries, all deployed as a single application. Key Principles: Module Independence: Each module is self-contained with its own domain logic Clear Boundaries: Modules communicate through well-defined interfaces Shared Deployment: All modules deployed together in a single process Domain Alignment: Modules organized around business capabilities Internal APIs: Modules expose APIs for inter-module communication Data Ownership: Each module owns its data and database schema Architecture Overview %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% graph TD subgraph "Modular Monolith Application" API[API Gateway/Router] subgraph "User Module" U1[User Service] U2[User Repository] U3[(User DB Schema)] end subgraph "Order Module" O1[Order Service] O2[Order Repository] O3[(Order DB Schema)] end subgraph "Product Module" P1[Product Service] P2[Product Repository] P3[(Product DB Schema)] end subgraph "Payment Module" PA1[Payment Service] PA2[Payment Repository] PA3[(Payment DB Schema)] end API --> U1 API --> O1 API --> P1 API --> PA1 U1 --> U2 O1 --> O2 P1 --> P2 PA1 --> PA2 U2 --> U3 O2 --> O3 P2 --> P3 PA2 --> PA3 O1 -.->|Module API| U1 O1 -.->|Module API| P1 O1 -.->|Module API| PA1 end style U1 fill:#1e3a5f,color:#fff style O1 fill:#78350f,color:#fff style P1 fill:#134e4a,color:#fff style PA1 fill:#4c1d95,color:#fff Module Communication Patterns %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant Client participant OrderModule participant UserModule participant ProductModule participant PaymentModule Client->>OrderModule: Create Order OrderModule->>UserModule: Validate User UserModule-->>OrderModule: User Valid OrderModule->>ProductModule: Check Stock ProductModule-->>OrderModule: Stock Available OrderModule->>ProductModule: Reserve Items ProductModule-->>OrderModule: Items Reserved OrderModule->>PaymentModule: Process Payment PaymentModule-->>OrderModule: Payment Success OrderModule->>ProductModule: Confirm Reservation ProductModule-->>OrderModule: Confirmed OrderModule-->>Client: Order Created Real-World Use Cases E-commerce Platforms: Product, order, inventory, and payment management SaaS Applications: Multi-tenant applications with distinct features Content Management Systems: Content, media, user, and workflow modules Banking Systems: Account, transaction, loan, and reporting modules Healthcare Systems: Patient, appointment, billing, and medical records Enterprise Applications: HR, finance, inventory, and CRM modules Modular Monolith Implementation Project Structure ├── cmd/ │ └── app/ │ └── main.go ├── internal/ │ ├── modules/ │ │ ├── user/ │ │ │ ├── domain/ │ │ │ │ ├── user.go │ │ │ │ └── repository.go │ │ │ ├── application/ │ │ │ │ └── service.go │ │ │ ├── infrastructure/ │ │ │ │ └── postgres_repository.go │ │ │ ├── api/ │ │ │ │ └── http_handler.go │ │ │ └── module.go │ │ ├── order/ │ │ │ ├── domain/ │ │ │ │ ├── order.go │ │ │ │ └── repository.go │ │ │ ├── application/ │ │ │ │ └── service.go │ │ │ ├── infrastructure/ │ │ │ │ └── postgres_repository.go │ │ │ ├── api/ │ │ │ │ └── http_handler.go │ │ │ └── module.go │ │ ├── product/ │ │ │ ├── domain/ │ │ │ │ ├── product.go │ │ │ │ └── repository.go │ │ │ ├── application/ │ │ │ │ └── service.go │ │ │ ├── infrastructure/ │ │ │ │ └── postgres_repository.go │ │ │ ├── api/ │ │ │ │ └── http_handler.go │ │ │ └── module.go │ │ └── payment/ │ │ ├── domain/ │ │ │ ├── payment.go │ │ │ └── repository.go │ │ ├── application/ │ │ │ └── service.go │ │ ├── infrastructure/ │ │ │ └── postgres_repository.go │ │ ├── api/ │ │ │ └── http_handler.go │ │ └── module.go │ └── shared/ │ ├── database/ │ │ └── postgres.go │ └── events/ │ └── event_bus.go └── go.mod Module 1: User Module // internal/modules/user/domain/user.go package domain import ( "context" "errors" "time" ) type UserID string type User struct { ID UserID Email string Name string Active bool CreatedAt time.Time UpdatedAt time.Time } var ( ErrUserNotFound = errors.New("user not found") ErrUserAlreadyExists = errors.New("user already exists") ErrInvalidEmail = errors.New("invalid email") ) // Repository defines the interface for user storage type Repository interface { Create(ctx context.Context, user *User) error GetByID(ctx context.Context, id UserID) (*User, error) GetByEmail(ctx context.Context, email string) (*User, error) Update(ctx context.Context, user *User) error Delete(ctx context.Context, id UserID) error } // internal/modules/user/application/service.go package application import ( "context" "fmt" "regexp" "app/internal/modules/user/domain" ) type Service struct { repo domain.Repository } func NewService(repo domain.Repository) *Service { return &Service{repo: repo} } func (s *Service) CreateUser(ctx context.Context, email, name string) (*domain.User, error) { if !isValidEmail(email) { return nil, domain.ErrInvalidEmail } // Check if user exists existing, _ := s.repo.GetByEmail(ctx, email) if existing != nil { return nil, domain.ErrUserAlreadyExists } user := &domain.User{ ID: domain.UserID(generateID()), Email: email, Name: name, Active: true, CreatedAt: time.Now(), UpdatedAt: time.Now(), } if err := s.repo.Create(ctx, user); err != nil { return nil, fmt.Errorf("failed to create user: %w", err) } return user, nil } func (s *Service) GetUser(ctx context.Context, id domain.UserID) (*domain.User, error) { return s.repo.GetByID(ctx, id) } func (s *Service) ValidateUser(ctx context.Context, id domain.UserID) (bool, error) { user, err := s.repo.GetByID(ctx, id) if err != nil { return false, err } return user.Active, nil } func isValidEmail(email string) bool { emailRegex := regexp.MustCompile(`^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`) return emailRegex.MatchString(email) } func generateID() string { return fmt.Sprintf("user_%d", time.Now().UnixNano()) } // internal/modules/user/infrastructure/postgres_repository.go package infrastructure import ( "context" "database/sql" "fmt" "app/internal/modules/user/domain" ) type PostgresRepository struct { db *sql.DB } func NewPostgresRepository(db *sql.DB) *PostgresRepository { return &PostgresRepository{db: db} } func (r *PostgresRepository) Create(ctx context.Context, user *domain.User) error { query := ` INSERT INTO users.users (id, email, name, active, created_at, updated_at) VALUES ($1, $2, $3, $4, $5, $6) ` _, err := r.db.ExecContext(ctx, query, user.ID, user.Email, user.Name, user.Active, user.CreatedAt, user.UpdatedAt) return err } func (r *PostgresRepository) GetByID(ctx context.Context, id domain.UserID) (*domain.User, error) { query := ` SELECT id, email, name, active, created_at, updated_at FROM users.users WHERE id = $1 ` user := &domain.User{} err := r.db.QueryRowContext(ctx, query, id).Scan( &user.ID, &user.Email, &user.Name, &user.Active, &user.CreatedAt, &user.UpdatedAt, ) if err == sql.ErrNoRows { return nil, domain.ErrUserNotFound } return user, err } func (r *PostgresRepository) GetByEmail(ctx context.Context, email string) (*domain.User, error) { query := ` SELECT id, email, name, active, created_at, updated_at FROM users.users WHERE email = $1 ` user := &domain.User{} err := r.db.QueryRowContext(ctx, query, email).Scan( &user.ID, &user.Email, &user.Name, &user.Active, &user.CreatedAt, &user.UpdatedAt, ) if err == sql.ErrNoRows { return nil, domain.ErrUserNotFound } return user, err } func (r *PostgresRepository) Update(ctx context.Context, user *domain.User) error { query := ` UPDATE users.users SET email = $2, name = $3, active = $4, updated_at = $5 WHERE id = $1 ` _, err := r.db.ExecContext(ctx, query, user.ID, user.Email, user.Name, user.Active, user.UpdatedAt) return err } func (r *PostgresRepository) Delete(ctx context.Context, id domain.UserID) error { query := `DELETE FROM users.users WHERE id = $1` _, err := r.db.ExecContext(ctx, query, id) return err } // internal/modules/user/module.go package user import ( "database/sql" "app/internal/modules/user/application" "app/internal/modules/user/infrastructure" ) type Module struct { Service *application.Service } func NewModule(db *sql.DB) *Module { repo := infrastructure.NewPostgresRepository(db) service := application.NewService(repo) return &Module{ Service: service, } } Module 2: Product Module // internal/modules/product/domain/product.go package domain import ( "context" "errors" "time" ) type ProductID string type Product struct { ID ProductID Name string Description string Price float64 Stock int CreatedAt time.Time UpdatedAt time.Time } var ( ErrProductNotFound = errors.New("product not found") ErrInsufficientStock = errors.New("insufficient stock") ErrInvalidPrice = errors.New("invalid price") ) type Repository interface { Create(ctx context.Context, product *Product) error GetByID(ctx context.Context, id ProductID) (*Product, error) Update(ctx context.Context, product *Product) error ReserveStock(ctx context.Context, id ProductID, quantity int) error ReleaseStock(ctx context.Context, id ProductID, quantity int) error } // internal/modules/product/application/service.go package application import ( "context" "fmt" "app/internal/modules/product/domain" ) type Service struct { repo domain.Repository } func NewService(repo domain.Repository) *Service { return &Service{repo: repo} } func (s *Service) CreateProduct(ctx context.Context, name, description string, price float64, stock int) (*domain.Product, error) { if price <= 0 { return nil, domain.ErrInvalidPrice } product := &domain.Product{ ID: domain.ProductID(generateID()), Name: name, Description: description, Price: price, Stock: stock, CreatedAt: time.Now(), UpdatedAt: time.Now(), } if err := s.repo.Create(ctx, product); err != nil { return nil, fmt.Errorf("failed to create product: %w", err) } return product, nil } func (s *Service) GetProduct(ctx context.Context, id domain.ProductID) (*domain.Product, error) { return s.repo.GetByID(ctx, id) } func (s *Service) CheckStock(ctx context.Context, id domain.ProductID, quantity int) (bool, error) { product, err := s.repo.GetByID(ctx, id) if err != nil { return false, err } return product.Stock >= quantity, nil } func (s *Service) ReserveStock(ctx context.Context, id domain.ProductID, quantity int) error { available, err := s.CheckStock(ctx, id, quantity) if err != nil { return err } if !available { return domain.ErrInsufficientStock } return s.repo.ReserveStock(ctx, id, quantity) } func (s *Service) ConfirmReservation(ctx context.Context, id domain.ProductID, quantity int) error { // In a real implementation, this would mark the reservation as confirmed return nil } func generateID() string { return fmt.Sprintf("product_%d", time.Now().UnixNano()) } // internal/modules/product/infrastructure/postgres_repository.go package infrastructure import ( "context" "database/sql" "app/internal/modules/product/domain" ) type PostgresRepository struct { db *sql.DB } func NewPostgresRepository(db *sql.DB) *PostgresRepository { return &PostgresRepository{db: db} } func (r *PostgresRepository) Create(ctx context.Context, product *domain.Product) error { query := ` INSERT INTO products.products (id, name, description, price, stock, created_at, updated_at) VALUES ($1, $2, $3, $4, $5, $6, $7) ` _, err := r.db.ExecContext(ctx, query, product.ID, product.Name, product.Description, product.Price, product.Stock, product.CreatedAt, product.UpdatedAt) return err } func (r *PostgresRepository) GetByID(ctx context.Context, id domain.ProductID) (*domain.Product, error) { query := ` SELECT id, name, description, price, stock, created_at, updated_at FROM products.products WHERE id = $1 ` product := &domain.Product{} err := r.db.QueryRowContext(ctx, query, id).Scan( &product.ID, &product.Name, &product.Description, &product.Price, &product.Stock, &product.CreatedAt, &product.UpdatedAt, ) if err == sql.ErrNoRows { return nil, domain.ErrProductNotFound } return product, err } func (r *PostgresRepository) Update(ctx context.Context, product *domain.Product) error { query := ` UPDATE products.products SET name = $2, description = $3, price = $4, stock = $5, updated_at = $6 WHERE id = $1 ` _, err := r.db.ExecContext(ctx, query, product.ID, product.Name, product.Description, product.Price, product.Stock, product.UpdatedAt) return err } func (r *PostgresRepository) ReserveStock(ctx context.Context, id domain.ProductID, quantity int) error { query := ` UPDATE products.products SET stock = stock - $2 WHERE id = $1 AND stock >= $2 ` result, err := r.db.ExecContext(ctx, query, id, quantity) if err != nil { return err } rows, err := result.RowsAffected() if err != nil { return err } if rows == 0 { return domain.ErrInsufficientStock } return nil } func (r *PostgresRepository) ReleaseStock(ctx context.Context, id domain.ProductID, quantity int) error { query := ` UPDATE products.products SET stock = stock + $2 WHERE id = $1 ` _, err := r.db.ExecContext(ctx, query, id, quantity) return err } // internal/modules/product/module.go package product import ( "database/sql" "app/internal/modules/product/application" "app/internal/modules/product/infrastructure" ) type Module struct { Service *application.Service } func NewModule(db *sql.DB) *Module { repo := infrastructure.NewPostgresRepository(db) service := application.NewService(repo) return &Module{ Service: service, } } Module 3: Order Module (Coordinates Other Modules) // internal/modules/order/domain/order.go package domain import ( "context" "errors" "time" "app/internal/modules/user/domain" "app/internal/modules/product/domain" ) type OrderID string type OrderStatus string const ( OrderStatusPending OrderStatus = "pending" OrderStatusConfirmed OrderStatus = "confirmed" OrderStatusCancelled OrderStatus = "cancelled" ) type OrderItem struct { ProductID domain.ProductID Quantity int Price float64 } type Order struct { ID OrderID UserID domain.UserID Items []OrderItem Total float64 Status OrderStatus CreatedAt time.Time UpdatedAt time.Time } var ( ErrOrderNotFound = errors.New("order not found") ErrInvalidOrder = errors.New("invalid order") ) type Repository interface { Create(ctx context.Context, order *Order) error GetByID(ctx context.Context, id OrderID) (*Order, error) Update(ctx context.Context, order *Order) error } // internal/modules/order/application/service.go package application import ( "context" "fmt" "time" orderdomain "app/internal/modules/order/domain" productapp "app/internal/modules/product/application" userapp "app/internal/modules/user/application" ) // Service coordinates between modules type Service struct { repo orderdomain.Repository userService *userapp.Service productService *productapp.Service } func NewService( repo orderdomain.Repository, userService *userapp.Service, productService *productapp.Service, ) *Service { return &Service{ repo: repo, userService: userService, productService: productService, } } func (s *Service) CreateOrder(ctx context.Context, userID domain.UserID, items []orderdomain.OrderItem) (*orderdomain.Order, error) { // Validate user through User module valid, err := s.userService.ValidateUser(ctx, userID) if err != nil { return nil, fmt.Errorf("failed to validate user: %w", err) } if !valid { return nil, fmt.Errorf("user is not active") } // Calculate total and validate products var total float64 for i, item := range items { product, err := s.productService.GetProduct(ctx, item.ProductID) if err != nil { return nil, fmt.Errorf("failed to get product: %w", err) } // Check stock availability available, err := s.productService.CheckStock(ctx, item.ProductID, item.Quantity) if err != nil { return nil, fmt.Errorf("failed to check stock: %w", err) } if !available { return nil, fmt.Errorf("insufficient stock for product %s", item.ProductID) } items[i].Price = product.Price total += product.Price * float64(item.Quantity) } // Reserve stock for all items for _, item := range items { if err := s.productService.ReserveStock(ctx, item.ProductID, item.Quantity); err != nil { // Rollback reservations on failure return nil, fmt.Errorf("failed to reserve stock: %w", err) } } order := &orderdomain.Order{ ID: orderdomain.OrderID(generateID()), UserID: userID, Items: items, Total: total, Status: orderdomain.OrderStatusPending, CreatedAt: time.Now(), UpdatedAt: time.Now(), } if err := s.repo.Create(ctx, order); err != nil { return nil, fmt.Errorf("failed to create order: %w", err) } return order, nil } func (s *Service) GetOrder(ctx context.Context, id orderdomain.OrderID) (*orderdomain.Order, error) { return s.repo.GetByID(ctx, id) } func (s *Service) ConfirmOrder(ctx context.Context, id orderdomain.OrderID) error { order, err := s.repo.GetByID(ctx, id) if err != nil { return err } // Confirm stock reservations for _, item := range order.Items { if err := s.productService.ConfirmReservation(ctx, item.ProductID, item.Quantity); err != nil { return fmt.Errorf("failed to confirm reservation: %w", err) } } order.Status = orderdomain.OrderStatusConfirmed order.UpdatedAt = time.Now() return s.repo.Update(ctx, order) } func generateID() string { return fmt.Sprintf("order_%d", time.Now().UnixNano()) } // internal/modules/order/module.go package order import ( "database/sql" "app/internal/modules/order/application" "app/internal/modules/order/infrastructure" productapp "app/internal/modules/product/application" userapp "app/internal/modules/user/application" ) type Module struct { Service *application.Service } func NewModule(db *sql.DB, userService *userapp.Service, productService *productapp.Service) *Module { repo := infrastructure.NewPostgresRepository(db) service := application.NewService(repo, userService, productService) return &Module{ Service: service, } } Main Application // cmd/app/main.go package main import ( "database/sql" "log" "net/http" _ "github.com/lib/pq" "app/internal/modules/user" "app/internal/modules/product" "app/internal/modules/order" ) func main() { // Initialize database db, err := sql.Open("postgres", "postgres://user:pass@localhost/modular_monolith?sslmode=disable") if err != nil { log.Fatal(err) } defer db.Close() // Initialize modules userModule := user.NewModule(db) productModule := product.NewModule(db) orderModule := order.NewModule(db, userModule.Service, productModule.Service) // Setup HTTP routes mux := http.NewServeMux() // User endpoints mux.HandleFunc("POST /users", func(w http.ResponseWriter, r *http.Request) { // Handle user creation }) // Product endpoints mux.HandleFunc("POST /products", func(w http.ResponseWriter, r *http.Request) { // Handle product creation }) // Order endpoints mux.HandleFunc("POST /orders", func(w http.ResponseWriter, r *http.Request) { // Handle order creation using orderModule.Service }) log.Println("Server starting on :8080") if err := http.ListenAndServe(":8080", mux); err != nil { log.Fatal(err) } } Best Practices Module Boundaries: Keep modules independent with clear interfaces Shared Database: Use schemas or table prefixes to separate module data Module APIs: Define explicit APIs for inter-module communication Dependency Direction: Modules should depend on interfaces, not implementations Event-Driven Communication: Use events for async inter-module communication Transaction Management: Handle cross-module transactions carefully Testing: Test modules independently with mocked dependencies Documentation: Document module APIs and boundaries clearly Common Pitfalls Shared Models: Sharing domain models between modules creates tight coupling Direct Database Access: Modules accessing other modules’ database tables Circular Dependencies: Modules depending on each other directly Anemic Modules: Modules with no business logic, just CRUD operations God Modules: Modules that know too much about other modules Ignoring Boundaries: Calling internal implementations instead of module APIs Synchronous Coupling: Over-reliance on synchronous inter-module calls When to Use Modular Monolith Use When: ...

    January 19, 2025 · 13 min · Rafiul Alam

    Go Concurrency Pattern: The Bank Account Drama

    ← Login Counter | Series Overview | Ticket Seller → The Problem: Money Vanishing Into Thin Air Two people share a bank account with $100. Both check the balance at the same time, see $100, and both withdraw $100. The bank just lost $100. This isn’t a hypothetical-race conditions in financial systems have caused real monetary losses. The bank account drama illustrates the fundamental challenge of concurrent programming: read-modify-write operations are not atomic. What seems like a simple operation actually involves multiple steps, and when multiple goroutines execute these steps concurrently, chaos ensues. ...

    January 18, 2025 · 7 min · Rafiul Alam