Release Management: From Semantic Versioning to Production Deployment

    Introduction Release management is the process of planning, scheduling, and controlling software releases through different stages and environments. It ensures that software is released reliably, predictably, and with minimal disruption. This guide visualizes key release management concepts: Semantic Versioning: Deciding when to bump major, minor, or patch versions Release Train: Structured release cadence with quality gates Hotfix Process: Fast-track critical fixes to production Release Checklist: Ensuring nothing is missed during deployment Environment Promotion: Moving code through dev, staging, and production Part 1: Semantic Versioning Decision Tree Understanding Version Numbers: MAJOR.MINOR.PATCH Semantic versioning (SemVer) uses a three-part version number: MAJOR.MINOR.PATCH ...

    January 24, 2025 · 19 min · Rafiul Alam

    Deployment Strategies: Blue-Green, Canary, Rolling Updates

    Introduction Choosing the right deployment strategy is critical for minimizing downtime and risk when releasing new versions of your application. Different strategies offer different trade-offs between speed, safety, and resource usage. This guide visualizes three essential deployment strategies: Rolling Updates: Gradual replacement of instances Blue-Green Deployments: Instant cutover between versions Canary Deployments: Progressive rollout with traffic splitting Comparison and Use Cases: When to use each strategy Part 1: Rolling Update Deployment Rolling updates gradually replace old version pods with new version pods, ensuring continuous availability. ...

    January 23, 2025 · 11 min · Rafiul Alam

    Multi-Environment Pipeline: Dev → Staging → Production

    Introduction Multi-environment pipelines enable safe, progressive deployment of code changes through isolated environments. Each environment serves a specific purpose in validating changes before they reach production users. This guide visualizes the multi-environment deployment flow: Environment Hierarchy: Dev → Staging → Production Environment Isolation: Separate configs, databases, resources Progressive Promotion: Automated testing at each stage Approval Gates: Manual checkpoints for production Configuration Management: Environment-specific settings Part 1: Multi-Environment Architecture Complete Environment Flow %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Dev([👨‍💻 Developer]) --> LocalDev[Local DevelopmentLaptop/Docker DesktopFast iteration] LocalDev --> Push[git push origin feature/new-api] Push --> CI[CI Pipeline TriggeredBuild + Test + Lint] CI --> CIPass{CIPassed?} CIPass -->|No| FixLocal[❌ Fix locallyCheck logsRun tests] FixLocal -.-> LocalDev CIPass -->|Yes| FeatureBranch{Branchtype?} FeatureBranch -->|feature/*| DevEnv[🔧 Dev EnvironmentNamespace: devAuto-deploy on push] FeatureBranch -->|main| StagingEnv[🎯 Staging EnvironmentNamespace: stagingAuto-deploy on merge] subgraph DevEnvironment[Development Environment] DevEnv --> DevConfig[Configuration:- Debug mode ON- Verbose logging- Mock external APIs- Dev database- Minimal replicas: 1] DevConfig --> DevTest[Basic Tests:- Smoke tests- Health checks- Manual QA] DevTest --> DevDone[✅ Dev validatedReady for staging] end DevDone --> MergePR[Merge Pull Requestto main branch] MergePR --> StagingEnv subgraph StagingEnvironment[Staging Environment] StagingEnv --> StagingConfig[Configuration:- Production-like setup- Staging database- Real external APIs test- Replicas: 2-3- Resource limits] StagingConfig --> StagingTest[Comprehensive Tests:- Integration tests- E2E tests- Performance tests- Security scans] StagingTest --> StagingResult{All testspassed?} StagingResult -->|No| StagingFail[❌ Staging failedRollback stagingFix issues] StagingFail -.-> FixLocal StagingResult -->|Yes| StagingMonitor[Monitor staging:- Error rates- Performance metrics- User acceptance testing] StagingMonitor --> StagingReady[✅ Staging validatedReady for production] end StagingReady --> ApprovalGate{ManualApprovalRequired} ApprovalGate --> ReviewTeam[Team Lead Review:- Code changes- Test results- Risk assessment- Deployment timing] ReviewTeam --> Approved{Approved?} Approved -->|No| Rejected[❌ RejectedMore testing neededor wrong timing] Approved -->|Yes| ProdEnv[🚀 Production EnvironmentNamespace: productionManual trigger only] subgraph ProductionEnvironment[Production Environment] ProdEnv --> ProdConfig[Configuration:- Production settings- Production database- High availability- Replicas: 5-10- Strict resource limits- Auto-scaling enabled] ProdConfig --> ProdDeploy[Deployment Strategy:- Blue-green or- Canary or- Rolling update] ProdDeploy --> ProdHealth{Productionhealthy?} ProdHealth -->|No| AutoRollback[🚨 Auto-rollbackRevert to previousAlert on-call team] ProdHealth -->|Yes| ProdMonitor[Monitor Production:- Real user metrics- Error rates- Business KPIs- SLO compliance] ProdMonitor --> ProdStable{Stable for15 minutes?} ProdStable -->|No| AutoRollback ProdStable -->|Yes| Success[✅ Deployment Complete!New version liveMonitor continues] end style DevEnv fill:#064e3b,stroke:#10b981 style StagingEnv fill:#78350f,stroke:#f59e0b style ProdEnv fill:#1e3a8a,stroke:#3b82f6 style Success fill:#064e3b,stroke:#10b981 style StagingFail fill:#7f1d1d,stroke:#ef4444 style AutoRollback fill:#7f1d1d,stroke:#ef4444 style Rejected fill:#7f1d1d,stroke:#ef4444 Part 2: Environment Comparison Environment Characteristics %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% graph TB subgraph Local[🏠 Local Development] LocalProps[Properties:✓ Fast iteration✓ Developer's laptop✓ Docker Compose✓ Mock services✓ Hot reload enabled] LocalData[Data:- SQLite or local DB- Seed data- No real user data- Quick reset] LocalAccess[Access:- localhost only- No authentication- Debug tools enabled] end subgraph Dev[🔧 Development Environment] DevProps[Properties:✓ Shared team env✓ Kubernetes cluster✓ Continuous deployment✓ Latest features✓ Can be unstable] DevData[Data:- Dev database- Synthetic test data- Reset weekly- No PII] DevAccess[Access:- VPN required- Basic auth- All developers- Debug mode ON] end subgraph Staging[🎯 Staging Environment] StagingProps[Properties:✓ Production mirror✓ Same infrastructure✓ Pre-production testing✓ Stable builds only✓ Performance testing] StagingData[Data:- Staging database- Anonymized prod data- Or realistic test data- Refreshed monthly] StagingAccess[Access:- VPN required- OAuth/SSO- Developers + QA- Debug mode OFF] end subgraph Prod[🚀 Production Environment] ProdProps[Properties:✓ Live customer traffic✓ High availability✓ Auto-scaling✓ Disaster recovery✓ Maximum stability] ProdData[Data:- Production database- Real user data- Encrypted at rest- Regular backups] ProdAccess[Access:- Public internet- Full authentication- Limited admin access- Audit logging enabled] end Local --> |git push feature/*| Dev Dev --> |Merge to main| Staging Staging --> |Manual approval| Prod style Local fill:#064e3b,stroke:#10b981 style Dev fill:#064e3b,stroke:#10b981 style Staging fill:#78350f,stroke:#f59e0b style Prod fill:#1e3a8a,stroke:#3b82f6 Environment Configuration Matrix Aspect Local Dev Staging Production Purpose Development Feature testing Pre-production validation Live users Deployment Manual Auto on push Auto on merge Manual approval Replicas 1 1-2 2-3 5-10+ Database Local SQLite Shared dev DB Staging DB (prod-like) Production DB Resources Minimal Low Medium (prod-like) High Monitoring None Basic Full Full + Alerts Debug Mode Yes Yes No No Logging Level DEBUG DEBUG INFO WARN/ERROR External APIs Mocked Test endpoints Test endpoints Production endpoints Data Seed data Synthetic Anonymized Real user data Access localhost VPN + Basic auth VPN + SSO Public + Full auth Uptime SLA N/A None None 99.9%+ Part 3: Progressive Promotion Pipeline Promotion Flow with Quality Gates %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart LR subgraph LocalStage[Local Stage] L1[Write Code] L2[Run Unit Tests] L3[Manual Testing] L1 --> L2 --> L3 end subgraph DevStage[Dev Stage] D1[Auto Deploy] D2[Smoke Tests] D3{TestsPass?} D4[Dev Validated ✓] D1 --> D2 --> D3 D3 -->|Yes| D4 D3 -->|No| D5[❌ Fix] D5 -.-> L1 end subgraph StagingStage[Staging Stage] S1[Auto Deploy] S2[Integration Tests] S3[E2E Tests] S4[Performance Tests] S5{All Pass?} S6[Staging Validated ✓] S1 --> S2 --> S3 --> S4 --> S5 S5 -->|Yes| S6 S5 -->|No| S7[❌ Fix] S7 -.-> L1 end subgraph ApprovalStage[Approval Gate] A1[Create Release] A2[Code Review] A3[Change Advisory] A4{Approved?} A1 --> A2 --> A3 --> A4 A4 -->|No| A5[❌ Rejected] A5 -.-> L1 end subgraph ProdStage[Production Stage] P1[Manual Deploy] P2[Canary 10%] P3{Healthy?} P4[Increase to 50%] P5{Healthy?} P6[Complete 100%] P7[Monitor] P8[Success ✓] P1 --> P2 --> P3 P3 -->|Yes| P4 --> P5 P5 -->|Yes| P6 --> P7 --> P8 P3 -->|No| P9[🚨 Rollback] P5 -->|No| P9 end L3 --> |git push| D1 D4 --> |Merge PR| S1 S6 --> A1 A4 -->|Yes| P1 style L3 fill:#064e3b,stroke:#10b981 style D4 fill:#064e3b,stroke:#10b981 style S6 fill:#064e3b,stroke:#10b981 style P8 fill:#064e3b,stroke:#10b981 style D5 fill:#7f1d1d,stroke:#ef4444 style S7 fill:#7f1d1d,stroke:#ef4444 style P9 fill:#7f1d1d,stroke:#ef4444 Part 4: Environment-Specific Configuration Configuration Management Strategy %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Application needs config]) --> Method{ConfigMethod?} Method --> EnvVars[Environment Variables] Method --> ConfigMaps[Kubernetes ConfigMaps] Method --> Secrets[Kubernetes Secrets] EnvVars --> EnvExample[Examples:- NODE_ENV=production- LOG_LEVEL=info- FEATURE_FLAGS=true] ConfigMaps --> CMExample[Examples:- app-config.yaml- nginx.conf- application.properties] Secrets --> SecretExample[Examples:- DATABASE_PASSWORD- API_KEYS- TLS certificates] EnvExample --> Override{Override perenvironment?} CMExample --> Override SecretExample --> Override Override --> DevOverride[Dev Environment:DEBUG=trueDB_HOST=dev-dbREPLICAS=1CACHE_TTL=60s] Override --> StagingOverride[Staging Environment:DEBUG=falseDB_HOST=staging-dbREPLICAS=3CACHE_TTL=300s] Override --> ProdOverride[Production Environment:DEBUG=falseDB_HOST=prod-dbREPLICAS=10CACHE_TTL=600s] DevOverride --> Inject[Inject at deployment:kubectl apply -f k8s/dev/- deployment.yaml- configmap.yaml- secrets.yaml] StagingOverride --> Inject ProdOverride --> Inject style EnvVars fill:#1e3a8a,stroke:#3b82f6 style ConfigMaps fill:#1e3a8a,stroke:#3b82f6 style Secrets fill:#7f1d1d,stroke:#ef4444 Kubernetes Configuration Example # k8s/base/deployment.yaml (Common base) apiVersion: apps/v1 kind: Deployment metadata: name: myapp spec: selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: containers: - name: myapp image: myapp:latest # Overridden per environment ports: - containerPort: 8080 envFrom: - configMapRef: name: myapp-config - secretRef: name: myapp-secrets resources: # Overridden per environment requests: memory: "128Mi" cpu: "100m" limits: memory: "256Mi" cpu: "200m" --- # k8s/dev/configmap.yaml apiVersion: v1 kind: ConfigMap metadata: name: myapp-config namespace: dev data: NODE_ENV: "development" LOG_LEVEL: "debug" DATABASE_HOST: "postgres.dev.svc.cluster.local" REDIS_HOST: "redis.dev.svc.cluster.local" FEATURE_NEW_UI: "true" FEATURE_BETA_API: "true" --- # k8s/staging/configmap.yaml apiVersion: v1 kind: ConfigMap metadata: name: myapp-config namespace: staging data: NODE_ENV: "staging" LOG_LEVEL: "info" DATABASE_HOST: "postgres.staging.svc.cluster.local" REDIS_HOST: "redis.staging.svc.cluster.local" FEATURE_NEW_UI: "true" FEATURE_BETA_API: "false" --- # k8s/production/configmap.yaml apiVersion: v1 kind: ConfigMap metadata: name: myapp-config namespace: production data: NODE_ENV: "production" LOG_LEVEL: "warn" DATABASE_HOST: "postgres.production.svc.cluster.local" REDIS_HOST: "redis.production.svc.cluster.local" FEATURE_NEW_UI: "false" # Gradual rollout FEATURE_BETA_API: "false" --- # k8s/dev/kustomization.yaml apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization namespace: dev resources: - ../base/deployment.yaml - configmap.yaml - secrets.yaml images: - name: myapp newTag: dev-abc123 replicas: - name: myapp count: 1 patches: - patch: |- - op: replace path: /spec/template/spec/containers/0/resources/requests/memory value: 128Mi - op: replace path: /spec/template/spec/containers/0/resources/limits/memory value: 256Mi target: kind: Deployment name: myapp --- # k8s/production/kustomization.yaml apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization namespace: production resources: - ../base/deployment.yaml - configmap.yaml - secrets.yaml images: - name: myapp newTag: v1.2.3 replicas: - name: myapp count: 10 patches: - patch: |- - op: replace path: /spec/template/spec/containers/0/resources/requests/memory value: 512Mi - op: replace path: /spec/template/spec/containers/0/resources/limits/memory value: 1Gi - op: replace path: /spec/template/spec/containers/0/resources/requests/cpu value: 500m - op: replace path: /spec/template/spec/containers/0/resources/limits/cpu value: 1000m target: kind: Deployment name: myapp Part 5: Database Migration Strategy Multi-Environment Database Flow %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant Dev as Developer participant DevDB as Dev Database participant StagingDB as Staging Database participant ProdDB as Production Database participant Migration as Migration Tool Note over Dev: Write migration:001_add_users_table.sql Dev->>DevDB: Run migration locallyCREATE TABLE users... DevDB-->>Dev: Migration applied ✓ Dev->>Dev: Test applicationwith new schema Dev->>Dev: git push feature/add-users Note over DevDB: CI/CD Pipeline triggered Dev->>DevDB: Auto-run migrationsin dev environment DevDB-->>Dev: Dev DB updated ✓ Note over Dev: Create Pull RequestMerge to main Dev->>StagingDB: Trigger staging deployment Note over Migration,StagingDB: Pre-deployment hook Migration->>StagingDB: Backup databasepg_dump > backup.sql Migration->>StagingDB: Run migrations001_add_users_table.sql StagingDB-->>Migration: Migration applied ✓ Note over StagingDB: Deploy applicationTest with new schema alt Migration Failed Migration->>StagingDB: Rollback migrationRestore from backup StagingDB-->>Migration: Rolled back end Note over Dev: Manual approvalfor production Dev->>ProdDB: Trigger production deployment Note over Migration,ProdDB: Pre-deployment steps Migration->>ProdDB: Full database backupSnapshot created Migration->>ProdDB: Check migration statusSELECT version FROM schema_migrations ProdDB-->>Migration: Current version: 000 Migration->>ProdDB: Run migrationsin transaction Note over Migration,ProdDB: BEGIN;CREATE TABLE users;INSERT INTO schema_migrationsVALUES ('001');COMMIT; ProdDB-->>Migration: Migration successful ✓ Note over ProdDB: Deploy new applicationversion alt Production Issues Migration->>ProdDB: Rollback migrationRun down migration:DROP TABLE users; Note over ProdDB: Deploy previousapplication version end Migration->>ProdDB: Verify data integrityCheck constraints ProdDB-->>Migration: All checks passed ✓ Note over Dev,ProdDB: Production updated successfully Part 6: Multi-Environment CI/CD Pipeline Complete Pipeline Configuration # .github/workflows/multi-env-deploy.yml name: Multi-Environment Deployment on: push: branches: - main - develop pull_request: branches: - main env: REGISTRY: ghcr.io IMAGE_NAME: ${{ github.repository }} jobs: # CI - Same for all environments build-and-test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Run linting run: npm run lint - name: Run unit tests run: npm test - name: Build Docker image run: docker build -t $IMAGE_NAME:${{ github.sha }} . - name: Run integration tests run: docker-compose -f docker-compose.test.yml up --abort-on-container-exit - name: Push image run: | echo ${{ secrets.GITHUB_TOKEN }} | docker login ghcr.io -u ${{ github.actor }} --password-stdin docker push $IMAGE_NAME:${{ github.sha }} # Deploy to Dev - Auto on feature branches deploy-dev: needs: build-and-test if: github.ref != 'refs/heads/main' runs-on: ubuntu-latest environment: name: development url: https://dev.example.com steps: - uses: actions/checkout@v3 - name: Deploy to Dev run: | kubectl config set-cluster dev --server="${{ secrets.DEV_K8S_SERVER }}" kubectl config set-credentials admin --token="${{ secrets.DEV_K8S_TOKEN }}" kubectl set image deployment/myapp myapp=$IMAGE_NAME:${{ github.sha }} -n dev kubectl rollout status deployment/myapp -n dev - name: Run smoke tests run: | curl https://dev.example.com/health npm run test:smoke -- --env=dev # Deploy to Staging - Auto on main branch deploy-staging: needs: build-and-test if: github.ref == 'refs/heads/main' runs-on: ubuntu-latest environment: name: staging url: https://staging.example.com steps: - uses: actions/checkout@v3 - name: Run database migrations run: | kubectl exec -n staging deployment/postgres -- \ psql -U postgres -d app -f /migrations/migrate.sql - name: Deploy to Staging run: | kubectl config set-cluster staging --server="${{ secrets.STAGING_K8S_SERVER }}" kubectl config set-credentials admin --token="${{ secrets.STAGING_K8S_TOKEN }}" kubectl apply -k k8s/staging/ kubectl rollout status deployment/myapp -n staging --timeout=5m - name: Run E2E tests run: npm run test:e2e -- --env=staging - name: Run performance tests run: | k6 run --vus 10 --duration 30s tests/performance.js - name: Check staging health run: | curl https://staging.example.com/health | jq '.status' | grep -q "healthy" # Deploy to Production - Manual approval required deploy-production: needs: deploy-staging runs-on: ubuntu-latest environment: name: production url: https://example.com steps: - uses: actions/checkout@v3 - name: Backup production database run: | kubectl exec -n production deployment/postgres -- \ pg_dump -U postgres app > backup-$(date +%Y%m%d-%H%M%S).sql - name: Run database migrations run: | kubectl exec -n production deployment/postgres -- \ psql -U postgres -d app -f /migrations/migrate.sql - name: Deploy to Production (Blue-Green) run: | kubectl config set-cluster prod --server="${{ secrets.PROD_K8S_SERVER }}" kubectl config set-credentials admin --token="${{ secrets.PROD_K8S_TOKEN }}" # Deploy green version kubectl apply -k k8s/production/ kubectl rollout status deployment/myapp-green -n production --timeout=10m # Switch traffic to green kubectl patch service myapp -n production -p '{"spec":{"selector":{"version":"green"}}}' - name: Monitor production metrics run: | sleep 300 # Wait 5 minutes ERROR_RATE=$(curl -s prometheus.example.com/api/v1/query?query=rate5m) if [ "$ERROR_RATE" -gt "0.01" ]; then echo "Error rate too high, rolling back" kubectl patch service myapp -n production -p '{"spec":{"selector":{"version":"blue"}}}' exit 1 fi - name: Notify team if: success() uses: slackapi/slack-github-action@v1 with: payload: | { "text": "✅ Production deployment successful!", "version": "${{ github.sha }}", "deployed_by": "${{ github.actor }}" } Part 7: Best Practices Environment Management Checklist ✅ DO: ...

    January 23, 2025 · 11 min · Rafiul Alam

    Rollback & Recovery: Detection to Previous Version

    Introduction Even with the best testing, production issues happen. Having a solid rollback and recovery strategy is critical for minimizing downtime and data loss when deployments go wrong. This guide visualizes the complete rollback process: Issue Detection: Monitoring alerts and health checks Rollback Decision: When to rollback vs forward fix Rollback Execution: Different rollback strategies Data Recovery: Handling database changes Post-Incident: Learning and prevention Part 1: Issue Detection Flow From Healthy to Incident %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Production deploymentcompleted]) --> Monitor[Monitoring Systems- Prometheus metrics- Application logs- User reports- Health checks] Monitor --> Baseline[Baseline Metrics:✓ Error rate: 0.1%✓ Latency p95: 150ms✓ Traffic: 10k req/min✓ CPU: 40%✓ Memory: 60%] Baseline --> Time[Time passes...Minutes after deployment] Time --> Detect{Issuedetected?} Detect -->|No issue| Healthy[✅ Deployment HealthyContinue monitoringAll metrics normal] Detect -->|Yes| IssueType{Issuetype?} IssueType --> ErrorSpike[🔴 Error Rate Spike0.1% → 15%Alert: HighErrorRate firing] IssueType --> LatencySpike[🟡 Latency Increasep95: 150ms → 5000msAlert: HighLatency firing] IssueType --> TrafficDrop[🟠 Traffic Drop10k → 1k req/minUsers can't access] IssueType --> ResourceIssue[🔴 Resource ExhaustionCPU: 40% → 100%OOMKilled events] IssueType --> DataCorruption[🔴 Data IssuesDatabase errorsInvalid data returned] ErrorSpike --> Severity1[Severity: CRITICALUser impact: HIGHAffecting all users] LatencySpike --> Severity2[Severity: WARNINGUser impact: MEDIUMSlow but functional] TrafficDrop --> Severity3[Severity: CRITICALUser impact: HIGHComplete outage] ResourceIssue --> Severity4[Severity: CRITICALUser impact: HIGHPods crashing] DataCorruption --> Severity5[Severity: CRITICALUser impact: CRITICALData integrity at risk] Severity1 --> AutoAlert[🚨 Automated Alerts:- PagerDuty page- Slack notification- Email alerts- Status page update] Severity2 --> AutoAlert Severity3 --> AutoAlert Severity4 --> AutoAlert Severity5 --> AutoAlert AutoAlert --> OnCall[On-Call EngineerReceives alertAcknowledges incident] OnCall --> Investigate[Quick Investigation:- Check deployment timeline- Review recent changes- Check logs- Verify metrics] Investigate --> RootCause{Root causeidentified?} RootCause -->|Yes - Recent deployment| Decision[Go to Rollback Decision] RootCause -->|Yes - Other cause| OtherFix[Different remediationNot deployment-related] RootCause -->|No - Time critical| Decision style Healthy fill:#064e3b,stroke:#10b981 style Severity1 fill:#7f1d1d,stroke:#ef4444 style Severity3 fill:#7f1d1d,stroke:#ef4444 style Severity4 fill:#7f1d1d,stroke:#ef4444 style Severity5 fill:#7f1d1d,stroke:#ef4444 style Severity2 fill:#78350f,stroke:#f59e0b Part 2: Rollback Decision Tree When to Rollback vs Forward Fix %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Production issue detected]) --> Assess[Assess situation:- User impact- Severity- Time deployed- Data changes] Assess --> Q1{Can issue befixed quickly?5 min} Q1 -->|Yes - Simple config| QuickFix[Forward Fix:- Update config map- Restart pods- No rollback needed] Q1 -->|No| Q2{Is issue causedby latestdeployment?} Q2 -->|No - External issue| External[External Root Cause:- Third-party API down- Database issue- Infrastructure problem→ Fix underlying issue] Q2 -->|Yes| Q3{User impactseverity?} Q3 -->|Low - Minor bugs| Q4{Time sincedeployment?} Q4 -->|< 30 min| RollbackLow[Consider Rollback:Low risk, easy rollbackUsers barely affected] Q4 -->|> 30 min| ForwardFix[Forward Fix:Deploy hotfixMore data changesRollback riskier] Q3 -->|Medium - Degraded| Q5{Data changesmade?} Q5 -->|No DB changes| RollbackMed[Rollback:Safe to revertNo data migrationQuick recovery] Q5 -->|DB changes made| Q6{Can revertDB changes?} Q6 -->|Yes - Reversible| RollbackWithDB[Rollback + DB Revert:1. Revert application2. Run down migrationCoordinate carefully] Q6 -->|No - Irreversible| ForwardOnly[Forward Fix ONLY:Cannot rollbackFix bug in new versionData can't be reverted] Q3 -->|High - Outage| Q7{Rollbacktime?} Q7 -->|< 5 min| ImmediateRollback[IMMEDIATE Rollback:User impact too highRollback firstDebug later] Q7 -->|> 5 min| Q8{Forward fixfaster?} Q8 -->|Yes| HotfixDeploy[Deploy Hotfix:If fix is obviousand can deployfaster than rollback] Q8 -->|No| ImmediateRollback QuickFix --> Monitor[Monitor metricsVerify fix worked] RollbackLow --> ExecuteRollback[Execute Rollback] RollbackMed --> ExecuteRollback RollbackWithDB --> ExecuteRollback ImmediateRollback --> ExecuteRollback ForwardFix --> DeployFix[Deploy Forward Fix] HotfixDeploy --> DeployFix ForwardOnly --> DeployFix style ImmediateRollback fill:#7f1d1d,stroke:#ef4444 style RollbackWithDB fill:#78350f,stroke:#f59e0b style ForwardOnly fill:#78350f,stroke:#f59e0b style QuickFix fill:#064e3b,stroke:#10b981 Part 3: Rollback Execution Strategies Application Rollback Methods %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Decision: Rollback]) --> Method{Deploymentstrategyused?} Method --> K8sRolling[Kubernetes Rolling Update] Method --> BlueGreen[Blue-Green Deployment] Method --> Canary[Canary Deployment] subgraph RollingRollback[Kubernetes Rolling Rollback] K8sRolling --> K8s1[kubectl rollout undodeployment myapp] K8s1 --> K8s2[Kubernetes:- Find previous ReplicaSet- Rolling update to old version- maxSurge: 1, maxUnavailable: 1] K8s2 --> K8s3[Gradual Pod Replacement:1. Create 1 old version pod2. Wait for ready3. Terminate 1 new version pod4. Repeat until all replaced] K8s3 --> K8s4[Time to rollback: 2-5 minDowntime: NoneSome users see old, some new] end subgraph BGRollback[Blue-Green Rollback] BlueGreen --> BG1[Current state:Blue v1.0 IDLEGreen v2.0 ACTIVE 100%] BG1 --> BG2[Update Service selector:version: green → version: blue] BG2 --> BG3[Instant Traffic Switch:Blue v1.0 ACTIVE 100%Green v2.0 IDLE 0%] BG3 --> BG4[Time to rollback: 1-2 secDowntime: ~1 secAll users switched instantly] end subgraph CanaryRollback[Canary Rollback] Canary --> C1[Current state:v1.0: 0 replicasv2.0: 10 replicas 100%] C1 --> C2[Scale down v2.0:v2.0: 10 → 0 replicas] C2 --> C3[Scale up v1.0:v1.0: 0 → 10 replicas] C3 --> C4[Time to rollback: 1-3 minDowntime: MinimalGradual traffic shift] end K8s4 --> Verify[Verification Steps] BG4 --> Verify C4 --> Verify Verify --> V1[1. Check pod statuskubectl get podsAll running?] V1 --> V2[2. Run health checkscurl /healthAll healthy?] V2 --> V3[3. Monitor metricsError rate back to normal?Latency improved?] V3 --> V4[4. Check user reportsAre users reporting success?] V4 --> Success{Rollbacksuccessful?} Success -->|Yes| Complete[✅ Rollback CompleteService restoredMonitor closely] Success -->|No| StillBroken[🚨 Still Broken!Issue not deployment-relatedDeeper investigation needed] style K8s4 fill:#1e3a8a,stroke:#3b82f6 style BG4 fill:#064e3b,stroke:#10b981 style C4 fill:#1e3a8a,stroke:#3b82f6 style Complete fill:#064e3b,stroke:#10b981 style StillBroken fill:#7f1d1d,stroke:#ef4444 Part 4: Database Rollback Complexity Handling Database Migrations %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Need to rollbackwith DB changes]) --> Analyze[Analyze migration type] Analyze --> Type{Migrationtype?} Type --> AddColumn[Added ColumnALTER TABLE usersADD COLUMN email] Type --> DropColumn[Dropped ColumnALTER TABLE usersDROP COLUMN phone] Type --> ModifyColumn[Modified ColumnALTER TABLE usersALTER COLUMN age TYPE bigint] Type --> AddTable[Added TableCREATE TABLE orders] AddColumn --> AC1{Column hasdata?} AC1 -->|No data yet| AC2[Safe Rollback:1. Deploy old app version2. DROP COLUMN emailOld app doesn't use it] AC1 -->|Has data| AC3[⚠️ Data Loss Risk:1. Backup table first2. Consider keeping column3. Deploy old app versionColumn ignored by old app] DropColumn --> DC1[🚨 CANNOT Rollback:Data already lostForward fix ONLYOptions:1. Restore from backup2. Accept data loss3. Recreate from logs] ModifyColumn --> MC1{Datacompatible?} MC1 -->|Yes - reversible| MC2[Revert Column Type:ALTER COLUMN age TYPE intVerify no data truncationThen deploy old app] MC1 -->|No - data loss| MC3[🚨 Cannot Revert:bigint values exceed int rangeForward fix ONLY] AddTable --> AT1{Table hascritical data?} AT1 -->|No data| AT2[Safe Rollback:1. Deploy old app version2. DROP TABLE ordersNo data lost] AT1 -->|Has data| AT3[Risky Rollback:1. BACKUP TABLE orders2. DROP TABLE orders3. Deploy old app versionData preserved in backup] AC2 --> SafeProcess[Safe Rollback Process:✅ No data loss✅ Quick rollback✅ Reversible] AC3 --> RiskyProcess[Risky Rollback Process:⚠️ Potential data loss⚠️ Need backup⚠️ Manual intervention] DC1 --> NoRollback[Forward Fix Only:❌ Cannot rollback❌ Data already lost❌ Must fix forward] MC2 --> SafeProcess MC3 --> NoRollback AT2 --> SafeProcess AT3 --> RiskyProcess SafeProcess --> Execute1[Execute Safe Rollback] RiskyProcess --> Decision{Acceptablerisk?} Decision -->|Yes| Execute2[Execute with Caution] Decision -->|No| NoRollback NoRollback --> HotfixDeploy[Deploy Hotfix:New version with fixKeep new schema] style SafeProcess fill:#064e3b,stroke:#10b981 style RiskyProcess fill:#78350f,stroke:#f59e0b style NoRollback fill:#7f1d1d,stroke:#ef4444 style DC1 fill:#7f1d1d,stroke:#ef4444 style MC3 fill:#7f1d1d,stroke:#ef4444 Part 5: Complete Rollback Workflow From Detection to Recovery %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant Monitor as Monitoring participant Alert as Alerting participant Engineer as On-Call Engineer participant Incident as Incident Channel participant K8s as Kubernetes participant DB as Database participant Users as End Users Note over Monitor: 5 minutes after deployment Monitor->>Monitor: Detect anomaly:Error rate: 0.1% → 18%Latency p95: 150ms → 3000ms Monitor->>Alert: Trigger alert:HighErrorRate FIRING Alert->>Engineer: 🚨 PagerDuty callCritical alertProduction incident Engineer->>Alert: Acknowledge alertStop escalation Engineer->>Incident: Create #incident-456"High error rate after v2.5 deployment" Note over Engineer: Open laptopStart investigation Engineer->>Monitor: Check Grafana dashboardWhen did issue start?Which endpoints affected? Monitor-->>Engineer: Started 5 min agoRight after deploymentAll endpoints affected Engineer->>K8s: kubectl get podsCheck pod status K8s-->>Engineer: All pods RunningNo crashesHealth checks passing Engineer->>K8s: kubectl logs deployment/myappCheck application logs K8s-->>Engineer: ERROR: Cannot connect to cacheERROR: Redis timeoutERROR: Connection refused Note over Engineer: Root cause: New versionhas Redis connection bug Engineer->>Incident: Update: Redis connection issue in v2.5Decision: Rollback to v2.4 Note over Engineer: Check deployment history Engineer->>K8s: kubectl rollout history deployment/myapp K8s-->>Engineer: REVISION 10: v2.5 (current)REVISION 9: v2.4 (previous) Engineer->>Incident: Starting rollback to v2.4ETA: 3 minutes Engineer->>K8s: kubectl rollout undo deployment/myapp K8s->>K8s: Start rollback:- Create pods with v2.4- Wait for ready- Terminate v2.5 pods loop Rolling Update K8s->>Users: Some users on v2.4 ✓Some users on v2.5 ✗ Note over K8s: Pod 1: v2.4 ReadyTerminating v2.5 Pod 1 Engineer->>K8s: kubectl rollout statusdeployment/myapp --watch K8s-->>Engineer: Waiting for rollout:2/5 pods updated end K8s->>Users: All users now on v2.4 ✓ K8s-->>Engineer: Rollout complete:deployment "myapp" successfully rolled out Engineer->>Monitor: Check metrics Note over Monitor: Wait 2 minutesfor metrics to stabilize Monitor-->>Engineer: ✅ Error rate: 0.1%✅ Latency p95: 160ms✅ All metrics normal Note over Alert: Metrics normalized Alert->>Engineer: ✅ Alert resolved:HighErrorRate Engineer->>Users: Verify user experience Users-->>Engineer: No error reportsApplication working Engineer->>Incident: ✅ Incident resolvedService restored to v2.4Duration: 12 minutesRoot cause: Redis bug in v2.5 Engineer->>Incident: Next steps:1. Fix Redis bug2. Add integration test3. Post-mortem scheduled Note over Engineer: Create follow-up tasks Engineer->>Engineer: Create Jira tickets:- BUG-789: Fix Redis connection- TEST-123: Add cache integration test- DOC-456: Update deployment checklist Note over Engineer,Users: Service restored ✓Monitoring continues Part 6: Automated Rollback Auto-Rollback Decision Flow %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Deployment completed]) --> Monitor[Continuous MonitoringEvery 30 seconds] Monitor --> Collect[Collect Metrics:- Error rate- Latency p95/p99- Success rate- Pod health- Resource usage] Collect --> Check1{Error rate> 5%?} Check1 -->|Yes| Trigger1[🚨 Trigger auto-rollbackError threshold exceeded] Check1 -->|No| Check2{Latency p95> 2x baseline?} Check2 -->|Yes| Trigger2[🚨 Trigger auto-rollbackLatency degradation] Check2 -->|No| Check3{Pod crashrate > 50%?} Check3 -->|Yes| Trigger3[🚨 Trigger auto-rollbackPods failing] Check3 -->|No| Check4{Custom metricthreshold?} Check4 -->|Yes| Trigger4[🚨 Trigger auto-rollbackBusiness metric failed] Check4 -->|No| Healthy[✅ All checks passedContinue monitoring] Healthy --> TimeCheck{Monitoringduration?} TimeCheck -->|< 15 min| Monitor TimeCheck -->|>= 15 min| Stable[✅ Deployment STABLEPassed soak periodAuto-rollback disabled] Trigger1 --> Rollback[Execute Auto-Rollback] Trigger2 --> Rollback Trigger3 --> Rollback Trigger4 --> Rollback Rollback --> R1[1. Log rollback decisionMetrics that triggeredTimestamp] R1 --> R2[2. Alert team:PagerDuty criticalSlack notification"Auto-rollback initiated"] R2 --> R3[3. Execute rollback:kubectl rollout undodeployment/myapp] R3 --> R4[4. Wait for rollback:Monitor pod statusWait for all pods ready] R4 --> R5[5. Verify recovery:Check metrics againError rate normal?Latency normal?] R5 --> Verify{Recoverysuccessful?} Verify -->|Yes| Success[✅ Auto-Rollback SuccessService restoredNotify teamCreate incident report] Verify -->|No| StillFailing[🚨 Still Failing!Issue not deploymentPage on-call immediatelyManual intervention needed] style Healthy fill:#064e3b,stroke:#10b981 style Stable fill:#064e3b,stroke:#10b981 style Success fill:#064e3b,stroke:#10b981 style Trigger1 fill:#7f1d1d,stroke:#ef4444 style Trigger2 fill:#7f1d1d,stroke:#ef4444 style Trigger3 fill:#7f1d1d,stroke:#ef4444 style Trigger4 fill:#7f1d1d,stroke:#ef4444 style StillFailing fill:#7f1d1d,stroke:#ef4444 Auto-Rollback Configuration # Flagger auto-rollback configuration apiVersion: flagger.app/v1beta1 kind: Canary metadata: name: myapp namespace: production spec: targetRef: apiVersion: apps/v1 kind: Deployment name: myapp service: port: 8080 # Canary analysis analysis: interval: 30s threshold: 5 # Rollback after 5 failed checks maxWeight: 50 stepWeight: 10 # Metrics for auto-rollback decision metrics: # HTTP error rate - name: request-success-rate thresholdRange: min: 95 # Rollback if success rate < 95% interval: 1m # HTTP latency - name: request-duration thresholdRange: max: 500 # Rollback if p95 > 500ms interval: 1m # Custom business metric - name: conversion-rate thresholdRange: min: 80 # Rollback if conversion < 80% of baseline interval: 2m # Webhooks for additional checks webhooks: - name: load-test url: http://flagger-loadtester/ timeout: 5s metadata: type: bash cmd: "hey -z 1m -q 10 http://myapp-canary:8080/" # Alerting on rollback alerts: - name: slack severity: error providerRef: name: slack namespace: flagger Part 7: Post-Incident Process Learning from Rollbacks %%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TD Start([Rollback completedService restored]) --> Timeline[Create Incident Timeline:- Deployment time- Issue detection time- Rollback decision time- Recovery timeTotal duration] Timeline --> PostMortem[Schedule Post-Mortem:Within 48 hoursAll stakeholders invitedBlameless culture] PostMortem --> Analyze[Root Cause Analysis:Why did issue occur?Why wasn't it caught?What can we learn?] Analyze --> Categories{Issuecategory?} Categories --> Testing[Insufficient Testing:- Missing test case- Integration gap- Load testing needed] Categories --> Monitoring[Monitoring Gap:- Missing alert- Wrong threshold- Blind spot found] Categories --> Process[Process Issue:- Skipped step- Wrong timing- Communication gap] Categories --> Code[Code Quality:- Bug in code- Edge case- Dependency issue] Testing --> Actions1[Action Items:□ Add integration test□ Expand E2E coverage□ Add load test□ Test in staging first] Monitoring --> Actions2[Action Items:□ Add new alert□ Adjust thresholds□ Add dashboard□ Improve visibility] Process --> Actions3[Action Items:□ Update runbook□ Add checklist item□ Change deployment time□ Improve communication] Code --> Actions4[Action Items:□ Fix bug□ Add validation□ Update dependency□ Code review process] Actions1 --> Assign[Assign Owners:Each action has ownerEach action has deadlineTrack in project board] Actions2 --> Assign Actions3 --> Assign Actions4 --> Assign Assign --> Document[Document Learnings:- Update wiki- Share with team- Add to knowledge base- Update training] Document --> Prevent[Prevent Recurrence:✓ Tests added✓ Monitoring improved✓ Process updated✓ Team educated] Prevent --> Complete[✅ Post-Incident CompleteStronger systemBetter preparedContinuous improvement] style Complete fill:#064e3b,stroke:#10b981 Part 8: Rollback Checklist Pre-Deployment Rollback Readiness Before Every Deployment: ...

    January 23, 2025 · 11 min · Rafiul Alam

    Graceful Shutdown Patterns in Go

    Go Concurrency Patterns Series: ← Go Memory Model | Series Overview | Generics Patterns → What is Graceful Shutdown? Graceful shutdown is the process of cleanly stopping a running application by: Receiving shutdown signals (SIGTERM, SIGINT) Stopping acceptance of new requests Finishing in-flight requests Closing database connections and other resources Flushing logs and metrics Exiting with appropriate status code Why It Matters: Zero-downtime deployments: No dropped requests during rollouts Data integrity: Complete ongoing transactions Resource cleanup: Prevent leaks and corruption Observability: Flush pending logs and metrics Container orchestration: Proper Kubernetes pod termination Real-World Use Cases HTTP/gRPC servers: Drain active connections before shutdown Background workers: Complete current jobs, reject new ones Message consumers: Finish processing messages, commit offsets Database connections: Close pools cleanly Caching layers: Persist in-memory state Kubernetes deployments: Respect termination grace period Basic Signal Handling Simple Shutdown Handler package main import ( "context" "fmt" "os" "os/signal" "syscall" "time" ) func main() { // Create signal channel sigChan := make(chan os.Signal, 1) // Register for SIGINT and SIGTERM signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM) // Simulate application work done := make(chan bool) go func() { fmt.Println("Application running...") time.Sleep(30 * time.Second) done <- true }() // Wait for signal or completion select { case sig := <-sigChan: fmt.Printf("\nReceived signal: %v\n", sig) fmt.Println("Initiating graceful shutdown...") // Perform cleanup cleanup() fmt.Println("Shutdown complete") os.Exit(0) case <-done: fmt.Println("Application completed normally") } } func cleanup() { fmt.Println("Cleaning up resources...") time.Sleep(2 * time.Second) fmt.Println("Cleanup complete") } HTTP Server Graceful Shutdown Basic HTTP Server Shutdown package main import ( "context" "fmt" "log" "net/http" "os" "os/signal" "syscall" "time" ) func main() { // Create HTTP server server := &http.Server{ Addr: ":8080", Handler: setupRoutes(), } // Channel to listen for errors from the server serverErrors := make(chan error, 1) // Start HTTP server in goroutine go func() { log.Printf("Server starting on %s", server.Addr) serverErrors <- server.ListenAndServe() }() // Channel to listen for interrupt signals shutdown := make(chan os.Signal, 1) signal.Notify(shutdown, syscall.SIGINT, syscall.SIGTERM) // Block until we receive a signal or server error select { case err := <-serverErrors: log.Fatalf("Server error: %v", err) case sig := <-shutdown: log.Printf("Received signal: %v. Starting graceful shutdown...", sig) // Create context with timeout for shutdown ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) defer cancel() // Attempt graceful shutdown if err := server.Shutdown(ctx); err != nil { log.Printf("Graceful shutdown failed: %v", err) // Force close if graceful shutdown fails if err := server.Close(); err != nil { log.Fatalf("Force close failed: %v", err) } } log.Println("Server shutdown complete") } } func setupRoutes() http.Handler { mux := http.NewServeMux() mux.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) { fmt.Fprintf(w, "Hello, World!") }) mux.HandleFunc("/slow", func(w http.ResponseWriter, r *http.Request) { // Simulate slow endpoint time.Sleep(5 * time.Second) fmt.Fprintf(w, "Slow response complete") }) return mux } Advanced HTTP Server with Multiple Resources package main import ( "context" "database/sql" "fmt" "log" "net/http" "os" "os/signal" "sync" "syscall" "time" _ "github.com/lib/pq" ) type Application struct { server *http.Server db *sql.DB logger *log.Logger shutdown chan struct{} wg sync.WaitGroup } func NewApplication() (*Application, error) { // Initialize database db, err := sql.Open("postgres", "postgres://localhost/mydb") if err != nil { return nil, fmt.Errorf("database connection failed: %w", err) } app := &Application{ db: db, logger: log.New(os.Stdout, "APP: ", log.LstdFlags), shutdown: make(chan struct{}), } // Setup HTTP server app.server = &http.Server{ Addr: ":8080", Handler: app.routes(), ReadTimeout: 10 * time.Second, WriteTimeout: 30 * time.Second, IdleTimeout: 60 * time.Second, } return app, nil } func (app *Application) routes() http.Handler { mux := http.NewServeMux() mux.HandleFunc("/health", app.healthHandler) mux.HandleFunc("/api/data", app.dataHandler) return mux } func (app *Application) healthHandler(w http.ResponseWriter, r *http.Request) { select { case <-app.shutdown: // Signal shutdown in progress w.WriteHeader(http.StatusServiceUnavailable) fmt.Fprintf(w, "Shutting down") default: w.WriteHeader(http.StatusOK) fmt.Fprintf(w, "OK") } } func (app *Application) dataHandler(w http.ResponseWriter, r *http.Request) { // Check if shutdown initiated select { case <-app.shutdown: http.Error(w, "Service shutting down", http.StatusServiceUnavailable) return default: } // Simulate database query ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second) defer cancel() var count int err := app.db.QueryRowContext(ctx, "SELECT COUNT(*) FROM items").Scan(&count) if err != nil { http.Error(w, "Database error", http.StatusInternalServerError) return } fmt.Fprintf(w, "Count: %d", count) } func (app *Application) Run() error { // Start HTTP server app.wg.Add(1) go func() { defer app.wg.Done() app.logger.Printf("Starting server on %s", app.server.Addr) if err := app.server.ListenAndServe(); err != http.ErrServerClosed { app.logger.Printf("Server error: %v", err) } }() // Start background worker app.wg.Add(1) go func() { defer app.wg.Done() app.backgroundWorker() }() return nil } func (app *Application) backgroundWorker() { ticker := time.NewTicker(10 * time.Second) defer ticker.Stop() for { select { case <-ticker.C: app.logger.Println("Background job executing...") // Do work case <-app.shutdown: app.logger.Println("Background worker shutting down...") return } } } func (app *Application) Shutdown(ctx context.Context) error { app.logger.Println("Starting graceful shutdown...") // Signal all components to stop close(app.shutdown) // Shutdown HTTP server app.logger.Println("Shutting down HTTP server...") if err := app.server.Shutdown(ctx); err != nil { return fmt.Errorf("HTTP server shutdown failed: %w", err) } // Wait for background workers to finish app.logger.Println("Waiting for background workers...") done := make(chan struct{}) go func() { app.wg.Wait() close(done) }() select { case <-done: app.logger.Println("All workers stopped") case <-ctx.Done(): return fmt.Errorf("shutdown timeout exceeded") } // Close database connections app.logger.Println("Closing database connections...") if err := app.db.Close(); err != nil { return fmt.Errorf("database close failed: %w", err) } app.logger.Println("Graceful shutdown complete") return nil } func main() { app, err := NewApplication() if err != nil { log.Fatalf("Application initialization failed: %v", err) } if err := app.Run(); err != nil { log.Fatalf("Application run failed: %v", err) } // Wait for shutdown signal sigChan := make(chan os.Signal, 1) signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM) sig := <-sigChan log.Printf("Received signal: %v", sig) // Graceful shutdown with 30 second timeout ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) defer cancel() if err := app.Shutdown(ctx); err != nil { log.Fatalf("Shutdown failed: %v", err) } log.Println("Application stopped") } Worker Pool Graceful Shutdown package main import ( "context" "fmt" "log" "sync" "time" ) type Job struct { ID int Data string } type WorkerPool struct { jobs chan Job results chan int numWorkers int wg sync.WaitGroup shutdown chan struct{} } func NewWorkerPool(numWorkers, queueSize int) *WorkerPool { return &WorkerPool{ jobs: make(chan Job, queueSize), results: make(chan int, queueSize), numWorkers: numWorkers, shutdown: make(chan struct{}), } } func (wp *WorkerPool) Start() { for i := 0; i < wp.numWorkers; i++ { wp.wg.Add(1) go wp.worker(i) } log.Printf("Started %d workers", wp.numWorkers) } func (wp *WorkerPool) worker(id int) { defer wp.wg.Done() log.Printf("Worker %d started", id) for { select { case job, ok := <-wp.jobs: if !ok { log.Printf("Worker %d: job channel closed, exiting", id) return } // Process job log.Printf("Worker %d processing job %d", id, job.ID) result := wp.processJob(job) wp.results <- result case <-wp.shutdown: // Drain remaining jobs before shutdown log.Printf("Worker %d: shutdown signal received, draining jobs...", id) for job := range wp.jobs { log.Printf("Worker %d processing remaining job %d", id, job.ID) result := wp.processJob(job) wp.results <- result } log.Printf("Worker %d: shutdown complete", id) return } } } func (wp *WorkerPool) processJob(job Job) int { // Simulate work time.Sleep(1 * time.Second) return job.ID * 2 } func (wp *WorkerPool) Submit(job Job) bool { select { case <-wp.shutdown: return false // Pool is shutting down case wp.jobs <- job: return true } } func (wp *WorkerPool) Shutdown(ctx context.Context) error { log.Println("WorkerPool: initiating shutdown...") // Signal workers to start draining close(wp.shutdown) // Close job channel to signal no more jobs close(wp.jobs) // Wait for workers with timeout done := make(chan struct{}) go func() { wp.wg.Wait() close(done) close(wp.results) }() select { case <-done: log.Println("WorkerPool: all workers completed") return nil case <-ctx.Done(): return fmt.Errorf("shutdown timeout: %w", ctx.Err()) } } func main() { pool := NewWorkerPool(3, 10) pool.Start() // Submit jobs go func() { for i := 1; i <= 20; i++ { job := Job{ID: i, Data: fmt.Sprintf("Job %d", i)} if !pool.Submit(job) { log.Printf("Failed to submit job %d (pool shutting down)", i) return } time.Sleep(200 * time.Millisecond) } }() // Collect results go func() { for result := range pool.results { log.Printf("Result: %d", result) } log.Println("All results collected") }() // Wait a bit then shutdown time.Sleep(5 * time.Second) ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) defer cancel() if err := pool.Shutdown(ctx); err != nil { log.Fatalf("Shutdown failed: %v", err) } log.Println("Main: shutdown complete") } Kubernetes-Aware Graceful Shutdown package main import ( "context" "fmt" "log" "net/http" "os" "os/signal" "syscall" "time" ) type KubernetesServer struct { server *http.Server shutdownDelay time.Duration terminationPeriod time.Duration } func NewKubernetesServer() *KubernetesServer { ks := &KubernetesServer{ // Delay before starting shutdown to allow load balancer de-registration shutdownDelay: 5 * time.Second, // Total Kubernetes termination grace period terminationPeriod: 30 * time.Second, } mux := http.NewServeMux() mux.HandleFunc("/health", ks.healthHandler) mux.HandleFunc("/readiness", ks.readinessHandler) mux.HandleFunc("/", ks.requestHandler) ks.server = &http.Server{ Addr: ":8080", Handler: mux, } return ks } var ( isHealthy = true isReady = true ) func (ks *KubernetesServer) healthHandler(w http.ResponseWriter, r *http.Request) { if isHealthy { w.WriteHeader(http.StatusOK) fmt.Fprintf(w, "healthy") } else { w.WriteHeader(http.StatusServiceUnavailable) fmt.Fprintf(w, "unhealthy") } } func (ks *KubernetesServer) readinessHandler(w http.ResponseWriter, r *http.Request) { if isReady { w.WriteHeader(http.StatusOK) fmt.Fprintf(w, "ready") } else { w.WriteHeader(http.StatusServiceUnavailable) fmt.Fprintf(w, "not ready") } } func (ks *KubernetesServer) requestHandler(w http.ResponseWriter, r *http.Request) { fmt.Fprintf(w, "Request processed at %s", time.Now().Format(time.RFC3339)) } func (ks *KubernetesServer) Run() error { go func() { log.Printf("Server starting on %s", ks.server.Addr) if err := ks.server.ListenAndServe(); err != http.ErrServerClosed { log.Fatalf("Server error: %v", err) } }() return nil } func (ks *KubernetesServer) GracefulShutdown() error { // Step 1: Mark as not ready (stop receiving new traffic from load balancer) log.Println("Step 1: Marking pod as not ready...") isReady = false // Step 2: Wait for load balancer to de-register log.Printf("Step 2: Waiting %v for load balancer de-registration...", ks.shutdownDelay) time.Sleep(ks.shutdownDelay) // Step 3: Stop accepting new connections and drain existing ones log.Println("Step 3: Shutting down HTTP server...") // Calculate remaining time for shutdown shutdownTimeout := ks.terminationPeriod - ks.shutdownDelay - (2 * time.Second) ctx, cancel := context.WithTimeout(context.Background(), shutdownTimeout) defer cancel() if err := ks.server.Shutdown(ctx); err != nil { log.Printf("Server shutdown error: %v", err) return err } log.Println("Step 4: Graceful shutdown complete") return nil } func main() { server := NewKubernetesServer() if err := server.Run(); err != nil { log.Fatalf("Server run failed: %v", err) } // Wait for termination signal sigChan := make(chan os.Signal, 1) signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM) sig := <-sigChan log.Printf("Received signal: %v", sig) if err := server.GracefulShutdown(); err != nil { log.Fatalf("Graceful shutdown failed: %v", err) os.Exit(1) } log.Println("Server stopped successfully") } Database Connection Cleanup package main import ( "context" "database/sql" "log" "time" _ "github.com/lib/pq" ) type DatabaseManager struct { db *sql.DB } func NewDatabaseManager(connString string) (*DatabaseManager, error) { db, err := sql.Open("postgres", connString) if err != nil { return nil, err } // Configure connection pool db.SetMaxOpenConns(25) db.SetMaxIdleConns(5) db.SetConnMaxLifetime(5 * time.Minute) db.SetConnMaxIdleTime(10 * time.Minute) return &DatabaseManager{db: db}, nil } func (dm *DatabaseManager) Shutdown(ctx context.Context) error { log.Println("Closing database connections...") // Wait for active queries to complete or timeout done := make(chan error, 1) go func() { // Close will wait for all connections to be returned to pool done <- dm.db.Close() }() select { case err := <-done: if err != nil { return err } log.Println("Database connections closed successfully") return nil case <-ctx.Done(): log.Println("Database close timeout exceeded") return ctx.Err() } } // QueryWithShutdown performs query with shutdown awareness func (dm *DatabaseManager) QueryWithShutdown(ctx context.Context, query string) error { // Check if context is already cancelled (shutdown initiated) if ctx.Err() != nil { return ctx.Err() } rows, err := dm.db.QueryContext(ctx, query) if err != nil { return err } defer rows.Close() // Process rows for rows.Next() { // Check for shutdown during processing if ctx.Err() != nil { return ctx.Err() } // Process row... } return rows.Err() } Message Queue Consumer Shutdown package main import ( "context" "fmt" "log" "sync" "time" ) type Message struct { ID string Payload string } type Consumer struct { messages chan Message shutdown chan struct{} wg sync.WaitGroup } func NewConsumer() *Consumer { return &Consumer{ messages: make(chan Message, 100), shutdown: make(chan struct{}), } } func (c *Consumer) Start() { c.wg.Add(1) go c.consume() log.Println("Consumer started") } func (c *Consumer) consume() { defer c.wg.Done() for { select { case msg := <-c.messages: // Process message if err := c.processMessage(msg); err != nil { log.Printf("Error processing message %s: %v", msg.ID, err) // In production: send to dead letter queue } case <-c.shutdown: log.Println("Consumer: shutdown initiated, processing remaining messages...") // Drain remaining messages for msg := range c.messages { log.Printf("Consumer: processing remaining message %s", msg.ID) if err := c.processMessage(msg); err != nil { log.Printf("Error processing remaining message %s: %v", msg.ID, err) } } log.Println("Consumer: all messages processed") return } } } func (c *Consumer) processMessage(msg Message) error { log.Printf("Processing message: %s", msg.ID) // Simulate processing time.Sleep(500 * time.Millisecond) // Acknowledge message log.Printf("Message %s processed successfully", msg.ID) return nil } func (c *Consumer) Shutdown(ctx context.Context) error { log.Println("Consumer: initiating shutdown...") // Stop accepting new messages close(c.shutdown) close(c.messages) // Wait for processing to complete done := make(chan struct{}) go func() { c.wg.Wait() close(done) }() select { case <-done: log.Println("Consumer: shutdown complete") return nil case <-ctx.Done(): return fmt.Errorf("consumer shutdown timeout: %w", ctx.Err()) } } func main() { consumer := NewConsumer() consumer.Start() // Simulate receiving messages go func() { for i := 1; i <= 10; i++ { msg := Message{ ID: fmt.Sprintf("msg-%d", i), Payload: fmt.Sprintf("Payload %d", i), } consumer.messages <- msg time.Sleep(300 * time.Millisecond) } }() // Wait then shutdown time.Sleep(3 * time.Second) ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) defer cancel() if err := consumer.Shutdown(ctx); err != nil { log.Fatalf("Shutdown failed: %v", err) } log.Println("Main: shutdown complete") } Best Practices 1. Use Context for Timeouts // Set realistic shutdown timeout ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) defer cancel() if err := server.Shutdown(ctx); err != nil { // Handle timeout } 2. Implement Shutdown Sequence func (app *Application) Shutdown(ctx context.Context) error { // 1. Stop health checks (remove from load balancer) app.setUnhealthy() // 2. Wait for de-registration time.Sleep(5 * time.Second) // 3. Stop accepting new requests app.server.Shutdown(ctx) // 4. Finish in-flight requests app.wg.Wait() // 5. Close resources app.db.Close() return nil } 3. Test Graceful Shutdown func TestGracefulShutdown(t *testing.T) { app := NewApplication() app.Run() // Start long-running request go func() { resp, err := http.Get("http://localhost:8080/slow") if err != nil { t.Errorf("Request failed: %v", err) } defer resp.Body.Close() }() // Wait for request to start time.Sleep(100 * time.Millisecond) // Initiate shutdown ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) defer cancel() if err := app.Shutdown(ctx); err != nil { t.Fatalf("Shutdown failed: %v", err) } } Kubernetes Configuration Pod Termination Lifecycle apiVersion: apps/v1 kind: Deployment metadata: name: myapp spec: template: spec: containers: - name: myapp image: myapp:latest ports: - containerPort: 8080 # Liveness probe (restart if unhealthy) livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 # Readiness probe (remove from service if not ready) readinessProbe: httpGet: path: /readiness port: 8080 initialDelaySeconds: 5 periodSeconds: 5 # Graceful shutdown configuration lifecycle: preStop: exec: command: ["/bin/sh", "-c", "sleep 5"] # Termination grace period (must be > shutdown delay + max request time) terminationGracePeriodSeconds: 30 Common Pitfalls Not Handling SIGTERM: Container orchestrators send SIGTERM for graceful shutdown Insufficient Timeout: Set timeout longer than longest request duration Ignoring In-Flight Requests: Always wait for active requests to complete Not Closing Resources: Explicitly close databases, files, connections Immediate Exit: Don’t call os.Exit(0) without cleanup Performance Considerations Shutdown Delay: Balance between deployment speed and zero downtime Timeout Values: Consider 95th percentile request duration Resource Cleanup: Close in order of dependency Logging: Flush logs before exit Metrics: Report shutdown metrics for monitoring Conclusion Graceful shutdown is essential for production Go services to ensure zero-downtime deployments and data integrity. ...

    June 27, 2024 · 13 min · Rafiul Alam