Update Existing Infrastructure
Goal: Safely update running infrastructure with minimal downtime Time: 15-30 minutes Difficulty: Intermediate
Overview
This guide covers:
- Checking for updates
- Planning update strategies
- Updating task services
- Rolling updates
- Rollback procedures
- Verification
Update Strategies
Strategy 1: In-Place Updates (Fastest)
Best for: Non-critical environments, development, staging
# Direct update without downtime consideration
provisioning t create <taskserv> --infra <project>
Strategy 2: Rolling Updates (Recommended)
Best for: Production environments, high availability
# Update servers one by one
provisioning s update --infra <project> --rolling
Strategy 3: Blue-Green Deployment (Safest)
Best for: Critical production, zero-downtime requirements
# Create new infrastructure, switch traffic, remove old
provisioning ws init <project>-green
# ... configure and deploy
# ... switch traffic
provisioning ws delete <project>-blue
Step 1: Check for Updates
1.1 Check All Task Services
# Check all taskservs for updates
provisioning t check-updates
Expected Output:
๐ฆ Task Service Update Check:
NAME CURRENT LATEST STATUS
kubernetes 1.29.0 1.30.0 โฌ๏ธ update available
containerd 1.7.13 1.7.13 โ
up-to-date
cilium 1.14.5 1.15.0 โฌ๏ธ update available
postgres 15.5 16.1 โฌ๏ธ update available
redis 7.2.3 7.2.3 โ
up-to-date
Updates available: 3
1.2 Check Specific Task Service
# Check specific taskserv
provisioning t check-updates kubernetes
Expected Output:
๐ฆ Kubernetes Update Check:
Current: 1.29.0
Latest: 1.30.0
Status: โฌ๏ธ Update available
Changelog:
โข Enhanced security features
โข Performance improvements
โข Bug fixes in kube-apiserver
โข New workload resource types
Breaking Changes:
โข None
Recommended: โ
Safe to update
1.3 Check Version Status
# Show detailed version information
provisioning version show
Expected Output:
๐ Component Versions:
COMPONENT CURRENT LATEST DAYS OLD STATUS
kubernetes 1.29.0 1.30.0 45 โฌ๏ธ update
containerd 1.7.13 1.7.13 0 โ
current
cilium 1.14.5 1.15.0 30 โฌ๏ธ update
postgres 15.5 16.1 60 โฌ๏ธ update (major)
redis 7.2.3 7.2.3 0 โ
current
1.4 Check for Security Updates
# Check for security-related updates
provisioning version updates --security-only
Step 2: Plan Your Update
2.1 Review Current Configuration
# Show current infrastructure
provisioning show settings --infra my-production
2.2 Backup Configuration
# Create configuration backup
cp -r workspace/infra/my-production workspace/infra/my-production.backup-$(date +%Y%m%d)
# Or use built-in backup
provisioning ws backup my-production
Expected Output:
โ
Backup created: workspace/backups/my-production-20250930.tar.gz
2.3 Create Update Plan
# Generate update plan
provisioning plan update --infra my-production
Expected Output:
๐ Update Plan for my-production:
Phase 1: Minor Updates (Low Risk)
โข containerd: No update needed
โข redis: No update needed
Phase 2: Patch Updates (Medium Risk)
โข cilium: 1.14.5 โ 1.15.0 (estimated 5 minutes)
Phase 3: Major Updates (High Risk - Requires Testing)
โข kubernetes: 1.29.0 โ 1.30.0 (estimated 15 minutes)
โข postgres: 15.5 โ 16.1 (estimated 10 minutes, may require data migration)
Recommended Order:
1. Update cilium (low risk)
2. Update kubernetes (test in staging first)
3. Update postgres (requires maintenance window)
Total Estimated Time: 30 minutes
Recommended: Test in staging environment first
Step 3: Update Task Services
3.1 Update Non-Critical Service (Cilium Example)
Dry-Run Update
# Test update without applying
provisioning t create cilium --infra my-production --check
Expected Output:
๐ CHECK MODE: Simulating Cilium update
Current: 1.14.5
Target: 1.15.0
Would perform:
1. Download Cilium 1.15.0
2. Update configuration
3. Rolling restart of Cilium pods
4. Verify connectivity
Estimated downtime: <1 minute per node
No errors detected. Ready to update.
Generate Updated Configuration
# Generate new configuration
provisioning t generate cilium --infra my-production
Expected Output:
โ
Generated Cilium configuration (version 1.15.0)
Saved to: workspace/infra/my-production/taskservs/cilium.ncl
Apply Update
# Apply update
provisioning t create cilium --infra my-production
Expected Output:
๐ Updating Cilium on my-production...
Downloading Cilium 1.15.0... โณ
โ
Downloaded
Updating configuration... โณ
โ
Configuration updated
Rolling restart: web-01... โณ
โ
web-01 updated (Cilium 1.15.0)
Rolling restart: web-02... โณ
โ
web-02 updated (Cilium 1.15.0)
Verifying connectivity... โณ
โ
All nodes connected
๐ Cilium update complete!
Version: 1.14.5 โ 1.15.0
Downtime: 0 minutes
Verify Update
# Verify updated version
provisioning version taskserv cilium
Expected Output:
๐ฆ Cilium Version Info:
Installed: 1.15.0
Latest: 1.15.0
Status: โ
Up-to-date
Nodes:
โ
web-01: 1.15.0 (running)
โ
web-02: 1.15.0 (running)
3.2 Update Critical Service (Kubernetes Example)
Test in Staging First
# If you have staging environment
provisioning t create kubernetes --infra my-staging --check
provisioning t create kubernetes --infra my-staging
# Run integration tests
provisioning test kubernetes --infra my-staging
Backup Current State
# Backup Kubernetes state
kubectl get all -A -o yaml > k8s-backup-$(date +%Y%m%d).yaml
# Backup etcd (if using external etcd)
provisioning t backup kubernetes --infra my-production
Schedule Maintenance Window
# Set maintenance mode (optional, if supported)
provisioning maintenance enable --infra my-production --duration 30m
Update Kubernetes
# Update control plane first
provisioning t create kubernetes --infra my-production --control-plane-only
Expected Output:
๐ Updating Kubernetes control plane on my-production...
Draining control plane: web-01... โณ
โ
web-01 drained
Updating control plane: web-01... โณ
โ
web-01 updated (Kubernetes 1.30.0)
Uncordoning: web-01... โณ
โ
web-01 ready
Verifying control plane... โณ
โ
Control plane healthy
๐ Control plane update complete!
# Update worker nodes one by one
provisioning t create kubernetes --infra my-production --workers-only --rolling
Expected Output:
๐ Updating Kubernetes workers on my-production...
Rolling update: web-02...
Draining... โณ
โ
Drained (pods rescheduled)
Updating... โณ
โ
Updated (Kubernetes 1.30.0)
Uncordoning... โณ
โ
Ready
Waiting for pods to stabilize... โณ
โ
All pods running
๐ Worker update complete!
Updated: web-02
Version: 1.30.0
Verify Update
# Verify Kubernetes cluster
kubectl get nodes
provisioning version taskserv kubernetes
Expected Output:
NAME STATUS ROLES AGE VERSION
web-01 Ready control-plane 30d v1.30.0
web-02 Ready <none> 30d v1.30.0
# Run smoke tests
provisioning test kubernetes --infra my-production
3.3 Update Database (PostgreSQL Example)
โ ๏ธ WARNING: Database updates may require data migration. Always backup first!
Backup Database
# Backup PostgreSQL database
provisioning t backup postgres --infra my-production
Expected Output:
๐๏ธ Backing up PostgreSQL...
Creating dump: my-production-postgres-20250930.sql... โณ
โ
Dump created (2.3 GB)
Compressing... โณ
โ
Compressed (450 MB)
Saved to: workspace/backups/postgres/my-production-20250930.sql.gz
Check Compatibility
# Check if data migration is needed
provisioning t check-migration postgres --from 15.5 --to 16.1
Expected Output:
๐ PostgreSQL Migration Check:
From: 15.5
To: 16.1
Migration Required: โ
Yes (major version change)
Steps Required:
1. Dump database with pg_dump
2. Stop PostgreSQL 15.5
3. Install PostgreSQL 16.1
4. Initialize new data directory
5. Restore from dump
Estimated Time: 15-30 minutes (depending on data size)
Estimated Downtime: 15-30 minutes
Recommended: Use streaming replication for zero-downtime upgrade
Perform Update
# Update PostgreSQL (with automatic migration)
provisioning t create postgres --infra my-production --migrate
Expected Output:
๐ Updating PostgreSQL on my-production...
โ ๏ธ Major version upgrade detected (15.5 โ 16.1)
Automatic migration will be performed
Dumping database... โณ
โ
Database dumped (2.3 GB)
Stopping PostgreSQL 15.5... โณ
โ
Stopped
Installing PostgreSQL 16.1... โณ
โ
Installed
Initializing new data directory... โณ
โ
Initialized
Restoring database... โณ
โ
Restored (2.3 GB)
Starting PostgreSQL 16.1... โณ
โ
Started
Verifying data integrity... โณ
โ
All tables verified
๐ PostgreSQL update complete!
Version: 15.5 โ 16.1
Downtime: 18 minutes
Verify Update
# Verify PostgreSQL
provisioning version taskserv postgres
ssh db-01 "psql --version"
Step 4: Update Multiple Services
4.1 Batch Update (Sequentially)
# Update multiple taskservs one by one
provisioning t update --infra my-production --taskservs cilium,containerd,redis
Expected Output:
๐ Updating 3 taskservs on my-production...
[1/3] Updating cilium... โณ
โ
cilium updated (1.15.0)
[2/3] Updating containerd... โณ
โ
containerd updated (1.7.14)
[3/3] Updating redis... โณ
โ
redis updated (7.2.4)
๐ All updates complete!
Updated: 3 taskservs
Total time: 8 minutes
4.2 Parallel Update (Non-Dependent Services)
# Update taskservs in parallel (if they don't depend on each other)
provisioning t update --infra my-production --taskservs redis,postgres --parallel
Expected Output:
๐ Updating 2 taskservs in parallel on my-production...
redis: Updating... โณ
postgres: Updating... โณ
redis: โ
Updated (7.2.4)
postgres: โ
Updated (16.1)
๐ All updates complete!
Updated: 2 taskservs
Total time: 3 minutes (parallel)
Step 5: Update Server Configuration
5.1 Update Server Resources
# Edit server configuration
provisioning sops workspace/infra/my-production/servers.ncl
Example: Upgrade server plan
# Before
{
name = "web-01"
plan = "1xCPU-2 GB" # Old plan
}
# After
{
name = "web-01"
plan = "2xCPU-4 GB" # New plan
}
# Apply server update
provisioning s update --infra my-production --check
provisioning s update --infra my-production
5.2 Update Server OS
# Update operating system packages
provisioning s update --infra my-production --os-update
Expected Output:
๐ Updating OS packages on my-production servers...
web-01: Updating packages... โณ
โ
web-01: 24 packages updated
web-02: Updating packages... โณ
โ
web-02: 24 packages updated
db-01: Updating packages... โณ
โ
db-01: 24 packages updated
๐ OS updates complete!
Step 6: Rollback Procedures
6.1 Rollback Task Service
If update fails or causes issues:
# Rollback to previous version
provisioning t rollback cilium --infra my-production
Expected Output:
๐ Rolling back Cilium on my-production...
Current: 1.15.0
Target: 1.14.5 (previous version)
Rolling back: web-01... โณ
โ
web-01 rolled back
Rolling back: web-02... โณ
โ
web-02 rolled back
Verifying connectivity... โณ
โ
All nodes connected
๐ Rollback complete!
Version: 1.15.0 โ 1.14.5
6.2 Rollback from Backup
# Restore configuration from backup
provisioning ws restore my-production --from workspace/backups/my-production-20250930.tar.gz
6.3 Emergency Rollback
# Complete infrastructure rollback
provisioning rollback --infra my-production --to-snapshot <snapshot-id>
Step 7: Post-Update Verification
7.1 Verify All Components
# Check overall health
provisioning health --infra my-production
Expected Output:
๐ฅ Health Check: my-production
Servers:
โ
web-01: Healthy
โ
web-02: Healthy
โ
db-01: Healthy
Task Services:
โ
kubernetes: 1.30.0 (healthy)
โ
containerd: 1.7.13 (healthy)
โ
cilium: 1.15.0 (healthy)
โ
postgres: 16.1 (healthy)
Clusters:
โ
buildkit: 2/2 replicas (healthy)
Overall Status: โ
All systems healthy
7.2 Verify Version Updates
# Verify all versions are updated
provisioning version show
7.3 Run Integration Tests
# Run comprehensive tests
provisioning test all --infra my-production
Expected Output:
๐งช Running Integration Tests...
[1/5] Server connectivity... โณ
โ
All servers reachable
[2/5] Kubernetes health... โณ
โ
All nodes ready, all pods running
[3/5] Network connectivity... โณ
โ
All services reachable
[4/5] Database connectivity... โณ
โ
PostgreSQL responsive
[5/5] Application health... โณ
โ
All applications healthy
๐ All tests passed!
7.4 Monitor for Issues
# Monitor logs for errors
provisioning logs --infra my-production --follow --level error
Update Checklist
Use this checklist for production updates:
- Check for available updates
- Review changelog and breaking changes
- Create configuration backup
- Test update in staging environment
- Schedule maintenance window
- Notify team/users of maintenance
- Update non-critical services first
- Verify each update before proceeding
- Update critical services with rolling updates
- Backup database before major updates
- Verify all components after update
- Run integration tests
- Monitor for issues (30 minutes minimum)
- Document any issues encountered
- Close maintenance window
Common Update Scenarios
Scenario 1: Minor Security Patch
# Quick security update
provisioning t check-updates --security-only
provisioning t update --infra my-production --security-patches --yes
Scenario 2: Major Version Upgrade
# Careful major version update
provisioning ws backup my-production
provisioning t check-migration <service> --from X.Y --to X+1.Y
provisioning t create <service> --infra my-production --migrate
provisioning test all --infra my-production
Scenario 3: Emergency Hotfix
# Apply critical hotfix immediately
provisioning t create <service> --infra my-production --hotfix --yes
Troubleshooting Updates
Issue: Update fails mid-process
Solution:
# Check update status
provisioning t status <taskserv> --infra my-production
# Resume failed update
provisioning t update <taskserv> --infra my-production --resume
# Or rollback
provisioning t rollback <taskserv> --infra my-production
Issue: Service not starting after update
Solution:
# Check logs
provisioning logs <taskserv> --infra my-production
# Verify configuration
provisioning t validate <taskserv> --infra my-production
# Rollback if necessary
provisioning t rollback <taskserv> --infra my-production
Issue: Data migration fails
Solution:
# Check migration logs
provisioning t migration-logs <taskserv> --infra my-production
# Restore from backup
provisioning t restore <taskserv> --infra my-production --from <backup-file>
Best Practices
- Always Test First: Test updates in staging before production
- Backup Everything: Create backups before any update
- Update Gradually: Update one service at a time
- Monitor Closely: Watch for errors after each update
- Have Rollback Plan: Always have a rollback strategy
- Document Changes: Keep update logs for reference
- Schedule Wisely: Update during low-traffic periods
- Verify Thoroughly: Run tests after each update
Next Steps
- Customize Guide - Customize your infrastructure
- From Scratch Guide - Deploy new infrastructure
- Workflow Guide - Automate with workflows
Quick Reference
# Update workflow
provisioning t check-updates
provisioning ws backup my-production
provisioning t create <taskserv> --infra my-production --check
provisioning t create <taskserv> --infra my-production
provisioning version taskserv <taskserv>
provisioning health --infra my-production
provisioning test all --infra my-production
This guide is part of the provisioning project documentation. Last updated: 2025-09-30