Keyboard shortcuts

Press โ† or โ†’ to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Update Existing Infrastructure

Goal: Safely update running infrastructure with minimal downtime Time: 15-30 minutes Difficulty: Intermediate

Overview

This guide covers:

  1. Checking for updates
  2. Planning update strategies
  3. Updating task services
  4. Rolling updates
  5. Rollback procedures
  6. Verification

Update Strategies

Strategy 1: In-Place Updates (Fastest)

Best for: Non-critical environments, development, staging

# Direct update without downtime consideration
provisioning t create <taskserv> --infra <project>
```plaintext

### Strategy 2: Rolling Updates (Recommended)

**Best for**: Production environments, high availability

```bash
# Update servers one by one
provisioning s update --infra <project> --rolling
```plaintext

### Strategy 3: Blue-Green Deployment (Safest)

**Best for**: Critical production, zero-downtime requirements

```bash
# Create new infrastructure, switch traffic, remove old
provisioning ws init <project>-green
# ... configure and deploy
# ... switch traffic
provisioning ws delete <project>-blue
```plaintext

## Step 1: Check for Updates

### 1.1 Check All Task Services

```bash
# Check all taskservs for updates
provisioning t check-updates
```plaintext

**Expected Output:**

```plaintext
๐Ÿ“ฆ Task Service Update Check:

NAME         CURRENT   LATEST    STATUS
kubernetes   1.29.0    1.30.0    โฌ†๏ธ  update available
containerd   1.7.13    1.7.13    โœ… up-to-date
cilium       1.14.5    1.15.0    โฌ†๏ธ  update available
postgres     15.5      16.1      โฌ†๏ธ  update available
redis        7.2.3     7.2.3     โœ… up-to-date

Updates available: 3
```plaintext

### 1.2 Check Specific Task Service

```bash
# Check specific taskserv
provisioning t check-updates kubernetes
```plaintext

**Expected Output:**

```plaintext
๐Ÿ“ฆ Kubernetes Update Check:

Current:  1.29.0
Latest:   1.30.0
Status:   โฌ†๏ธ  Update available

Changelog:
  โ€ข Enhanced security features
  โ€ข Performance improvements
  โ€ข Bug fixes in kube-apiserver
  โ€ข New workload resource types

Breaking Changes:
  โ€ข None

Recommended: โœ… Safe to update
```plaintext

### 1.3 Check Version Status

```bash
# Show detailed version information
provisioning version show
```plaintext

**Expected Output:**

```plaintext
๐Ÿ“‹ Component Versions:

COMPONENT    CURRENT   LATEST    DAYS OLD  STATUS
kubernetes   1.29.0    1.30.0    45        โฌ†๏ธ  update
containerd   1.7.13    1.7.13    0         โœ… current
cilium       1.14.5    1.15.0    30        โฌ†๏ธ  update
postgres     15.5      16.1      60        โฌ†๏ธ  update (major)
redis        7.2.3     7.2.3     0         โœ… current
```plaintext

### 1.4 Check for Security Updates

```bash
# Check for security-related updates
provisioning version updates --security-only
```plaintext

## Step 2: Plan Your Update

### 2.1 Review Current Configuration

```bash
# Show current infrastructure
provisioning show settings --infra my-production
```plaintext

### 2.2 Backup Configuration

```bash
# Create configuration backup
cp -r workspace/infra/my-production workspace/infra/my-production.backup-$(date +%Y%m%d)

# Or use built-in backup
provisioning ws backup my-production
```plaintext

**Expected Output:**

```plaintext
โœ… Backup created: workspace/backups/my-production-20250930.tar.gz
```plaintext

### 2.3 Create Update Plan

```bash
# Generate update plan
provisioning plan update --infra my-production
```plaintext

**Expected Output:**

```plaintext
๐Ÿ“ Update Plan for my-production:

Phase 1: Minor Updates (Low Risk)
  โ€ข containerd: No update needed
  โ€ข redis: No update needed

Phase 2: Patch Updates (Medium Risk)
  โ€ข cilium: 1.14.5 โ†’ 1.15.0 (estimated 5 minutes)

Phase 3: Major Updates (High Risk - Requires Testing)
  โ€ข kubernetes: 1.29.0 โ†’ 1.30.0 (estimated 15 minutes)
  โ€ข postgres: 15.5 โ†’ 16.1 (estimated 10 minutes, may require data migration)

Recommended Order:
  1. Update cilium (low risk)
  2. Update kubernetes (test in staging first)
  3. Update postgres (requires maintenance window)

Total Estimated Time: 30 minutes
Recommended: Test in staging environment first
```plaintext

## Step 3: Update Task Services

### 3.1 Update Non-Critical Service (Cilium Example)

#### Dry-Run Update

```bash
# Test update without applying
provisioning t create cilium --infra my-production --check
```plaintext

**Expected Output:**

```plaintext
๐Ÿ” CHECK MODE: Simulating Cilium update

Current: 1.14.5
Target:  1.15.0

Would perform:
  1. Download Cilium 1.15.0
  2. Update configuration
  3. Rolling restart of Cilium pods
  4. Verify connectivity

Estimated downtime: <1 minute per node
No errors detected. Ready to update.
```plaintext

#### Generate Updated Configuration

```bash
# Generate new configuration
provisioning t generate cilium --infra my-production
```plaintext

**Expected Output:**

```plaintext
โœ… Generated Cilium configuration (version 1.15.0)
   Saved to: workspace/infra/my-production/taskservs/cilium.k
```plaintext

#### Apply Update

```bash
# Apply update
provisioning t create cilium --infra my-production
```plaintext

**Expected Output:**

```plaintext
๐Ÿš€ Updating Cilium on my-production...

Downloading Cilium 1.15.0... โณ
โœ… Downloaded

Updating configuration... โณ
โœ… Configuration updated

Rolling restart: web-01... โณ
โœ… web-01 updated (Cilium 1.15.0)

Rolling restart: web-02... โณ
โœ… web-02 updated (Cilium 1.15.0)

Verifying connectivity... โณ
โœ… All nodes connected

๐ŸŽ‰ Cilium update complete!
   Version: 1.14.5 โ†’ 1.15.0
   Downtime: 0 minutes
```plaintext

#### Verify Update

```bash
# Verify updated version
provisioning version taskserv cilium
```plaintext

**Expected Output:**

```plaintext
๐Ÿ“ฆ Cilium Version Info:

Installed: 1.15.0
Latest:    1.15.0
Status:    โœ… Up-to-date

Nodes:
  โœ… web-01: 1.15.0 (running)
  โœ… web-02: 1.15.0 (running)
```plaintext

### 3.2 Update Critical Service (Kubernetes Example)

#### Test in Staging First

```bash
# If you have staging environment
provisioning t create kubernetes --infra my-staging --check
provisioning t create kubernetes --infra my-staging

# Run integration tests
provisioning test kubernetes --infra my-staging
```plaintext

#### Backup Current State

```bash
# Backup Kubernetes state
kubectl get all -A -o yaml > k8s-backup-$(date +%Y%m%d).yaml

# Backup etcd (if using external etcd)
provisioning t backup kubernetes --infra my-production
```plaintext

#### Schedule Maintenance Window

```bash
# Set maintenance mode (optional, if supported)
provisioning maintenance enable --infra my-production --duration 30m
```plaintext

#### Update Kubernetes

```bash
# Update control plane first
provisioning t create kubernetes --infra my-production --control-plane-only
```plaintext

**Expected Output:**

```plaintext
๐Ÿš€ Updating Kubernetes control plane on my-production...

Draining control plane: web-01... โณ
โœ… web-01 drained

Updating control plane: web-01... โณ
โœ… web-01 updated (Kubernetes 1.30.0)

Uncordoning: web-01... โณ
โœ… web-01 ready

Verifying control plane... โณ
โœ… Control plane healthy

๐ŸŽ‰ Control plane update complete!
```plaintext

```bash
# Update worker nodes one by one
provisioning t create kubernetes --infra my-production --workers-only --rolling
```plaintext

**Expected Output:**

```plaintext
๐Ÿš€ Updating Kubernetes workers on my-production...

Rolling update: web-02...
  Draining... โณ
  โœ… Drained (pods rescheduled)

  Updating... โณ
  โœ… Updated (Kubernetes 1.30.0)

  Uncordoning... โณ
  โœ… Ready

  Waiting for pods to stabilize... โณ
  โœ… All pods running

๐ŸŽ‰ Worker update complete!
   Updated: web-02
   Version: 1.30.0
```plaintext

#### Verify Update

```bash
# Verify Kubernetes cluster
kubectl get nodes
provisioning version taskserv kubernetes
```plaintext

**Expected Output:**

```plaintext
NAME     STATUS   ROLES           AGE   VERSION
web-01   Ready    control-plane   30d   v1.30.0
web-02   Ready    <none>          30d   v1.30.0
```plaintext

```bash
# Run smoke tests
provisioning test kubernetes --infra my-production
```plaintext

### 3.3 Update Database (PostgreSQL Example)

โš ๏ธ **WARNING**: Database updates may require data migration. Always backup first!

#### Backup Database

```bash
# Backup PostgreSQL database
provisioning t backup postgres --infra my-production
```plaintext

**Expected Output:**

```plaintext
๐Ÿ—„๏ธ  Backing up PostgreSQL...

Creating dump: my-production-postgres-20250930.sql... โณ
โœ… Dump created (2.3 GB)

Compressing... โณ
โœ… Compressed (450 MB)

Saved to: workspace/backups/postgres/my-production-20250930.sql.gz
```plaintext

#### Check Compatibility

```bash
# Check if data migration is needed
provisioning t check-migration postgres --from 15.5 --to 16.1
```plaintext

**Expected Output:**

```plaintext
๐Ÿ” PostgreSQL Migration Check:

From: 15.5
To:   16.1

Migration Required: โœ… Yes (major version change)

Steps Required:
  1. Dump database with pg_dump
  2. Stop PostgreSQL 15.5
  3. Install PostgreSQL 16.1
  4. Initialize new data directory
  5. Restore from dump

Estimated Time: 15-30 minutes (depending on data size)
Estimated Downtime: 15-30 minutes

Recommended: Use streaming replication for zero-downtime upgrade
```plaintext

#### Perform Update

```bash
# Update PostgreSQL (with automatic migration)
provisioning t create postgres --infra my-production --migrate
```plaintext

**Expected Output:**

```plaintext
๐Ÿš€ Updating PostgreSQL on my-production...

โš ๏ธ  Major version upgrade detected (15.5 โ†’ 16.1)
   Automatic migration will be performed

Dumping database... โณ
โœ… Database dumped (2.3 GB)

Stopping PostgreSQL 15.5... โณ
โœ… Stopped

Installing PostgreSQL 16.1... โณ
โœ… Installed

Initializing new data directory... โณ
โœ… Initialized

Restoring database... โณ
โœ… Restored (2.3 GB)

Starting PostgreSQL 16.1... โณ
โœ… Started

Verifying data integrity... โณ
โœ… All tables verified

๐ŸŽ‰ PostgreSQL update complete!
   Version: 15.5 โ†’ 16.1
   Downtime: 18 minutes
```plaintext

#### Verify Update

```bash
# Verify PostgreSQL
provisioning version taskserv postgres
ssh db-01 "psql --version"
```plaintext

## Step 4: Update Multiple Services

### 4.1 Batch Update (Sequentially)

```bash
# Update multiple taskservs one by one
provisioning t update --infra my-production --taskservs cilium,containerd,redis
```plaintext

**Expected Output:**

```plaintext
๐Ÿš€ Updating 3 taskservs on my-production...

[1/3] Updating cilium... โณ
โœ… cilium updated (1.15.0)

[2/3] Updating containerd... โณ
โœ… containerd updated (1.7.14)

[3/3] Updating redis... โณ
โœ… redis updated (7.2.4)

๐ŸŽ‰ All updates complete!
   Updated: 3 taskservs
   Total time: 8 minutes
```plaintext

### 4.2 Parallel Update (Non-Dependent Services)

```bash
# Update taskservs in parallel (if they don't depend on each other)
provisioning t update --infra my-production --taskservs redis,postgres --parallel
```plaintext

**Expected Output:**

```plaintext
๐Ÿš€ Updating 2 taskservs in parallel on my-production...

redis: Updating... โณ
postgres: Updating... โณ

redis: โœ… Updated (7.2.4)
postgres: โœ… Updated (16.1)

๐ŸŽ‰ All updates complete!
   Updated: 2 taskservs
   Total time: 3 minutes (parallel)
```plaintext

## Step 5: Update Server Configuration

### 5.1 Update Server Resources

```bash
# Edit server configuration
provisioning sops workspace/infra/my-production/servers.k
```plaintext

**Example: Upgrade server plan**

```kcl
# Before
{
    name = "web-01"
    plan = "1xCPU-2GB"  # Old plan
}

# After
{
    name = "web-01"
    plan = "2xCPU-4GB"  # New plan
}
```plaintext

```bash
# Apply server update
provisioning s update --infra my-production --check
provisioning s update --infra my-production
```plaintext

### 5.2 Update Server OS

```bash
# Update operating system packages
provisioning s update --infra my-production --os-update
```plaintext

**Expected Output:**

```plaintext
๐Ÿš€ Updating OS packages on my-production servers...

web-01: Updating packages... โณ
โœ… web-01: 24 packages updated

web-02: Updating packages... โณ
โœ… web-02: 24 packages updated

db-01: Updating packages... โณ
โœ… db-01: 24 packages updated

๐ŸŽ‰ OS updates complete!
```plaintext

## Step 6: Rollback Procedures

### 6.1 Rollback Task Service

If update fails or causes issues:

```bash
# Rollback to previous version
provisioning t rollback cilium --infra my-production
```plaintext

**Expected Output:**

```plaintext
๐Ÿ”„ Rolling back Cilium on my-production...

Current: 1.15.0
Target:  1.14.5 (previous version)

Rolling back: web-01... โณ
โœ… web-01 rolled back

Rolling back: web-02... โณ
โœ… web-02 rolled back

Verifying connectivity... โณ
โœ… All nodes connected

๐ŸŽ‰ Rollback complete!
   Version: 1.15.0 โ†’ 1.14.5
```plaintext

### 6.2 Rollback from Backup

```bash
# Restore configuration from backup
provisioning ws restore my-production --from workspace/backups/my-production-20250930.tar.gz
```plaintext

### 6.3 Emergency Rollback

```bash
# Complete infrastructure rollback
provisioning rollback --infra my-production --to-snapshot <snapshot-id>
```plaintext

## Step 7: Post-Update Verification

### 7.1 Verify All Components

```bash
# Check overall health
provisioning health --infra my-production
```plaintext

**Expected Output:**

```plaintext
๐Ÿฅ Health Check: my-production

Servers:
  โœ… web-01: Healthy
  โœ… web-02: Healthy
  โœ… db-01: Healthy

Task Services:
  โœ… kubernetes: 1.30.0 (healthy)
  โœ… containerd: 1.7.13 (healthy)
  โœ… cilium: 1.15.0 (healthy)
  โœ… postgres: 16.1 (healthy)

Clusters:
  โœ… buildkit: 2/2 replicas (healthy)

Overall Status: โœ… All systems healthy
```plaintext

### 7.2 Verify Version Updates

```bash
# Verify all versions are updated
provisioning version show
```plaintext

### 7.3 Run Integration Tests

```bash
# Run comprehensive tests
provisioning test all --infra my-production
```plaintext

**Expected Output:**

```plaintext
๐Ÿงช Running Integration Tests...

[1/5] Server connectivity... โณ
โœ… All servers reachable

[2/5] Kubernetes health... โณ
โœ… All nodes ready, all pods running

[3/5] Network connectivity... โณ
โœ… All services reachable

[4/5] Database connectivity... โณ
โœ… PostgreSQL responsive

[5/5] Application health... โณ
โœ… All applications healthy

๐ŸŽ‰ All tests passed!
```plaintext

### 7.4 Monitor for Issues

```bash
# Monitor logs for errors
provisioning logs --infra my-production --follow --level error
```plaintext

## Update Checklist

Use this checklist for production updates:

- [ ] Check for available updates
- [ ] Review changelog and breaking changes
- [ ] Create configuration backup
- [ ] Test update in staging environment
- [ ] Schedule maintenance window
- [ ] Notify team/users of maintenance
- [ ] Update non-critical services first
- [ ] Verify each update before proceeding
- [ ] Update critical services with rolling updates
- [ ] Backup database before major updates
- [ ] Verify all components after update
- [ ] Run integration tests
- [ ] Monitor for issues (30 minutes minimum)
- [ ] Document any issues encountered
- [ ] Close maintenance window

## Common Update Scenarios

### Scenario 1: Minor Security Patch

```bash
# Quick security update
provisioning t check-updates --security-only
provisioning t update --infra my-production --security-patches --yes
```plaintext

### Scenario 2: Major Version Upgrade

```bash
# Careful major version update
provisioning ws backup my-production
provisioning t check-migration <service> --from X.Y --to X+1.Y
provisioning t create <service> --infra my-production --migrate
provisioning test all --infra my-production
```plaintext

### Scenario 3: Emergency Hotfix

```bash
# Apply critical hotfix immediately
provisioning t create <service> --infra my-production --hotfix --yes
```plaintext

## Troubleshooting Updates

### Issue: Update fails mid-process

**Solution:**

```bash
# Check update status
provisioning t status <taskserv> --infra my-production

# Resume failed update
provisioning t update <taskserv> --infra my-production --resume

# Or rollback
provisioning t rollback <taskserv> --infra my-production
```plaintext

### Issue: Service not starting after update

**Solution:**

```bash
# Check logs
provisioning logs <taskserv> --infra my-production

# Verify configuration
provisioning t validate <taskserv> --infra my-production

# Rollback if necessary
provisioning t rollback <taskserv> --infra my-production
```plaintext

### Issue: Data migration fails

**Solution:**

```bash
# Check migration logs
provisioning t migration-logs <taskserv> --infra my-production

# Restore from backup
provisioning t restore <taskserv> --infra my-production --from <backup-file>
```plaintext

## Best Practices

1. **Always Test First**: Test updates in staging before production
2. **Backup Everything**: Create backups before any update
3. **Update Gradually**: Update one service at a time
4. **Monitor Closely**: Watch for errors after each update
5. **Have Rollback Plan**: Always have a rollback strategy
6. **Document Changes**: Keep update logs for reference
7. **Schedule Wisely**: Update during low-traffic periods
8. **Verify Thoroughly**: Run tests after each update

## Next Steps

- **[Customize Guide](customize-infrastructure.md)** - Customize your infrastructure
- **[From Scratch Guide](from-scratch.md)** - Deploy new infrastructure
- **[Workflow Guide](../development/workflow.md)** - Automate with workflows

## Quick Reference

```bash
# Update workflow
provisioning t check-updates
provisioning ws backup my-production
provisioning t create <taskserv> --infra my-production --check
provisioning t create <taskserv> --infra my-production
provisioning version taskserv <taskserv>
provisioning health --infra my-production
provisioning test all --infra my-production
```plaintext

---

*This guide is part of the provisioning project documentation. Last updated: 2025-09-30*