- Remove KCL ecosystem (~220 files deleted) - Migrate all infrastructure to Nickel schema system - Consolidate documentation: legacy docs → provisioning/docs/src/ - Add CI/CD workflows (.github/) and Rust build config (.cargo/) - Update core system for Nickel schema parsing - Update README.md and CHANGES.md for v5.0.0 release - Fix pre-commit hooks: end-of-file, trailing-whitespace - Breaking changes: KCL workspaces require migration - Migration bridge available in docs/src/development/
18 KiB
18 KiB
Multi-Region High Availability Workspace
This workspace demonstrates a production-ready global high availability deployment spanning three cloud providers across three geographic regions:
- US East (DigitalOcean NYC): Primary region - active serving, primary database
- EU Central (Hetzner Germany): Secondary region - active serving, read replicas
- Asia Pacific (AWS Singapore): Tertiary region - active serving, read replicas
Why Multi-Region High Availability?
Business Benefits
- 99.99% Uptime: Automatic failover across regions
- Low Latency: Users served from geographically closest region
- Compliance: Data residency in specific regions (GDPR for EU)
- Disaster Recovery: Complete regional failure tolerance
Technical Benefits
- Load Distribution: Traffic spread across 3 regions
- Cost Optimization: Pay only for actual usage (~$311/month)
- Provider Diversity: Reduces vendor lock-in risk
- Capacity Planning: Scale independently per region
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ Global Route53 DNS │
│ Geographic Routing + Health Checks │
└────────────────────┬────────────────────────────────────────────┘
│
┌───────────┼───────────┐
│ │ │
┌────▼─────┐ ┌──▼────────┐ ┌▼──────────┐
│ US │ │ EU │ │ APAC │
│ Primary │ │ Secondary │ │ Tertiary │
└────┬─────┘ └──┬────────┘ └▼──────────┘
│ │ │
┌────▼──────────▼───────────▼────┐
│ Multi-Master Database │
│ Replication (300s lag) │
└────────────────────────────────┘
│ │ │
┌────▼────┐ ┌──▼─────┐ ┌──▼────┐
│DO Droplets Hetzner AWS
│ 3 x nyc3 3 x nbg1 3 x sgp1
│ │ │ │
│ Load Balancer (per region)
│ │ │ │
└─────────┼─────────┼─────────┘
│VPN Tunnels (IPSec)│
└───────────────────┘
Regional Components
US East (DigitalOcean) - Primary
Region: nyc3 (New York)
Compute: 3x Droplets (s-2vcpu-4gb)
Load Balancer: Round-robin with health checks
Database: PostgreSQL (3-node cluster, Multi-AZ)
Network: VPC 10.0.0.0/16
Cost: ~$102/month
EU Central (Hetzner) - Secondary
Region: nbg1 (Nuremberg, Germany)
Compute: 3x CPX21 servers (4 vCPU, 8GB RAM)
Load Balancer: Hetzner Load Balancer
Database: Read-only replica (lag: 300s)
Network: vSwitch 10.1.0.0/16
Cost: ~$79/month (€72.70)
Asia Pacific (AWS) - Tertiary
Region: ap-southeast-1 (Singapore)
Compute: 3x EC2 t3.medium instances
Load Balancer: Application Load Balancer (ALB)
Database: RDS read-only replica (lag: 300s)
Network: VPC 10.2.0.0/16
Cost: ~$130/month
Prerequisites
1. Cloud Accounts & Credentials
DigitalOcean
# Create API token
# Dashboard → API → Tokens/Keys → Generate New Token
# Scopes: read, write
export DIGITALOCEAN_TOKEN="dop_v1_abc123def456ghi789jkl012mno"
Hetzner
# Create API token
# Dashboard → Security → API Tokens → Generate Token
export HCLOUD_TOKEN="MC4wNTI1YmE1M2E4YmE0YTQzMTQyZTdlODYy"
AWS
# Create IAM user with programmatic access
# IAM → Users → Add User → Check "Programmatic access"
# Attach policies: AmazonEC2FullAccess, AmazonRDSFullAccess, Route53FullAccess
export AWS_ACCESS_KEY_ID="AKIA1234567890ABCDEF"
export AWS_SECRET_ACCESS_KEY="wJalrXUtnFEMI/K7MDENG+j/zI0m1234567890ab"
2. CLI Tools
# Verify all CLIs are installed
which doctl
which hcloud
which aws
which nickel
# Versions
doctl version # >= 1.94.0
hcloud version # >= 1.35.0
aws --version # >= 2.0
nickel --version # >= 1.0
3. SSH Keys
DigitalOcean
# Upload SSH key
doctl compute ssh-key create provisioning-key \
--public-key-from-file ~/.ssh/id_rsa.pub
# Note the key ID
doctl compute ssh-key list
Hetzner
# Upload SSH key
hcloud ssh-key create \
--name provisioning-key \
--public-key-from-file ~/.ssh/id_rsa.pub
# List keys
hcloud ssh-key list
AWS
# Create or import EC2 key pair
aws ec2 create-key-pair \
--key-name provisioning-key \
--query 'KeyMaterial' --output text > provisioning-key.pem
chmod 600 provisioning-key.pem
4. Domain and DNS
You need a domain with Route53 or ability to create DNS records:
# Create hosted zone in Route53
aws route53 create-hosted-zone \
--name api.example.com \
--caller-reference $(date +%s)
# Note the Zone ID for updates
aws route53 list-hosted-zones
Deployment
Step 1: Configure the Workspace
Edit workspace.ncl to customize:
# Update SSH key references
droplets = digitalocean.Droplet & {
ssh_keys = ["YOUR_DO_KEY_ID"],
name = "us-app",
region = "nyc3"
}
# Update AWS AMI IDs for your region
app_servers = aws.EC2 & {
image_id = "ami-09d56f8956ab235b7",
instance_type = "t3.medium",
region = "ap-southeast-1"
}
# Update certificate ID
load_balancer = digitalocean.LoadBalancer & {
forwarding_rules = [{
certificate_id = "your-certificate-id",
entry_protocol = "https",
entry_port = 443
}]
}
Edit config.toml:
# Update regional names if different
[providers.digitalocean]
region_name = "us-east"
[providers.hetzner]
region_name = "eu-central"
[providers.aws]
region_name = "asia-southeast"
# Update domain
[dns]
domain = "api.example.com"
Step 2: Validate Configuration
# Validate Nickel syntax
nickel export workspace.ncl | jq . > /dev/null
# Verify credentials per provider
doctl auth init --access-token $DIGITALOCEAN_TOKEN
hcloud context use default
aws sts get-caller-identity
# Check connectivity
doctl account get
hcloud server list
aws ec2 describe-regions
Step 3: Deploy
# Make script executable
chmod +x deploy.nu
# Execute deployment (step-by-step)
./deploy.nu
# Or with debug output
./deploy.nu --debug
# Or deploy per region
./deploy.nu --region us-east
./deploy.nu --region eu-central
./deploy.nu --region asia-southeast
Step 4: Verify Global Deployment
# List resources per region
echo "=== US EAST (DigitalOcean) ==="
doctl compute droplet list --format Name,Region,Status,PublicIPv4
doctl compute load-balancer list
echo "=== EU CENTRAL (Hetzner) ==="
hcloud server list
echo "=== ASIA PACIFIC (AWS) ==="
aws ec2 describe-instances --region ap-southeast-1 \
--query 'Reservations[*].Instances[*].[InstanceId,InstanceType,State.Name,PublicIpAddress]' \
--output table
aws elbv2 describe-load-balancers --region ap-southeast-1
Post-Deployment Configuration
1. SSL/TLS Certificates
AWS Certificate Manager
# Request certificate for all regions
aws acm request-certificate \
--domain-name api.example.com \
--subject-alternative-names *.api.example.com \
--validation-method DNS \
--region us-east-1
# Get certificate ARN
aws acm list-certificates --region us-east-1
# Note the ARN for workspace.ncl
2. Database Primary/Replica Setup
# Connect to US East primary
PGPASSWORD=admin psql -h us-db-primary.abc123.us-east-1.rds.amazonaws.com -U admin -d postgres
# Create read-only replication users for EU and APAC
CREATE ROLE replication_user WITH REPLICATION LOGIN PASSWORD 'replica_password';
# On EU read replica (Hetzner) - verify replication
SELECT slot_name, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots;
# On APAC read replica (AWS RDS) - verify replica status
SELECT databaseid, xmin, catalog_xmin FROM pg_replication_origin_status;
3. Global DNS Setup
# Create Route53 records for each region
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890ABC \
--change-batch '{
"Changes": [
{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "us.api.example.com",
"Type": "A",
"TTL": 60,
"ResourceRecords": [{"Value": "198.51.100.15"}]
}
},
{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "eu.api.example.com",
"Type": "A",
"TTL": 60,
"ResourceRecords": [{"Value": "192.0.2.100"}]
}
},
{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "asia.api.example.com",
"Type": "A",
"TTL": 60,
"ResourceRecords": [{"Value": "203.0.113.50"}]
}
}
]
}'
# Health checks per region
aws route53 create-health-check \
--health-check-config '{
"Type": "HTTPS",
"ResourcePath": "/health",
"FullyQualifiedDomainName": "us.api.example.com",
"Port": 443,
"RequestInterval": 30,
"FailureThreshold": 3
}'
4. Application Deployment
SSH to web servers in each region:
# US East
US_IP=$(doctl compute droplet get us-app-1 --format PublicIPv4 --no-header)
ssh root@$US_IP
# Deploy application
cd /var/www
git clone https://github.com/your-org/app.git
cd app
./deploy.sh
# EU Central
EU_IP=$(hcloud server list --selector region=eu-central --format ID | head -1 | xargs -I {} hcloud server ip {})
ssh root@$EU_IP
# Asia Pacific
ASIA_IP=$(aws ec2 describe-instances \
--region ap-southeast-1 \
--filters "Name=tag:Name,Values=asia-app-1" \
--query 'Reservations[0].Instances[0].PublicIpAddress' \
--output text)
ssh -i provisioning-key.pem ec2-user@$ASIA_IP
Monitoring and Health Checks
Regional Monitoring
Each region generates metrics to CloudWatch/provider-specific monitoring:
# DigitalOcean metrics
doctl monitoring metrics list droplet \
--droplet-id 123456789 \
--metric cpu
# Hetzner metrics (manual monitoring)
hcloud server list
# AWS CloudWatch
aws cloudwatch get-metric-statistics \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--start-time 2024-01-01T00:00:00Z \
--end-time 2024-01-02T00:00:00Z \
--period 300 \
--statistics Average
Global Health Checks
Route53 health checks verify all regions are healthy:
# List health checks
aws route53 list-health-checks
# Get detailed status
aws route53 get-health-check-status --health-check-id abc123
# Verify replication lag
# On primary (US East) DigitalOcean
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;
# Should be less than 300 seconds
Alert Configuration
Configure alerts for critical metrics:
# CPU > 80%
aws cloudwatch put-metric-alarm \
--alarm-name us-east-high-cpu \
--alarm-actions arn:aws:sns:us-east-1:123456:ops-alerts \
--metric-name CPUUtilization \
--threshold 80 \
--comparison-operator GreaterThanThreshold
# Replication lag > 600s
aws cloudwatch put-metric-alarm \
--alarm-name replication-lag-critical \
--metric-name ReplicationLag \
--threshold 600 \
--comparison-operator GreaterThanThreshold
Failover Testing
Planned Failover - US East to EU Central
# 1. Stop traffic to US East
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890ABC \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "api.example.com",
"Type": "A",
"TTL": 60,
"ResourceRecords": [{"Value": "192.0.2.100"}]
}
}]
}'
# 2. Promote EU Central to primary
# Connect to EU read replica and promote
psql -h hetzner-eu-db.netz.de -U admin -d postgres \
-c "SELECT pg_promote();"
# 3. Verify failover
curl https://api.example.com/health
# 4. Monitor replication (now from EU)
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;
Automatic Failover - Health Check Failure
Route53 automatically fails over when health checks fail:
# Simulate US East failure (for testing only)
# Stop web servers temporarily
doctl compute droplet-action power-off us-app-1 us-app-2 us-app-3
# Wait ~1 minute for health check to fail
sleep 60
# Verify traffic now routes to EU/APAC
curl https://api.example.com/ -v | grep -E "^< Server"
# Restore US East
doctl compute droplet-action power-on us-app-1 us-app-2 us-app-3
Scaling and Upgrades
Add More Web Servers
Edit workspace.ncl:
# Increase droplet count
region_us_east.app_servers = digitalocean.Droplet & {
count = 5,
name = "us-app",
region = "nyc3"
}
# Increase Hetzner servers
region_eu_central.app_servers = hetzner.Server & {
count = 5,
server_type = "cpx21",
location = "nbg1"
}
# Increase AWS EC2 instances
region_asia_southeast.app_servers = aws.EC2 & {
count = 5,
instance_type = "t3.medium",
region = "ap-southeast-1"
}
Redeploy:
./deploy.nu --region us-east
./deploy.nu --region eu-central
./deploy.nu --region asia-southeast
Upgrade Database Instance Class
Edit workspace.ncl:
# US East primary
database = digitalocean.Database & {
size = "db-s-4vcpu-8gb",
name = "us-db-primary",
engine = "pg"
}
DigitalOcean handles upgrade with minimal downtime.
Upgrade EC2 Instances
# Stop instances for upgrade (rolling)
aws ec2 stop-instances --region ap-southeast-1 --instance-ids i-1234567890abcdef0
# Wait for stop
aws ec2 wait instance-stopped --region ap-southeast-1 --instance-ids i-1234567890abcdef0
# Modify instance type
aws ec2 modify-instance-attribute \
--region ap-southeast-1 \
--instance-id i-1234567890abcdef0 \
--instance-type t3.large
# Start instance
aws ec2 start-instances --region ap-southeast-1 --instance-ids i-1234567890abcdef0
Cost Optimization
Monthly Cost Breakdown
| Component | US East | EU Central | Asia Pacific | Total |
|---|---|---|---|---|
| Compute | $72 | €62.70 | $80 | $242.70 |
| Database | $30 | Read Replica | $30 | $60 |
| Load Balancer | Free | ~$10 | ~$20 | ~$30 |
| Total | $102 | ~$79 | $130 | ~$311 |
Optimization Strategies
- Reduce instance count from 3 to 2 (saves ~$30-40/month)
- Downsize compute to s-1vcpu-2gb (saves ~$20-30/month)
- Use Reserved Instances on AWS (saves ~20-30%)
- Optimize data transfer between regions
- Review backups and retention settings
Monitor Costs
# DigitalOcean
doctl billing get
# AWS Cost Explorer
aws ce get-cost-and-usage \
--time-period Start=2024-01-01,End=2024-01-31 \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE
# Hetzner (manual via console)
# https://console.hetzner.cloud/billing
Troubleshooting
Issue: One Region Not Responding
Diagnosis:
# Check health checks
aws route53 get-health-check-status --health-check-id abc123
# Test regional endpoints
curl -v https://us.api.example.com/health
curl -v https://eu.api.example.com/health
curl -v https://asia.api.example.com/health
Solution:
- Check web server status in affected region
- Verify load balancer is healthy
- Review security groups/firewall rules
- Check application logs on web servers
Issue: High Replication Lag
Diagnosis:
# Check replication status
psql -h us-db-primary.abc123.us-east-1.rds.amazonaws.com -U admin -d postgres \
-c "SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;"
# Check replication slots
psql -h us-db-primary.abc123.us-east-1.rds.amazonaws.com -U admin -d postgres \
-c "SELECT * FROM pg_replication_slots;"
Solution:
- Check network connectivity between regions
- Verify VPN tunnels are operational
- Reduce write load on primary
- Monitor network bandwidth
- May need larger database instance
Issue: VPN Tunnel Down
Diagnosis:
# Check VPN connection status
aws ec2 describe-vpn-connections --region us-east-1
# Test connectivity between regions
ssh hetzner-server "ping 10.0.0.1"
Solution:
- Reconnect VPN tunnel manually
- Verify tunnel configuration
- Check security groups allow necessary ports
- Review ISP routing
Cleanup
To destroy all resources (use carefully):
# DigitalOcean
doctl compute droplet delete --force us-app-1 us-app-2 us-app-3
doctl compute load-balancer delete --force us-lb
doctl compute database delete --force us-db-primary
# Hetzner
hcloud server delete hetzner-eu-1 hetzner-eu-2 hetzner-eu-3
hcloud load-balancer delete eu-lb
hcloud volume delete eu-backups
# AWS
aws ec2 terminate-instances --region ap-southeast-1 --instance-ids i-xxxxx
aws elbv2 delete-load-balancer --load-balancer-arn arn:aws:elasticloadbalancing:ap-southeast-1:123456789:loadbalancer/app/asia-lb/1234567890abcdef
aws rds delete-db-instance --db-instance-identifier asia-db-replica --skip-final-snapshot
# Route53
aws route53 delete-health-check --health-check-id abc123
aws route53 delete-hosted-zone --id Z1234567890ABC
Next Steps
- Disaster Recovery Testing: Regular failover drills
- Auto-scaling: Add provider-specific autoscaling
- Monitoring Integration: Connect to centralized monitoring (Datadog, New Relic, Prometheus)
- Backup Automation: Implement cross-region backups
- Cost Optimization: Review and tune resource sizing
- Security Hardening: Implement WAF, DDoS protection
- Load Testing: Validate performance across regions
Support
For issues or questions:
- Review the multi-provider networking guide
- Check provider-specific documentation
- Review regional deployment logs:
./deploy.nu --debug - Test regional endpoints independently
Files
workspace.ncl: Global infrastructure definition (Nickel)config.toml: Provider credentials and regional settingsdeploy.nu: Multi-region deployment orchestration (Nushell)README.md: This file