provisioning/docs/src/infrastructure/batch-workflow-multi-provider.md
2026-01-14 03:09:18 +00:00

24 KiB

Multi-Provider Batch Workflow Examples\n\nThis document provides practical examples of orchestrating complex deployments and operations across multiple cloud providers using the batch workflow\nsystem.\n\n## Table of Contents\n\n- Overview\n- Workflow 1: Coordinated Multi-Provider Deployment\n- Workflow 2: Multi-Provider Disaster Recovery Failover\n- Workflow 3: Cost Optimization Workload Migration\n- Workflow 4: Multi-Region Database Replication\n- Best Practices\n- Troubleshooting\n\n## Overview\n\nThe batch workflow system enables declarative orchestration of operations across multiple providers with:\n\n- Dependency Tracking: Define what must complete before what\n- Error Handling: Automatic rollback on failure\n- Idempotency: Safe to re-run workflows\n- Status Tracking: Real-time progress monitoring\n- Recovery Checkpoints: Resume from failure points\n\n## Workflow 1: Coordinated Multi-Provider Deployment\n\nUse Case: Deploy web application across DigitalOcean, AWS, and Hetzner with proper sequencing and dependencies.\n\nWorkflow Characteristics:\n- Database created first (dependencies)\n- Backup storage ready before compute\n- Web servers scale once database ready\n- Health checks before considering complete\n\n### Workflow Definition\n\n\n# file: workflows/multi-provider-deployment.yml\n\nname: multi-provider-app-deployment\nversion: "1.0"\ndescription: "Deploy web app across three cloud providers"\n\nparameters:\n do_region: "nyc3"\n aws_region: "us-east-1"\n hetzner_location: "nbg1"\n web_server_count: 3\n\nphases:\n # Phase 1: Create backup storage first (independent)\n - name: "provision-backup-storage"\n provider: "hetzner"\n description: "Create backup storage volume in Hetzner"\n\n operations:\n - id: "create-backup-volume"\n action: "create-volume"\n config:\n name: "webapp-backups"\n size: 500\n location: "{{ hetzner_location }}"\n format: "ext4"\n\n tags: ["storage", "backup"]\n\n on_failure: "alert"\n on_success: "proceed"\n\n # Phase 2: Create database (independent, but must complete before app)\n - name: "provision-database"\n provider: "aws"\n description: "Create managed PostgreSQL database"\n depends_on: [] # Can run in parallel with Phase 1\n\n operations:\n - id: "create-rds-instance"\n action: "create-db-instance"\n config:\n identifier: "webapp-db"\n engine: "postgres"\n engine_version: "14.6"\n instance_class: "db.t3.medium"\n allocated_storage: 100\n multi_az: true\n backup_retention_days: 30\n\n tags: ["database", "primary"]\n\n - id: "create-security-group"\n action: "create-security-group"\n config:\n name: "webapp-db-sg"\n description: "Security group for RDS"\n\n depends_on: ["create-rds-instance"]\n\n - id: "configure-db-access"\n action: "authorize-security-group"\n config:\n group_id: "{{ create-security-group.id }}"\n protocol: "tcp"\n port: 5432\n cidr: "10.0.0.0/8"\n\n depends_on: ["create-security-group"]\n\n timeout: 60\n\n # Phase 3: Create web tier (depends on database being ready)\n - name: "provision-web-tier"\n provider: "digitalocean"\n description: "Create web servers and load balancer"\n depends_on: ["provision-database"] # Wait for database\n\n operations:\n - id: "create-droplets"\n action: "create-droplet"\n config:\n name: "web-server"\n size: "s-2vcpu-4gb"\n region: "{{ do_region }}"\n image: "ubuntu-22-04-x64"\n count: "{{ web_server_count }}"\n backups: true\n monitoring: true\n\n tags: ["web", "production"]\n\n timeout: 300\n retry:\n max_attempts: 3\n backoff: exponential\n\n - id: "create-firewall"\n action: "create-firewall"\n config:\n name: "web-firewall"\n inbound_rules:\n - protocol: "tcp"\n ports: "22"\n sources: ["0.0.0.0/0"]\n - protocol: "tcp"\n ports: "80"\n sources: ["0.0.0.0/0"]\n - protocol: "tcp"\n ports: "443"\n sources: ["0.0.0.0/0"]\n\n depends_on: ["create-droplets"]\n\n - id: "create-load-balancer"\n action: "create-load-balancer"\n config:\n name: "web-lb"\n algorithm: "round_robin"\n region: "{{ do_region }}"\n forwarding_rules:\n - entry_protocol: "http"\n entry_port: 80\n target_protocol: "http"\n target_port: 80\n - entry_protocol: "https"\n entry_port: 443\n target_protocol: "http"\n target_port: 80\n health_check:\n protocol: "http"\n port: 80\n path: "/health"\n interval: 10\n\n depends_on: ["create-droplets"]\n\n # Phase 4: Network configuration (depends on all resources)\n - name: "configure-networking"\n description: "Setup VPN tunnels and security between providers"\n depends_on: ["provision-web-tier"]\n\n operations:\n - id: "setup-vpn-tunnel-do-aws"\n action: "create-vpn-tunnel"\n config:\n source_provider: "digitalocean"\n destination_provider: "aws"\n protocol: "ipsec"\n encryption: "aes-256"\n\n timeout: 120\n\n - id: "setup-vpn-tunnel-aws-hetzner"\n action: "create-vpn-tunnel"\n config:\n source_provider: "aws"\n destination_provider: "hetzner"\n protocol: "ipsec"\n encryption: "aes-256"\n\n # Phase 5: Validation and verification\n - name: "verify-deployment"\n description: "Verify all resources are operational"\n depends_on: ["configure-networking"]\n\n operations:\n - id: "health-check-droplets"\n action: "run-health-check"\n config:\n targets: "{{ create-droplets.ips }}"\n endpoint: "/health"\n expected_status: 200\n timeout: 30\n\n timeout: 300\n\n - id: "health-check-database"\n action: "verify-database"\n config:\n host: "{{ create-rds-instance.endpoint }}"\n port: 5432\n database: "postgres"\n timeout: 30\n\n - id: "health-check-backup"\n action: "verify-volume"\n config:\n volume_id: "{{ create-backup-volume.id }}"\n status: "available"\n\n# Rollback strategy: if any phase fails\nrollback:\n strategy: "automatic"\n on_phase_failure: "rollback-previous-phases"\n preserve_data: true\n\n# Notifications\nnotifications:\n on_start: "slack:#deployments"\n on_phase_complete: "slack:#deployments"\n on_failure: "slack:#alerts"\n on_success: "slack:#deployments"\n\n# Validation checks\npre_flight:\n - check: "credentials"\n description: "Verify all provider credentials"\n - check: "quotas"\n description: "Verify sufficient quotas in each provider"\n - check: "dependencies"\n description: "Verify all dependencies are available"\n\n\n### Execution Flow\n\n\n┌─────────────────────────────────────────────────────────┐\n│ Start Deployment │\n└──────────────────┬──────────────────────────────────────┘\n │\n ┌──────────┴──────────┐\n │ │\n ▼ ▼\n ┌─────────────┐ ┌──────────────────┐\n │ Hetzner │ │ AWS │\n │ Backup │ │ Database │\n │ (Phase 1) │ │ (Phase 2) │\n └──────┬──────┘ └────────┬─────────┘\n │ │\n │ Ready │ Ready\n └────────┬───────────┘\n │\n ▼\n ┌──────────────────┐\n │ DigitalOcean │\n │ Web Tier │\n │ (Phase 3) │\n │ - Droplets │\n │ - Firewall │\n │ - Load Balancer │\n └────────┬─────────┘\n │\n ▼\n ┌──────────────────┐\n │ Network Setup │\n │ (Phase 4) │\n │ - VPN Tunnels │\n └────────┬─────────┘\n │\n ▼\n ┌──────────────────┐\n │ Verification │\n │ (Phase 5) │\n │ - Health Checks │\n └────────┬─────────┘\n │\n ▼\n ┌──────────────────┐\n │ Deployment OK │\n │ (Ready to use) │\n └──────────────────┘\n\n\n## Workflow 2: Multi-Provider Disaster Recovery Failover\n\nUse Case: Automated failover from primary provider (DigitalOcean) to backup provider (Hetzner) on detection of failure.\n\nWorkflow Characteristics:\n- Continuous health monitoring\n- Automatic failover trigger\n- Database promotion\n- DNS update\n- Verification before considering complete\n\n### Workflow Definition\n\n\n# file: workflows/multi-provider-dr-failover.yml\n\nname: multi-provider-dr-failover\nversion: "1.0"\ndescription: "Automated failover from DigitalOcean to Hetzner"\n\nparameters:\n primary_provider: "digitalocean"\n backup_provider: "hetzner"\n dns_provider: "aws"\n health_check_threshold: 3\n\nphases:\n # Phase 1: Monitor primary provider\n - name: "monitor-primary"\n description: "Continuous health monitoring of primary"\n\n operations:\n - id: "health-check-primary"\n action: "run-health-check"\n config:\n provider: "{{ primary_provider }}"\n resources: ["web-servers", "load-balancer"]\n checks:\n - type: "http"\n endpoint: "/health"\n expected_status: 200\n - type: "database"\n host: "db.primary.example.com"\n query: "SELECT 1"\n - type: "connectivity"\n test: "ping"\n interval: 30 # Check every 30 seconds\n\n timeout: 300\n\n - id: "aggregate-health"\n action: "aggregate-metrics"\n config:\n source: "{{ health-check-primary.results }}"\n failure_threshold: 3 # 3 consecutive failures trigger failover\n\n # Phase 2: Trigger failover (conditional on failure)\n - name: "trigger-failover"\n description: "Activate disaster recovery if primary fails"\n depends_on: ["monitor-primary"]\n condition: "{{ aggregate-health.status }} == 'FAILED'"\n\n operations:\n - id: "alert-on-failure"\n action: "send-notification"\n config:\n type: "critical"\n message: "Primary provider ({{ primary_provider }}) has failed. Initiating failover..."\n recipients: ["ops-team@example.com", "slack:#alerts"]\n\n - id: "enable-backup-infrastructure"\n action: "scale-up"\n config:\n provider: "{{ backup_provider }}"\n target: "warm-standby-servers"\n desired_count: 3\n instance_type: "cx31"\n\n timeout: 300\n retry:\n max_attempts: 3\n\n - id: "promote-database-replica"\n action: "promote-read-replica"\n config:\n provider: "aws"\n replica_identifier: "backup-db-replica"\n to_master: true\n\n timeout: 600 # Allow time for promotion\n\n # Phase 3: Network failover\n - name: "network-failover"\n description: "Switch traffic to backup provider"\n depends_on: ["trigger-failover"]\n\n operations:\n - id: "update-load-balancer"\n action: "reconfigure-load-balancer"\n config:\n provider: "{{ dns_provider }}"\n record: "api.example.com"\n old_backend: "do-lb-{{ primary_provider }}"\n new_backend: "hz-lb-{{ backup_provider }}"\n\n - id: "update-dns"\n action: "update-dns-record"\n config:\n provider: "route53"\n record: "example.com"\n old_value: "do-lb-ip"\n new_value: "hz-lb-ip"\n ttl: 60\n\n - id: "update-cdn"\n action: "update-cdn-origin"\n config:\n cdn_provider: "cloudfront"\n distribution_id: "E123456789ABCDEF"\n new_origin: "backup-lb.hetzner.com"\n\n # Phase 4: Verify failover\n - name: "verify-failover"\n description: "Verify backup provider is operational"\n depends_on: ["network-failover"]\n\n operations:\n - id: "health-check-backup"\n action: "run-health-check"\n config:\n provider: "{{ backup_provider }}"\n resources: ["backup-servers"]\n endpoint: "/health"\n expected_status: 200\n timeout: 30\n\n timeout: 300\n\n - id: "verify-database"\n action: "verify-database"\n config:\n provider: "aws"\n database: "backup-db-promoted"\n query: "SELECT COUNT(*) FROM users"\n expected_rows: "> 0"\n\n - id: "verify-traffic"\n action: "verify-traffic-flow"\n config:\n endpoint: "https://example.com"\n expected_response_time: "< 500 ms"\n expected_status: 200\n\n # Phase 5: Activate backup fully\n - name: "activate-backup"\n description: "Run at full capacity on backup provider"\n depends_on: ["verify-failover"]\n\n operations:\n - id: "scale-to-production"\n action: "scale-up"\n config:\n provider: "{{ backup_provider }}"\n target: "all-backup-servers"\n desired_count: 6\n\n timeout: 600\n\n - id: "configure-persistence"\n action: "enable-persistence"\n config:\n provider: "{{ backup_provider }}"\n resources: ["backup-servers"]\n persistence_type: "volume"\n\n# Recovery strategy for primary restoration\nrecovery:\n description: "Restore primary provider when recovered"\n phases:\n - name: "detect-primary-recovery"\n operation: "health-check"\n target: "primary-provider"\n success_criteria: "3 consecutive successful checks"\n\n - name: "resync-data"\n operation: "database-resync"\n direction: "backup-to-primary"\n timeout: 3600\n\n - name: "failback"\n operation: "switch-traffic"\n target: "primary-provider"\n verification: "100% traffic restored"\n\n# Notifications\nnotifications:\n on_failover_start: "pagerduty:critical"\n on_failover_complete: "slack:#ops"\n on_failover_failed: ["pagerduty:critical", "email:cto@example.com"]\n on_recovery_start: "slack:#ops"\n on_recovery_complete: "slack:#ops"\n\n\n### Failover Timeline\n\n\nTime Event\n────────────────────────────────────────────────────\n00:00 Health check detects failure (3 consecutive failures)\n00:01 Alert sent to ops team\n00:02 Backup infrastructure scaled to 3 servers\n00:05 Database replica promoted to master\n00:10 DNS updated (TTL=60s, propagation ~2 minutes)\n00:12 Load balancer reconfigured\n00:15 Traffic verified flowing through backup\n00:20 Backup scaled to full production capacity (6 servers)\n00:25 Fully operational on backup provider\n\nTotal RTO: 25 minutes (including DNS propagation)\nData loss (RPO): < 5 minutes (database replication lag)\n\n\n## Workflow 3: Cost Optimization Workload Migration\n\nUse Case: Migrate running workloads to cheaper provider (DigitalOcean to Hetzner) for cost reduction.\n\nWorkflow Characteristics:\n- Parallel deployment on target provider\n- Gradual traffic migration\n- Rollback capability\n- Cost tracking\n\n### Workflow Definition\n\n\n# file: workflows/cost-optimization-migration.yml\n\nname: cost-optimization-migration\nversion: "1.0"\ndescription: "Migrate workload from DigitalOcean to Hetzner for cost savings"\n\nparameters:\n source_provider: "digitalocean"\n target_provider: "hetzner"\n migration_speed: "gradual" # or "aggressive"\n traffic_split: [10, 25, 50, 75, 100] # Gradual percentages\n\nphases:\n # Phase 1: Create target infrastructure\n - name: "create-target-infrastructure"\n description: "Deploy identical workload on Hetzner"\n\n operations:\n - id: "provision-servers"\n action: "create-server"\n config:\n provider: "{{ target_provider }}"\n name: "migration-app"\n server_type: "cpx21" # Better price/performance than DO\n count: 3\n\n timeout: 300\n\n # Phase 2: Verify target is ready\n - name: "verify-target"\n description: "Health checks on target infrastructure"\n depends_on: ["create-target-infrastructure"]\n\n operations:\n - id: "health-check"\n action: "run-health-check"\n config:\n provider: "{{ target_provider }}"\n endpoint: "/health"\n\n timeout: 300\n\n # Phase 3: Gradual traffic migration\n - name: "migrate-traffic"\n description: "Gradually shift traffic to target provider"\n depends_on: ["verify-target"]\n\n operations:\n - id: "set-traffic-10"\n action: "set-traffic-split"\n config:\n source: "{{ source_provider }}"\n target: "{{ target_provider }}"\n percentage: 10\n duration: 300\n\n - id: "verify-10"\n action: "verify-traffic-flow"\n config:\n target_percentage: 10\n error_rate_threshold: 0.1\n\n - id: "set-traffic-25"\n action: "set-traffic-split"\n config:\n percentage: 25\n duration: 600\n\n - id: "set-traffic-50"\n action: "set-traffic-split"\n config:\n percentage: 50\n duration: 900\n\n - id: "set-traffic-75"\n action: "set-traffic-split"\n config:\n percentage: 75\n duration: 900\n\n - id: "set-traffic-100"\n action: "set-traffic-split"\n config:\n percentage: 100\n duration: 600\n\n # Phase 4: Cleanup source\n - name: "cleanup-source"\n description: "Remove old infrastructure from source provider"\n depends_on: ["migrate-traffic"]\n\n operations:\n - id: "verify-final"\n action: "run-health-check"\n config:\n provider: "{{ target_provider }}"\n duration: 3600 # Monitor for 1 hour\n\n - id: "decommission-source"\n action: "delete-resources"\n config:\n provider: "{{ source_provider }}"\n resources: ["droplets", "load-balancer"]\n preserve_backups: true\n\n# Cost tracking\ncost_tracking:\n before:\n provider: "{{ source_provider }}"\n estimated_monthly: "$72"\n\n after:\n provider: "{{ target_provider }}"\n estimated_monthly: "$42"\n\n savings:\n monthly: "$30"\n annual: "$360"\n percentage: "42%"\n\n\n## Workflow 4: Multi-Region Database Replication\n\nUse Case: Setup database replication across multiple providers and regions for disaster recovery.\n\nWorkflow Characteristics:\n- Create primary database\n- Setup read replicas in other providers\n- Configure replication\n- Monitor lag\n\n### Workflow Definition\n\n\n# file: workflows/multi-region-replication.yml\n\nname: multi-region-replication\nversion: "1.0"\ndescription: "Setup database replication across providers"\n\nphases:\n # Primary database\n - name: "create-primary"\n provider: "aws"\n operations:\n - id: "create-rds"\n action: "create-db-instance"\n config:\n identifier: "app-db-primary"\n engine: "postgres"\n instance_class: "db.t3.medium"\n region: "us-east-1"\n\n # Secondary replica\n - name: "create-secondary-replica"\n depends_on: ["create-primary"]\n provider: "aws"\n operations:\n - id: "create-replica"\n action: "create-read-replica"\n config:\n source: "app-db-primary"\n region: "eu-west-1"\n identifier: "app-db-secondary"\n\n # Tertiary replica in different provider\n - name: "create-tertiary-replica"\n depends_on: ["create-primary"]\n operations:\n - id: "setup-replication"\n action: "setup-external-replication"\n config:\n source_provider: "aws"\n source_db: "app-db-primary"\n target_provider: "hetzner"\n replication_slot: "hetzner_replica"\n replication_type: "logical"\n\n # Monitor replication\n - name: "monitor-replication"\n depends_on: ["create-tertiary-replica"]\n operations:\n - id: "check-lag"\n action: "monitor-replication-lag"\n config:\n replicas:\n - name: "secondary"\n warning_threshold: 300\n critical_threshold: 600\n - name: "tertiary"\n warning_threshold: 1000\n critical_threshold: 2000\n interval: 60\n\n\n## Best Practices\n\n### 1. Workflow Design\n\n- Define Clear Dependencies: Explicitly state what must happen before what\n- Use Idempotent Operations: Workflows should be safe to re-run\n- Set Realistic Timeouts: Account for cloud provider delays\n- Plan for Failures: Define rollback strategies\n- Test Workflows: Run in staging before production\n\n### 2. Orchestration\n\n- Parallel Execution: Run independent phases in parallel for speed\n- Checkpoints: Add verification at each phase\n- Progressive Deployment: Use gradual traffic shifting\n- Monitoring Integration: Track metrics during workflow\n- Notifications: Alert team at key points\n\n### 3. Cost Management\n\n- Calculate ROI: Track cost savings from optimizations\n- Monitor Resource Usage: Watch for over-provisioning\n- Implement Cleanup: Remove old resources after migration\n- Review Regularly: Reassess provider choices\n\n## Troubleshooting\n\n### Issue: Workflow Stuck in Phase\n\nDiagnosis:\n\nprovisioning workflow status workflow-id --verbose\n\n\nSolution:\n- Increase timeout if legitimate long operation\n- Check provider logs for actual status\n- Manually intervene if necessary\n- Use --skip-phase to skip problematic phase\n\n### Issue: Rollback Failed\n\nDiagnosis:\n\nprovisioning workflow rollback workflow-id --dry-run\n\n\nSolution:\n- Review what resources were created\n- Manually delete resources if needed\n- Fix root cause of failure\n- Re-run workflow\n\n### Issue: Data Inconsistency After Failover\n\nDiagnosis:\n\nprovisioning database verify-consistency\n\n\nSolution:\n- Check replication lag before failover\n- Manually resync if necessary\n- Use backup to restore consistency\n- Run validation queries\n\n## Summary\n\nBatch workflows enable complex multi-provider orchestration with:\n\n- Coordinated deployment across providers\n- Automated failover and recovery\n- Gradual workload migration\n- Cost optimization\n- Disaster recovery\n\nStart with simple workflows and gradually add complexity as you gain confidence.