Vapora/provisioning/vapora-wrksp/workflows/disaster-recovery.yaml

388 lines
12 KiB
YAML
Raw Permalink Normal View History

feat: Phase 5.3 - Multi-Agent Learning Infrastructure Implement intelligent agent learning from Knowledge Graph execution history with per-task-type expertise tracking, recency bias, and learning curves. ## Phase 5.3 Implementation ### Learning Infrastructure (✅ Complete) - LearningProfileService with per-task-type expertise metrics - TaskTypeExpertise model tracking success_rate, confidence, learning curves - Recency bias weighting: recent 7 days weighted 3x higher (exponential decay) - Confidence scoring prevents overfitting: min(1.0, executions / 20) - Learning curves computed from daily execution windows ### Agent Scoring Service (✅ Complete) - Unified AgentScore combining SwarmCoordinator + learning profiles - Scoring formula: 0.3*base + 0.5*expertise + 0.2*confidence - Rank agents by combined score for intelligent assignment - Support for recency-biased scoring (recent_success_rate) - Methods: rank_agents, select_best, rank_agents_with_recency ### KG Integration (✅ Complete) - KGPersistence::get_executions_for_task_type() - query by agent + task type - KGPersistence::get_agent_executions() - all executions for agent - Coordinator::load_learning_profile_from_kg() - core KG→Learning integration - Coordinator::load_all_learning_profiles() - batch load for multiple agents - Convert PersistedExecution → ExecutionData for learning calculations ### Agent Assignment Integration (✅ Complete) - AgentCoordinator uses learning profiles for task assignment - extract_task_type() infers task type from title/description - assign_task() scores candidates using AgentScoringService - Fallback to load-based selection if no learning data available - Learning profiles stored in coordinator.learning_profiles RwLock ### Profile Adapter Enhancements (✅ Complete) - create_learning_profile() - initialize empty profiles - add_task_type_expertise() - set task-type expertise - update_profile_with_learning() - update swarm profiles from learning ## Files Modified ### vapora-knowledge-graph/src/persistence.rs (+30 lines) - get_executions_for_task_type(agent_id, task_type, limit) - get_agent_executions(agent_id, limit) ### vapora-agents/src/coordinator.rs (+100 lines) - load_learning_profile_from_kg() - core KG integration method - load_all_learning_profiles() - batch loading for agents - assign_task() already uses learning-based scoring via AgentScoringService ### Existing Complete Implementation - vapora-knowledge-graph/src/learning.rs - calculation functions - vapora-agents/src/learning_profile.rs - data structures and expertise - vapora-agents/src/scoring.rs - unified scoring service - vapora-agents/src/profile_adapter.rs - adapter methods ## Tests Passing - learning_profile: 7 tests ✅ - scoring: 5 tests ✅ - profile_adapter: 6 tests ✅ - coordinator: learning-specific tests ✅ ## Data Flow 1. Task arrives → AgentCoordinator::assign_task() 2. Extract task_type from description 3. Query KG for task-type executions (load_learning_profile_from_kg) 4. Calculate expertise with recency bias 5. Score candidates (SwarmCoordinator + learning) 6. Assign to top-scored agent 7. Execution result → KG → Update learning profiles ## Key Design Decisions ✅ Recency bias: 7-day half-life with 3x weight for recent performance ✅ Confidence scoring: min(1.0, total_executions / 20) prevents overfitting ✅ Hierarchical scoring: 30% base load, 50% expertise, 20% confidence ✅ KG query limit: 100 recent executions per task-type for performance ✅ Async loading: load_learning_profile_from_kg supports concurrent loads ## Next: Phase 5.4 - Cost Optimization Ready to implement budget enforcement and cost-aware provider selection.
2026-01-11 13:03:53 +00:00
apiVersion: provisioning.vapora.io/v1
kind: Workflow
metadata:
name: disaster-recovery
description: VAPORA disaster recovery and restoration from backups
spec:
version: "0.2.0"
namespace: vapora-system
timeout: 3600s # 1 hour max
inputs:
- name: backup_source
type: string
required: true
description: "Backup identifier or timestamp to restore from"
- name: partial_restore
type: boolean
required: false
default: false
description: "Restore only specific components instead of full system"
- name: components_to_restore
type: array
required: false
description: "List of components to restore: database, services, configuration, data"
- name: verify_only
type: boolean
required: false
default: false
description: "Verify backup integrity without restoring"
phases:
# Phase 1: Pre-recovery assessment
- name: "Assess Damage and Backup Status"
description: "Evaluate current cluster state and available backups"
retryable: true
steps:
- name: "Check cluster connectivity"
command: "kubectl cluster-info"
timeout: 30s
continueOnError: true
- name: "List available backups"
command: |
provisioning backup list --detailed \
| grep -E "id|timestamp|size|status"
timeout: 60s
- name: "Verify backup integrity"
command: |
provisioning backup verify --id $BACKUP_SOURCE
timeout: 300s
env:
- name: BACKUP_SOURCE
value: "${backup_source}"
- name: "Estimate recovery time"
command: |
provisioning backup estimate-restore-time --id $BACKUP_SOURCE \
| tee /tmp/restore_estimate.txt
timeout: 60s
# Phase 2: Drain system
- name: "Prepare System for Recovery"
description: "Stop services and prepare for restoration"
retryable: false
steps:
- name: "Stop accepting new tasks"
command: |
kubectl patch configmap vapora-config \
-n vapora-system \
-p '{"data":{"recovery_mode":"true","accept_requests":"false"}}'
timeout: 60s
continueOnError: true
- name: "Drain task queue"
command: "provisioning agents drain --timeout 300s"
timeout: 320s
continueOnError: true
- name: "Stop agent runtime"
command: |
kubectl scale deployment vapora-agents \
-n vapora-system \
--replicas 0
timeout: 120s
continueOnError: true
- name: "Stop backend services"
command: |
kubectl scale deployment vapora-backend \
-n vapora-system \
--replicas 0
kubectl scale deployment vapora-llm-router \
-n vapora-system \
--replicas 0
kubectl scale deployment vapora-mcp-gateway \
-n vapora-system \
--replicas 0
timeout: 180s
continueOnError: true
- name: "Stop frontend"
command: |
kubectl scale deployment vapora-frontend \
-n vapora-system \
--replicas 0
timeout: 120s
continueOnError: true
# Phase 3: Restore database
- name: "Restore Database"
description: "Restore SurrealDB from backup"
retryable: false
steps:
- name: "Scale down SurrealDB replicas"
command: |
kubectl scale statefulset surrealdb \
-n vapora-system \
--replicas 1
timeout: 300s
- name: "Wait for single replica"
command: |
kubectl wait --for=condition=Ready pod \
-l app=surrealdb \
-n vapora-system \
--timeout=300s
timeout: 320s
- name: "Restore database data"
command: |
provisioning db restore \
--database surrealdb \
--backup $BACKUP_SOURCE \
--wait-for-completion
timeout: 1200s
env:
- name: BACKUP_SOURCE
value: "${backup_source}"
- name: "Verify database integrity"
command: |
provisioning db verify --database surrealdb \
--check-tables \
--check-indexes \
--check-constraints
timeout: 300s
- name: "Scale up SurrealDB replicas"
command: |
kubectl scale statefulset surrealdb \
-n vapora-system \
--replicas 3
timeout: 300s
- name: "Wait for all replicas"
command: |
kubectl wait --for=condition=Ready pod \
-l app=surrealdb \
-n vapora-system \
--timeout=600s
timeout: 620s
# Phase 4: Restore configuration and secrets
- name: "Restore Configuration"
description: "Restore ConfigMaps, Secrets, and application configuration"
retryable: true
steps:
- name: "Restore ConfigMaps"
command: |
provisioning backup restore \
--id $BACKUP_SOURCE \
--resource-type ConfigMap \
--namespace vapora-system
timeout: 180s
- name: "Restore Secrets"
command: |
provisioning backup restore \
--id $BACKUP_SOURCE \
--resource-type Secret \
--namespace vapora-system
timeout: 180s
- name: "Verify configuration"
command: |
kubectl get configmap -n vapora-system
kubectl get secrets -n vapora-system
timeout: 60s
# Phase 5: Restore service configurations
- name: "Restore Services"
description: "Restore service deployments and Istio configurations"
retryable: true
steps:
- name: "Restore backend deployment"
command: |
provisioning backup restore \
--id $BACKUP_SOURCE \
--resource-type Deployment \
--name vapora-backend
timeout: 300s
- name: "Restore LLM Router"
command: |
provisioning backup restore \
--id $BACKUP_SOURCE \
--resource-type Deployment \
--name vapora-llm-router
timeout: 300s
- name: "Restore MCP Gateway"
command: |
provisioning backup restore \
--id $BACKUP_SOURCE \
--resource-type Deployment \
--name vapora-mcp-gateway
timeout: 300s
- name: "Restore frontend"
command: |
provisioning backup restore \
--id $BACKUP_SOURCE \
--resource-type Deployment \
--name vapora-frontend
timeout: 300s
- name: "Restore Istio configuration"
command: |
provisioning backup restore \
--id $BACKUP_SOURCE \
--resource-type Gateway,VirtualService \
--namespace vapora-system
timeout: 180s
# Phase 6: Restore agents
- name: "Restore Agent Runtime"
description: "Restore agent deployment and state"
retryable: true
steps:
- name: "Restore agent deployment"
command: |
provisioning backup restore \
--id $BACKUP_SOURCE \
--resource-type Deployment \
--name vapora-agents
timeout: 300s
- name: "Wait for agents to be ready"
command: |
kubectl wait --for=condition=Ready pod \
-l app=vapora-agents \
-n vapora-system \
--timeout=600s
timeout: 620s
- name: "Verify agent communication"
command: "provisioning agents health-check --nats nats://nats-0.vapora-system:4222"
timeout: 120s
# Phase 7: Post-recovery verification
- name: "Verify Recovery"
description: "Comprehensive verification of recovered system"
retryable: false
steps:
- name: "Health check cluster"
command: "provisioning health-check --cluster"
timeout: 300s
- name: "Health check all services"
command: "provisioning health-check --services all --strict"
timeout: 300s
- name: "Test database connectivity"
command: "provisioning db test-connection --database surrealdb"
timeout: 120s
- name: "Verify data consistency"
command: |
provisioning db verify --database surrealdb \
--check-integrity \
--sample-size 1000
timeout: 300s
- name: "Run smoke tests"
command: |
provisioning test smoke \
--api http://vapora-backend.vapora-system:8080 \
--frontend http://vapora-frontend.vapora-system:3000 \
--timeout 600s
timeout: 620s
- name: "Test agent communication"
command: |
provisioning agents test \
--send-test-message \
--verify-delivery \
--timeout 120s
timeout: 140s
# Phase 8: Re-enable system
- name: "Resume Operations"
description: "Re-enable system for normal operation"
retryable: false
steps:
- name: "Disable recovery mode"
command: |
kubectl patch configmap vapora-config \
-n vapora-system \
-p '{"data":{"recovery_mode":"false","accept_requests":"true"}}'
timeout: 60s
- name: "Scale up services to previous state"
command: |
provisioning taskserv scale-to-previous \
--namespace vapora-system
timeout: 300s
- name: "Resume agent work"
command: "provisioning agents drain --disable"
timeout: 60s
- name: "Final health check"
command: "provisioning health-check --cluster"
timeout: 300s
# Phase 9: Documentation and reporting
- name: "Generate Recovery Report"
description: "Document recovery operation"
retryable: false
steps:
- name: "Create recovery report"
command: |
provisioning report generate \
--type disaster-recovery \
--backup-id $BACKUP_SOURCE \
--output "recovery-report-$(date +%Y%m%d-%H%M%S).md"
timeout: 120s
- name: "Create git commit for recovery"
command: |
git add -A
git commit -m "Disaster recovery: Restored from backup $BACKUP_SOURCE at $(date)"
timeout: 60s
continueOnError: true
- name: "Log recovery event"
command: |
kubectl logs -n vapora-system -l app=vapora-backend \
| grep -E "recovery|restore|initialize" \
| tail -20
timeout: 60s
outputs:
- name: recovery_status
value: "echo 'Disaster recovery completed'"
- name: restored_services
command: "kubectl get deployment -n vapora-system -o wide"
- name: database_status
command: "provisioning db status --database surrealdb"
# Error handling
onFailure:
procedure:
- name: "Gather diagnostic information"
command: |
provisioning debug collect \
--output "debug-logs-$(date +%s).tar.gz"
- name: "Alert operations team"
command: "slack: #alerts"
notifications:
onStart:
- "slack: #deployment"
- "email: devops@example.com"
- "severity: critical"
onSuccess:
- "slack: #deployment"
- "slack: notify: Disaster recovery successful"
- "email: devops@example.com"
onFailure:
- "slack: #deployment"
- "slack: #alerts"
- "email: devops@example.com"
- "severity: critical"
- "page: on-call"