Vapora/provisioning/vapora-wrksp/workflows/disaster-recovery.yaml
Jesús Pérez d14150da75 feat: Phase 5.3 - Multi-Agent Learning Infrastructure
Implement intelligent agent learning from Knowledge Graph execution history
with per-task-type expertise tracking, recency bias, and learning curves.

## Phase 5.3 Implementation

### Learning Infrastructure ( Complete)
- LearningProfileService with per-task-type expertise metrics
- TaskTypeExpertise model tracking success_rate, confidence, learning curves
- Recency bias weighting: recent 7 days weighted 3x higher (exponential decay)
- Confidence scoring prevents overfitting: min(1.0, executions / 20)
- Learning curves computed from daily execution windows

### Agent Scoring Service ( Complete)
- Unified AgentScore combining SwarmCoordinator + learning profiles
- Scoring formula: 0.3*base + 0.5*expertise + 0.2*confidence
- Rank agents by combined score for intelligent assignment
- Support for recency-biased scoring (recent_success_rate)
- Methods: rank_agents, select_best, rank_agents_with_recency

### KG Integration ( Complete)
- KGPersistence::get_executions_for_task_type() - query by agent + task type
- KGPersistence::get_agent_executions() - all executions for agent
- Coordinator::load_learning_profile_from_kg() - core KG→Learning integration
- Coordinator::load_all_learning_profiles() - batch load for multiple agents
- Convert PersistedExecution → ExecutionData for learning calculations

### Agent Assignment Integration ( Complete)
- AgentCoordinator uses learning profiles for task assignment
- extract_task_type() infers task type from title/description
- assign_task() scores candidates using AgentScoringService
- Fallback to load-based selection if no learning data available
- Learning profiles stored in coordinator.learning_profiles RwLock

### Profile Adapter Enhancements ( Complete)
- create_learning_profile() - initialize empty profiles
- add_task_type_expertise() - set task-type expertise
- update_profile_with_learning() - update swarm profiles from learning

## Files Modified

### vapora-knowledge-graph/src/persistence.rs (+30 lines)
- get_executions_for_task_type(agent_id, task_type, limit)
- get_agent_executions(agent_id, limit)

### vapora-agents/src/coordinator.rs (+100 lines)
- load_learning_profile_from_kg() - core KG integration method
- load_all_learning_profiles() - batch loading for agents
- assign_task() already uses learning-based scoring via AgentScoringService

### Existing Complete Implementation
- vapora-knowledge-graph/src/learning.rs - calculation functions
- vapora-agents/src/learning_profile.rs - data structures and expertise
- vapora-agents/src/scoring.rs - unified scoring service
- vapora-agents/src/profile_adapter.rs - adapter methods

## Tests Passing
- learning_profile: 7 tests 
- scoring: 5 tests 
- profile_adapter: 6 tests 
- coordinator: learning-specific tests 

## Data Flow
1. Task arrives → AgentCoordinator::assign_task()
2. Extract task_type from description
3. Query KG for task-type executions (load_learning_profile_from_kg)
4. Calculate expertise with recency bias
5. Score candidates (SwarmCoordinator + learning)
6. Assign to top-scored agent
7. Execution result → KG → Update learning profiles

## Key Design Decisions
 Recency bias: 7-day half-life with 3x weight for recent performance
 Confidence scoring: min(1.0, total_executions / 20) prevents overfitting
 Hierarchical scoring: 30% base load, 50% expertise, 20% confidence
 KG query limit: 100 recent executions per task-type for performance
 Async loading: load_learning_profile_from_kg supports concurrent loads

## Next: Phase 5.4 - Cost Optimization
Ready to implement budget enforcement and cost-aware provider selection.
2026-01-11 13:03:53 +00:00

388 lines
12 KiB
YAML

apiVersion: provisioning.vapora.io/v1
kind: Workflow
metadata:
name: disaster-recovery
description: VAPORA disaster recovery and restoration from backups
spec:
version: "0.2.0"
namespace: vapora-system
timeout: 3600s # 1 hour max
inputs:
- name: backup_source
type: string
required: true
description: "Backup identifier or timestamp to restore from"
- name: partial_restore
type: boolean
required: false
default: false
description: "Restore only specific components instead of full system"
- name: components_to_restore
type: array
required: false
description: "List of components to restore: database, services, configuration, data"
- name: verify_only
type: boolean
required: false
default: false
description: "Verify backup integrity without restoring"
phases:
# Phase 1: Pre-recovery assessment
- name: "Assess Damage and Backup Status"
description: "Evaluate current cluster state and available backups"
retryable: true
steps:
- name: "Check cluster connectivity"
command: "kubectl cluster-info"
timeout: 30s
continueOnError: true
- name: "List available backups"
command: |
provisioning backup list --detailed \
| grep -E "id|timestamp|size|status"
timeout: 60s
- name: "Verify backup integrity"
command: |
provisioning backup verify --id $BACKUP_SOURCE
timeout: 300s
env:
- name: BACKUP_SOURCE
value: "${backup_source}"
- name: "Estimate recovery time"
command: |
provisioning backup estimate-restore-time --id $BACKUP_SOURCE \
| tee /tmp/restore_estimate.txt
timeout: 60s
# Phase 2: Drain system
- name: "Prepare System for Recovery"
description: "Stop services and prepare for restoration"
retryable: false
steps:
- name: "Stop accepting new tasks"
command: |
kubectl patch configmap vapora-config \
-n vapora-system \
-p '{"data":{"recovery_mode":"true","accept_requests":"false"}}'
timeout: 60s
continueOnError: true
- name: "Drain task queue"
command: "provisioning agents drain --timeout 300s"
timeout: 320s
continueOnError: true
- name: "Stop agent runtime"
command: |
kubectl scale deployment vapora-agents \
-n vapora-system \
--replicas 0
timeout: 120s
continueOnError: true
- name: "Stop backend services"
command: |
kubectl scale deployment vapora-backend \
-n vapora-system \
--replicas 0
kubectl scale deployment vapora-llm-router \
-n vapora-system \
--replicas 0
kubectl scale deployment vapora-mcp-gateway \
-n vapora-system \
--replicas 0
timeout: 180s
continueOnError: true
- name: "Stop frontend"
command: |
kubectl scale deployment vapora-frontend \
-n vapora-system \
--replicas 0
timeout: 120s
continueOnError: true
# Phase 3: Restore database
- name: "Restore Database"
description: "Restore SurrealDB from backup"
retryable: false
steps:
- name: "Scale down SurrealDB replicas"
command: |
kubectl scale statefulset surrealdb \
-n vapora-system \
--replicas 1
timeout: 300s
- name: "Wait for single replica"
command: |
kubectl wait --for=condition=Ready pod \
-l app=surrealdb \
-n vapora-system \
--timeout=300s
timeout: 320s
- name: "Restore database data"
command: |
provisioning db restore \
--database surrealdb \
--backup $BACKUP_SOURCE \
--wait-for-completion
timeout: 1200s
env:
- name: BACKUP_SOURCE
value: "${backup_source}"
- name: "Verify database integrity"
command: |
provisioning db verify --database surrealdb \
--check-tables \
--check-indexes \
--check-constraints
timeout: 300s
- name: "Scale up SurrealDB replicas"
command: |
kubectl scale statefulset surrealdb \
-n vapora-system \
--replicas 3
timeout: 300s
- name: "Wait for all replicas"
command: |
kubectl wait --for=condition=Ready pod \
-l app=surrealdb \
-n vapora-system \
--timeout=600s
timeout: 620s
# Phase 4: Restore configuration and secrets
- name: "Restore Configuration"
description: "Restore ConfigMaps, Secrets, and application configuration"
retryable: true
steps:
- name: "Restore ConfigMaps"
command: |
provisioning backup restore \
--id $BACKUP_SOURCE \
--resource-type ConfigMap \
--namespace vapora-system
timeout: 180s
- name: "Restore Secrets"
command: |
provisioning backup restore \
--id $BACKUP_SOURCE \
--resource-type Secret \
--namespace vapora-system
timeout: 180s
- name: "Verify configuration"
command: |
kubectl get configmap -n vapora-system
kubectl get secrets -n vapora-system
timeout: 60s
# Phase 5: Restore service configurations
- name: "Restore Services"
description: "Restore service deployments and Istio configurations"
retryable: true
steps:
- name: "Restore backend deployment"
command: |
provisioning backup restore \
--id $BACKUP_SOURCE \
--resource-type Deployment \
--name vapora-backend
timeout: 300s
- name: "Restore LLM Router"
command: |
provisioning backup restore \
--id $BACKUP_SOURCE \
--resource-type Deployment \
--name vapora-llm-router
timeout: 300s
- name: "Restore MCP Gateway"
command: |
provisioning backup restore \
--id $BACKUP_SOURCE \
--resource-type Deployment \
--name vapora-mcp-gateway
timeout: 300s
- name: "Restore frontend"
command: |
provisioning backup restore \
--id $BACKUP_SOURCE \
--resource-type Deployment \
--name vapora-frontend
timeout: 300s
- name: "Restore Istio configuration"
command: |
provisioning backup restore \
--id $BACKUP_SOURCE \
--resource-type Gateway,VirtualService \
--namespace vapora-system
timeout: 180s
# Phase 6: Restore agents
- name: "Restore Agent Runtime"
description: "Restore agent deployment and state"
retryable: true
steps:
- name: "Restore agent deployment"
command: |
provisioning backup restore \
--id $BACKUP_SOURCE \
--resource-type Deployment \
--name vapora-agents
timeout: 300s
- name: "Wait for agents to be ready"
command: |
kubectl wait --for=condition=Ready pod \
-l app=vapora-agents \
-n vapora-system \
--timeout=600s
timeout: 620s
- name: "Verify agent communication"
command: "provisioning agents health-check --nats nats://nats-0.vapora-system:4222"
timeout: 120s
# Phase 7: Post-recovery verification
- name: "Verify Recovery"
description: "Comprehensive verification of recovered system"
retryable: false
steps:
- name: "Health check cluster"
command: "provisioning health-check --cluster"
timeout: 300s
- name: "Health check all services"
command: "provisioning health-check --services all --strict"
timeout: 300s
- name: "Test database connectivity"
command: "provisioning db test-connection --database surrealdb"
timeout: 120s
- name: "Verify data consistency"
command: |
provisioning db verify --database surrealdb \
--check-integrity \
--sample-size 1000
timeout: 300s
- name: "Run smoke tests"
command: |
provisioning test smoke \
--api http://vapora-backend.vapora-system:8080 \
--frontend http://vapora-frontend.vapora-system:3000 \
--timeout 600s
timeout: 620s
- name: "Test agent communication"
command: |
provisioning agents test \
--send-test-message \
--verify-delivery \
--timeout 120s
timeout: 140s
# Phase 8: Re-enable system
- name: "Resume Operations"
description: "Re-enable system for normal operation"
retryable: false
steps:
- name: "Disable recovery mode"
command: |
kubectl patch configmap vapora-config \
-n vapora-system \
-p '{"data":{"recovery_mode":"false","accept_requests":"true"}}'
timeout: 60s
- name: "Scale up services to previous state"
command: |
provisioning taskserv scale-to-previous \
--namespace vapora-system
timeout: 300s
- name: "Resume agent work"
command: "provisioning agents drain --disable"
timeout: 60s
- name: "Final health check"
command: "provisioning health-check --cluster"
timeout: 300s
# Phase 9: Documentation and reporting
- name: "Generate Recovery Report"
description: "Document recovery operation"
retryable: false
steps:
- name: "Create recovery report"
command: |
provisioning report generate \
--type disaster-recovery \
--backup-id $BACKUP_SOURCE \
--output "recovery-report-$(date +%Y%m%d-%H%M%S).md"
timeout: 120s
- name: "Create git commit for recovery"
command: |
git add -A
git commit -m "Disaster recovery: Restored from backup $BACKUP_SOURCE at $(date)"
timeout: 60s
continueOnError: true
- name: "Log recovery event"
command: |
kubectl logs -n vapora-system -l app=vapora-backend \
| grep -E "recovery|restore|initialize" \
| tail -20
timeout: 60s
outputs:
- name: recovery_status
value: "echo 'Disaster recovery completed'"
- name: restored_services
command: "kubectl get deployment -n vapora-system -o wide"
- name: database_status
command: "provisioning db status --database surrealdb"
# Error handling
onFailure:
procedure:
- name: "Gather diagnostic information"
command: |
provisioning debug collect \
--output "debug-logs-$(date +%s).tar.gz"
- name: "Alert operations team"
command: "slack: #alerts"
notifications:
onStart:
- "slack: #deployment"
- "email: devops@example.com"
- "severity: critical"
onSuccess:
- "slack: #deployment"
- "slack: notify: Disaster recovery successful"
- "email: devops@example.com"
onFailure:
- "slack: #deployment"
- "slack: #alerts"
- "email: devops@example.com"
- "severity: critical"
- "page: on-call"