Vapora/provisioning/PROVISIONING-INTEGRATION.md

312 lines
7.6 KiB
Markdown
Raw Permalink Normal View History

feat: Phase 5.3 - Multi-Agent Learning Infrastructure Implement intelligent agent learning from Knowledge Graph execution history with per-task-type expertise tracking, recency bias, and learning curves. ## Phase 5.3 Implementation ### Learning Infrastructure (✅ Complete) - LearningProfileService with per-task-type expertise metrics - TaskTypeExpertise model tracking success_rate, confidence, learning curves - Recency bias weighting: recent 7 days weighted 3x higher (exponential decay) - Confidence scoring prevents overfitting: min(1.0, executions / 20) - Learning curves computed from daily execution windows ### Agent Scoring Service (✅ Complete) - Unified AgentScore combining SwarmCoordinator + learning profiles - Scoring formula: 0.3*base + 0.5*expertise + 0.2*confidence - Rank agents by combined score for intelligent assignment - Support for recency-biased scoring (recent_success_rate) - Methods: rank_agents, select_best, rank_agents_with_recency ### KG Integration (✅ Complete) - KGPersistence::get_executions_for_task_type() - query by agent + task type - KGPersistence::get_agent_executions() - all executions for agent - Coordinator::load_learning_profile_from_kg() - core KG→Learning integration - Coordinator::load_all_learning_profiles() - batch load for multiple agents - Convert PersistedExecution → ExecutionData for learning calculations ### Agent Assignment Integration (✅ Complete) - AgentCoordinator uses learning profiles for task assignment - extract_task_type() infers task type from title/description - assign_task() scores candidates using AgentScoringService - Fallback to load-based selection if no learning data available - Learning profiles stored in coordinator.learning_profiles RwLock ### Profile Adapter Enhancements (✅ Complete) - create_learning_profile() - initialize empty profiles - add_task_type_expertise() - set task-type expertise - update_profile_with_learning() - update swarm profiles from learning ## Files Modified ### vapora-knowledge-graph/src/persistence.rs (+30 lines) - get_executions_for_task_type(agent_id, task_type, limit) - get_agent_executions(agent_id, limit) ### vapora-agents/src/coordinator.rs (+100 lines) - load_learning_profile_from_kg() - core KG integration method - load_all_learning_profiles() - batch loading for agents - assign_task() already uses learning-based scoring via AgentScoringService ### Existing Complete Implementation - vapora-knowledge-graph/src/learning.rs - calculation functions - vapora-agents/src/learning_profile.rs - data structures and expertise - vapora-agents/src/scoring.rs - unified scoring service - vapora-agents/src/profile_adapter.rs - adapter methods ## Tests Passing - learning_profile: 7 tests ✅ - scoring: 5 tests ✅ - profile_adapter: 6 tests ✅ - coordinator: learning-specific tests ✅ ## Data Flow 1. Task arrives → AgentCoordinator::assign_task() 2. Extract task_type from description 3. Query KG for task-type executions (load_learning_profile_from_kg) 4. Calculate expertise with recency bias 5. Score candidates (SwarmCoordinator + learning) 6. Assign to top-scored agent 7. Execution result → KG → Update learning profiles ## Key Design Decisions ✅ Recency bias: 7-day half-life with 3x weight for recent performance ✅ Confidence scoring: min(1.0, total_executions / 20) prevents overfitting ✅ Hierarchical scoring: 30% base load, 50% expertise, 20% confidence ✅ KG query limit: 100 recent executions per task-type for performance ✅ Async loading: load_learning_profile_from_kg supports concurrent loads ## Next: Phase 5.4 - Cost Optimization Ready to implement budget enforcement and cost-aware provider selection.
2026-01-11 13:03:53 +00:00
# VAPORA Provisioning Integration
Integration documentation for deploying VAPORA v1.0 using Provisioning.
## Overview
VAPORA can be deployed using **Provisioning**, a Rust-based infrastructure-as-code platform that manages Kubernetes clusters, services, and workflows.
The Provisioning workspace is located at: `provisioning/vapora-wrksp/` (relative to repository root)
feat: Phase 5.3 - Multi-Agent Learning Infrastructure Implement intelligent agent learning from Knowledge Graph execution history with per-task-type expertise tracking, recency bias, and learning curves. ## Phase 5.3 Implementation ### Learning Infrastructure (✅ Complete) - LearningProfileService with per-task-type expertise metrics - TaskTypeExpertise model tracking success_rate, confidence, learning curves - Recency bias weighting: recent 7 days weighted 3x higher (exponential decay) - Confidence scoring prevents overfitting: min(1.0, executions / 20) - Learning curves computed from daily execution windows ### Agent Scoring Service (✅ Complete) - Unified AgentScore combining SwarmCoordinator + learning profiles - Scoring formula: 0.3*base + 0.5*expertise + 0.2*confidence - Rank agents by combined score for intelligent assignment - Support for recency-biased scoring (recent_success_rate) - Methods: rank_agents, select_best, rank_agents_with_recency ### KG Integration (✅ Complete) - KGPersistence::get_executions_for_task_type() - query by agent + task type - KGPersistence::get_agent_executions() - all executions for agent - Coordinator::load_learning_profile_from_kg() - core KG→Learning integration - Coordinator::load_all_learning_profiles() - batch load for multiple agents - Convert PersistedExecution → ExecutionData for learning calculations ### Agent Assignment Integration (✅ Complete) - AgentCoordinator uses learning profiles for task assignment - extract_task_type() infers task type from title/description - assign_task() scores candidates using AgentScoringService - Fallback to load-based selection if no learning data available - Learning profiles stored in coordinator.learning_profiles RwLock ### Profile Adapter Enhancements (✅ Complete) - create_learning_profile() - initialize empty profiles - add_task_type_expertise() - set task-type expertise - update_profile_with_learning() - update swarm profiles from learning ## Files Modified ### vapora-knowledge-graph/src/persistence.rs (+30 lines) - get_executions_for_task_type(agent_id, task_type, limit) - get_agent_executions(agent_id, limit) ### vapora-agents/src/coordinator.rs (+100 lines) - load_learning_profile_from_kg() - core KG integration method - load_all_learning_profiles() - batch loading for agents - assign_task() already uses learning-based scoring via AgentScoringService ### Existing Complete Implementation - vapora-knowledge-graph/src/learning.rs - calculation functions - vapora-agents/src/learning_profile.rs - data structures and expertise - vapora-agents/src/scoring.rs - unified scoring service - vapora-agents/src/profile_adapter.rs - adapter methods ## Tests Passing - learning_profile: 7 tests ✅ - scoring: 5 tests ✅ - profile_adapter: 6 tests ✅ - coordinator: learning-specific tests ✅ ## Data Flow 1. Task arrives → AgentCoordinator::assign_task() 2. Extract task_type from description 3. Query KG for task-type executions (load_learning_profile_from_kg) 4. Calculate expertise with recency bias 5. Score candidates (SwarmCoordinator + learning) 6. Assign to top-scored agent 7. Execution result → KG → Update learning profiles ## Key Design Decisions ✅ Recency bias: 7-day half-life with 3x weight for recent performance ✅ Confidence scoring: min(1.0, total_executions / 20) prevents overfitting ✅ Hierarchical scoring: 30% base load, 50% expertise, 20% confidence ✅ KG query limit: 100 recent executions per task-type for performance ✅ Async loading: load_learning_profile_from_kg supports concurrent loads ## Next: Phase 5.4 - Cost Optimization Ready to implement budget enforcement and cost-aware provider selection.
2026-01-11 13:03:53 +00:00
## Provisioning Workspace Structure
```
provisioning/vapora-wrksp/
├── workspace.toml # Master configuration
├── kcl/ # Infrastructure schemas (KCL)
│ ├── cluster.k # Cluster definition
│ ├── namespace.k # Namespace configuration
│ ├── backend.k # Backend deployment
│ ├── frontend.k # Frontend deployment
│ └── agents.k # Agent deployment
├── taskservs/ # Service definitions (TOML)
│ ├── surrealdb.toml # SurrealDB service
│ ├── nats.toml # NATS service
│ ├── backend.toml # Backend service
│ ├── frontend.toml # Frontend service
│ └── agents.toml # Agents service
└── workflows/ # Batch operations (YAML)
├── deploy-full-stack.yaml
├── deploy-infra.yaml
├── deploy-services.yaml
└── health-check.yaml
```
## Integration Points
### 1. Cluster Management
Provisioning creates and manages the Kubernetes cluster:
```bash
cd provisioning/vapora-wrksp
provisioning cluster create --config workspace.toml
```
This creates:
- K3s/RKE2 cluster
- Storage class (Rook Ceph or local-path)
- Ingress controller (nginx)
- Service mesh (optional Istio)
### 2. Service Deployment
Services are defined in `taskservs/` and deployed via workflows:
```bash
provisioning workflow run workflows/deploy-full-stack.yaml
```
This deploys all VAPORA components in order:
1. SurrealDB (StatefulSet)
2. NATS JetStream (Deployment)
3. Backend API (Deployment)
4. Frontend UI (Deployment)
5. Agents (Deployment)
6. MCP Server (Deployment)
### 3. Infrastructure as Code (KCL)
KCL schemas in `kcl/` define infrastructure resources:
**Example: `kcl/backend.k`**
```python
schema BackendDeployment:
name: str = "vapora-backend"
namespace: str = "vapora"
replicas: int = 2
image: str = "vapora/backend:latest"
port: int = 8080
env:
SURREALDB_URL: str = "http://surrealdb:8000"
NATS_URL: str = "nats://nats:4222"
JWT_SECRET: str = "${SECRET:jwt-secret}"
```
### 4. Taskserv Definitions
Taskservs define how services are deployed and managed:
**Example: `taskservs/backend.toml`**
```toml
[service]
name = "vapora-backend"
type = "deployment"
namespace = "vapora"
[deployment]
replicas = 2
image = "vapora/backend:latest"
port = 8080
[health]
liveness = "/health"
readiness = "/health"
[dependencies]
requires = ["surrealdb", "nats"]
```
### 5. Workflows
Workflows orchestrate complex deployment tasks:
**Example: `workflows/deploy-full-stack.yaml`**
```yaml
name: deploy-full-stack
description: Deploy complete VAPORA stack
steps:
- name: create-namespace
taskserv: namespace
action: create
- name: deploy-database
taskserv: surrealdb
action: deploy
wait: true
- name: deploy-messaging
taskserv: nats
action: deploy
wait: true
- name: deploy-services
parallel: true
tasks:
- taskserv: backend
- taskserv: frontend
- taskserv: agents
- taskserv: mcp-server
- name: health-check
action: validate
```
## Provisioning vs. Vanilla K8s
| Aspect | Provisioning | Vanilla K8s |
|--------|-------------|-------------|
| Cluster Creation | Automated (RKE2/K3s) | Manual |
| Service Mesh | Optional Istio | Manual |
| Secrets | RustyVault integration | kubectl create secret |
| Workflows | Declarative YAML | Manual kubectl |
| Rollback | Built-in | Manual |
| Monitoring | Prometheus auto-configured | Manual |
## Advantages of Provisioning
1. **Unified Management**: Single tool for cluster, services, and workflows
2. **Type Safety**: KCL schemas provide compile-time validation
3. **Reproducibility**: Infrastructure and services defined as code
4. **Dependency Management**: Automatic service ordering
5. **Secret Management**: Integration with RustyVault
6. **Rollback**: Automatic rollback on failure
## Migration from Vanilla K8s
If you have an existing K8s deployment using `/kubernetes/` manifests:
1. **Import existing manifests**:
```bash
provisioning import kubernetes/*.yaml --output kcl/
```
2. **Generate taskservs**:
```bash
provisioning taskserv generate --from-kcl kcl/*.k
```
3. **Create workflow**:
```bash
provisioning workflow create --interactive
```
4. **Deploy**:
```bash
provisioning workflow run workflows/deploy-full-stack.yaml
```
## Deployment Workflow
### Using Provisioning (Recommended for Production)
```bash
# 1. Navigate to workspace
cd provisioning/vapora-wrksp
# 2. Validate configuration
provisioning validate --all
# 3. Create cluster
provisioning cluster create --config workspace.toml
# 4. Deploy infrastructure
provisioning workflow run workflows/deploy-infra.yaml
# 5. Deploy services
provisioning workflow run workflows/deploy-services.yaml
# 6. Health check
provisioning workflow run workflows/health-check.yaml
# 7. Monitor
provisioning health-check --all
```
### Using Vanilla K8s (Manual)
```bash
# Use vanilla K8s manifests (from repository root)
feat: Phase 5.3 - Multi-Agent Learning Infrastructure Implement intelligent agent learning from Knowledge Graph execution history with per-task-type expertise tracking, recency bias, and learning curves. ## Phase 5.3 Implementation ### Learning Infrastructure (✅ Complete) - LearningProfileService with per-task-type expertise metrics - TaskTypeExpertise model tracking success_rate, confidence, learning curves - Recency bias weighting: recent 7 days weighted 3x higher (exponential decay) - Confidence scoring prevents overfitting: min(1.0, executions / 20) - Learning curves computed from daily execution windows ### Agent Scoring Service (✅ Complete) - Unified AgentScore combining SwarmCoordinator + learning profiles - Scoring formula: 0.3*base + 0.5*expertise + 0.2*confidence - Rank agents by combined score for intelligent assignment - Support for recency-biased scoring (recent_success_rate) - Methods: rank_agents, select_best, rank_agents_with_recency ### KG Integration (✅ Complete) - KGPersistence::get_executions_for_task_type() - query by agent + task type - KGPersistence::get_agent_executions() - all executions for agent - Coordinator::load_learning_profile_from_kg() - core KG→Learning integration - Coordinator::load_all_learning_profiles() - batch load for multiple agents - Convert PersistedExecution → ExecutionData for learning calculations ### Agent Assignment Integration (✅ Complete) - AgentCoordinator uses learning profiles for task assignment - extract_task_type() infers task type from title/description - assign_task() scores candidates using AgentScoringService - Fallback to load-based selection if no learning data available - Learning profiles stored in coordinator.learning_profiles RwLock ### Profile Adapter Enhancements (✅ Complete) - create_learning_profile() - initialize empty profiles - add_task_type_expertise() - set task-type expertise - update_profile_with_learning() - update swarm profiles from learning ## Files Modified ### vapora-knowledge-graph/src/persistence.rs (+30 lines) - get_executions_for_task_type(agent_id, task_type, limit) - get_agent_executions(agent_id, limit) ### vapora-agents/src/coordinator.rs (+100 lines) - load_learning_profile_from_kg() - core KG integration method - load_all_learning_profiles() - batch loading for agents - assign_task() already uses learning-based scoring via AgentScoringService ### Existing Complete Implementation - vapora-knowledge-graph/src/learning.rs - calculation functions - vapora-agents/src/learning_profile.rs - data structures and expertise - vapora-agents/src/scoring.rs - unified scoring service - vapora-agents/src/profile_adapter.rs - adapter methods ## Tests Passing - learning_profile: 7 tests ✅ - scoring: 5 tests ✅ - profile_adapter: 6 tests ✅ - coordinator: learning-specific tests ✅ ## Data Flow 1. Task arrives → AgentCoordinator::assign_task() 2. Extract task_type from description 3. Query KG for task-type executions (load_learning_profile_from_kg) 4. Calculate expertise with recency bias 5. Score candidates (SwarmCoordinator + learning) 6. Assign to top-scored agent 7. Execution result → KG → Update learning profiles ## Key Design Decisions ✅ Recency bias: 7-day half-life with 3x weight for recent performance ✅ Confidence scoring: min(1.0, total_executions / 20) prevents overfitting ✅ Hierarchical scoring: 30% base load, 50% expertise, 20% confidence ✅ KG query limit: 100 recent executions per task-type for performance ✅ Async loading: load_learning_profile_from_kg supports concurrent loads ## Next: Phase 5.4 - Cost Optimization Ready to implement budget enforcement and cost-aware provider selection.
2026-01-11 13:03:53 +00:00
nu scripts/deploy-k8s.nu
```
## Validation
To validate Provisioning configuration without executing:
```bash
# From project root
nu scripts/validate-provisioning.nu
```
This checks:
- Workspace exists
- KCL schemas are valid
- Taskserv definitions exist
- Workflows are well-formed
## Next Steps
1. **Review Configuration**:
- Update `workspace.toml` with your cluster details
- Modify KCL schemas for your environment
- Adjust resource limits in taskservs
2. **Test Locally**:
- Use K3s for local testing
- Validate with `--dry-run` flag
3. **Deploy to Production**:
- Use RKE2 for production cluster
- Enable Istio service mesh
- Configure external load balancer
4. **Monitor**:
- Use built-in Prometheus/Grafana
- Configure alerting
- Set up log aggregation
## Troubleshooting
### Provisioning not installed
```bash
# Install Provisioning (Rust-based)
cargo install provisioning-cli
```
### Workspace validation fails
```bash
cd provisioning/vapora-wrksp
provisioning validate --verbose
```
### Deployment stuck
```bash
# Check workflow status
provisioning workflow status <workflow-id>
# View logs
provisioning logs --taskserv backend
# Rollback
provisioning rollback --to-version <version>
```
## Documentation References
- **Provisioning Documentation**: See `provisioning/vapora-wrksp/README.md`
- **KCL Language Guide**: https://kcl-lang.io/docs/
- **Taskserv Specification**: `provisioning/vapora-wrksp/taskservs/README.md`
- **Workflow Syntax**: `provisioning/vapora-wrksp/workflows/README.md`
## Notes
- **IMPORTANT**: Provisioning integration is **validated** but not executed in this phase
- All configuration files exist and are valid
- Deployment using Provisioning is deferred for manual production deployment
- For immediate testing, use vanilla K8s deployment: `nu scripts/deploy-k8s.nu`
- Provisioning provides advanced features (service mesh, auto-scaling, rollback)
- Vanilla K8s deployment is simpler and requires less infrastructure
## Support
For issues related to:
- **VAPORA deployment**: Check `/kubernetes/README.md` and `DEPLOYMENT.md`
- **Provisioning workspace**: See `provisioning/vapora-wrksp/README.md`
- **Scripts**: Run `nu scripts/<script-name>.nu --help`