306 lines
11 KiB
Markdown
306 lines
11 KiB
Markdown
|
|
# VAPORA Architecture
|
|||
|
|
## Multi-Agent Multi-IA Cloud-Native Platform
|
|||
|
|
|
|||
|
|
**Status**: Production Ready (v1.2.0)
|
|||
|
|
**Date**: January 2026
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 Executive Summary
|
|||
|
|
|
|||
|
|
**VAPORA** is a **cloud-native platform for multi-agent software development**:
|
|||
|
|
- ✅ **12 specialized agents** working in parallel (Architect, Developer, Reviewer, Tester, Documenter, etc.)
|
|||
|
|
- ✅ **Multi-IA routing** (Claude, OpenAI, Gemini, Ollama) optimized per task
|
|||
|
|
- ✅ **Full-stack Rust** (Backend, Frontend, Agents, Infrastructure)
|
|||
|
|
- ✅ **Kubernetes-native** deployment via Provisioning
|
|||
|
|
- ✅ **Self-hosted** - no SaaS dependencies
|
|||
|
|
- ✅ **Cedar-based RBAC** for teams and access control
|
|||
|
|
- ✅ **NATS JetStream** for inter-agent coordination
|
|||
|
|
- ✅ **Learning-based agent selection** with task-type expertise
|
|||
|
|
- ✅ **Budget-enforced LLM routing** with automatic fallback
|
|||
|
|
- ✅ **Knowledge Graph** for execution history and learning curves
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🏗️ 4-Layer Architecture
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|||
|
|
│ Frontend Layer │
|
|||
|
|
│ Leptos CSR (WASM) + UnoCSS Glassmorphism │
|
|||
|
|
│ │
|
|||
|
|
│ Kanban Board │ Projects │ Agents Marketplace │ Settings │
|
|||
|
|
└──────────────────────────────┬──────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
Istio Ingress (mTLS)
|
|||
|
|
│
|
|||
|
|
┌──────────────────────────────┴──────────────────────────────────────┐
|
|||
|
|
│ API Layer │
|
|||
|
|
│ Axum REST API + WebSocket (Async Rust) │
|
|||
|
|
│ │
|
|||
|
|
│ /tasks │ /agents │ /workflows │ /auth │ /projects │
|
|||
|
|
│ Rate Limiting │ Auth (JWT) │ Compression │
|
|||
|
|
└──────────────────────────────┬──────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
┌────────────────────┼────────────────────┐
|
|||
|
|
│ │ │
|
|||
|
|
┌─────────▼────────┐ ┌────────▼────────┐ ┌────────▼─────────┐
|
|||
|
|
│ Agent Service │ │ LLM Router │ │ MCP Gateway │
|
|||
|
|
│ Orchestration │ │ (Multi-IA) │ │ (Plugin System) │
|
|||
|
|
└────────┬─────────┘ └────────┬────────┘ └────────┬─────────┘
|
|||
|
|
│ │ │
|
|||
|
|
└────────────────────┼───────────────────┘
|
|||
|
|
│
|
|||
|
|
┌────────────────────┼───────────────────┐
|
|||
|
|
│ │ │
|
|||
|
|
┌────▼─────┐ ┌──────▼──────┐ ┌────▼──────┐
|
|||
|
|
│SurrealDB │ │NATS Jet │ │RustyVault │
|
|||
|
|
│(MultiTen)│ │Stream (Jobs)│ │(Secrets) │
|
|||
|
|
└──────────┘ └─────────────┘ └───────────┘
|
|||
|
|
│
|
|||
|
|
┌─────────▼─────────┐
|
|||
|
|
│ Observability │
|
|||
|
|
│ Prometheus/Grafana│
|
|||
|
|
│ Loki/Tempo (Logs) │
|
|||
|
|
└───────────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📋 Component Overview
|
|||
|
|
|
|||
|
|
### Frontend (Leptos WASM)
|
|||
|
|
|
|||
|
|
- **Kanban Board**: Drag-drop task management with real-time updates
|
|||
|
|
- **Project Dashboard**: Project overview, metrics, team stats
|
|||
|
|
- **Agent Marketplace**: Browse, install, configure agent plugins
|
|||
|
|
- **Settings**: User preferences, workspace configuration
|
|||
|
|
|
|||
|
|
**Tech**: Leptos (reactive), UnoCSS (styling), WebSocket (real-time)
|
|||
|
|
|
|||
|
|
### API Layer (Axum)
|
|||
|
|
|
|||
|
|
- **REST Endpoints** (40+): Full CRUD for projects, tasks, agents, workflows
|
|||
|
|
- **WebSocket API**: Real-time task updates, agent status changes
|
|||
|
|
- **Authentication**: JWT tokens, refresh rotation
|
|||
|
|
- **Rate Limiting**: Per-user/IP throttling
|
|||
|
|
- **Compression**: gzip for bandwidth optimization
|
|||
|
|
|
|||
|
|
**Tech**: Axum (async), Tokio (runtime), Tower middleware
|
|||
|
|
|
|||
|
|
### Service Layer
|
|||
|
|
|
|||
|
|
**Agent Orchestration**:
|
|||
|
|
- Agent registry with capability-based discovery
|
|||
|
|
- Task assignment via SwarmCoordinator with load balancing
|
|||
|
|
- Learning profiles for task-type expertise
|
|||
|
|
- Health checking with automatic agent removal
|
|||
|
|
- NATS JetStream integration for async coordination
|
|||
|
|
|
|||
|
|
**LLM Router** (Multi-Provider):
|
|||
|
|
- Claude (Opus, Sonnet, Haiku)
|
|||
|
|
- OpenAI (GPT-4, GPT-4o)
|
|||
|
|
- Google Gemini (2.0 Pro, Flash)
|
|||
|
|
- Ollama (Local open-source models)
|
|||
|
|
|
|||
|
|
**Provider Selection Strategy**:
|
|||
|
|
- Rules-based routing by task complexity/type
|
|||
|
|
- Learning-based selection by agent expertise
|
|||
|
|
- Budget-aware routing with automatic fallback
|
|||
|
|
- Cost efficiency ranking (quality/cost ratio)
|
|||
|
|
|
|||
|
|
**MCP Gateway**:
|
|||
|
|
- Plugin protocol for external tools
|
|||
|
|
- Code analysis, RAG, GitHub, Jira integrations
|
|||
|
|
- Tool calling and resource management
|
|||
|
|
|
|||
|
|
### Data Layer
|
|||
|
|
|
|||
|
|
**SurrealDB**:
|
|||
|
|
- Multi-tenant scopes for workspace isolation
|
|||
|
|
- Nested tables for relational data
|
|||
|
|
- Full-text search for task/doc indexing
|
|||
|
|
- Versioning for audit trails
|
|||
|
|
|
|||
|
|
**NATS JetStream**:
|
|||
|
|
- Reliable message queue for agent jobs
|
|||
|
|
- Consumer groups for load balancing
|
|||
|
|
- At-least-once delivery guarantee
|
|||
|
|
|
|||
|
|
**RustyVault**:
|
|||
|
|
- API key storage (OpenAI, Anthropic, Google)
|
|||
|
|
- Encryption at rest
|
|||
|
|
- Audit logging
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔄 Data Flow: Task Execution
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
1. User creates task in Kanban → API POST /tasks
|
|||
|
|
2. Backend validates and persists to SurrealDB
|
|||
|
|
3. Task published to NATS subject: tasks.{type}.{priority}
|
|||
|
|
4. SwarmCoordinator subscribes, selects best agent:
|
|||
|
|
- Learning profile lookup (task-type expertise)
|
|||
|
|
- Load balancing (success_rate / (1 + load))
|
|||
|
|
- Scoring: 0.3*load + 0.5*expertise + 0.2*confidence
|
|||
|
|
5. Agent receives job, calls LLMRouter.select_provider():
|
|||
|
|
- Check budget status (monthly/weekly limits)
|
|||
|
|
- If budget exceeded: fallback to cheap provider (Ollama/Gemini)
|
|||
|
|
- If near threshold: prefer cost-efficient provider
|
|||
|
|
- Otherwise: rule-based routing
|
|||
|
|
6. LLM generates response
|
|||
|
|
7. Agent processes result, stores execution in KG
|
|||
|
|
8. Result persisted to SurrealDB
|
|||
|
|
9. Learning profiles updated (background sync, 30s interval)
|
|||
|
|
10. Budget tracker updated
|
|||
|
|
11. WebSocket pushes update to frontend
|
|||
|
|
12. Kanban board updates in real-time
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔐 Security & Multi-Tenancy
|
|||
|
|
|
|||
|
|
**Tenant Isolation**:
|
|||
|
|
- SurrealDB scopes: `workspace:123`, `team:456`
|
|||
|
|
- Row-level filtering in all queries
|
|||
|
|
- No cross-tenant data leakage
|
|||
|
|
|
|||
|
|
**Authentication**:
|
|||
|
|
- JWT tokens (HS256)
|
|||
|
|
- Token TTL: 15 minutes
|
|||
|
|
- Refresh token rotation (7 days)
|
|||
|
|
- HTTPS/mTLS enforced
|
|||
|
|
|
|||
|
|
**Authorization** (Cedar Policy Engine):
|
|||
|
|
- Fine-grained RBAC per workspace
|
|||
|
|
- Roles: Owner, Admin, Member, Viewer
|
|||
|
|
- Resource-scoped permissions: create_task, edit_workflow, etc.
|
|||
|
|
|
|||
|
|
**Audit Logging**:
|
|||
|
|
- All significant actions logged: task creation, agent assignment, provider selection
|
|||
|
|
- Timestamp, actor, action, resource, result
|
|||
|
|
- Searchable in SurrealDB
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 Learning & Cost Optimization
|
|||
|
|
|
|||
|
|
### Multi-Agent Learning (Phase 5.3)
|
|||
|
|
|
|||
|
|
**Learning Profiles**:
|
|||
|
|
- Per-agent, per-task-type expertise tracking
|
|||
|
|
- Success rate calculation with recency bias (7-day window, 3× weight)
|
|||
|
|
- Confidence scoring to prevent overfitting
|
|||
|
|
- Learning curves for trend analysis
|
|||
|
|
|
|||
|
|
**Agent Scoring Formula**:
|
|||
|
|
```
|
|||
|
|
final_score = 0.3*base_score + 0.5*expertise_score + 0.2*confidence
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Cost Optimization (Phase 5.4)
|
|||
|
|
|
|||
|
|
**Budget Enforcement**:
|
|||
|
|
- Per-role budget limits (monthly/weekly in cents)
|
|||
|
|
- Three-tier policy:
|
|||
|
|
1. Normal: Rule-based routing
|
|||
|
|
2. Near-threshold (>80%): Prefer cheaper providers
|
|||
|
|
3. Budget exceeded: Automatic fallback to cheapest provider
|
|||
|
|
|
|||
|
|
**Provider Fallback Chain** (cost-ordered):
|
|||
|
|
1. Ollama (free local)
|
|||
|
|
2. Gemini (cheap cloud)
|
|||
|
|
3. OpenAI (mid-tier)
|
|||
|
|
4. Claude (premium)
|
|||
|
|
|
|||
|
|
**Cost Tracking**:
|
|||
|
|
- Per-provider costs
|
|||
|
|
- Per-task-type costs
|
|||
|
|
- Real-time budget utilization
|
|||
|
|
- Prometheus metrics: `vapora_llm_budget_utilization{role}`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 Monitoring & Observability
|
|||
|
|
|
|||
|
|
**Prometheus Metrics**:
|
|||
|
|
- HTTP request latencies (p50, p95, p99)
|
|||
|
|
- Agent task execution times
|
|||
|
|
- LLM token usage per provider
|
|||
|
|
- Database query performance
|
|||
|
|
- Budget utilization per role
|
|||
|
|
- Fallback trigger rates
|
|||
|
|
|
|||
|
|
**Grafana Dashboards**:
|
|||
|
|
- VAPORA Overview: Request rates, errors, latencies
|
|||
|
|
- Agent Metrics: Job queue depth, execution times, token usage
|
|||
|
|
- LLM Routing: Provider distribution, cost per role
|
|||
|
|
- Istio Mesh: Traffic flows, mTLS status
|
|||
|
|
|
|||
|
|
**Structured Logging** (via tracing):
|
|||
|
|
- JSON output in production
|
|||
|
|
- Human-readable in development
|
|||
|
|
- Searchable in Loki
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔄 Deployment
|
|||
|
|
|
|||
|
|
**Development**:
|
|||
|
|
- `docker compose up` starts all services locally
|
|||
|
|
- SurrealDB, NATS, Redis included
|
|||
|
|
- Hot reload for backend changes
|
|||
|
|
|
|||
|
|
**Kubernetes**:
|
|||
|
|
- Istio service mesh for mTLS and traffic management
|
|||
|
|
- Horizontal Pod Autoscaling (HPA) for agents
|
|||
|
|
- Rook Ceph for persistent storage
|
|||
|
|
- Sealed secrets for credentials
|
|||
|
|
|
|||
|
|
**Provisioning** (Infrastructure as Code):
|
|||
|
|
- Nickel KCL for declarative K8s manifests
|
|||
|
|
- Taskservs for service definitions
|
|||
|
|
- Workflows for multi-step deployments
|
|||
|
|
- GitOps-friendly (version-controlled configs)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 Key Design Patterns
|
|||
|
|
|
|||
|
|
### 1. Hierarchical Decision Making
|
|||
|
|
- Level 1: Agent Selection (WHO) → Learning profiles
|
|||
|
|
- Level 2: Provider Selection (HOW) → Budget manager
|
|||
|
|
|
|||
|
|
### 2. Graceful Degradation
|
|||
|
|
- Works without budget config (learning still active)
|
|||
|
|
- Fallback providers ensure task completion even when budget exhausted
|
|||
|
|
- NATS optional (in-memory fallback available)
|
|||
|
|
|
|||
|
|
### 3. Recency Bias in Learning
|
|||
|
|
- 7-day exponential decay prevents "permanent reputation"
|
|||
|
|
- Allows agents to recover from bad periods
|
|||
|
|
- Reflects current capability, not historical average
|
|||
|
|
|
|||
|
|
### 4. Confidence Weighting
|
|||
|
|
- `min(1.0, executions/20)` prevents overfitting
|
|||
|
|
- New agents won't be preferred on lucky streak
|
|||
|
|
- Balances exploration vs. exploitation
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📚 Related Documentation
|
|||
|
|
|
|||
|
|
- **[Agent Registry & Coordination](agent-registry-coordination.md)** — Agent orchestration patterns
|
|||
|
|
- **[Multi-Agent Workflows](multi-agent-workflows.md)** — Workflow execution and coordination
|
|||
|
|
- **[Multi-IA Router](multi-ia-router.md)** — Provider selection and routing
|
|||
|
|
- **[Roles, Permissions & Profiles](roles-permissions-profiles.md)** — RBAC implementation
|
|||
|
|
- **[Task, Agent & Doc Manager](task-agent-doc-manager.md)** — Task orchestration and docs sync
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Status**: ✅ Production Ready
|
|||
|
|
**Version**: 1.2.0
|
|||
|
|
**Last Updated**: January 2026
|