Vapora/docs/architecture/multi-ia-router.md
Jesús Pérez d14150da75 feat: Phase 5.3 - Multi-Agent Learning Infrastructure
Implement intelligent agent learning from Knowledge Graph execution history
with per-task-type expertise tracking, recency bias, and learning curves.

## Phase 5.3 Implementation

### Learning Infrastructure ( Complete)
- LearningProfileService with per-task-type expertise metrics
- TaskTypeExpertise model tracking success_rate, confidence, learning curves
- Recency bias weighting: recent 7 days weighted 3x higher (exponential decay)
- Confidence scoring prevents overfitting: min(1.0, executions / 20)
- Learning curves computed from daily execution windows

### Agent Scoring Service ( Complete)
- Unified AgentScore combining SwarmCoordinator + learning profiles
- Scoring formula: 0.3*base + 0.5*expertise + 0.2*confidence
- Rank agents by combined score for intelligent assignment
- Support for recency-biased scoring (recent_success_rate)
- Methods: rank_agents, select_best, rank_agents_with_recency

### KG Integration ( Complete)
- KGPersistence::get_executions_for_task_type() - query by agent + task type
- KGPersistence::get_agent_executions() - all executions for agent
- Coordinator::load_learning_profile_from_kg() - core KG→Learning integration
- Coordinator::load_all_learning_profiles() - batch load for multiple agents
- Convert PersistedExecution → ExecutionData for learning calculations

### Agent Assignment Integration ( Complete)
- AgentCoordinator uses learning profiles for task assignment
- extract_task_type() infers task type from title/description
- assign_task() scores candidates using AgentScoringService
- Fallback to load-based selection if no learning data available
- Learning profiles stored in coordinator.learning_profiles RwLock

### Profile Adapter Enhancements ( Complete)
- create_learning_profile() - initialize empty profiles
- add_task_type_expertise() - set task-type expertise
- update_profile_with_learning() - update swarm profiles from learning

## Files Modified

### vapora-knowledge-graph/src/persistence.rs (+30 lines)
- get_executions_for_task_type(agent_id, task_type, limit)
- get_agent_executions(agent_id, limit)

### vapora-agents/src/coordinator.rs (+100 lines)
- load_learning_profile_from_kg() - core KG integration method
- load_all_learning_profiles() - batch loading for agents
- assign_task() already uses learning-based scoring via AgentScoringService

### Existing Complete Implementation
- vapora-knowledge-graph/src/learning.rs - calculation functions
- vapora-agents/src/learning_profile.rs - data structures and expertise
- vapora-agents/src/scoring.rs - unified scoring service
- vapora-agents/src/profile_adapter.rs - adapter methods

## Tests Passing
- learning_profile: 7 tests 
- scoring: 5 tests 
- profile_adapter: 6 tests 
- coordinator: learning-specific tests 

## Data Flow
1. Task arrives → AgentCoordinator::assign_task()
2. Extract task_type from description
3. Query KG for task-type executions (load_learning_profile_from_kg)
4. Calculate expertise with recency bias
5. Score candidates (SwarmCoordinator + learning)
6. Assign to top-scored agent
7. Execution result → KG → Update learning profiles

## Key Design Decisions
 Recency bias: 7-day half-life with 3x weight for recent performance
 Confidence scoring: min(1.0, total_executions / 20) prevents overfitting
 Hierarchical scoring: 30% base load, 50% expertise, 20% confidence
 KG query limit: 100 recent executions per task-type for performance
 Async loading: load_learning_profile_from_kg supports concurrent loads

## Next: Phase 5.4 - Cost Optimization
Ready to implement budget enforcement and cost-aware provider selection.
2026-01-11 13:03:53 +00:00

499 lines
14 KiB
Markdown

# 🧠 Multi-IA Router
## Routing Inteligente entre Múltiples Proveedores de LLM
**Version**: 0.1.0
**Status**: Specification (VAPORA v1.0 - Multi-Agent Multi-IA)
**Purpose**: Sistema de routing dinámico que selecciona el LLM óptimo por contexto
---
## 🎯 Objetivo
**Problema**:
- Cada tarea necesita un LLM diferente (code ≠ embeddings ≠ review)
- Costos varían enormemente (Ollama gratis vs Claude Opus $$$)
- Disponibilidad varía (rate limits, latencia)
- Necesidad de fallback automático
**Solución**: Sistema inteligente de routing que decide qué LLM usar según:
1. **Contexto de la tarea** (type, domain, complexity)
2. **Reglas predefinidas** (mappings estáticos)
3. **Decisión dinámica** (disponibilidad, costo, carga)
4. **Override manual** (usuario especifica LLM requerido)
---
## 🏗️ Arquitectura
### Layer 1: LLM Providers (Trait Pattern)
```rust
pub enum LLMProvider {
Claude {
api_key: String,
model: String, // "opus-4", "sonnet-4", "haiku-3"
max_tokens: usize,
},
OpenAI {
api_key: String,
model: String, // "gpt-4", "gpt-4-turbo", "gpt-3.5-turbo"
max_tokens: usize,
},
Gemini {
api_key: String,
model: String, // "gemini-2.0-pro", "gemini-pro", "gemini-flash"
max_tokens: usize,
},
Ollama {
endpoint: String, // "http://localhost:11434"
model: String, // "llama3.2", "mistral", "neural-chat"
max_tokens: usize,
},
}
pub trait LLMClient: Send + Sync {
async fn complete(
&self,
prompt: String,
context: Option<String>,
) -> anyhow::Result<String>;
async fn stream(
&self,
prompt: String,
) -> anyhow::Result<tokio::sync::mpsc::Receiver<String>>;
fn cost_per_1k_tokens(&self) -> f64;
fn latency_ms(&self) -> u32;
fn available(&self) -> bool;
}
```
### Layer 2: Task Context Classifier
```rust
#[derive(Debug, Clone, PartialEq)]
pub enum TaskType {
// Code tasks
CodeGeneration,
CodeReview,
CodeRefactor,
UnitTest,
Integration Test,
// Analysis tasks
ArchitectureDesign,
SecurityAnalysis,
PerformanceAnalysis,
// Documentation
DocumentGeneration,
CodeDocumentation,
APIDocumentation,
// Search/RAG
Embeddings,
SemanticSearch,
ContextRetrieval,
// General
GeneralQuery,
Summarization,
Translation,
}
#[derive(Debug, Clone)]
pub struct TaskContext {
pub task_type: TaskType,
pub domain: String, // "backend", "frontend", "infra"
pub complexity: Complexity, // Low, Medium, High, Critical
pub quality_requirement: Quality, // Low, Medium, High, Critical
pub latency_required_ms: u32, // 500 = <500ms required
pub budget_cents: Option<u32>, // Cost limit in cents for 1k tokens
}
#[derive(Debug, Clone, PartialEq, PartialOrd)]
pub enum Complexity {
Low,
Medium,
High,
Critical,
}
#[derive(Debug, Clone, PartialEq, PartialOrd)]
pub enum Quality {
Low, // Quick & cheap
Medium, // Balanced
High, // Good quality
Critical // Best possible
}
```
### Layer 3: Mapping Engine (Reglas Predefinidas)
```rust
pub struct IAMapping {
pub task_type: TaskType,
pub primary: LLMProvider,
pub fallback_order: Vec<LLMProvider>,
pub reasoning: String,
pub cost_estimate_per_task: f64,
}
pub static DEFAULT_MAPPINGS: &[IAMapping] = &[
// Embeddings → Ollama (local, free)
IAMapping {
task_type: TaskType::Embeddings,
primary: LLMProvider::Ollama {
endpoint: "http://localhost:11434".to_string(),
model: "nomic-embed-text".to_string(),
max_tokens: 8192,
},
fallback_order: vec![
LLMProvider::OpenAI {
api_key: "".to_string(),
model: "text-embedding-3-small".to_string(),
max_tokens: 8192,
},
],
reasoning: "Ollama local es gratis y rápido para embeddings. Fallback a OpenAI si Ollama no disponible".to_string(),
cost_estimate_per_task: 0.0, // Gratis localmente
},
// Code Generation → Claude Opus (máxima calidad)
IAMapping {
task_type: TaskType::CodeGeneration,
primary: LLMProvider::Claude {
api_key: "".to_string(),
model: "opus-4".to_string(),
max_tokens: 8000,
},
fallback_order: vec![
LLMProvider::OpenAI {
api_key: "".to_string(),
model: "gpt-4".to_string(),
max_tokens: 8000,
},
],
reasoning: "Claude Opus mejor para código complejo. GPT-4 como fallback".to_string(),
cost_estimate_per_task: 0.06, // ~6 cents per 1k tokens
},
// Code Review → Claude Sonnet (balance calidad/costo)
IAMapping {
task_type: TaskType::CodeReview,
primary: LLMProvider::Claude {
api_key: "".to_string(),
model: "sonnet-4".to_string(),
max_tokens: 4000,
},
fallback_order: vec![
LLMProvider::Gemini {
api_key: "".to_string(),
model: "gemini-pro".to_string(),
max_tokens: 4000,
},
],
reasoning: "Sonnet balance perfecto. Gemini como fallback".to_string(),
cost_estimate_per_task: 0.015,
},
// Documentation → GPT-4 (mejor formato)
IAMapping {
task_type: TaskType::DocumentGeneration,
primary: LLMProvider::OpenAI {
api_key: "".to_string(),
model: "gpt-4".to_string(),
max_tokens: 4000,
},
fallback_order: vec![
LLMProvider::Claude {
api_key: "".to_string(),
model: "sonnet-4".to_string(),
max_tokens: 4000,
},
],
reasoning: "GPT-4 mejor formato para docs. Claude como fallback".to_string(),
cost_estimate_per_task: 0.03,
},
// Quick Queries → Gemini Flash (velocidad)
IAMapping {
task_type: TaskType::GeneralQuery,
primary: LLMProvider::Gemini {
api_key: "".to_string(),
model: "gemini-flash-2.0".to_string(),
max_tokens: 1000,
},
fallback_order: vec![
LLMProvider::Ollama {
endpoint: "http://localhost:11434".to_string(),
model: "llama3.2".to_string(),
max_tokens: 1000,
},
],
reasoning: "Gemini Flash muy rápido. Ollama como fallback".to_string(),
cost_estimate_per_task: 0.002,
},
];
```
### Layer 4: Routing Engine (Decisiones Dinámicas)
```rust
pub struct LLMRouter {
pub mappings: HashMap<TaskType, Vec<LLMProvider>>,
pub providers: HashMap<String, Box<dyn LLMClient>>,
pub cost_tracker: CostTracker,
pub rate_limiter: RateLimiter,
}
impl LLMRouter {
/// Routing decision: hybrid (rules + dynamic + override)
pub async fn route(
&mut self,
context: TaskContext,
override_llm: Option<LLMProvider>,
) -> anyhow::Result<LLMProvider> {
// 1. Si hay override manual, usar ese
if let Some(llm) = override_llm {
self.cost_tracker.log_usage(&llm, &context);
return Ok(llm);
}
// 2. Obtener mappings predefinidos
let mut candidates = self.get_mapping(&context.task_type)?;
// 3. Filtrar por disponibilidad (rate limits, latencia)
candidates = self.filter_by_availability(candidates).await?;
// 4. Filtrar por presupuesto si existe
if let Some(budget) = context.budget_cents {
candidates = candidates.into_iter()
.filter(|llm| llm.cost_per_1k_tokens() * 10.0 < budget as f64)
.collect();
}
// 5. Seleccionar por balance calidad/costo/latencia
let selected = self.select_optimal(candidates, &context)?;
self.cost_tracker.log_usage(&selected, &context);
Ok(selected)
}
async fn filter_by_availability(
&self,
candidates: Vec<LLMProvider>,
) -> anyhow::Result<Vec<LLMProvider>> {
let mut available = Vec::new();
for llm in candidates {
if self.rate_limiter.can_use(&llm).await? {
available.push(llm);
}
}
Ok(available.is_empty() ? candidates : available)
}
fn select_optimal(
&self,
candidates: Vec<LLMProvider>,
context: &TaskContext,
) -> anyhow::Result<LLMProvider> {
// Scoring: quality * 0.4 + cost * 0.3 + latency * 0.3
let best = candidates.iter().max_by(|a, b| {
let score_a = self.score_llm(a, context);
let score_b = self.score_llm(b, context);
score_a.partial_cmp(&score_b).unwrap()
});
Ok(best.ok_or(anyhow::anyhow!("No LLM available"))?.clone())
}
fn score_llm(&self, llm: &LLMProvider, context: &TaskContext) -> f64 {
let quality_score = match context.quality_requirement {
Quality::Critical => 1.0,
Quality::High => 0.9,
Quality::Medium => 0.7,
Quality::Low => 0.5,
};
let cost = llm.cost_per_1k_tokens();
let cost_score = 1.0 / (1.0 + cost); // Inverse: lower cost = higher score
let latency = llm.latency_ms();
let latency_score = 1.0 / (1.0 + latency as f64);
quality_score * 0.4 + cost_score * 0.3 + latency_score * 0.3
}
}
```
### Layer 5: Cost Tracking & Monitoring
```rust
pub struct CostTracker {
pub tasks_completed: HashMap<TaskType, u32>,
pub total_tokens_used: u64,
pub total_cost_cents: u32,
pub cost_by_provider: HashMap<String, u32>,
pub cost_by_task_type: HashMap<TaskType, u32>,
}
impl CostTracker {
pub fn log_usage(&mut self, llm: &LLMProvider, context: &TaskContext) {
let provider_name = llm.provider_name();
let cost = (llm.cost_per_1k_tokens() * 10.0) as u32; // Estimate per task
*self.cost_by_provider.entry(provider_name).or_insert(0) += cost;
*self.cost_by_task_type.entry(context.task_type.clone()).or_insert(0) += cost;
self.total_cost_cents += cost;
*self.tasks_completed.entry(context.task_type.clone()).or_insert(0) += 1;
}
pub fn monthly_cost_estimate(&self) -> f64 {
self.total_cost_cents as f64 / 100.0 // Convert to dollars
}
pub fn generate_report(&self) -> String {
format!(
"Cost Report:\n Total: ${:.2}\n By Provider: {:?}\n By Task: {:?}",
self.monthly_cost_estimate(),
self.cost_by_provider,
self.cost_by_task_type
)
}
}
```
---
## 🔧 Routing: Tres Modos
### Modo 1: Reglas Estáticas (Default)
```rust
// Automático, usa DEFAULT_MAPPINGS
let router = LLMRouter::new();
let llm = router.route(
TaskContext {
task_type: TaskType::CodeGeneration,
domain: "backend".to_string(),
complexity: Complexity::High,
quality_requirement: Quality::High,
latency_required_ms: 5000,
budget_cents: None,
},
None, // Sin override
).await?;
// Resultado: Claude Opus (regla predefinida)
```
### Modo 2: Decisión Dinámica (Smart)
```rust
// Router evalúa disponibilidad, latencia, costo
let router = LLMRouter::with_tracking();
let llm = router.route(
TaskContext {
task_type: TaskType::CodeReview,
domain: "frontend".to_string(),
complexity: Complexity::Medium,
quality_requirement: Quality::Medium,
latency_required_ms: 2000,
budget_cents: Some(20), // Max 2 cents por task
},
None,
).await?;
// Router elige entre Sonnet vs Gemini según disponibilidad y presupuesto
```
### Modo 3: Override Manual (Control Total)
```rust
// Usuario especifica exactamente qué LLM usar
let llm = router.route(
context,
Some(LLMProvider::Claude {
api_key: "sk-...".to_string(),
model: "opus-4".to_string(),
max_tokens: 8000,
}),
).await?;
// Usa exactamente lo especificado, registra en cost tracker
```
---
## 📊 Configuración (vapora.toml)
```toml
[llm_router]
# Mapeos personalizados (override DEFAULT_MAPPINGS)
[[llm_router.custom_mapping]]
task_type = "CodeGeneration"
primary_provider = "claude"
primary_model = "opus-4"
fallback_providers = ["openai:gpt-4"]
# Proveedores disponibles
[[llm_router.providers]]
name = "claude"
api_key = "${ANTHROPIC_API_KEY}"
model_variants = ["opus-4", "sonnet-4", "haiku-3"]
rate_limit = { tokens_per_minute = 1000000 }
[[llm_router.providers]]
name = "openai"
api_key = "${OPENAI_API_KEY}"
model_variants = ["gpt-4", "gpt-4-turbo"]
rate_limit = { tokens_per_minute = 500000 }
[[llm_router.providers]]
name = "gemini"
api_key = "${GEMINI_API_KEY}"
model_variants = ["gemini-pro", "gemini-flash-2.0"]
[[llm_router.providers]]
name = "ollama"
endpoint = "http://localhost:11434"
model_variants = ["llama3.2", "mistral", "neural-chat"]
rate_limit = { tokens_per_minute = 10000000 } # Local, sin límites reales
# Cost tracking
[llm_router.cost_tracking]
enabled = true
warn_when_exceeds_cents = 1000 # Warn if daily cost > $10
```
---
## 🎯 Implementation Checklist
- [ ] Trait `LLMClient` + implementaciones (Claude, OpenAI, Gemini, Ollama)
- [ ] `TaskContext` y clasificación de tareas
- [ ] `IAMapping` y DEFAULT_MAPPINGS
- [ ] `LLMRouter` con routing híbrido
- [ ] Fallback automático + error handling
- [ ] `CostTracker` para monitoreo
- [ ] Config loading desde vapora.toml
- [ ] CLI: `vapora llm-router status` (ver providers, costos)
- [ ] Tests unitarios (routing logic)
- [ ] Integration tests (real providers)
---
## 📈 Success Metrics
✅ Routing decision < 100ms
Fallback automático funciona
Cost tracking preciso
Documentación de costos por tarea
Override manual siempre funciona
Rate limiting respetado
---
**Version**: 0.1.0
**Status**: Specification Complete (VAPORA v1.0)
**Purpose**: Multi-IA routing system para orquestación de agentes