# ADR-018: Swarm Load-Balanced Task Assignment **Status**: Accepted | Implemented **Date**: 2024-11-01 **Deciders**: Swarm Coordination Team **Technical Story**: Distributing tasks across agents considering both capability and current load --- ## Decision Implementar **load-balanced task assignment** con fórmula `assignment_score = success_rate / (1 + load)`. --- ## Rationale 1. **Success Rate**: Seleccionar agentes que han tenido éxito en tareas similares 2. **Load Factor**: Balancear entre expertise y disponibilidad (no sobrecargar) 3. **Single Formula**: Combina ambas dimensiones en una métrica comparable 4. **Prevents Concentration**: Evitar que todos los tasks vayan a un solo agent --- ## Alternatives Considered ### ❌ Success Rate Only - **Pros**: Selecciona best performer - **Cons**: Concentra todas las tasks, agent se sobrecarga ### ❌ Round-Robin (Equal Distribution) - **Pros**: Simple, fair distribution - **Cons**: No considera capability, bad agents get same load ### ✅ Success Rate / (1 + Load) (CHOSEN) - Balancea expertise con availability --- ## Trade-offs **Pros**: - ✅ Considers both capability and availability - ✅ Simple, single metric for comparison - ✅ Prevents overloading high-performing agents - ✅ Encourages fair distribution **Cons**: - ⚠️ Formula is simplified (linear load penalty) - ⚠️ May sacrifice quality for load balance - ⚠️ Requires real-time load tracking --- ## Implementation **Agent Load Tracking**: ```rust // crates/vapora-swarm/src/coordinator.rs pub struct AgentState { pub id: String, pub role: AgentRole, pub status: AgentStatus, // Ready, Busy, Offline pub in_flight_tasks: u32, pub max_concurrent: u32, pub success_rate: f32, // [0.0, 1.0] pub avg_latency_ms: u32, } impl AgentState { /// Current load (0.0 = idle, 1.0 = at capacity) pub fn current_load(&self) -> f32 { (self.in_flight_tasks as f32) / (self.max_concurrent as f32) } /// Assignment score: success_rate / (1 + load) /// Higher = better candidate for task pub fn assignment_score(&self) -> f32 { self.success_rate / (1.0 + self.current_load()) } } ``` **Task Assignment Logic**: ```rust pub async fn assign_task_to_best_agent( task: &Task, agents: &[AgentState], ) -> Result { // Filter eligible agents (matching role, online) let eligible: Vec<_> = agents .iter() .filter(|a| { a.status == AgentStatus::Ready || a.status == AgentStatus::Busy }) .collect(); if eligible.is_empty() { return Err(Error::NoAgentsAvailable); } // Score each agent let mut scored: Vec<_> = eligible .iter() .map(|agent| { let score = agent.assignment_score(); (agent.id.clone(), score) }) .collect(); // Sort by score descending scored.sort_by(|a, b| { b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal) }); // Assign to highest scoring agent let selected_agent_id = scored[0].0.clone(); // Increment in-flight counter if let Some(agent) = agents.iter_mut().find(|a| a.id == selected_agent_id) { agent.in_flight_tasks += 1; } Ok(selected_agent_id) } ``` **Load Calculation Examples**: ``` Agent A: success_rate = 0.95, in_flight = 2, max_concurrent = 5 load = 2/5 = 0.4 score = 0.95 / (1 + 0.4) = 0.95 / 1.4 = 0.68 Agent B: success_rate = 0.85, in_flight = 0, max_concurrent = 5 load = 0/5 = 0.0 score = 0.85 / (1 + 0.0) = 0.85 / 1.0 = 0.85 ← Selected Agent C: success_rate = 0.90, in_flight = 5, max_concurrent = 5 load = 5/5 = 1.0 score = 0.90 / (1 + 1.0) = 0.90 / 2.0 = 0.45 ``` **Real-Time Metrics**: ```rust pub async fn collect_swarm_metrics( agents: &[AgentState], ) -> SwarmMetrics { SwarmMetrics { total_agents: agents.len(), idle_agents: agents.iter().filter(|a| a.in_flight_tasks == 0).count(), busy_agents: agents.iter().filter(|a| a.in_flight_tasks > 0).count(), offline_agents: agents.iter().filter(|a| a.status == AgentStatus::Offline).count(), total_in_flight: agents.iter().map(|a| a.in_flight_tasks).sum::(), avg_success_rate: agents.iter().map(|a| a.success_rate).sum::() / agents.len() as f32, avg_load: agents.iter().map(|a| a.current_load()).sum::() / agents.len() as f32, } } ``` **Prometheus Metrics**: ```rust // Register metrics lazy_static::lazy_static! { static ref TASK_ASSIGNMENTS: Counter = Counter::new( "vapora_task_assignments_total", "Total task assignments" ).unwrap(); static ref AGENT_LOAD: Gauge = Gauge::new( "vapora_agent_current_load", "Current agent load (0-1)" ).unwrap(); static ref ASSIGNMENT_SCORE: Histogram = Histogram::new( "vapora_assignment_score", "Assignment score distribution" ).unwrap(); } // Record metrics TASK_ASSIGNMENTS.inc(); AGENT_LOAD.set(best_agent.current_load()); ASSIGNMENT_SCORE.observe(best_agent.assignment_score()); ``` **Key Files**: - `/crates/vapora-swarm/src/coordinator.rs` (assignment logic) - `/crates/vapora-swarm/src/metrics.rs` (Prometheus metrics) - `/crates/vapora-backend/src/api/` (task creation triggers assignment) --- ## Verification ```bash # Test assignment score calculation cargo test -p vapora-swarm test_assignment_score_calculation # Test load factor impact cargo test -p vapora-swarm test_load_factor_impact # Test best agent selection cargo test -p vapora-swarm test_select_best_agent # Test fair distribution (no concentration) cargo test -p vapora-swarm test_fair_distribution # Integration: assign multiple tasks sequentially cargo test -p vapora-swarm test_assignment_sequence # Load balancing under stress cargo test -p vapora-swarm test_load_balancing_stress ``` **Expected Output**: - Agents with high success_rate + low load selected first - Load increases after each assignment - Fair distribution across agents - No single agent receiving all tasks - Metrics tracked accurately - Scores properly reflect trade-off --- ## Consequences ### Fairness - High-performing agents get more tasks (deserved) - Overloaded agents get fewer tasks (protection) - Fair distribution emerges automatically ### Performance - Task latency depends on agent load (may queue) - Peak throughput = sum of all agent max_concurrent - SLA contracts respect per-agent limits ### Scaling - Adding agents increases total capacity - Load automatically redistributes - Horizontal scaling works naturally ### Monitoring - Track assignment distribution - Alert if concentration detected - Identify bottleneck agents --- ## References - `/crates/vapora-swarm/src/coordinator.rs` (implementation) - `/crates/vapora-swarm/src/metrics.rs` (metrics collection) - ADR-014 (Learning Profiles) - ADR-018 (This ADR) --- **Related ADRs**: ADR-014 (Learning Profiles), ADR-020 (Audit Trail)