Vapora/docs/adrs/0018-swarm-load-balancing.md

260 lines
6.8 KiB
Markdown
Raw Normal View History

# ADR-018: Swarm Load-Balanced Task Assignment
**Status**: Accepted | Implemented
**Date**: 2024-11-01
**Deciders**: Swarm Coordination Team
**Technical Story**: Distributing tasks across agents considering both capability and current load
---
## Decision
Implementar **load-balanced task assignment** con fórmula `assignment_score = success_rate / (1 + load)`.
---
## Rationale
1. **Success Rate**: Seleccionar agentes que han tenido éxito en tareas similares
2. **Load Factor**: Balancear entre expertise y disponibilidad (no sobrecargar)
3. **Single Formula**: Combina ambas dimensiones en una métrica comparable
4. **Prevents Concentration**: Evitar que todos los tasks vayan a un solo agent
---
## Alternatives Considered
### ❌ Success Rate Only
- **Pros**: Selecciona best performer
- **Cons**: Concentra todas las tasks, agent se sobrecarga
### ❌ Round-Robin (Equal Distribution)
- **Pros**: Simple, fair distribution
- **Cons**: No considera capability, bad agents get same load
### ✅ Success Rate / (1 + Load) (CHOSEN)
- Balancea expertise con availability
---
## Trade-offs
**Pros**:
- ✅ Considers both capability and availability
- ✅ Simple, single metric for comparison
- ✅ Prevents overloading high-performing agents
- ✅ Encourages fair distribution
**Cons**:
- ⚠️ Formula is simplified (linear load penalty)
- ⚠️ May sacrifice quality for load balance
- ⚠️ Requires real-time load tracking
---
## Implementation
**Agent Load Tracking**:
```rust
// crates/vapora-swarm/src/coordinator.rs
pub struct AgentState {
pub id: String,
pub role: AgentRole,
pub status: AgentStatus, // Ready, Busy, Offline
pub in_flight_tasks: u32,
pub max_concurrent: u32,
pub success_rate: f32, // [0.0, 1.0]
pub avg_latency_ms: u32,
}
impl AgentState {
/// Current load (0.0 = idle, 1.0 = at capacity)
pub fn current_load(&self) -> f32 {
(self.in_flight_tasks as f32) / (self.max_concurrent as f32)
}
/// Assignment score: success_rate / (1 + load)
/// Higher = better candidate for task
pub fn assignment_score(&self) -> f32 {
self.success_rate / (1.0 + self.current_load())
}
}
```
**Task Assignment Logic**:
```rust
pub async fn assign_task_to_best_agent(
task: &Task,
agents: &[AgentState],
) -> Result<String> {
// Filter eligible agents (matching role, online)
let eligible: Vec<_> = agents
.iter()
.filter(|a| {
a.status == AgentStatus::Ready || a.status == AgentStatus::Busy
})
.collect();
if eligible.is_empty() {
return Err(Error::NoAgentsAvailable);
}
// Score each agent
let mut scored: Vec<_> = eligible
.iter()
.map(|agent| {
let score = agent.assignment_score();
(agent.id.clone(), score)
})
.collect();
// Sort by score descending
scored.sort_by(|a, b| {
b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal)
});
// Assign to highest scoring agent
let selected_agent_id = scored[0].0.clone();
// Increment in-flight counter
if let Some(agent) = agents.iter_mut().find(|a| a.id == selected_agent_id) {
agent.in_flight_tasks += 1;
}
Ok(selected_agent_id)
}
```
**Load Calculation Examples**:
```
Agent A: success_rate = 0.95, in_flight = 2, max_concurrent = 5
load = 2/5 = 0.4
score = 0.95 / (1 + 0.4) = 0.95 / 1.4 = 0.68
Agent B: success_rate = 0.85, in_flight = 0, max_concurrent = 5
load = 0/5 = 0.0
score = 0.85 / (1 + 0.0) = 0.85 / 1.0 = 0.85 ← Selected
Agent C: success_rate = 0.90, in_flight = 5, max_concurrent = 5
load = 5/5 = 1.0
score = 0.90 / (1 + 1.0) = 0.90 / 2.0 = 0.45
```
**Real-Time Metrics**:
```rust
pub async fn collect_swarm_metrics(
agents: &[AgentState],
) -> SwarmMetrics {
SwarmMetrics {
total_agents: agents.len(),
idle_agents: agents.iter().filter(|a| a.in_flight_tasks == 0).count(),
busy_agents: agents.iter().filter(|a| a.in_flight_tasks > 0).count(),
offline_agents: agents.iter().filter(|a| a.status == AgentStatus::Offline).count(),
total_in_flight: agents.iter().map(|a| a.in_flight_tasks).sum::<u32>(),
avg_success_rate: agents.iter().map(|a| a.success_rate).sum::<f32>() / agents.len() as f32,
avg_load: agents.iter().map(|a| a.current_load()).sum::<f32>() / agents.len() as f32,
}
}
```
**Prometheus Metrics**:
```rust
// Register metrics
lazy_static::lazy_static! {
static ref TASK_ASSIGNMENTS: Counter = Counter::new(
"vapora_task_assignments_total",
"Total task assignments"
).unwrap();
static ref AGENT_LOAD: Gauge = Gauge::new(
"vapora_agent_current_load",
"Current agent load (0-1)"
).unwrap();
static ref ASSIGNMENT_SCORE: Histogram = Histogram::new(
"vapora_assignment_score",
"Assignment score distribution"
).unwrap();
}
// Record metrics
TASK_ASSIGNMENTS.inc();
AGENT_LOAD.set(best_agent.current_load());
ASSIGNMENT_SCORE.observe(best_agent.assignment_score());
```
**Key Files**:
- `/crates/vapora-swarm/src/coordinator.rs` (assignment logic)
- `/crates/vapora-swarm/src/metrics.rs` (Prometheus metrics)
- `/crates/vapora-backend/src/api/` (task creation triggers assignment)
---
## Verification
```bash
# Test assignment score calculation
cargo test -p vapora-swarm test_assignment_score_calculation
# Test load factor impact
cargo test -p vapora-swarm test_load_factor_impact
# Test best agent selection
cargo test -p vapora-swarm test_select_best_agent
# Test fair distribution (no concentration)
cargo test -p vapora-swarm test_fair_distribution
# Integration: assign multiple tasks sequentially
cargo test -p vapora-swarm test_assignment_sequence
# Load balancing under stress
cargo test -p vapora-swarm test_load_balancing_stress
```
**Expected Output**:
- Agents with high success_rate + low load selected first
- Load increases after each assignment
- Fair distribution across agents
- No single agent receiving all tasks
- Metrics tracked accurately
- Scores properly reflect trade-off
---
## Consequences
### Fairness
- High-performing agents get more tasks (deserved)
- Overloaded agents get fewer tasks (protection)
- Fair distribution emerges automatically
### Performance
- Task latency depends on agent load (may queue)
- Peak throughput = sum of all agent max_concurrent
- SLA contracts respect per-agent limits
### Scaling
- Adding agents increases total capacity
- Load automatically redistributes
- Horizontal scaling works naturally
### Monitoring
- Track assignment distribution
- Alert if concentration detected
- Identify bottleneck agents
---
## References
- `/crates/vapora-swarm/src/coordinator.rs` (implementation)
- `/crates/vapora-swarm/src/metrics.rs` (metrics collection)
- ADR-014 (Learning Profiles)
- ADR-018 (This ADR)
---
**Related ADRs**: ADR-014 (Learning Profiles), ADR-020 (Audit Trail)