6.8 KiB
6.8 KiB
ADR-018: Swarm Load-Balanced Task Assignment
Status: Accepted | Implemented Date: 2024-11-01 Deciders: Swarm Coordination Team Technical Story: Distributing tasks across agents considering both capability and current load
Decision
Implementar load-balanced task assignment con fórmula assignment_score = success_rate / (1 + load).
Rationale
- Success Rate: Seleccionar agentes que han tenido éxito en tareas similares
- Load Factor: Balancear entre expertise y disponibilidad (no sobrecargar)
- Single Formula: Combina ambas dimensiones en una métrica comparable
- Prevents Concentration: Evitar que todos los tasks vayan a un solo agent
Alternatives Considered
❌ Success Rate Only
- Pros: Selecciona best performer
- Cons: Concentra todas las tasks, agent se sobrecarga
❌ Round-Robin (Equal Distribution)
- Pros: Simple, fair distribution
- Cons: No considera capability, bad agents get same load
✅ Success Rate / (1 + Load) (CHOSEN)
- Balancea expertise con availability
Trade-offs
Pros:
- ✅ Considers both capability and availability
- ✅ Simple, single metric for comparison
- ✅ Prevents overloading high-performing agents
- ✅ Encourages fair distribution
Cons:
- ⚠️ Formula is simplified (linear load penalty)
- ⚠️ May sacrifice quality for load balance
- ⚠️ Requires real-time load tracking
Implementation
Agent Load Tracking:
// crates/vapora-swarm/src/coordinator.rs
pub struct AgentState {
pub id: String,
pub role: AgentRole,
pub status: AgentStatus, // Ready, Busy, Offline
pub in_flight_tasks: u32,
pub max_concurrent: u32,
pub success_rate: f32, // [0.0, 1.0]
pub avg_latency_ms: u32,
}
impl AgentState {
/// Current load (0.0 = idle, 1.0 = at capacity)
pub fn current_load(&self) -> f32 {
(self.in_flight_tasks as f32) / (self.max_concurrent as f32)
}
/// Assignment score: success_rate / (1 + load)
/// Higher = better candidate for task
pub fn assignment_score(&self) -> f32 {
self.success_rate / (1.0 + self.current_load())
}
}
Task Assignment Logic:
pub async fn assign_task_to_best_agent(
task: &Task,
agents: &[AgentState],
) -> Result<String> {
// Filter eligible agents (matching role, online)
let eligible: Vec<_> = agents
.iter()
.filter(|a| {
a.status == AgentStatus::Ready || a.status == AgentStatus::Busy
})
.collect();
if eligible.is_empty() {
return Err(Error::NoAgentsAvailable);
}
// Score each agent
let mut scored: Vec<_> = eligible
.iter()
.map(|agent| {
let score = agent.assignment_score();
(agent.id.clone(), score)
})
.collect();
// Sort by score descending
scored.sort_by(|a, b| {
b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal)
});
// Assign to highest scoring agent
let selected_agent_id = scored[0].0.clone();
// Increment in-flight counter
if let Some(agent) = agents.iter_mut().find(|a| a.id == selected_agent_id) {
agent.in_flight_tasks += 1;
}
Ok(selected_agent_id)
}
Load Calculation Examples:
Agent A: success_rate = 0.95, in_flight = 2, max_concurrent = 5
load = 2/5 = 0.4
score = 0.95 / (1 + 0.4) = 0.95 / 1.4 = 0.68
Agent B: success_rate = 0.85, in_flight = 0, max_concurrent = 5
load = 0/5 = 0.0
score = 0.85 / (1 + 0.0) = 0.85 / 1.0 = 0.85 ← Selected
Agent C: success_rate = 0.90, in_flight = 5, max_concurrent = 5
load = 5/5 = 1.0
score = 0.90 / (1 + 1.0) = 0.90 / 2.0 = 0.45
Real-Time Metrics:
pub async fn collect_swarm_metrics(
agents: &[AgentState],
) -> SwarmMetrics {
SwarmMetrics {
total_agents: agents.len(),
idle_agents: agents.iter().filter(|a| a.in_flight_tasks == 0).count(),
busy_agents: agents.iter().filter(|a| a.in_flight_tasks > 0).count(),
offline_agents: agents.iter().filter(|a| a.status == AgentStatus::Offline).count(),
total_in_flight: agents.iter().map(|a| a.in_flight_tasks).sum::<u32>(),
avg_success_rate: agents.iter().map(|a| a.success_rate).sum::<f32>() / agents.len() as f32,
avg_load: agents.iter().map(|a| a.current_load()).sum::<f32>() / agents.len() as f32,
}
}
Prometheus Metrics:
// Register metrics
lazy_static::lazy_static! {
static ref TASK_ASSIGNMENTS: Counter = Counter::new(
"vapora_task_assignments_total",
"Total task assignments"
).unwrap();
static ref AGENT_LOAD: Gauge = Gauge::new(
"vapora_agent_current_load",
"Current agent load (0-1)"
).unwrap();
static ref ASSIGNMENT_SCORE: Histogram = Histogram::new(
"vapora_assignment_score",
"Assignment score distribution"
).unwrap();
}
// Record metrics
TASK_ASSIGNMENTS.inc();
AGENT_LOAD.set(best_agent.current_load());
ASSIGNMENT_SCORE.observe(best_agent.assignment_score());
Key Files:
/crates/vapora-swarm/src/coordinator.rs(assignment logic)/crates/vapora-swarm/src/metrics.rs(Prometheus metrics)/crates/vapora-backend/src/api/(task creation triggers assignment)
Verification
# Test assignment score calculation
cargo test -p vapora-swarm test_assignment_score_calculation
# Test load factor impact
cargo test -p vapora-swarm test_load_factor_impact
# Test best agent selection
cargo test -p vapora-swarm test_select_best_agent
# Test fair distribution (no concentration)
cargo test -p vapora-swarm test_fair_distribution
# Integration: assign multiple tasks sequentially
cargo test -p vapora-swarm test_assignment_sequence
# Load balancing under stress
cargo test -p vapora-swarm test_load_balancing_stress
Expected Output:
- Agents with high success_rate + low load selected first
- Load increases after each assignment
- Fair distribution across agents
- No single agent receiving all tasks
- Metrics tracked accurately
- Scores properly reflect trade-off
Consequences
Fairness
- High-performing agents get more tasks (deserved)
- Overloaded agents get fewer tasks (protection)
- Fair distribution emerges automatically
Performance
- Task latency depends on agent load (may queue)
- Peak throughput = sum of all agent max_concurrent
- SLA contracts respect per-agent limits
Scaling
- Adding agents increases total capacity
- Load automatically redistributes
- Horizontal scaling works naturally
Monitoring
- Track assignment distribution
- Alert if concentration detected
- Identify bottleneck agents
References
/crates/vapora-swarm/src/coordinator.rs(implementation)/crates/vapora-swarm/src/metrics.rs(metrics collection)- ADR-014 (Learning Profiles)
- ADR-018 (This ADR)
Related ADRs: ADR-014 (Learning Profiles), ADR-020 (Audit Trail)