260 lines
6.8 KiB
Markdown
260 lines
6.8 KiB
Markdown
# ADR-018: Swarm Load-Balanced Task Assignment
|
|
|
|
**Status**: Accepted | Implemented
|
|
**Date**: 2024-11-01
|
|
**Deciders**: Swarm Coordination Team
|
|
**Technical Story**: Distributing tasks across agents considering both capability and current load
|
|
|
|
---
|
|
|
|
## Decision
|
|
|
|
Implementar **load-balanced task assignment** con fórmula `assignment_score = success_rate / (1 + load)`.
|
|
|
|
---
|
|
|
|
## Rationale
|
|
|
|
1. **Success Rate**: Seleccionar agentes que han tenido éxito en tareas similares
|
|
2. **Load Factor**: Balancear entre expertise y disponibilidad (no sobrecargar)
|
|
3. **Single Formula**: Combina ambas dimensiones en una métrica comparable
|
|
4. **Prevents Concentration**: Evitar que todos los tasks vayan a un solo agent
|
|
|
|
---
|
|
|
|
## Alternatives Considered
|
|
|
|
### ❌ Success Rate Only
|
|
- **Pros**: Selecciona best performer
|
|
- **Cons**: Concentra todas las tasks, agent se sobrecarga
|
|
|
|
### ❌ Round-Robin (Equal Distribution)
|
|
- **Pros**: Simple, fair distribution
|
|
- **Cons**: No considera capability, bad agents get same load
|
|
|
|
### ✅ Success Rate / (1 + Load) (CHOSEN)
|
|
- Balancea expertise con availability
|
|
|
|
---
|
|
|
|
## Trade-offs
|
|
|
|
**Pros**:
|
|
- ✅ Considers both capability and availability
|
|
- ✅ Simple, single metric for comparison
|
|
- ✅ Prevents overloading high-performing agents
|
|
- ✅ Encourages fair distribution
|
|
|
|
**Cons**:
|
|
- ⚠️ Formula is simplified (linear load penalty)
|
|
- ⚠️ May sacrifice quality for load balance
|
|
- ⚠️ Requires real-time load tracking
|
|
|
|
---
|
|
|
|
## Implementation
|
|
|
|
**Agent Load Tracking**:
|
|
```rust
|
|
// crates/vapora-swarm/src/coordinator.rs
|
|
|
|
pub struct AgentState {
|
|
pub id: String,
|
|
pub role: AgentRole,
|
|
pub status: AgentStatus, // Ready, Busy, Offline
|
|
pub in_flight_tasks: u32,
|
|
pub max_concurrent: u32,
|
|
pub success_rate: f32, // [0.0, 1.0]
|
|
pub avg_latency_ms: u32,
|
|
}
|
|
|
|
impl AgentState {
|
|
/// Current load (0.0 = idle, 1.0 = at capacity)
|
|
pub fn current_load(&self) -> f32 {
|
|
(self.in_flight_tasks as f32) / (self.max_concurrent as f32)
|
|
}
|
|
|
|
/// Assignment score: success_rate / (1 + load)
|
|
/// Higher = better candidate for task
|
|
pub fn assignment_score(&self) -> f32 {
|
|
self.success_rate / (1.0 + self.current_load())
|
|
}
|
|
}
|
|
```
|
|
|
|
**Task Assignment Logic**:
|
|
```rust
|
|
pub async fn assign_task_to_best_agent(
|
|
task: &Task,
|
|
agents: &[AgentState],
|
|
) -> Result<String> {
|
|
// Filter eligible agents (matching role, online)
|
|
let eligible: Vec<_> = agents
|
|
.iter()
|
|
.filter(|a| {
|
|
a.status == AgentStatus::Ready || a.status == AgentStatus::Busy
|
|
})
|
|
.collect();
|
|
|
|
if eligible.is_empty() {
|
|
return Err(Error::NoAgentsAvailable);
|
|
}
|
|
|
|
// Score each agent
|
|
let mut scored: Vec<_> = eligible
|
|
.iter()
|
|
.map(|agent| {
|
|
let score = agent.assignment_score();
|
|
(agent.id.clone(), score)
|
|
})
|
|
.collect();
|
|
|
|
// Sort by score descending
|
|
scored.sort_by(|a, b| {
|
|
b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal)
|
|
});
|
|
|
|
// Assign to highest scoring agent
|
|
let selected_agent_id = scored[0].0.clone();
|
|
|
|
// Increment in-flight counter
|
|
if let Some(agent) = agents.iter_mut().find(|a| a.id == selected_agent_id) {
|
|
agent.in_flight_tasks += 1;
|
|
}
|
|
|
|
Ok(selected_agent_id)
|
|
}
|
|
```
|
|
|
|
**Load Calculation Examples**:
|
|
```
|
|
Agent A: success_rate = 0.95, in_flight = 2, max_concurrent = 5
|
|
load = 2/5 = 0.4
|
|
score = 0.95 / (1 + 0.4) = 0.95 / 1.4 = 0.68
|
|
|
|
Agent B: success_rate = 0.85, in_flight = 0, max_concurrent = 5
|
|
load = 0/5 = 0.0
|
|
score = 0.85 / (1 + 0.0) = 0.85 / 1.0 = 0.85 ← Selected
|
|
|
|
Agent C: success_rate = 0.90, in_flight = 5, max_concurrent = 5
|
|
load = 5/5 = 1.0
|
|
score = 0.90 / (1 + 1.0) = 0.90 / 2.0 = 0.45
|
|
```
|
|
|
|
**Real-Time Metrics**:
|
|
```rust
|
|
pub async fn collect_swarm_metrics(
|
|
agents: &[AgentState],
|
|
) -> SwarmMetrics {
|
|
SwarmMetrics {
|
|
total_agents: agents.len(),
|
|
idle_agents: agents.iter().filter(|a| a.in_flight_tasks == 0).count(),
|
|
busy_agents: agents.iter().filter(|a| a.in_flight_tasks > 0).count(),
|
|
offline_agents: agents.iter().filter(|a| a.status == AgentStatus::Offline).count(),
|
|
total_in_flight: agents.iter().map(|a| a.in_flight_tasks).sum::<u32>(),
|
|
avg_success_rate: agents.iter().map(|a| a.success_rate).sum::<f32>() / agents.len() as f32,
|
|
avg_load: agents.iter().map(|a| a.current_load()).sum::<f32>() / agents.len() as f32,
|
|
}
|
|
}
|
|
```
|
|
|
|
**Prometheus Metrics**:
|
|
```rust
|
|
// Register metrics
|
|
lazy_static::lazy_static! {
|
|
static ref TASK_ASSIGNMENTS: Counter = Counter::new(
|
|
"vapora_task_assignments_total",
|
|
"Total task assignments"
|
|
).unwrap();
|
|
|
|
static ref AGENT_LOAD: Gauge = Gauge::new(
|
|
"vapora_agent_current_load",
|
|
"Current agent load (0-1)"
|
|
).unwrap();
|
|
|
|
static ref ASSIGNMENT_SCORE: Histogram = Histogram::new(
|
|
"vapora_assignment_score",
|
|
"Assignment score distribution"
|
|
).unwrap();
|
|
}
|
|
|
|
// Record metrics
|
|
TASK_ASSIGNMENTS.inc();
|
|
AGENT_LOAD.set(best_agent.current_load());
|
|
ASSIGNMENT_SCORE.observe(best_agent.assignment_score());
|
|
```
|
|
|
|
**Key Files**:
|
|
- `/crates/vapora-swarm/src/coordinator.rs` (assignment logic)
|
|
- `/crates/vapora-swarm/src/metrics.rs` (Prometheus metrics)
|
|
- `/crates/vapora-backend/src/api/` (task creation triggers assignment)
|
|
|
|
---
|
|
|
|
## Verification
|
|
|
|
```bash
|
|
# Test assignment score calculation
|
|
cargo test -p vapora-swarm test_assignment_score_calculation
|
|
|
|
# Test load factor impact
|
|
cargo test -p vapora-swarm test_load_factor_impact
|
|
|
|
# Test best agent selection
|
|
cargo test -p vapora-swarm test_select_best_agent
|
|
|
|
# Test fair distribution (no concentration)
|
|
cargo test -p vapora-swarm test_fair_distribution
|
|
|
|
# Integration: assign multiple tasks sequentially
|
|
cargo test -p vapora-swarm test_assignment_sequence
|
|
|
|
# Load balancing under stress
|
|
cargo test -p vapora-swarm test_load_balancing_stress
|
|
```
|
|
|
|
**Expected Output**:
|
|
- Agents with high success_rate + low load selected first
|
|
- Load increases after each assignment
|
|
- Fair distribution across agents
|
|
- No single agent receiving all tasks
|
|
- Metrics tracked accurately
|
|
- Scores properly reflect trade-off
|
|
|
|
---
|
|
|
|
## Consequences
|
|
|
|
### Fairness
|
|
- High-performing agents get more tasks (deserved)
|
|
- Overloaded agents get fewer tasks (protection)
|
|
- Fair distribution emerges automatically
|
|
|
|
### Performance
|
|
- Task latency depends on agent load (may queue)
|
|
- Peak throughput = sum of all agent max_concurrent
|
|
- SLA contracts respect per-agent limits
|
|
|
|
### Scaling
|
|
- Adding agents increases total capacity
|
|
- Load automatically redistributes
|
|
- Horizontal scaling works naturally
|
|
|
|
### Monitoring
|
|
- Track assignment distribution
|
|
- Alert if concentration detected
|
|
- Identify bottleneck agents
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- `/crates/vapora-swarm/src/coordinator.rs` (implementation)
|
|
- `/crates/vapora-swarm/src/metrics.rs` (metrics collection)
|
|
- ADR-014 (Learning Profiles)
|
|
- ADR-018 (This ADR)
|
|
|
|
---
|
|
|
|
**Related ADRs**: ADR-014 (Learning Profiles), ADR-020 (Audit Trail)
|