6.7 KiB
6.7 KiB
ADR-017: Confidence Weighting en Learning Profiles
Status: Accepted | Implemented Date: 2024-11-01 Deciders: Agent Architecture Team Technical Story: Preventing new agents from being preferred on lucky first runs
Decision
Implementar Confidence Weighting con fórmula confidence = min(1.0, total_executions / 20).
Rationale
- Prevents Overfitting: Agentes nuevos con 1 éxito no deben ser preferred
- Statistical Significance: 20 ejecuciones proporciona confianza estadística
- Gradual Increase: Confianza sube mientras agente ejecuta más tareas
- Prevents Lucky Streaks: Requiere evidencia antes de preferencia
Alternatives Considered
❌ No Confidence Weighting
- Pros: Simple
- Cons: New agent with 1 success could be selected
❌ Higher Threshold (e.g., 50 executions)
- Pros: More statistical rigor
- Cons: Cold-start problem worse, new agents never selected
✅ Confidence = min(1.0, executions/20) (CHOSEN)
- Reasonable threshold, balances learning and avoiding lucky streaks
Trade-offs
Pros:
- ✅ Prevents overfitting on single success
- ✅ Reasonable learning curve (20 executions)
- ✅ Simple formula
- ✅ Transparent and explainable
Cons:
- ⚠️ Cold-start: new agents take 20 runs to full confidence
- ⚠️ Not adaptive (same threshold for all task types)
- ⚠️ May still allow lucky streaks (before 20 runs)
Implementation
Confidence Model:
// crates/vapora-agents/src/learning_profile.rs
impl TaskTypeLearning {
/// Confidence score: how much to trust this agent's score
/// min(1.0, executions / 20) = 0.05 at 1 execution, 1.0 at 20+
pub fn confidence(&self) -> f32 {
std::cmp::min(
1.0,
(self.executions_total as f32) / 20.0
)
}
/// Adjusted score: expertise * confidence
/// Even with perfect expertise, low confidence reduces score
pub fn adjusted_score(&self) -> f32 {
let expertise = self.expertise_score();
let confidence = self.confidence();
expertise * confidence
}
/// Confidence progression examples:
/// 1 exec: confidence = 0.05 (5%)
/// 5 exec: confidence = 0.25 (25%)
/// 10 exec: confidence = 0.50 (50%)
/// 20 exec: confidence = 1.0 (100%)
}
Agent Selection with Confidence:
pub async fn select_best_agent_with_confidence(
db: &Surreal<Ws>,
task_type: &str,
) -> Result<String> {
// Query all agents for this task type
let profiles = db.query(
"SELECT agent_id, executions_total, expertise_score(), confidence() \
FROM task_type_learning \
WHERE task_type = $1 \
ORDER BY (expertise_score * confidence) DESC \
LIMIT 5"
)
.bind(task_type)
.await?;
let best = profiles
.take::<TaskTypeLearning>(0)?
.first()
.ok_or(Error::NoAgentsAvailable)?;
// Log selection with confidence for debugging
tracing::info!(
"Selected agent {} with confidence {:.2}% (after {} executions)",
best.agent_id,
best.confidence() * 100.0,
best.executions_total
);
Ok(best.agent_id.clone())
}
Preventing Lucky Streaks:
// Example: Agent with 1 success but 5% confidence
let agent_1_success = TaskTypeLearning {
agent_id: "new-agent-1".to_string(),
task_type: "code_generation".to_string(),
executions_total: 1,
executions_successful: 1,
avg_quality_score: 0.95, // Perfect on first try!
records: vec![ExecutionRecord { /* ... */ }],
};
// Expertise would be 0.95, but confidence is only 0.05
let score = agent_1_success.adjusted_score(); // 0.95 * 0.05 = 0.0475
// This agent scores much lower than established agent with 0.80 expertise, 0.50 confidence
// 0.80 * 0.50 = 0.40 > 0.0475
// Agent needs ~20 successes before reaching full confidence
let agent_20_success = TaskTypeLearning {
executions_total: 20,
executions_successful: 20,
avg_quality_score: 0.95,
/* ... */
};
let score = agent_20_success.adjusted_score(); // 0.95 * 1.0 = 0.95
Confidence Visualization:
pub fn confidence_ramp() -> Vec<(u32, f32)> {
(0..=40)
.map(|execs| {
let confidence = std::cmp::min(1.0, (execs as f32) / 20.0);
(execs, confidence)
})
.collect()
}
// Output:
// 0 execs: 0.00
// 1 exec: 0.05
// 2 execs: 0.10
// 5 execs: 0.25
// 10 execs: 0.50
// 20 execs: 1.00 ← Full confidence reached
// 30 execs: 1.00 ← Capped at 1.0
// 40 execs: 1.00 ← Still capped
Key Files:
/crates/vapora-agents/src/learning_profile.rs(confidence calculation)/crates/vapora-agents/src/selector.rs(agent selection logic)/crates/vapora-agents/src/scoring.rs(score calculations)
Verification
# Test confidence calculation at key milestones
cargo test -p vapora-agents test_confidence_at_1_exec
cargo test -p vapora-agents test_confidence_at_5_execs
cargo test -p vapora-agents test_confidence_at_20_execs
cargo test -p vapora-agents test_confidence_cap_at_1
# Test lucky streak prevention
cargo test -p vapora-agents test_lucky_streak_prevention
# Test adjusted score (expertise * confidence)
cargo test -p vapora-agents test_adjusted_score_calculation
# Integration: new agent vs established agent selection
cargo test -p vapora-agents test_agent_selection_with_confidence
Expected Output:
- 1 execution: confidence = 0.05 (5%)
- 5 executions: confidence = 0.25 (25%)
- 10 executions: confidence = 0.50 (50%)
- 20 executions: confidence = 1.0 (100%)
- New agent with 1 success not selected over established agent
- Confidence gradually increases as agent executes more
- Adjusted score properly combines expertise and confidence
Consequences
Agent Cold-Start
- New agents require ~20 successful executions before reaching full score
- Longer ramp-up but prevents bad deployments
- Users understand why new agents aren't immediately selected
Agent Ranking
- Established agents (20+ executions) ranked by expertise only
- Developing agents (< 20 executions) ranked by expertise * confidence
- Creates natural progression for agent improvement
Learning Curve
- First 20 executions critical for agent adoption
- After 20, confidence no longer a limiting factor
- Encourages testing new agents early
Monitoring
- Track which agents reach 20 executions
- Identify agents stuck below 20 (poor performance)
- Celebrate agents reaching full confidence
References
/crates/vapora-agents/src/learning_profile.rs(implementation)/crates/vapora-agents/src/selector.rs(usage)- ADR-014 (Learning Profiles)
- ADR-018 (Swarm Load Balancing)
Related ADRs: ADR-014 (Learning Profiles), ADR-018 (Load Balancing), ADR-019 (Temporal History)