ADR-017: Confidence Weighting en Learning Profiles
Status: Accepted | Implemented Date: 2024-11-01 Deciders: Agent Architecture Team Technical Story: Preventing new agents from being preferred on lucky first runs
Decision
Implementar Confidence Weighting con fórmula confidence = min(1.0, total_executions / 20).
Rationale
- Prevents Overfitting: Agentes nuevos con 1 éxito no deben ser preferred
- Statistical Significance: 20 ejecuciones proporciona confianza estadística
- Gradual Increase: Confianza sube mientras agente ejecuta más tareas
- Prevents Lucky Streaks: Requiere evidencia antes de preferencia
Alternatives Considered
❌ No Confidence Weighting
- Pros: Simple
- Cons: New agent with 1 success could be selected
❌ Higher Threshold (e.g., 50 executions)
- Pros: More statistical rigor
- Cons: Cold-start problem worse, new agents never selected
✅ Confidence = min(1.0, executions/20) (CHOSEN)
- Reasonable threshold, balances learning and avoiding lucky streaks
Trade-offs
Pros:
- ✅ Prevents overfitting on single success
- ✅ Reasonable learning curve (20 executions)
- ✅ Simple formula
- ✅ Transparent and explainable
Cons:
- ⚠️ Cold-start: new agents take 20 runs to full confidence
- ⚠️ Not adaptive (same threshold for all task types)
- ⚠️ May still allow lucky streaks (before 20 runs)
Implementation
Confidence Model:
#![allow(unused)] fn main() { // crates/vapora-agents/src/learning_profile.rs impl TaskTypeLearning { /// Confidence score: how much to trust this agent's score /// min(1.0, executions / 20) = 0.05 at 1 execution, 1.0 at 20+ pub fn confidence(&self) -> f32 { std::cmp::min( 1.0, (self.executions_total as f32) / 20.0 ) } /// Adjusted score: expertise * confidence /// Even with perfect expertise, low confidence reduces score pub fn adjusted_score(&self) -> f32 { let expertise = self.expertise_score(); let confidence = self.confidence(); expertise * confidence } /// Confidence progression examples: /// 1 exec: confidence = 0.05 (5%) /// 5 exec: confidence = 0.25 (25%) /// 10 exec: confidence = 0.50 (50%) /// 20 exec: confidence = 1.0 (100%) } }
Agent Selection with Confidence:
#![allow(unused)] fn main() { pub async fn select_best_agent_with_confidence( db: &Surreal<Ws>, task_type: &str, ) -> Result<String> { // Query all agents for this task type let profiles = db.query( "SELECT agent_id, executions_total, expertise_score(), confidence() \ FROM task_type_learning \ WHERE task_type = $1 \ ORDER BY (expertise_score * confidence) DESC \ LIMIT 5" ) .bind(task_type) .await?; let best = profiles .take::<TaskTypeLearning>(0)? .first() .ok_or(Error::NoAgentsAvailable)?; // Log selection with confidence for debugging tracing::info!( "Selected agent {} with confidence {:.2}% (after {} executions)", best.agent_id, best.confidence() * 100.0, best.executions_total ); Ok(best.agent_id.clone()) } }
Preventing Lucky Streaks:
#![allow(unused)] fn main() { // Example: Agent with 1 success but 5% confidence let agent_1_success = TaskTypeLearning { agent_id: "new-agent-1".to_string(), task_type: "code_generation".to_string(), executions_total: 1, executions_successful: 1, avg_quality_score: 0.95, // Perfect on first try! records: vec![ExecutionRecord { /* ... */ }], }; // Expertise would be 0.95, but confidence is only 0.05 let score = agent_1_success.adjusted_score(); // 0.95 * 0.05 = 0.0475 // This agent scores much lower than established agent with 0.80 expertise, 0.50 confidence // 0.80 * 0.50 = 0.40 > 0.0475 // Agent needs ~20 successes before reaching full confidence let agent_20_success = TaskTypeLearning { executions_total: 20, executions_successful: 20, avg_quality_score: 0.95, /* ... */ }; let score = agent_20_success.adjusted_score(); // 0.95 * 1.0 = 0.95 }
Confidence Visualization:
#![allow(unused)] fn main() { pub fn confidence_ramp() -> Vec<(u32, f32)> { (0..=40) .map(|execs| { let confidence = std::cmp::min(1.0, (execs as f32) / 20.0); (execs, confidence) }) .collect() } // Output: // 0 execs: 0.00 // 1 exec: 0.05 // 2 execs: 0.10 // 5 execs: 0.25 // 10 execs: 0.50 // 20 execs: 1.00 ← Full confidence reached // 30 execs: 1.00 ← Capped at 1.0 // 40 execs: 1.00 ← Still capped }
Key Files:
/crates/vapora-agents/src/learning_profile.rs(confidence calculation)/crates/vapora-agents/src/selector.rs(agent selection logic)/crates/vapora-agents/src/scoring.rs(score calculations)
Verification
# Test confidence calculation at key milestones
cargo test -p vapora-agents test_confidence_at_1_exec
cargo test -p vapora-agents test_confidence_at_5_execs
cargo test -p vapora-agents test_confidence_at_20_execs
cargo test -p vapora-agents test_confidence_cap_at_1
# Test lucky streak prevention
cargo test -p vapora-agents test_lucky_streak_prevention
# Test adjusted score (expertise * confidence)
cargo test -p vapora-agents test_adjusted_score_calculation
# Integration: new agent vs established agent selection
cargo test -p vapora-agents test_agent_selection_with_confidence
Expected Output:
- 1 execution: confidence = 0.05 (5%)
- 5 executions: confidence = 0.25 (25%)
- 10 executions: confidence = 0.50 (50%)
- 20 executions: confidence = 1.0 (100%)
- New agent with 1 success not selected over established agent
- Confidence gradually increases as agent executes more
- Adjusted score properly combines expertise and confidence
Consequences
Agent Cold-Start
- New agents require ~20 successful executions before reaching full score
- Longer ramp-up but prevents bad deployments
- Users understand why new agents aren't immediately selected
Agent Ranking
- Established agents (20+ executions) ranked by expertise only
- Developing agents (< 20 executions) ranked by expertise * confidence
- Creates natural progression for agent improvement
Learning Curve
- First 20 executions critical for agent adoption
- After 20, confidence no longer a limiting factor
- Encourages testing new agents early
Monitoring
- Track which agents reach 20 executions
- Identify agents stuck below 20 (poor performance)
- Celebrate agents reaching full confidence
References
/crates/vapora-agents/src/learning_profile.rs(implementation)/crates/vapora-agents/src/selector.rs(usage)- ADR-014 (Learning Profiles)
- ADR-018 (Swarm Load Balancing)
Related ADRs: ADR-014 (Learning Profiles), ADR-018 (Load Balancing), ADR-019 (Temporal History)