# ADR-017: Confidence Weighting en Learning Profiles **Status**: Accepted | Implemented **Date**: 2024-11-01 **Deciders**: Agent Architecture Team **Technical Story**: Preventing new agents from being preferred on lucky first runs --- ## Decision Implementar **Confidence Weighting** con fórmula `confidence = min(1.0, total_executions / 20)`. --- ## Rationale 1. **Prevents Overfitting**: Agentes nuevos con 1 éxito no deben ser preferred 2. **Statistical Significance**: 20 ejecuciones proporciona confianza estadística 3. **Gradual Increase**: Confianza sube mientras agente ejecuta más tareas 4. **Prevents Lucky Streaks**: Requiere evidencia antes de preferencia --- ## Alternatives Considered ### ❌ No Confidence Weighting - **Pros**: Simple - **Cons**: New agent with 1 success could be selected ### ❌ Higher Threshold (e.g., 50 executions) - **Pros**: More statistical rigor - **Cons**: Cold-start problem worse, new agents never selected ### ✅ Confidence = min(1.0, executions/20) (CHOSEN) - Reasonable threshold, balances learning and avoiding lucky streaks --- ## Trade-offs **Pros**: - ✅ Prevents overfitting on single success - ✅ Reasonable learning curve (20 executions) - ✅ Simple formula - ✅ Transparent and explainable **Cons**: - ⚠️ Cold-start: new agents take 20 runs to full confidence - ⚠️ Not adaptive (same threshold for all task types) - ⚠️ May still allow lucky streaks (before 20 runs) --- ## Implementation **Confidence Model**: ```rust // crates/vapora-agents/src/learning_profile.rs impl TaskTypeLearning { /// Confidence score: how much to trust this agent's score /// min(1.0, executions / 20) = 0.05 at 1 execution, 1.0 at 20+ pub fn confidence(&self) -> f32 { std::cmp::min( 1.0, (self.executions_total as f32) / 20.0 ) } /// Adjusted score: expertise * confidence /// Even with perfect expertise, low confidence reduces score pub fn adjusted_score(&self) -> f32 { let expertise = self.expertise_score(); let confidence = self.confidence(); expertise * confidence } /// Confidence progression examples: /// 1 exec: confidence = 0.05 (5%) /// 5 exec: confidence = 0.25 (25%) /// 10 exec: confidence = 0.50 (50%) /// 20 exec: confidence = 1.0 (100%) } ``` **Agent Selection with Confidence**: ```rust pub async fn select_best_agent_with_confidence( db: &Surreal, task_type: &str, ) -> Result { // Query all agents for this task type let profiles = db.query( "SELECT agent_id, executions_total, expertise_score(), confidence() \ FROM task_type_learning \ WHERE task_type = $1 \ ORDER BY (expertise_score * confidence) DESC \ LIMIT 5" ) .bind(task_type) .await?; let best = profiles .take::(0)? .first() .ok_or(Error::NoAgentsAvailable)?; // Log selection with confidence for debugging tracing::info!( "Selected agent {} with confidence {:.2}% (after {} executions)", best.agent_id, best.confidence() * 100.0, best.executions_total ); Ok(best.agent_id.clone()) } ``` **Preventing Lucky Streaks**: ```rust // Example: Agent with 1 success but 5% confidence let agent_1_success = TaskTypeLearning { agent_id: "new-agent-1".to_string(), task_type: "code_generation".to_string(), executions_total: 1, executions_successful: 1, avg_quality_score: 0.95, // Perfect on first try! records: vec![ExecutionRecord { /* ... */ }], }; // Expertise would be 0.95, but confidence is only 0.05 let score = agent_1_success.adjusted_score(); // 0.95 * 0.05 = 0.0475 // This agent scores much lower than established agent with 0.80 expertise, 0.50 confidence // 0.80 * 0.50 = 0.40 > 0.0475 // Agent needs ~20 successes before reaching full confidence let agent_20_success = TaskTypeLearning { executions_total: 20, executions_successful: 20, avg_quality_score: 0.95, /* ... */ }; let score = agent_20_success.adjusted_score(); // 0.95 * 1.0 = 0.95 ``` **Confidence Visualization**: ```rust pub fn confidence_ramp() -> Vec<(u32, f32)> { (0..=40) .map(|execs| { let confidence = std::cmp::min(1.0, (execs as f32) / 20.0); (execs, confidence) }) .collect() } // Output: // 0 execs: 0.00 // 1 exec: 0.05 // 2 execs: 0.10 // 5 execs: 0.25 // 10 execs: 0.50 // 20 execs: 1.00 ← Full confidence reached // 30 execs: 1.00 ← Capped at 1.0 // 40 execs: 1.00 ← Still capped ``` **Key Files**: - `/crates/vapora-agents/src/learning_profile.rs` (confidence calculation) - `/crates/vapora-agents/src/selector.rs` (agent selection logic) - `/crates/vapora-agents/src/scoring.rs` (score calculations) --- ## Verification ```bash # Test confidence calculation at key milestones cargo test -p vapora-agents test_confidence_at_1_exec cargo test -p vapora-agents test_confidence_at_5_execs cargo test -p vapora-agents test_confidence_at_20_execs cargo test -p vapora-agents test_confidence_cap_at_1 # Test lucky streak prevention cargo test -p vapora-agents test_lucky_streak_prevention # Test adjusted score (expertise * confidence) cargo test -p vapora-agents test_adjusted_score_calculation # Integration: new agent vs established agent selection cargo test -p vapora-agents test_agent_selection_with_confidence ``` **Expected Output**: - 1 execution: confidence = 0.05 (5%) - 5 executions: confidence = 0.25 (25%) - 10 executions: confidence = 0.50 (50%) - 20 executions: confidence = 1.0 (100%) - New agent with 1 success not selected over established agent - Confidence gradually increases as agent executes more - Adjusted score properly combines expertise and confidence --- ## Consequences ### Agent Cold-Start - New agents require ~20 successful executions before reaching full score - Longer ramp-up but prevents bad deployments - Users understand why new agents aren't immediately selected ### Agent Ranking - Established agents (20+ executions) ranked by expertise only - Developing agents (< 20 executions) ranked by expertise * confidence - Creates natural progression for agent improvement ### Learning Curve - First 20 executions critical for agent adoption - After 20, confidence no longer a limiting factor - Encourages testing new agents early ### Monitoring - Track which agents reach 20 executions - Identify agents stuck below 20 (poor performance) - Celebrate agents reaching full confidence --- ## References - `/crates/vapora-agents/src/learning_profile.rs` (implementation) - `/crates/vapora-agents/src/selector.rs` (usage) - ADR-014 (Learning Profiles) - ADR-018 (Swarm Load Balancing) --- **Related ADRs**: ADR-014 (Learning Profiles), ADR-018 (Load Balancing), ADR-019 (Temporal History)