Vapora/docs/adrs/0017-confidence-weighting.md

# ADR-017: Confidence Weighting en Learning Profiles

**Status**: Accepted | Implemented
**Date**: 2024-11-01
**Deciders**: Agent Architecture Team
**Technical Story**: Preventing new agents from being preferred on lucky first runs

---

## Decision

Implementar **Confidence Weighting** con fórmula `confidence = min(1.0, total_executions / 20)`.

---

## Rationale

1. **Prevents Overfitting**: Agentes nuevos con 1 éxito no deben ser preferred
2. **Statistical Significance**: 20 ejecuciones proporciona confianza estadística
3. **Gradual Increase**: Confianza sube mientras agente ejecuta más tareas
4. **Prevents Lucky Streaks**: Requiere evidencia antes de preferencia

---

## Alternatives Considered

### ❌ No Confidence Weighting
- **Pros**: Simple
- **Cons**: New agent with 1 success could be selected

### ❌ Higher Threshold (e.g., 50 executions)
- **Pros**: More statistical rigor
- **Cons**: Cold-start problem worse, new agents never selected

### ✅ Confidence = min(1.0, executions/20) (CHOSEN)
- Reasonable threshold, balances learning and avoiding lucky streaks

---

## Trade-offs

**Pros**:
- ✅ Prevents overfitting on single success
- ✅ Reasonable learning curve (20 executions)
- ✅ Simple formula
- ✅ Transparent and explainable

**Cons**:
- ⚠️ Cold-start: new agents take 20 runs to full confidence
- ⚠️ Not adaptive (same threshold for all task types)
- ⚠️ May still allow lucky streaks (before 20 runs)

---

## Implementation

**Confidence Model**:
```rust
// crates/vapora-agents/src/learning_profile.rs

impl TaskTypeLearning {
    /// Confidence score: how much to trust this agent's score
    /// min(1.0, executions / 20) = 0.05 at 1 execution, 1.0 at 20+
    pub fn confidence(&self) -> f32 {
        std::cmp::min(
            1.0,
            (self.executions_total as f32) / 20.0
        )
    }

    /// Adjusted score: expertise * confidence
    /// Even with perfect expertise, low confidence reduces score
    pub fn adjusted_score(&self) -> f32 {
        let expertise = self.expertise_score();
        let confidence = self.confidence();
        expertise * confidence
    }

    /// Confidence progression examples:
    /// 1 exec:  confidence = 0.05 (5%)
    /// 5 exec:  confidence = 0.25 (25%)
    /// 10 exec: confidence = 0.50 (50%)
    /// 20 exec: confidence = 1.0 (100%)
}
```

**Agent Selection with Confidence**:
```rust
pub async fn select_best_agent_with_confidence(
    db: &Surreal<Ws>,
    task_type: &str,
) -> Result<String> {
    // Query all agents for this task type
    let profiles = db.query(
        "SELECT agent_id, executions_total, expertise_score(), confidence() \
         FROM task_type_learning \
         WHERE task_type = $1 \
         ORDER BY (expertise_score * confidence) DESC \
         LIMIT 5"
    )
    .bind(task_type)
    .await?;

    let best = profiles
        .take::<TaskTypeLearning>(0)?
        .first()
        .ok_or(Error::NoAgentsAvailable)?;

    // Log selection with confidence for debugging
    tracing::info!(
        "Selected agent {} with confidence {:.2}% (after {} executions)",
        best.agent_id,
        best.confidence() * 100.0,
        best.executions_total
    );

    Ok(best.agent_id.clone())
}
```

**Preventing Lucky Streaks**:
```rust
// Example: Agent with 1 success but 5% confidence
let agent_1_success = TaskTypeLearning {
    agent_id: "new-agent-1".to_string(),
    task_type: "code_generation".to_string(),
    executions_total: 1,
    executions_successful: 1,
    avg_quality_score: 0.95,  // Perfect on first try!
    records: vec![ExecutionRecord { /* ... */ }],
};

// Expertise would be 0.95, but confidence is only 0.05
let score = agent_1_success.adjusted_score();  // 0.95 * 0.05 = 0.0475
// This agent scores much lower than established agent with 0.80 expertise, 0.50 confidence
// 0.80 * 0.50 = 0.40 > 0.0475

// Agent needs ~20 successes before reaching full confidence
let agent_20_success = TaskTypeLearning {
    executions_total: 20,
    executions_successful: 20,
    avg_quality_score: 0.95,
    /* ... */
};

let score = agent_20_success.adjusted_score();  // 0.95 * 1.0 = 0.95
```

**Confidence Visualization**:
```rust
pub fn confidence_ramp() -> Vec<(u32, f32)> {
    (0..=40)
        .map(|execs| {
            let confidence = std::cmp::min(1.0, (execs as f32) / 20.0);
            (execs, confidence)
        })
        .collect()
}

// Output:
// 0 execs:  0.00
// 1 exec:   0.05
// 2 execs:  0.10
// 5 execs:  0.25
// 10 execs: 0.50
// 20 execs: 1.00  ← Full confidence reached
// 30 execs: 1.00  ← Capped at 1.0
// 40 execs: 1.00  ← Still capped
```

**Key Files**:
- `/crates/vapora-agents/src/learning_profile.rs` (confidence calculation)
- `/crates/vapora-agents/src/selector.rs` (agent selection logic)
- `/crates/vapora-agents/src/scoring.rs` (score calculations)

---

## Verification

```bash
# Test confidence calculation at key milestones
cargo test -p vapora-agents test_confidence_at_1_exec
cargo test -p vapora-agents test_confidence_at_5_execs
cargo test -p vapora-agents test_confidence_at_20_execs
cargo test -p vapora-agents test_confidence_cap_at_1

# Test lucky streak prevention
cargo test -p vapora-agents test_lucky_streak_prevention

# Test adjusted score (expertise * confidence)
cargo test -p vapora-agents test_adjusted_score_calculation

# Integration: new agent vs established agent selection
cargo test -p vapora-agents test_agent_selection_with_confidence
```

**Expected Output**:
- 1 execution: confidence = 0.05 (5%)
- 5 executions: confidence = 0.25 (25%)
- 10 executions: confidence = 0.50 (50%)
- 20 executions: confidence = 1.0 (100%)
- New agent with 1 success not selected over established agent
- Confidence gradually increases as agent executes more
- Adjusted score properly combines expertise and confidence

---

## Consequences

### Agent Cold-Start
- New agents require ~20 successful executions before reaching full score
- Longer ramp-up but prevents bad deployments
- Users understand why new agents aren't immediately selected

### Agent Ranking
- Established agents (20+ executions) ranked by expertise only
- Developing agents (< 20 executions) ranked by expertise * confidence
- Creates natural progression for agent improvement

### Learning Curve
- First 20 executions critical for agent adoption
- After 20, confidence no longer a limiting factor
- Encourages testing new agents early

### Monitoring
- Track which agents reach 20 executions
- Identify agents stuck below 20 (poor performance)
- Celebrate agents reaching full confidence

---

## References

- `/crates/vapora-agents/src/learning_profile.rs` (implementation)
- `/crates/vapora-agents/src/selector.rs` (usage)
- ADR-014 (Learning Profiles)
- ADR-018 (Swarm Load Balancing)

---

**Related ADRs**: ADR-014 (Learning Profiles), ADR-018 (Load Balancing), ADR-019 (Temporal History)