Vapora/docs/adrs/0017-confidence-weighting.md

# ADR-017: Confidence Weighting en Learning Profiles

**Status**: Accepted | Implemented
**Date**: 2024-11-01
**Deciders**: Agent Architecture Team
**Technical Story**: Preventing new agents from being preferred on lucky first runs

---

## Decision

Implementar **Confidence Weighting** con fórmula `confidence = min(1.0, total_executions / 20)`.

---

## Rationale

1. **Prevents Overfitting**: Agentes nuevos con 1 éxito no deben ser preferred
2. **Statistical Significance**: 20 ejecuciones proporciona confianza estadística
3. **Gradual Increase**: Confianza sube mientras agente ejecuta más tareas
4. **Prevents Lucky Streaks**: Requiere evidencia antes de preferencia

---

## Alternatives Considered

### ❌ No Confidence Weighting
- **Pros**: Simple
- **Cons**: New agent with 1 success could be selected

### ❌ Higher Threshold (e.g., 50 executions)
- **Pros**: More statistical rigor
- **Cons**: Cold-start problem worse, new agents never selected

### ✅ Confidence = min(1.0, executions/20) (CHOSEN)
- Reasonable threshold, balances learning and avoiding lucky streaks

---

## Trade-offs

**Pros**:
- ✅ Prevents overfitting on single success
- ✅ Reasonable learning curve (20 executions)
- ✅ Simple formula
- ✅ Transparent and explainable

**Cons**:
- ⚠️ Cold-start: new agents take 20 runs to full confidence
- ⚠️ Not adaptive (same threshold for all task types)
- ⚠️ May still allow lucky streaks (before 20 runs)

---

## Implementation

**Confidence Model**:
```rust
// crates/vapora-agents/src/learning_profile.rs

impl TaskTypeLearning {
    /// Confidence score: how much to trust this agent's score
    /// min(1.0, executions / 20) = 0.05 at 1 execution, 1.0 at 20+
    pub fn confidence(&self) -> f32 {
        std::cmp::min(
            1.0,
            (self.executions_total as f32) / 20.0
        )
    }

    /// Adjusted score: expertise * confidence
    /// Even with perfect expertise, low confidence reduces score
    pub fn adjusted_score(&self) -> f32 {
        let expertise = self.expertise_score();
        let confidence = self.confidence();
        expertise * confidence
    }

    /// Confidence progression examples:
    /// 1 exec:  confidence = 0.05 (5%)
    /// 5 exec:  confidence = 0.25 (25%)
    /// 10 exec: confidence = 0.50 (50%)
    /// 20 exec: confidence = 1.0 (100%)
}
```

**Agent Selection with Confidence**:
```rust
pub async fn select_best_agent_with_confidence(
    db: &Surreal<Ws>,
    task_type: &str,
) -> Result<String> {
    // Query all agents for this task type
    let profiles = db.query(
        "SELECT agent_id, executions_total, expertise_score(), confidence() \
         FROM task_type_learning \
         WHERE task_type = $1 \
         ORDER BY (expertise_score * confidence) DESC \
         LIMIT 5"
    )
    .bind(task_type)
    .await?;

    let best = profiles
        .take::<TaskTypeLearning>(0)?
        .first()
        .ok_or(Error::NoAgentsAvailable)?;

    // Log selection with confidence for debugging
    tracing::info!(
        "Selected agent {} with confidence {:.2}% (after {} executions)",
        best.agent_id,
        best.confidence() * 100.0,
        best.executions_total
    );

    Ok(best.agent_id.clone())
}
```

**Preventing Lucky Streaks**:
```rust
// Example: Agent with 1 success but 5% confidence
let agent_1_success = TaskTypeLearning {
    agent_id: "new-agent-1".to_string(),
    task_type: "code_generation".to_string(),
    executions_total: 1,
    executions_successful: 1,
    avg_quality_score: 0.95,  // Perfect on first try!
    records: vec![ExecutionRecord { /* ... */ }],
};

// Expertise would be 0.95, but confidence is only 0.05
let score = agent_1_success.adjusted_score();  // 0.95 * 0.05 = 0.0475
// This agent scores much lower than established agent with 0.80 expertise, 0.50 confidence
// 0.80 * 0.50 = 0.40 > 0.0475

// Agent needs ~20 successes before reaching full confidence
let agent_20_success = TaskTypeLearning {
    executions_total: 20,
    executions_successful: 20,
    avg_quality_score: 0.95,
    /* ... */
};

let score = agent_20_success.adjusted_score();  // 0.95 * 1.0 = 0.95
```

**Confidence Visualization**:
```rust
pub fn confidence_ramp() -> Vec<(u32, f32)> {
    (0..=40)
        .map(|execs| {
            let confidence = std::cmp::min(1.0, (execs as f32) / 20.0);
            (execs, confidence)
        })
        .collect()
}

// Output:
// 0 execs:  0.00
// 1 exec:   0.05
// 2 execs:  0.10
// 5 execs:  0.25
// 10 execs: 0.50
// 20 execs: 1.00  ← Full confidence reached
// 30 execs: 1.00  ← Capped at 1.0
// 40 execs: 1.00  ← Still capped
```

**Key Files**:
- `/crates/vapora-agents/src/learning_profile.rs` (confidence calculation)
- `/crates/vapora-agents/src/selector.rs` (agent selection logic)
- `/crates/vapora-agents/src/scoring.rs` (score calculations)

---

## Verification

```bash
# Test confidence calculation at key milestones
cargo test -p vapora-agents test_confidence_at_1_exec
cargo test -p vapora-agents test_confidence_at_5_execs
cargo test -p vapora-agents test_confidence_at_20_execs
cargo test -p vapora-agents test_confidence_cap_at_1

# Test lucky streak prevention
cargo test -p vapora-agents test_lucky_streak_prevention

# Test adjusted score (expertise * confidence)
cargo test -p vapora-agents test_adjusted_score_calculation

# Integration: new agent vs established agent selection
cargo test -p vapora-agents test_agent_selection_with_confidence
```

**Expected Output**:
- 1 execution: confidence = 0.05 (5%)
- 5 executions: confidence = 0.25 (25%)
- 10 executions: confidence = 0.50 (50%)
- 20 executions: confidence = 1.0 (100%)
- New agent with 1 success not selected over established agent
- Confidence gradually increases as agent executes more
- Adjusted score properly combines expertise and confidence

---

## Consequences

### Agent Cold-Start
- New agents require ~20 successful executions before reaching full score
- Longer ramp-up but prevents bad deployments
- Users understand why new agents aren't immediately selected

### Agent Ranking
- Established agents (20+ executions) ranked by expertise only
- Developing agents (< 20 executions) ranked by expertise * confidence
- Creates natural progression for agent improvement

### Learning Curve
- First 20 executions critical for agent adoption
- After 20, confidence no longer a limiting factor
- Encourages testing new agents early

### Monitoring
- Track which agents reach 20 executions
- Identify agents stuck below 20 (poor performance)
- Celebrate agents reaching full confidence

---

## References

- `/crates/vapora-agents/src/learning_profile.rs` (implementation)
- `/crates/vapora-agents/src/selector.rs` (usage)
- ADR-014 (Learning Profiles)
- ADR-018 (Swarm Load Balancing)

---

**Related ADRs**: ADR-014 (Learning Profiles), ADR-018 (Load Balancing), ADR-019 (Temporal History)
chore: extend doc: adr, tutorials, operations, etc 2026-01-12 03:32:47 +00:00			`# ADR-017: Confidence Weighting en Learning Profiles`

			`Status: Accepted \| Implemented`
			`Date: 2024-11-01`
			`Deciders: Agent Architecture Team`
			`Technical Story: Preventing new agents from being preferred on lucky first runs`

			`---`

			`## Decision`

			Implementar Confidence Weighting con fórmula `confidence = min(1.0, total_executions / 20)`.

			`---`

			`## Rationale`

			`1. Prevents Overfitting: Agentes nuevos con 1 éxito no deben ser preferred`
			`2. Statistical Significance: 20 ejecuciones proporciona confianza estadística`
			`3. Gradual Increase: Confianza sube mientras agente ejecuta más tareas`
			`4. Prevents Lucky Streaks: Requiere evidencia antes de preferencia`

			`---`

			`## Alternatives Considered`

			`### ❌ No Confidence Weighting`
			`- Pros: Simple`
			`- Cons: New agent with 1 success could be selected`

			`### ❌ Higher Threshold (e.g., 50 executions)`
			`- Pros: More statistical rigor`
			`- Cons: Cold-start problem worse, new agents never selected`

			`### ✅ Confidence = min(1.0, executions/20) (CHOSEN)`
			`- Reasonable threshold, balances learning and avoiding lucky streaks`

			`---`

			`## Trade-offs`

			`Pros:`
			`- ✅ Prevents overfitting on single success`
			`- ✅ Reasonable learning curve (20 executions)`
			`- ✅ Simple formula`
			`- ✅ Transparent and explainable`

			`Cons:`
			`- ⚠️ Cold-start: new agents take 20 runs to full confidence`
			`- ⚠️ Not adaptive (same threshold for all task types)`
			`- ⚠️ May still allow lucky streaks (before 20 runs)`

			`---`

			`## Implementation`

			`Confidence Model:`
			```rust
			`// crates/vapora-agents/src/learning_profile.rs`

			`impl TaskTypeLearning {`
			`/// Confidence score: how much to trust this agent's score`
			`/// min(1.0, executions / 20) = 0.05 at 1 execution, 1.0 at 20+`
			`pub fn confidence(&self) -> f32 {`
			`std::cmp::min(`
			`1.0,`
			`(self.executions_total as f32) / 20.0`
			`)`
			`}`

			`/// Adjusted score: expertise * confidence`
			`/// Even with perfect expertise, low confidence reduces score`
			`pub fn adjusted_score(&self) -> f32 {`
			`let expertise = self.expertise_score();`
			`let confidence = self.confidence();`
			`expertise * confidence`
			`}`

			`/// Confidence progression examples:`
			`/// 1 exec: confidence = 0.05 (5%)`
			`/// 5 exec: confidence = 0.25 (25%)`
			`/// 10 exec: confidence = 0.50 (50%)`
			`/// 20 exec: confidence = 1.0 (100%)`
			`}`
			```

			`Agent Selection with Confidence:`
			```rust
			`pub async fn select_best_agent_with_confidence(`
			`db: &Surreal<Ws>,`
			`task_type: &str,`
			`) -> Result<String> {`
			`// Query all agents for this task type`
			`let profiles = db.query(`
			`"SELECT agent_id, executions_total, expertise_score(), confidence() \`
			`FROM task_type_learning \`
			`WHERE task_type = $1 \`
			`ORDER BY (expertise_score * confidence) DESC \`
			`LIMIT 5"`
			`)`
			`.bind(task_type)`
			`.await?;`

			`let best = profiles`
			`.take::<TaskTypeLearning>(0)?`
			`.first()`
			`.ok_or(Error::NoAgentsAvailable)?;`

			`// Log selection with confidence for debugging`
			`tracing::info!(`
			`"Selected agent {} with confidence {:.2}% (after {} executions)",`
			`best.agent_id,`
			`best.confidence() * 100.0,`
			`best.executions_total`
			`);`

			`Ok(best.agent_id.clone())`
			`}`
			```

			`Preventing Lucky Streaks:`
			```rust
			`// Example: Agent with 1 success but 5% confidence`
			`let agent_1_success = TaskTypeLearning {`
			`agent_id: "new-agent-1".to_string(),`
			`task_type: "code_generation".to_string(),`
			`executions_total: 1,`
			`executions_successful: 1,`
			`avg_quality_score: 0.95, // Perfect on first try!`
			`records: vec![ExecutionRecord { /* ... */ }],`
			`};`

			`// Expertise would be 0.95, but confidence is only 0.05`
			`let score = agent_1_success.adjusted_score(); // 0.95 * 0.05 = 0.0475`
			`// This agent scores much lower than established agent with 0.80 expertise, 0.50 confidence`
			`// 0.80 * 0.50 = 0.40 > 0.0475`

			`// Agent needs ~20 successes before reaching full confidence`
			`let agent_20_success = TaskTypeLearning {`
			`executions_total: 20,`
			`executions_successful: 20,`
			`avg_quality_score: 0.95,`
			`/* ... */`
			`};`

			`let score = agent_20_success.adjusted_score(); // 0.95 * 1.0 = 0.95`
			```

			`Confidence Visualization:`
			```rust
			`pub fn confidence_ramp() -> Vec<(u32, f32)> {`
			`(0..=40)`
			`.map(\|execs\| {`
			`let confidence = std::cmp::min(1.0, (execs as f32) / 20.0);`
			`(execs, confidence)`
			`})`
			`.collect()`
			`}`

			`// Output:`
			`// 0 execs: 0.00`
			`// 1 exec: 0.05`
			`// 2 execs: 0.10`
			`// 5 execs: 0.25`
			`// 10 execs: 0.50`
			`// 20 execs: 1.00 ← Full confidence reached`
			`// 30 execs: 1.00 ← Capped at 1.0`
			`// 40 execs: 1.00 ← Still capped`
			```

			`Key Files:`
			- `/crates/vapora-agents/src/learning_profile.rs` (confidence calculation)
			- `/crates/vapora-agents/src/selector.rs` (agent selection logic)
			- `/crates/vapora-agents/src/scoring.rs` (score calculations)

			`---`

			`## Verification`

			```bash
			`# Test confidence calculation at key milestones`
			`cargo test -p vapora-agents test_confidence_at_1_exec`
			`cargo test -p vapora-agents test_confidence_at_5_execs`
			`cargo test -p vapora-agents test_confidence_at_20_execs`
			`cargo test -p vapora-agents test_confidence_cap_at_1`

			`# Test lucky streak prevention`
			`cargo test -p vapora-agents test_lucky_streak_prevention`

			`# Test adjusted score (expertise * confidence)`
			`cargo test -p vapora-agents test_adjusted_score_calculation`

			`# Integration: new agent vs established agent selection`
			`cargo test -p vapora-agents test_agent_selection_with_confidence`
			```

			`Expected Output:`
			`- 1 execution: confidence = 0.05 (5%)`
			`- 5 executions: confidence = 0.25 (25%)`
			`- 10 executions: confidence = 0.50 (50%)`
			`- 20 executions: confidence = 1.0 (100%)`
			`- New agent with 1 success not selected over established agent`
			`- Confidence gradually increases as agent executes more`
			`- Adjusted score properly combines expertise and confidence`

			`---`

			`## Consequences`

			`### Agent Cold-Start`
			`- New agents require ~20 successful executions before reaching full score`
			`- Longer ramp-up but prevents bad deployments`
			`- Users understand why new agents aren't immediately selected`

			`### Agent Ranking`
			`- Established agents (20+ executions) ranked by expertise only`
			`- Developing agents (< 20 executions) ranked by expertise * confidence`
			`- Creates natural progression for agent improvement`

			`### Learning Curve`
			`- First 20 executions critical for agent adoption`
			`- After 20, confidence no longer a limiting factor`
			`- Encourages testing new agents early`

			`### Monitoring`
			`- Track which agents reach 20 executions`
			`- Identify agents stuck below 20 (poor performance)`
			`- Celebrate agents reaching full confidence`

			`---`

			`## References`

			- `/crates/vapora-agents/src/learning_profile.rs` (implementation)
			- `/crates/vapora-agents/src/selector.rs` (usage)
			`- ADR-014 (Learning Profiles)`
			`- ADR-018 (Swarm Load Balancing)`

			`---`

			`Related ADRs: ADR-014 (Learning Profiles), ADR-018 (Load Balancing), ADR-019 (Temporal History)`