Vapora/docs/adrs/0017-confidence-weighting.md
Jesús Pérez 7110ffeea2
Some checks failed
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
chore: extend doc: adr, tutorials, operations, etc
2026-01-12 03:32:47 +00:00

242 lines
6.7 KiB
Markdown

# ADR-017: Confidence Weighting en Learning Profiles
**Status**: Accepted | Implemented
**Date**: 2024-11-01
**Deciders**: Agent Architecture Team
**Technical Story**: Preventing new agents from being preferred on lucky first runs
---
## Decision
Implementar **Confidence Weighting** con fórmula `confidence = min(1.0, total_executions / 20)`.
---
## Rationale
1. **Prevents Overfitting**: Agentes nuevos con 1 éxito no deben ser preferred
2. **Statistical Significance**: 20 ejecuciones proporciona confianza estadística
3. **Gradual Increase**: Confianza sube mientras agente ejecuta más tareas
4. **Prevents Lucky Streaks**: Requiere evidencia antes de preferencia
---
## Alternatives Considered
### ❌ No Confidence Weighting
- **Pros**: Simple
- **Cons**: New agent with 1 success could be selected
### ❌ Higher Threshold (e.g., 50 executions)
- **Pros**: More statistical rigor
- **Cons**: Cold-start problem worse, new agents never selected
### ✅ Confidence = min(1.0, executions/20) (CHOSEN)
- Reasonable threshold, balances learning and avoiding lucky streaks
---
## Trade-offs
**Pros**:
- ✅ Prevents overfitting on single success
- ✅ Reasonable learning curve (20 executions)
- ✅ Simple formula
- ✅ Transparent and explainable
**Cons**:
- ⚠️ Cold-start: new agents take 20 runs to full confidence
- ⚠️ Not adaptive (same threshold for all task types)
- ⚠️ May still allow lucky streaks (before 20 runs)
---
## Implementation
**Confidence Model**:
```rust
// crates/vapora-agents/src/learning_profile.rs
impl TaskTypeLearning {
/// Confidence score: how much to trust this agent's score
/// min(1.0, executions / 20) = 0.05 at 1 execution, 1.0 at 20+
pub fn confidence(&self) -> f32 {
std::cmp::min(
1.0,
(self.executions_total as f32) / 20.0
)
}
/// Adjusted score: expertise * confidence
/// Even with perfect expertise, low confidence reduces score
pub fn adjusted_score(&self) -> f32 {
let expertise = self.expertise_score();
let confidence = self.confidence();
expertise * confidence
}
/// Confidence progression examples:
/// 1 exec: confidence = 0.05 (5%)
/// 5 exec: confidence = 0.25 (25%)
/// 10 exec: confidence = 0.50 (50%)
/// 20 exec: confidence = 1.0 (100%)
}
```
**Agent Selection with Confidence**:
```rust
pub async fn select_best_agent_with_confidence(
db: &Surreal<Ws>,
task_type: &str,
) -> Result<String> {
// Query all agents for this task type
let profiles = db.query(
"SELECT agent_id, executions_total, expertise_score(), confidence() \
FROM task_type_learning \
WHERE task_type = $1 \
ORDER BY (expertise_score * confidence) DESC \
LIMIT 5"
)
.bind(task_type)
.await?;
let best = profiles
.take::<TaskTypeLearning>(0)?
.first()
.ok_or(Error::NoAgentsAvailable)?;
// Log selection with confidence for debugging
tracing::info!(
"Selected agent {} with confidence {:.2}% (after {} executions)",
best.agent_id,
best.confidence() * 100.0,
best.executions_total
);
Ok(best.agent_id.clone())
}
```
**Preventing Lucky Streaks**:
```rust
// Example: Agent with 1 success but 5% confidence
let agent_1_success = TaskTypeLearning {
agent_id: "new-agent-1".to_string(),
task_type: "code_generation".to_string(),
executions_total: 1,
executions_successful: 1,
avg_quality_score: 0.95, // Perfect on first try!
records: vec![ExecutionRecord { /* ... */ }],
};
// Expertise would be 0.95, but confidence is only 0.05
let score = agent_1_success.adjusted_score(); // 0.95 * 0.05 = 0.0475
// This agent scores much lower than established agent with 0.80 expertise, 0.50 confidence
// 0.80 * 0.50 = 0.40 > 0.0475
// Agent needs ~20 successes before reaching full confidence
let agent_20_success = TaskTypeLearning {
executions_total: 20,
executions_successful: 20,
avg_quality_score: 0.95,
/* ... */
};
let score = agent_20_success.adjusted_score(); // 0.95 * 1.0 = 0.95
```
**Confidence Visualization**:
```rust
pub fn confidence_ramp() -> Vec<(u32, f32)> {
(0..=40)
.map(|execs| {
let confidence = std::cmp::min(1.0, (execs as f32) / 20.0);
(execs, confidence)
})
.collect()
}
// Output:
// 0 execs: 0.00
// 1 exec: 0.05
// 2 execs: 0.10
// 5 execs: 0.25
// 10 execs: 0.50
// 20 execs: 1.00 ← Full confidence reached
// 30 execs: 1.00 ← Capped at 1.0
// 40 execs: 1.00 ← Still capped
```
**Key Files**:
- `/crates/vapora-agents/src/learning_profile.rs` (confidence calculation)
- `/crates/vapora-agents/src/selector.rs` (agent selection logic)
- `/crates/vapora-agents/src/scoring.rs` (score calculations)
---
## Verification
```bash
# Test confidence calculation at key milestones
cargo test -p vapora-agents test_confidence_at_1_exec
cargo test -p vapora-agents test_confidence_at_5_execs
cargo test -p vapora-agents test_confidence_at_20_execs
cargo test -p vapora-agents test_confidence_cap_at_1
# Test lucky streak prevention
cargo test -p vapora-agents test_lucky_streak_prevention
# Test adjusted score (expertise * confidence)
cargo test -p vapora-agents test_adjusted_score_calculation
# Integration: new agent vs established agent selection
cargo test -p vapora-agents test_agent_selection_with_confidence
```
**Expected Output**:
- 1 execution: confidence = 0.05 (5%)
- 5 executions: confidence = 0.25 (25%)
- 10 executions: confidence = 0.50 (50%)
- 20 executions: confidence = 1.0 (100%)
- New agent with 1 success not selected over established agent
- Confidence gradually increases as agent executes more
- Adjusted score properly combines expertise and confidence
---
## Consequences
### Agent Cold-Start
- New agents require ~20 successful executions before reaching full score
- Longer ramp-up but prevents bad deployments
- Users understand why new agents aren't immediately selected
### Agent Ranking
- Established agents (20+ executions) ranked by expertise only
- Developing agents (< 20 executions) ranked by expertise * confidence
- Creates natural progression for agent improvement
### Learning Curve
- First 20 executions critical for agent adoption
- After 20, confidence no longer a limiting factor
- Encourages testing new agents early
### Monitoring
- Track which agents reach 20 executions
- Identify agents stuck below 20 (poor performance)
- Celebrate agents reaching full confidence
---
## References
- `/crates/vapora-agents/src/learning_profile.rs` (implementation)
- `/crates/vapora-agents/src/selector.rs` (usage)
- ADR-014 (Learning Profiles)
- ADR-018 (Swarm Load Balancing)
---
**Related ADRs**: ADR-014 (Learning Profiles), ADR-018 (Load Balancing), ADR-019 (Temporal History)