242 lines
6.7 KiB
Markdown
242 lines
6.7 KiB
Markdown
|
|
# ADR-017: Confidence Weighting en Learning Profiles
|
||
|
|
|
||
|
|
**Status**: Accepted | Implemented
|
||
|
|
**Date**: 2024-11-01
|
||
|
|
**Deciders**: Agent Architecture Team
|
||
|
|
**Technical Story**: Preventing new agents from being preferred on lucky first runs
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Decision
|
||
|
|
|
||
|
|
Implementar **Confidence Weighting** con fórmula `confidence = min(1.0, total_executions / 20)`.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Rationale
|
||
|
|
|
||
|
|
1. **Prevents Overfitting**: Agentes nuevos con 1 éxito no deben ser preferred
|
||
|
|
2. **Statistical Significance**: 20 ejecuciones proporciona confianza estadística
|
||
|
|
3. **Gradual Increase**: Confianza sube mientras agente ejecuta más tareas
|
||
|
|
4. **Prevents Lucky Streaks**: Requiere evidencia antes de preferencia
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Alternatives Considered
|
||
|
|
|
||
|
|
### ❌ No Confidence Weighting
|
||
|
|
- **Pros**: Simple
|
||
|
|
- **Cons**: New agent with 1 success could be selected
|
||
|
|
|
||
|
|
### ❌ Higher Threshold (e.g., 50 executions)
|
||
|
|
- **Pros**: More statistical rigor
|
||
|
|
- **Cons**: Cold-start problem worse, new agents never selected
|
||
|
|
|
||
|
|
### ✅ Confidence = min(1.0, executions/20) (CHOSEN)
|
||
|
|
- Reasonable threshold, balances learning and avoiding lucky streaks
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Trade-offs
|
||
|
|
|
||
|
|
**Pros**:
|
||
|
|
- ✅ Prevents overfitting on single success
|
||
|
|
- ✅ Reasonable learning curve (20 executions)
|
||
|
|
- ✅ Simple formula
|
||
|
|
- ✅ Transparent and explainable
|
||
|
|
|
||
|
|
**Cons**:
|
||
|
|
- ⚠️ Cold-start: new agents take 20 runs to full confidence
|
||
|
|
- ⚠️ Not adaptive (same threshold for all task types)
|
||
|
|
- ⚠️ May still allow lucky streaks (before 20 runs)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Implementation
|
||
|
|
|
||
|
|
**Confidence Model**:
|
||
|
|
```rust
|
||
|
|
// crates/vapora-agents/src/learning_profile.rs
|
||
|
|
|
||
|
|
impl TaskTypeLearning {
|
||
|
|
/// Confidence score: how much to trust this agent's score
|
||
|
|
/// min(1.0, executions / 20) = 0.05 at 1 execution, 1.0 at 20+
|
||
|
|
pub fn confidence(&self) -> f32 {
|
||
|
|
std::cmp::min(
|
||
|
|
1.0,
|
||
|
|
(self.executions_total as f32) / 20.0
|
||
|
|
)
|
||
|
|
}
|
||
|
|
|
||
|
|
/// Adjusted score: expertise * confidence
|
||
|
|
/// Even with perfect expertise, low confidence reduces score
|
||
|
|
pub fn adjusted_score(&self) -> f32 {
|
||
|
|
let expertise = self.expertise_score();
|
||
|
|
let confidence = self.confidence();
|
||
|
|
expertise * confidence
|
||
|
|
}
|
||
|
|
|
||
|
|
/// Confidence progression examples:
|
||
|
|
/// 1 exec: confidence = 0.05 (5%)
|
||
|
|
/// 5 exec: confidence = 0.25 (25%)
|
||
|
|
/// 10 exec: confidence = 0.50 (50%)
|
||
|
|
/// 20 exec: confidence = 1.0 (100%)
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Agent Selection with Confidence**:
|
||
|
|
```rust
|
||
|
|
pub async fn select_best_agent_with_confidence(
|
||
|
|
db: &Surreal<Ws>,
|
||
|
|
task_type: &str,
|
||
|
|
) -> Result<String> {
|
||
|
|
// Query all agents for this task type
|
||
|
|
let profiles = db.query(
|
||
|
|
"SELECT agent_id, executions_total, expertise_score(), confidence() \
|
||
|
|
FROM task_type_learning \
|
||
|
|
WHERE task_type = $1 \
|
||
|
|
ORDER BY (expertise_score * confidence) DESC \
|
||
|
|
LIMIT 5"
|
||
|
|
)
|
||
|
|
.bind(task_type)
|
||
|
|
.await?;
|
||
|
|
|
||
|
|
let best = profiles
|
||
|
|
.take::<TaskTypeLearning>(0)?
|
||
|
|
.first()
|
||
|
|
.ok_or(Error::NoAgentsAvailable)?;
|
||
|
|
|
||
|
|
// Log selection with confidence for debugging
|
||
|
|
tracing::info!(
|
||
|
|
"Selected agent {} with confidence {:.2}% (after {} executions)",
|
||
|
|
best.agent_id,
|
||
|
|
best.confidence() * 100.0,
|
||
|
|
best.executions_total
|
||
|
|
);
|
||
|
|
|
||
|
|
Ok(best.agent_id.clone())
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Preventing Lucky Streaks**:
|
||
|
|
```rust
|
||
|
|
// Example: Agent with 1 success but 5% confidence
|
||
|
|
let agent_1_success = TaskTypeLearning {
|
||
|
|
agent_id: "new-agent-1".to_string(),
|
||
|
|
task_type: "code_generation".to_string(),
|
||
|
|
executions_total: 1,
|
||
|
|
executions_successful: 1,
|
||
|
|
avg_quality_score: 0.95, // Perfect on first try!
|
||
|
|
records: vec![ExecutionRecord { /* ... */ }],
|
||
|
|
};
|
||
|
|
|
||
|
|
// Expertise would be 0.95, but confidence is only 0.05
|
||
|
|
let score = agent_1_success.adjusted_score(); // 0.95 * 0.05 = 0.0475
|
||
|
|
// This agent scores much lower than established agent with 0.80 expertise, 0.50 confidence
|
||
|
|
// 0.80 * 0.50 = 0.40 > 0.0475
|
||
|
|
|
||
|
|
// Agent needs ~20 successes before reaching full confidence
|
||
|
|
let agent_20_success = TaskTypeLearning {
|
||
|
|
executions_total: 20,
|
||
|
|
executions_successful: 20,
|
||
|
|
avg_quality_score: 0.95,
|
||
|
|
/* ... */
|
||
|
|
};
|
||
|
|
|
||
|
|
let score = agent_20_success.adjusted_score(); // 0.95 * 1.0 = 0.95
|
||
|
|
```
|
||
|
|
|
||
|
|
**Confidence Visualization**:
|
||
|
|
```rust
|
||
|
|
pub fn confidence_ramp() -> Vec<(u32, f32)> {
|
||
|
|
(0..=40)
|
||
|
|
.map(|execs| {
|
||
|
|
let confidence = std::cmp::min(1.0, (execs as f32) / 20.0);
|
||
|
|
(execs, confidence)
|
||
|
|
})
|
||
|
|
.collect()
|
||
|
|
}
|
||
|
|
|
||
|
|
// Output:
|
||
|
|
// 0 execs: 0.00
|
||
|
|
// 1 exec: 0.05
|
||
|
|
// 2 execs: 0.10
|
||
|
|
// 5 execs: 0.25
|
||
|
|
// 10 execs: 0.50
|
||
|
|
// 20 execs: 1.00 ← Full confidence reached
|
||
|
|
// 30 execs: 1.00 ← Capped at 1.0
|
||
|
|
// 40 execs: 1.00 ← Still capped
|
||
|
|
```
|
||
|
|
|
||
|
|
**Key Files**:
|
||
|
|
- `/crates/vapora-agents/src/learning_profile.rs` (confidence calculation)
|
||
|
|
- `/crates/vapora-agents/src/selector.rs` (agent selection logic)
|
||
|
|
- `/crates/vapora-agents/src/scoring.rs` (score calculations)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Verification
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Test confidence calculation at key milestones
|
||
|
|
cargo test -p vapora-agents test_confidence_at_1_exec
|
||
|
|
cargo test -p vapora-agents test_confidence_at_5_execs
|
||
|
|
cargo test -p vapora-agents test_confidence_at_20_execs
|
||
|
|
cargo test -p vapora-agents test_confidence_cap_at_1
|
||
|
|
|
||
|
|
# Test lucky streak prevention
|
||
|
|
cargo test -p vapora-agents test_lucky_streak_prevention
|
||
|
|
|
||
|
|
# Test adjusted score (expertise * confidence)
|
||
|
|
cargo test -p vapora-agents test_adjusted_score_calculation
|
||
|
|
|
||
|
|
# Integration: new agent vs established agent selection
|
||
|
|
cargo test -p vapora-agents test_agent_selection_with_confidence
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected Output**:
|
||
|
|
- 1 execution: confidence = 0.05 (5%)
|
||
|
|
- 5 executions: confidence = 0.25 (25%)
|
||
|
|
- 10 executions: confidence = 0.50 (50%)
|
||
|
|
- 20 executions: confidence = 1.0 (100%)
|
||
|
|
- New agent with 1 success not selected over established agent
|
||
|
|
- Confidence gradually increases as agent executes more
|
||
|
|
- Adjusted score properly combines expertise and confidence
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Consequences
|
||
|
|
|
||
|
|
### Agent Cold-Start
|
||
|
|
- New agents require ~20 successful executions before reaching full score
|
||
|
|
- Longer ramp-up but prevents bad deployments
|
||
|
|
- Users understand why new agents aren't immediately selected
|
||
|
|
|
||
|
|
### Agent Ranking
|
||
|
|
- Established agents (20+ executions) ranked by expertise only
|
||
|
|
- Developing agents (< 20 executions) ranked by expertise * confidence
|
||
|
|
- Creates natural progression for agent improvement
|
||
|
|
|
||
|
|
### Learning Curve
|
||
|
|
- First 20 executions critical for agent adoption
|
||
|
|
- After 20, confidence no longer a limiting factor
|
||
|
|
- Encourages testing new agents early
|
||
|
|
|
||
|
|
### Monitoring
|
||
|
|
- Track which agents reach 20 executions
|
||
|
|
- Identify agents stuck below 20 (poor performance)
|
||
|
|
- Celebrate agents reaching full confidence
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## References
|
||
|
|
|
||
|
|
- `/crates/vapora-agents/src/learning_profile.rs` (implementation)
|
||
|
|
- `/crates/vapora-agents/src/selector.rs` (usage)
|
||
|
|
- ADR-014 (Learning Profiles)
|
||
|
|
- ADR-018 (Swarm Load Balancing)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Related ADRs**: ADR-014 (Learning Profiles), ADR-018 (Load Balancing), ADR-019 (Temporal History)
|