Vapora/docs/adrs/0017-confidence-weighting.md
Jesús Pérez 7110ffeea2
Some checks failed
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
chore: extend doc: adr, tutorials, operations, etc
2026-01-12 03:32:47 +00:00

6.7 KiB

ADR-017: Confidence Weighting en Learning Profiles

Status: Accepted | Implemented Date: 2024-11-01 Deciders: Agent Architecture Team Technical Story: Preventing new agents from being preferred on lucky first runs


Decision

Implementar Confidence Weighting con fórmula confidence = min(1.0, total_executions / 20).


Rationale

  1. Prevents Overfitting: Agentes nuevos con 1 éxito no deben ser preferred
  2. Statistical Significance: 20 ejecuciones proporciona confianza estadística
  3. Gradual Increase: Confianza sube mientras agente ejecuta más tareas
  4. Prevents Lucky Streaks: Requiere evidencia antes de preferencia

Alternatives Considered

No Confidence Weighting

  • Pros: Simple
  • Cons: New agent with 1 success could be selected

Higher Threshold (e.g., 50 executions)

  • Pros: More statistical rigor
  • Cons: Cold-start problem worse, new agents never selected

Confidence = min(1.0, executions/20) (CHOSEN)

  • Reasonable threshold, balances learning and avoiding lucky streaks

Trade-offs

Pros:

  • Prevents overfitting on single success
  • Reasonable learning curve (20 executions)
  • Simple formula
  • Transparent and explainable

Cons:

  • ⚠️ Cold-start: new agents take 20 runs to full confidence
  • ⚠️ Not adaptive (same threshold for all task types)
  • ⚠️ May still allow lucky streaks (before 20 runs)

Implementation

Confidence Model:

// crates/vapora-agents/src/learning_profile.rs

impl TaskTypeLearning {
    /// Confidence score: how much to trust this agent's score
    /// min(1.0, executions / 20) = 0.05 at 1 execution, 1.0 at 20+
    pub fn confidence(&self) -> f32 {
        std::cmp::min(
            1.0,
            (self.executions_total as f32) / 20.0
        )
    }

    /// Adjusted score: expertise * confidence
    /// Even with perfect expertise, low confidence reduces score
    pub fn adjusted_score(&self) -> f32 {
        let expertise = self.expertise_score();
        let confidence = self.confidence();
        expertise * confidence
    }

    /// Confidence progression examples:
    /// 1 exec:  confidence = 0.05 (5%)
    /// 5 exec:  confidence = 0.25 (25%)
    /// 10 exec: confidence = 0.50 (50%)
    /// 20 exec: confidence = 1.0 (100%)
}

Agent Selection with Confidence:

pub async fn select_best_agent_with_confidence(
    db: &Surreal<Ws>,
    task_type: &str,
) -> Result<String> {
    // Query all agents for this task type
    let profiles = db.query(
        "SELECT agent_id, executions_total, expertise_score(), confidence() \
         FROM task_type_learning \
         WHERE task_type = $1 \
         ORDER BY (expertise_score * confidence) DESC \
         LIMIT 5"
    )
    .bind(task_type)
    .await?;

    let best = profiles
        .take::<TaskTypeLearning>(0)?
        .first()
        .ok_or(Error::NoAgentsAvailable)?;

    // Log selection with confidence for debugging
    tracing::info!(
        "Selected agent {} with confidence {:.2}% (after {} executions)",
        best.agent_id,
        best.confidence() * 100.0,
        best.executions_total
    );

    Ok(best.agent_id.clone())
}

Preventing Lucky Streaks:

// Example: Agent with 1 success but 5% confidence
let agent_1_success = TaskTypeLearning {
    agent_id: "new-agent-1".to_string(),
    task_type: "code_generation".to_string(),
    executions_total: 1,
    executions_successful: 1,
    avg_quality_score: 0.95,  // Perfect on first try!
    records: vec![ExecutionRecord { /* ... */ }],
};

// Expertise would be 0.95, but confidence is only 0.05
let score = agent_1_success.adjusted_score();  // 0.95 * 0.05 = 0.0475
// This agent scores much lower than established agent with 0.80 expertise, 0.50 confidence
// 0.80 * 0.50 = 0.40 > 0.0475

// Agent needs ~20 successes before reaching full confidence
let agent_20_success = TaskTypeLearning {
    executions_total: 20,
    executions_successful: 20,
    avg_quality_score: 0.95,
    /* ... */
};

let score = agent_20_success.adjusted_score();  // 0.95 * 1.0 = 0.95

Confidence Visualization:

pub fn confidence_ramp() -> Vec<(u32, f32)> {
    (0..=40)
        .map(|execs| {
            let confidence = std::cmp::min(1.0, (execs as f32) / 20.0);
            (execs, confidence)
        })
        .collect()
}

// Output:
// 0 execs:  0.00
// 1 exec:   0.05
// 2 execs:  0.10
// 5 execs:  0.25
// 10 execs: 0.50
// 20 execs: 1.00  ← Full confidence reached
// 30 execs: 1.00  ← Capped at 1.0
// 40 execs: 1.00  ← Still capped

Key Files:

  • /crates/vapora-agents/src/learning_profile.rs (confidence calculation)
  • /crates/vapora-agents/src/selector.rs (agent selection logic)
  • /crates/vapora-agents/src/scoring.rs (score calculations)

Verification

# Test confidence calculation at key milestones
cargo test -p vapora-agents test_confidence_at_1_exec
cargo test -p vapora-agents test_confidence_at_5_execs
cargo test -p vapora-agents test_confidence_at_20_execs
cargo test -p vapora-agents test_confidence_cap_at_1

# Test lucky streak prevention
cargo test -p vapora-agents test_lucky_streak_prevention

# Test adjusted score (expertise * confidence)
cargo test -p vapora-agents test_adjusted_score_calculation

# Integration: new agent vs established agent selection
cargo test -p vapora-agents test_agent_selection_with_confidence

Expected Output:

  • 1 execution: confidence = 0.05 (5%)
  • 5 executions: confidence = 0.25 (25%)
  • 10 executions: confidence = 0.50 (50%)
  • 20 executions: confidence = 1.0 (100%)
  • New agent with 1 success not selected over established agent
  • Confidence gradually increases as agent executes more
  • Adjusted score properly combines expertise and confidence

Consequences

Agent Cold-Start

  • New agents require ~20 successful executions before reaching full score
  • Longer ramp-up but prevents bad deployments
  • Users understand why new agents aren't immediately selected

Agent Ranking

  • Established agents (20+ executions) ranked by expertise only
  • Developing agents (< 20 executions) ranked by expertise * confidence
  • Creates natural progression for agent improvement

Learning Curve

  • First 20 executions critical for agent adoption
  • After 20, confidence no longer a limiting factor
  • Encourages testing new agents early

Monitoring

  • Track which agents reach 20 executions
  • Identify agents stuck below 20 (poor performance)
  • Celebrate agents reaching full confidence

References

  • /crates/vapora-agents/src/learning_profile.rs (implementation)
  • /crates/vapora-agents/src/selector.rs (usage)
  • ADR-014 (Learning Profiles)
  • ADR-018 (Swarm Load Balancing)

Related ADRs: ADR-014 (Learning Profiles), ADR-018 (Load Balancing), ADR-019 (Temporal History)