jesus/Vapora

Fork 0

Jesús Pérez 7110ffeea2

Rust CI / Security Audit (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (stable) (push) Has been cancelled

Details

chore: extend doc: adr, tutorials, operations, etc

2026-01-12 03:32:47 +00:00

6.7 KiB

Raw Blame History

ADR-017: Confidence Weighting en Learning Profiles

Status: Accepted | Implemented Date: 2024-11-01 Deciders: Agent Architecture Team Technical Story: Preventing new agents from being preferred on lucky first runs

Decision

Implementar Confidence Weighting con fórmula confidence = min(1.0, total_executions / 20).

Rationale

Prevents Overfitting: Agentes nuevos con 1 éxito no deben ser preferred
Statistical Significance: 20 ejecuciones proporciona confianza estadística
Gradual Increase: Confianza sube mientras agente ejecuta más tareas
Prevents Lucky Streaks: Requiere evidencia antes de preferencia

Alternatives Considered

❌ No Confidence Weighting

Pros: Simple
Cons: New agent with 1 success could be selected

❌ Higher Threshold (e.g., 50 executions)

Pros: More statistical rigor
Cons: Cold-start problem worse, new agents never selected

✅ Confidence = min(1.0, executions/20) (CHOSEN)

Reasonable threshold, balances learning and avoiding lucky streaks

Trade-offs

Pros:

✅ Prevents overfitting on single success
✅ Reasonable learning curve (20 executions)
✅ Simple formula
✅ Transparent and explainable

Cons:

⚠️ Cold-start: new agents take 20 runs to full confidence
⚠️ Not adaptive (same threshold for all task types)
⚠️ May still allow lucky streaks (before 20 runs)

Implementation

Confidence Model:

// crates/vapora-agents/src/learning_profile.rs

impl TaskTypeLearning {
    /// Confidence score: how much to trust this agent's score
    /// min(1.0, executions / 20) = 0.05 at 1 execution, 1.0 at 20+
    pub fn confidence(&self) -> f32 {
        std::cmp::min(
            1.0,
            (self.executions_total as f32) / 20.0
        )
    }

    /// Adjusted score: expertise * confidence
    /// Even with perfect expertise, low confidence reduces score
    pub fn adjusted_score(&self) -> f32 {
        let expertise = self.expertise_score();
        let confidence = self.confidence();
        expertise * confidence
    }

    /// Confidence progression examples:
    /// 1 exec:  confidence = 0.05 (5%)
    /// 5 exec:  confidence = 0.25 (25%)
    /// 10 exec: confidence = 0.50 (50%)
    /// 20 exec: confidence = 1.0 (100%)
}

Agent Selection with Confidence:

pub async fn select_best_agent_with_confidence(
    db: &Surreal<Ws>,
    task_type: &str,
) -> Result<String> {
    // Query all agents for this task type
    let profiles = db.query(
        "SELECT agent_id, executions_total, expertise_score(), confidence() \
         FROM task_type_learning \
         WHERE task_type = $1 \
         ORDER BY (expertise_score * confidence) DESC \
         LIMIT 5"
    )
    .bind(task_type)
    .await?;

    let best = profiles
        .take::<TaskTypeLearning>(0)?
        .first()
        .ok_or(Error::NoAgentsAvailable)?;

    // Log selection with confidence for debugging
    tracing::info!(
        "Selected agent {} with confidence {:.2}% (after {} executions)",
        best.agent_id,
        best.confidence() * 100.0,
        best.executions_total
    );

    Ok(best.agent_id.clone())
}

Preventing Lucky Streaks:

// Example: Agent with 1 success but 5% confidence
let agent_1_success = TaskTypeLearning {
    agent_id: "new-agent-1".to_string(),
    task_type: "code_generation".to_string(),
    executions_total: 1,
    executions_successful: 1,
    avg_quality_score: 0.95,  // Perfect on first try!
    records: vec![ExecutionRecord { /* ... */ }],
};

// Expertise would be 0.95, but confidence is only 0.05
let score = agent_1_success.adjusted_score();  // 0.95 * 0.05 = 0.0475
// This agent scores much lower than established agent with 0.80 expertise, 0.50 confidence
// 0.80 * 0.50 = 0.40 > 0.0475

// Agent needs ~20 successes before reaching full confidence
let agent_20_success = TaskTypeLearning {
    executions_total: 20,
    executions_successful: 20,
    avg_quality_score: 0.95,
    /* ... */
};

let score = agent_20_success.adjusted_score();  // 0.95 * 1.0 = 0.95

Confidence Visualization:

pub fn confidence_ramp() -> Vec<(u32, f32)> {
    (0..=40)
        .map(|execs| {
            let confidence = std::cmp::min(1.0, (execs as f32) / 20.0);
            (execs, confidence)
        })
        .collect()
}

// Output:
// 0 execs:  0.00
// 1 exec:   0.05
// 2 execs:  0.10
// 5 execs:  0.25
// 10 execs: 0.50
// 20 execs: 1.00  ← Full confidence reached
// 30 execs: 1.00  ← Capped at 1.0
// 40 execs: 1.00  ← Still capped

Key Files:

/crates/vapora-agents/src/learning_profile.rs (confidence calculation)
/crates/vapora-agents/src/selector.rs (agent selection logic)
/crates/vapora-agents/src/scoring.rs (score calculations)

Verification

# Test confidence calculation at key milestones
cargo test -p vapora-agents test_confidence_at_1_exec
cargo test -p vapora-agents test_confidence_at_5_execs
cargo test -p vapora-agents test_confidence_at_20_execs
cargo test -p vapora-agents test_confidence_cap_at_1

# Test lucky streak prevention
cargo test -p vapora-agents test_lucky_streak_prevention

# Test adjusted score (expertise * confidence)
cargo test -p vapora-agents test_adjusted_score_calculation

# Integration: new agent vs established agent selection
cargo test -p vapora-agents test_agent_selection_with_confidence

Expected Output:

1 execution: confidence = 0.05 (5%)
5 executions: confidence = 0.25 (25%)
10 executions: confidence = 0.50 (50%)
20 executions: confidence = 1.0 (100%)
New agent with 1 success not selected over established agent
Confidence gradually increases as agent executes more
Adjusted score properly combines expertise and confidence

Consequences

Agent Cold-Start

New agents require ~20 successful executions before reaching full score
Longer ramp-up but prevents bad deployments
Users understand why new agents aren't immediately selected

Agent Ranking

Established agents (20+ executions) ranked by expertise only
Developing agents (< 20 executions) ranked by expertise * confidence
Creates natural progression for agent improvement

Learning Curve

First 20 executions critical for agent adoption
After 20, confidence no longer a limiting factor
Encourages testing new agents early

Monitoring

Track which agents reach 20 executions
Identify agents stuck below 20 (poor performance)
Celebrate agents reaching full confidence

References

/crates/vapora-agents/src/learning_profile.rs (implementation)
/crates/vapora-agents/src/selector.rs (usage)
ADR-014 (Learning Profiles)
ADR-018 (Swarm Load Balancing)

Related ADRs: ADR-014 (Learning Profiles), ADR-018 (Load Balancing), ADR-019 (Temporal History)

6.7 KiB Raw Blame History

ADR-017: Confidence Weighting en Learning Profiles

Decision

Rationale

Alternatives Considered

❌ No Confidence Weighting

❌ Higher Threshold (e.g., 50 executions)

✅ Confidence = min(1.0, executions/20) (CHOSEN)

Trade-offs

Implementation

Verification

Consequences

Agent Cold-Start

Agent Ranking

Learning Curve

Monitoring

References

6.7 KiB

Raw Blame History