Jesús Pérez c5f4caa2ab

feat(agents): stable identity + hot-reload for zero learning loss on config change

Introduce stable_id = role on AgentMetadata so learning profiles and KG
  execution records survive process restarts and hot-reloads. Previously
  every Uuid::new_v4() rotation orphaned accumulated expertise.

  - registry: add stable_id field (serde default, backward-compatible),
    stable_id_or_role() fallback helper, drain_role(), list_roles()
  - coordinator: profile lookup and KG writes use stable_id_or_role()
    instead of the ephemeral UUID; drain_role() drops Sender to close
    mpsc channels after in-flight messages drain; registry_arc() accessor
  - executor: agent_id written to KG now uses stable_id_or_role()
  - server: reload_agents() drain-and-respawn function; SIGHUP handler
    via while sighup.recv().await.is_some(); POST /reload endpoint;
    AppState extended with config_path, router, cap_registry
  - fix: SIGHUP recv() spin-loop guard (is_some())
  - fix: io_other_error clippy lint in vapora-agents, vapora-llm-router,
    vapora-workflow-engine (std::io::Error::other instead of Error::new)
  - docs: ADR-0040, CHANGELOG entry, README hot-reload section

2026-03-02 22:54:28 +00:00

9.1 KiB

Raw Blame History

ADR-0040: Agent Hot-Reload — Stable Identity and Zero-Downtime Config Reload

Status: Implemented Date: 2026-03-02 Deciders: VAPORA Team Technical Story: AgentMetadata::id was a Uuid::new_v4() generated at startup. learning_profiles in AgentCoordinator and execution records in KGPersistence used this UUID as the key. Every process restart or SIGHUP reload rotated all UUIDs, orphaning accumulated expertise profiles and resetting the learning system to zero.

Decision

Introduce stable_id: String on AgentMetadata, computed as role.clone() at construction time. Switch all learning profile keys and KG execution records from the ephemeral id (UUID) to stable_id. Add hot-reload mechanics — SIGHUP handler and POST /reload endpoint — that drain and re-spawn executors while leaving learning_profiles untouched.

Context

The Identity Problem

Before this change, every agent had two implicit identities that were conflated into one field:

Identity	Purpose	Lifecycle
Instance ID (`id`)	Sender handle in `executor_channels`, registry key	Ephemeral — dies with the process or on reload
Profile ID	Key for `learning_profiles` and KG records	Must survive restarts to preserve learning

Using Uuid::new_v4() for both meant any reload (SIGHUP, restart, crash recovery) threw away all accumulated expertise. An agent that had processed 500 coding tasks and learned optimal patterns would start from zero on the next deploy.

Why `role` as stable_id

VAPORA's architecture already partitions learning at the role level: AgentScoringService::rank_agents accepts Vec<(agent_id, Option<LearningProfile>)> where multiple agents of the same role compete for a task. The profile that matters for selection is role-level expertise (how well the "developer" role handles "coding" tasks), not per-instance expertise. Using role as the stable key:

Is deterministic across restarts
Aggregates learning across all instances of the same role
Requires no additional persistence (no UUID→role mapping table)
Degrades gracefully: legacy-deserialized records with empty stable_id fall back to role via stable_id_or_role()

Implementation

`AgentMetadata` (registry.rs)

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AgentMetadata {
    pub id: String,           // Uuid::new_v4() — ephemeral, per-instance
    #[serde(default)]
    pub stable_id: String,    // role.clone() — persistent across restarts
    pub role: String,
    // ...
}

impl AgentMetadata {
    pub fn new(role: String, ...) -> Self {
        Self {
            id: Uuid::new_v4().to_string(),
            stable_id: role.clone(),   // set before role is moved
            role,
            // ...
        }
    }

    pub fn stable_id_or_role(&self) -> &str {
        if self.stable_id.is_empty() { &self.role } else { &self.stable_id }
    }
}

`AgentRegistry::drain_role` (registry.rs)

Removes all agents for a role from the agents map and clears running_count. This allows immediate re-registration after drain without hitting MaxAgentsReached.

pub fn drain_role(&self, role: &str) -> Vec<String> {
    let mut inner = self.inner.write().expect("registry write lock");
    let ids: Vec<String> = inner.agents.values()
        .filter(|a| a.role == role)
        .map(|a| a.id.clone())
        .collect();
    for id in &ids { inner.agents.remove(id); }
    inner.running_count.remove(role);
    ids
}

`AgentCoordinator::drain_role` (coordinator.rs)

Delegates to registry.drain_role, then removes the corresponding Sender entries from executor_channels. Dropping the Sender closes the mpsc channel; the executor's while let Some(task) = rx.recv().await loop exits after draining any buffered messages — no explicit shutdown signal required.

pub fn drain_role(&self, role: &str) -> Vec<String> {
    let ids = self.registry.drain_role(role);
    for id in &ids {
        self.executor_channels.remove(id);
    }
    ids
}

learning_profiles is keyed by stable_id (= role) and is not touched during drain. New executor instances spawned after reload inherit accumulated expertise immediately.

Profile lookup (coordinator.rs)

// assign_task — before:
.map(|a| (a.id.clone(), profiles.get(&a.id).cloned()))

// assign_task — after:
.map(|a| {
    let key = a.stable_id_or_role();
    (a.id.clone(), profiles.get(key).cloned())
})

Hot-reload entry points (server.rs)

Two entry points invoke the same reload_agents function:

// SIGHUP
while sighup.recv().await.is_some() {
    handle_sighup_reload(&state, &registry).await;
}

// REST
.route("/reload", axum::routing::post(reload_handler))

reload_agents sequence:

registry.list_roles() → drain each role via coordinator.drain_role
Re-spawn capability executors from CapabilityRegistry
Re-spawn config agents not covered by capabilities
Return registry.total_count()

Availability Window

reload_agents drains all roles before re-spawning. During the window between the last drain and the first successful register_agent, assign_task for those roles returns CoordinatorError::NoAvailableAgent. This window is typically sub-millisecond on the same thread, but callers must handle this error and retry.

This is a deliberate trade-off: atomic swap-in of new executors would require a blue-green registry pattern, adding significant complexity for a latency window that is orders of magnitude shorter than any typical LLM call (which takes 500ms–30s).

Out of Scope

BudgetManager reload: budget limit changes require process restart. The BudgetManager is constructed once from config in main() and stored in AppState. Adding reload support requires either a RwLock<BudgetConfig> wrapper or rebuilding the manager and swapping it in AppState under a lock.
LLMRouter reload: provider API key changes require restart for the same reason.

Alternatives Considered

UUID + external persistence of UUID→role mapping

Would preserve per-instance identity. Rejected: adds a SurrealDB table (UUID→role) that must be kept in sync across restarts, adds a lookup on every assign_task, and provides no additional value since role-level profiles already capture collective expertise.

Blue-green registry swap

Two AgentRegistry instances: old one drains while new one accepts assignments. Rejected: requires AgentCoordinator to hold Arc<RwLock<Arc<AgentRegistry>>> and all call sites to acquire the inner lock on every call. Complexity disproportionate to the gain (sub-millisecond → zero gap).

Versioned stable_id (e.g., `developer-v2`)

For breaking role renames. Rejected: out of scope; role renames already require explicit operator action.

Trade-offs

Pros:

Learning profiles survive indefinitely across restarts and hot-reloads
SIGHUP and POST /reload provide two operator-friendly reload paths
stable_id_or_role() fallback ensures backward compatibility with persisted data that predates this change
drain_role cleans up cleanly: no stale executor channels, no MaxAgentsReached on re-register

Cons:

All agents of the same role share one learning profile. Per-instance specialization (e.g., "this specific GPU node is faster at inference") is not representable. Acceptable: VAPORA's role model deliberately treats same-role agents as interchangeable for task routing purposes.
Brief NoAvailableAgent window during reload (see Availability Window above).
BudgetManager and LLMRouter not reloadable without restart.

Verification

cargo test -p vapora-agents test_stable_id_deterministic
cargo test -p vapora-agents test_drain_role
cargo test -p vapora-agents test_profile_survives_role_drain
cargo test -p vapora-agents test_list_roles

# Hot-reload via signal
kill -HUP $(pgrep vapora-agents)

# Hot-reload via REST
curl -s -X POST http://localhost:9000/reload | jq .
# Expected: {"reloaded": true, "agents": N}

cargo clippy -p vapora-agents -- -D warnings

Consequences

AgentMetadata gains a new field stable_id with #[serde(default)]. Existing serialized records deserialize cleanly; stable_id_or_role() falls back to role.
KG execution records (the agent_id field in SurrealDB) now store stable_id (= role) instead of a UUID. Existing records with UUID keys remain in the database but are no longer updated; they can be cleaned up with a migration if needed.
ADR-0014 (Learning Profiles) and ADR-0015 (Budget Enforcement) are unaffected at the API level; only the internal key used to look up profiles changes.

References

ADR-0014 — Learning Profiles
ADR-0015 — Budget Enforcement
ADR-0026 — Arc-Based Shared State
crates/vapora-agents/src/registry.rs — AgentMetadata, drain_role, list_roles
crates/vapora-agents/src/coordinator.rs — drain_role, registry_arc, profile lookup
crates/vapora-agents/src/bin/server.rs — reload_agents, SIGHUP handler, /reload endpoint

9.1 KiB Raw Blame History Unescape Escape