Vapora/docs/adrs/0040-agent-hot-reload-stable-identity.md

# ADR-0040: Agent Hot-Reload — Stable Identity and Zero-Downtime Config Reload

**Status**: Implemented
**Date**: 2026-03-02
**Deciders**: VAPORA Team
**Technical Story**: `AgentMetadata::id` was a `Uuid::new_v4()` generated at startup. `learning_profiles` in `AgentCoordinator` and execution records in `KGPersistence` used this UUID as the key. Every process restart or SIGHUP reload rotated all UUIDs, orphaning accumulated expertise profiles and resetting the learning system to zero.

---

## Decision

Introduce `stable_id: String` on `AgentMetadata`, computed as `role.clone()` at construction time. Switch all learning profile keys and KG execution records from the ephemeral `id` (UUID) to `stable_id`. Add hot-reload mechanics — SIGHUP handler and `POST /reload` endpoint — that drain and re-spawn executors while leaving `learning_profiles` untouched.

---

## Context

### The Identity Problem

Before this change, every agent had two implicit identities that were conflated into one field:

| Identity | Purpose | Lifecycle |
|----------|---------|-----------|
| Instance ID (`id`) | Sender handle in `executor_channels`, registry key | Ephemeral — dies with the process or on reload |
| Profile ID | Key for `learning_profiles` and KG records | Must survive restarts to preserve learning |

Using `Uuid::new_v4()` for both meant any reload (SIGHUP, restart, crash recovery) threw away all accumulated expertise. An agent that had processed 500 coding tasks and learned optimal patterns would start from zero on the next deploy.

### Why `role` as stable_id

VAPORA's architecture already partitions learning at the role level: `AgentScoringService::rank_agents` accepts `Vec<(agent_id, Option<LearningProfile>)>` where multiple agents of the same role compete for a task. The profile that matters for selection is role-level expertise (how well the "developer" role handles "coding" tasks), not per-instance expertise. Using `role` as the stable key:

- Is deterministic across restarts
- Aggregates learning across all instances of the same role
- Requires no additional persistence (no UUID→role mapping table)
- Degrades gracefully: legacy-deserialized records with empty `stable_id` fall back to `role` via `stable_id_or_role()`

---

## Implementation

### `AgentMetadata` (registry.rs)

```rust
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AgentMetadata {
    pub id: String,           // Uuid::new_v4() — ephemeral, per-instance
    #[serde(default)]
    pub stable_id: String,    // role.clone() — persistent across restarts
    pub role: String,
    // ...
}

impl AgentMetadata {
    pub fn new(role: String, ...) -> Self {
        Self {
            id: Uuid::new_v4().to_string(),
            stable_id: role.clone(),   // set before role is moved
            role,
            // ...
        }
    }

    pub fn stable_id_or_role(&self) -> &str {
        if self.stable_id.is_empty() { &self.role } else { &self.stable_id }
    }
}
```

### `AgentRegistry::drain_role` (registry.rs)

Removes all agents for a role from the `agents` map and clears `running_count`. This allows immediate re-registration after drain without hitting `MaxAgentsReached`.

```rust
pub fn drain_role(&self, role: &str) -> Vec<String> {
    let mut inner = self.inner.write().expect("registry write lock");
    let ids: Vec<String> = inner.agents.values()
        .filter(|a| a.role == role)
        .map(|a| a.id.clone())
        .collect();
    for id in &ids { inner.agents.remove(id); }
    inner.running_count.remove(role);
    ids
}
```

### `AgentCoordinator::drain_role` (coordinator.rs)

Delegates to `registry.drain_role`, then removes the corresponding `Sender` entries from `executor_channels`. Dropping the `Sender` closes the mpsc channel; the executor's `while let Some(task) = rx.recv().await` loop exits after draining any buffered messages — no explicit shutdown signal required.

```rust
pub fn drain_role(&self, role: &str) -> Vec<String> {
    let ids = self.registry.drain_role(role);
    for id in &ids {
        self.executor_channels.remove(id);
    }
    ids
}
```

`learning_profiles` is keyed by `stable_id` (= role) and is **not** touched during drain. New executor instances spawned after reload inherit accumulated expertise immediately.

### Profile lookup (coordinator.rs)

```rust
// assign_task — before:
.map(|a| (a.id.clone(), profiles.get(&a.id).cloned()))

// assign_task — after:
.map(|a| {
    let key = a.stable_id_or_role();
    (a.id.clone(), profiles.get(key).cloned())
})
```

### Hot-reload entry points (server.rs)

Two entry points invoke the same `reload_agents` function:

```rust
// SIGHUP
while sighup.recv().await.is_some() {
    handle_sighup_reload(&state, &registry).await;
}

// REST
.route("/reload", axum::routing::post(reload_handler))
```

`reload_agents` sequence:

1. `registry.list_roles()` → drain each role via `coordinator.drain_role`
2. Re-spawn capability executors from `CapabilityRegistry`
3. Re-spawn config agents not covered by capabilities
4. Return `registry.total_count()`

---

## Availability Window

`reload_agents` drains all roles before re-spawning. During the window between the last drain and the first successful `register_agent`, `assign_task` for those roles returns `CoordinatorError::NoAvailableAgent`. This window is typically sub-millisecond on the same thread, but callers must handle this error and retry.

This is a deliberate trade-off: atomic swap-in of new executors would require a blue-green registry pattern, adding significant complexity for a latency window that is orders of magnitude shorter than any typical LLM call (which takes 500ms–30s).

---

## Out of Scope

- **BudgetManager reload**: budget limit changes require process restart. The `BudgetManager` is constructed once from config in `main()` and stored in `AppState`. Adding reload support requires either a `RwLock<BudgetConfig>` wrapper or rebuilding the manager and swapping it in `AppState` under a lock.
- **LLMRouter reload**: provider API key changes require restart for the same reason.

---

## Alternatives Considered

### UUID + external persistence of UUID→role mapping

Would preserve per-instance identity. Rejected: adds a SurrealDB table (UUID→role) that must be kept in sync across restarts, adds a lookup on every `assign_task`, and provides no additional value since role-level profiles already capture collective expertise.

### Blue-green registry swap

Two `AgentRegistry` instances: old one drains while new one accepts assignments. Rejected: requires `AgentCoordinator` to hold `Arc<RwLock<Arc<AgentRegistry>>>` and all call sites to acquire the inner lock on every call. Complexity disproportionate to the gain (sub-millisecond → zero gap).

### Versioned stable_id (e.g., `developer-v2`)

For breaking role renames. Rejected: out of scope; role renames already require explicit operator action.

---

## Trade-offs

**Pros**:

- Learning profiles survive indefinitely across restarts and hot-reloads
- SIGHUP and `POST /reload` provide two operator-friendly reload paths
- `stable_id_or_role()` fallback ensures backward compatibility with persisted data that predates this change
- `drain_role` cleans up cleanly: no stale executor channels, no MaxAgentsReached on re-register

**Cons**:

- All agents of the same role share one learning profile. Per-instance specialization (e.g., "this specific GPU node is faster at inference") is not representable. Acceptable: VAPORA's role model deliberately treats same-role agents as interchangeable for task routing purposes.
- Brief `NoAvailableAgent` window during reload (see Availability Window above).
- BudgetManager and LLMRouter not reloadable without restart.

---

## Verification

```bash
cargo test -p vapora-agents test_stable_id_deterministic
cargo test -p vapora-agents test_drain_role
cargo test -p vapora-agents test_profile_survives_role_drain
cargo test -p vapora-agents test_list_roles

# Hot-reload via signal
kill -HUP $(pgrep vapora-agents)

# Hot-reload via REST
curl -s -X POST http://localhost:9000/reload | jq .
# Expected: {"reloaded": true, "agents": N}

cargo clippy -p vapora-agents -- -D warnings
```

---

## Consequences

- `AgentMetadata` gains a new field `stable_id` with `#[serde(default)]`. Existing serialized records deserialize cleanly; `stable_id_or_role()` falls back to `role`.
- KG execution records (the `agent_id` field in SurrealDB) now store `stable_id` (= role) instead of a UUID. Existing records with UUID keys remain in the database but are no longer updated; they can be cleaned up with a migration if needed.
- ADR-0014 (Learning Profiles) and ADR-0015 (Budget Enforcement) are unaffected at the API level; only the internal key used to look up profiles changes.

---

## References

- [ADR-0014 — Learning Profiles](./0014-learning-profiles.md)
- [ADR-0015 — Budget Enforcement](./0015-budget-enforcement.md)
- [ADR-0026 — Arc-Based Shared State](./0026-shared-state.md)
- `crates/vapora-agents/src/registry.rs` — `AgentMetadata`, `drain_role`, `list_roles`
- `crates/vapora-agents/src/coordinator.rs` — `drain_role`, `registry_arc`, profile lookup
- `crates/vapora-agents/src/bin/server.rs` — `reload_agents`, SIGHUP handler, `/reload` endpoint