Introduce stable_id = role on AgentMetadata so learning profiles and KG
execution records survive process restarts and hot-reloads. Previously
every Uuid::new_v4() rotation orphaned accumulated expertise.
- registry: add stable_id field (serde default, backward-compatible),
stable_id_or_role() fallback helper, drain_role(), list_roles()
- coordinator: profile lookup and KG writes use stable_id_or_role()
instead of the ephemeral UUID; drain_role() drops Sender to close
mpsc channels after in-flight messages drain; registry_arc() accessor
- executor: agent_id written to KG now uses stable_id_or_role()
- server: reload_agents() drain-and-respawn function; SIGHUP handler
via while sighup.recv().await.is_some(); POST /reload endpoint;
AppState extended with config_path, router, cap_registry
- fix: SIGHUP recv() spin-loop guard (is_some())
- fix: io_other_error clippy lint in vapora-agents, vapora-llm-router,
vapora-workflow-engine (std::io::Error::other instead of Error::new)
- docs: ADR-0040, CHANGELOG entry, README hot-reload section
9.1 KiB
ADR-0040: Agent Hot-Reload — Stable Identity and Zero-Downtime Config Reload
Status: Implemented
Date: 2026-03-02
Deciders: VAPORA Team
Technical Story: AgentMetadata::id was a Uuid::new_v4() generated at startup. learning_profiles in AgentCoordinator and execution records in KGPersistence used this UUID as the key. Every process restart or SIGHUP reload rotated all UUIDs, orphaning accumulated expertise profiles and resetting the learning system to zero.
Decision
Introduce stable_id: String on AgentMetadata, computed as role.clone() at construction time. Switch all learning profile keys and KG execution records from the ephemeral id (UUID) to stable_id. Add hot-reload mechanics — SIGHUP handler and POST /reload endpoint — that drain and re-spawn executors while leaving learning_profiles untouched.
Context
The Identity Problem
Before this change, every agent had two implicit identities that were conflated into one field:
| Identity | Purpose | Lifecycle |
|---|---|---|
Instance ID (id) |
Sender handle in executor_channels, registry key |
Ephemeral — dies with the process or on reload |
| Profile ID | Key for learning_profiles and KG records |
Must survive restarts to preserve learning |
Using Uuid::new_v4() for both meant any reload (SIGHUP, restart, crash recovery) threw away all accumulated expertise. An agent that had processed 500 coding tasks and learned optimal patterns would start from zero on the next deploy.
Why role as stable_id
VAPORA's architecture already partitions learning at the role level: AgentScoringService::rank_agents accepts Vec<(agent_id, Option<LearningProfile>)> where multiple agents of the same role compete for a task. The profile that matters for selection is role-level expertise (how well the "developer" role handles "coding" tasks), not per-instance expertise. Using role as the stable key:
- Is deterministic across restarts
- Aggregates learning across all instances of the same role
- Requires no additional persistence (no UUID→role mapping table)
- Degrades gracefully: legacy-deserialized records with empty
stable_idfall back toroleviastable_id_or_role()
Implementation
AgentMetadata (registry.rs)
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AgentMetadata {
pub id: String, // Uuid::new_v4() — ephemeral, per-instance
#[serde(default)]
pub stable_id: String, // role.clone() — persistent across restarts
pub role: String,
// ...
}
impl AgentMetadata {
pub fn new(role: String, ...) -> Self {
Self {
id: Uuid::new_v4().to_string(),
stable_id: role.clone(), // set before role is moved
role,
// ...
}
}
pub fn stable_id_or_role(&self) -> &str {
if self.stable_id.is_empty() { &self.role } else { &self.stable_id }
}
}
AgentRegistry::drain_role (registry.rs)
Removes all agents for a role from the agents map and clears running_count. This allows immediate re-registration after drain without hitting MaxAgentsReached.
pub fn drain_role(&self, role: &str) -> Vec<String> {
let mut inner = self.inner.write().expect("registry write lock");
let ids: Vec<String> = inner.agents.values()
.filter(|a| a.role == role)
.map(|a| a.id.clone())
.collect();
for id in &ids { inner.agents.remove(id); }
inner.running_count.remove(role);
ids
}
AgentCoordinator::drain_role (coordinator.rs)
Delegates to registry.drain_role, then removes the corresponding Sender entries from executor_channels. Dropping the Sender closes the mpsc channel; the executor's while let Some(task) = rx.recv().await loop exits after draining any buffered messages — no explicit shutdown signal required.
pub fn drain_role(&self, role: &str) -> Vec<String> {
let ids = self.registry.drain_role(role);
for id in &ids {
self.executor_channels.remove(id);
}
ids
}
learning_profiles is keyed by stable_id (= role) and is not touched during drain. New executor instances spawned after reload inherit accumulated expertise immediately.
Profile lookup (coordinator.rs)
// assign_task — before:
.map(|a| (a.id.clone(), profiles.get(&a.id).cloned()))
// assign_task — after:
.map(|a| {
let key = a.stable_id_or_role();
(a.id.clone(), profiles.get(key).cloned())
})
Hot-reload entry points (server.rs)
Two entry points invoke the same reload_agents function:
// SIGHUP
while sighup.recv().await.is_some() {
handle_sighup_reload(&state, ®istry).await;
}
// REST
.route("/reload", axum::routing::post(reload_handler))
reload_agents sequence:
registry.list_roles()→ drain each role viacoordinator.drain_role- Re-spawn capability executors from
CapabilityRegistry - Re-spawn config agents not covered by capabilities
- Return
registry.total_count()
Availability Window
reload_agents drains all roles before re-spawning. During the window between the last drain and the first successful register_agent, assign_task for those roles returns CoordinatorError::NoAvailableAgent. This window is typically sub-millisecond on the same thread, but callers must handle this error and retry.
This is a deliberate trade-off: atomic swap-in of new executors would require a blue-green registry pattern, adding significant complexity for a latency window that is orders of magnitude shorter than any typical LLM call (which takes 500ms–30s).
Out of Scope
- BudgetManager reload: budget limit changes require process restart. The
BudgetManageris constructed once from config inmain()and stored inAppState. Adding reload support requires either aRwLock<BudgetConfig>wrapper or rebuilding the manager and swapping it inAppStateunder a lock. - LLMRouter reload: provider API key changes require restart for the same reason.
Alternatives Considered
UUID + external persistence of UUID→role mapping
Would preserve per-instance identity. Rejected: adds a SurrealDB table (UUID→role) that must be kept in sync across restarts, adds a lookup on every assign_task, and provides no additional value since role-level profiles already capture collective expertise.
Blue-green registry swap
Two AgentRegistry instances: old one drains while new one accepts assignments. Rejected: requires AgentCoordinator to hold Arc<RwLock<Arc<AgentRegistry>>> and all call sites to acquire the inner lock on every call. Complexity disproportionate to the gain (sub-millisecond → zero gap).
Versioned stable_id (e.g., developer-v2)
For breaking role renames. Rejected: out of scope; role renames already require explicit operator action.
Trade-offs
Pros:
- Learning profiles survive indefinitely across restarts and hot-reloads
- SIGHUP and
POST /reloadprovide two operator-friendly reload paths stable_id_or_role()fallback ensures backward compatibility with persisted data that predates this changedrain_rolecleans up cleanly: no stale executor channels, no MaxAgentsReached on re-register
Cons:
- All agents of the same role share one learning profile. Per-instance specialization (e.g., "this specific GPU node is faster at inference") is not representable. Acceptable: VAPORA's role model deliberately treats same-role agents as interchangeable for task routing purposes.
- Brief
NoAvailableAgentwindow during reload (see Availability Window above). - BudgetManager and LLMRouter not reloadable without restart.
Verification
cargo test -p vapora-agents test_stable_id_deterministic
cargo test -p vapora-agents test_drain_role
cargo test -p vapora-agents test_profile_survives_role_drain
cargo test -p vapora-agents test_list_roles
# Hot-reload via signal
kill -HUP $(pgrep vapora-agents)
# Hot-reload via REST
curl -s -X POST http://localhost:9000/reload | jq .
# Expected: {"reloaded": true, "agents": N}
cargo clippy -p vapora-agents -- -D warnings
Consequences
AgentMetadatagains a new fieldstable_idwith#[serde(default)]. Existing serialized records deserialize cleanly;stable_id_or_role()falls back torole.- KG execution records (the
agent_idfield in SurrealDB) now storestable_id(= role) instead of a UUID. Existing records with UUID keys remain in the database but are no longer updated; they can be cleaned up with a migration if needed. - ADR-0014 (Learning Profiles) and ADR-0015 (Budget Enforcement) are unaffected at the API level; only the internal key used to look up profiles changes.
References
- ADR-0014 — Learning Profiles
- ADR-0015 — Budget Enforcement
- ADR-0026 — Arc-Based Shared State
crates/vapora-agents/src/registry.rs—AgentMetadata,drain_role,list_rolescrates/vapora-agents/src/coordinator.rs—drain_role,registry_arc, profile lookupcrates/vapora-agents/src/bin/server.rs—reload_agents, SIGHUP handler,/reloadendpoint