feat(agents): stable identity + hot-reload for zero learning loss on config change

Introduce stable_id = role on AgentMetadata so learning profiles and KG
  execution records survive process restarts and hot-reloads. Previously
  every Uuid::new_v4() rotation orphaned accumulated expertise.

  - registry: add stable_id field (serde default, backward-compatible),
    stable_id_or_role() fallback helper, drain_role(), list_roles()
  - coordinator: profile lookup and KG writes use stable_id_or_role()
    instead of the ephemeral UUID; drain_role() drops Sender to close
    mpsc channels after in-flight messages drain; registry_arc() accessor
  - executor: agent_id written to KG now uses stable_id_or_role()
  - server: reload_agents() drain-and-respawn function; SIGHUP handler
    via while sighup.recv().await.is_some(); POST /reload endpoint;
    AppState extended with config_path, router, cap_registry
  - fix: SIGHUP recv() spin-loop guard (is_some())
  - fix: io_other_error clippy lint in vapora-agents, vapora-llm-router,
    vapora-workflow-engine (std::io::Error::other instead of Error::new)
  - docs: ADR-0040, CHANGELOG entry, README hot-reload section
This commit is contained in:
Jesús Pérez 2026-03-02 22:54:28 +00:00
parent 847523e4d4
commit c5f4caa2ab
Signed by: jesus
GPG key ID: 9F243E355E0BC939
37 changed files with 1689 additions and 662 deletions

View file

@ -7,6 +7,39 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [Unreleased] ## [Unreleased]
### Added - Agent Hot-Reload: Stable Identity and Zero-Downtime Config Reload
#### `vapora-agents` — stable_id, drain/respawn, SIGHUP, REST endpoint
- **`AgentMetadata::stable_id`** (`registry.rs`): New `#[serde(default)]` field computed as `role.clone()` at construction, before the `role` field is moved. `stable_id_or_role()` helper returns `stable_id` if non-empty, otherwise falls back to `role` for backward compatibility with serialized records that predate this change.
- **Profile key switch** (`coordinator.rs`): `assign_task` and `load_all_learning_profiles` now look up `learning_profiles` by `stable_id_or_role()` instead of the ephemeral UUID `id`. Learning expertise accumulated across any number of restarts or hot-reloads is no longer orphaned.
- **KG execution records** (`executor.rs`): The `agent_id` field written to SurrealDB on task completion now uses `stable_id_or_role()` (= role) instead of the per-instance UUID. Execution history is partitioned by role, consistent with how profiles are keyed.
- **`AgentRegistry::drain_role`** (`registry.rs`): Removes all agents for a given role from the registry and clears `running_count`, enabling immediate re-registration without hitting `MaxAgentsReached`.
- **`AgentRegistry::list_roles`** (`registry.rs`): Returns the sorted list of distinct roles currently registered.
- **`AgentCoordinator::drain_role`** (`coordinator.rs`): Calls `registry.drain_role`, then drops the corresponding `Sender` entries from `executor_channels`. Dropping the `Sender` closes the mpsc channel; each executor's `while let Some(task) = rx.recv().await` loop exits after draining buffered messages — no explicit shutdown signal needed. `learning_profiles` is untouched.
- **`AgentCoordinator::registry_arc`** (`coordinator.rs`): New accessor returning `Arc<AgentRegistry>`, used by the `/reload` endpoint to obtain the registry without exposing coordinator internals.
- **`reload_agents`** (`server.rs`): Async function that (1) drains all active roles, (2) re-spawns capability executors from `CapabilityRegistry`, (3) re-spawns config agents not covered by capabilities, (4) returns the new total agent count. Learning profiles survive the entire sequence.
- **SIGHUP handler** (`server.rs`): `tokio::signal::unix::signal(SignalKind::hangup())` drives a `while sighup.recv().await.is_some()` loop calling `handle_sighup_reload`. The `is_some()` guard prevents a spin-loop if the signal stream closes during runtime shutdown.
- **`POST /reload` endpoint** (`server.rs`): HTTP entry point for operators and CI pipelines. Loads a fresh config from `config_path`, calls `reload_agents`, returns `{"reloaded": true, "agents": N}` on success or `500` with error detail on failure.
- **Availability window documented**: `reload_agents` has a brief window (typically sub-millisecond) between drain and re-registration during which `assign_task` returns `NoAvailableAgent`. Callers must handle and retry. `BudgetManager` and `LLMRouter` are not reloaded; changes to those require a process restart.
#### Tests added (`vapora-agents`)
- `test_stable_id_deterministic` — two `AgentMetadata::new("developer", ...)` produce distinct `id` but identical `stable_id = "developer"`
- `test_stable_id_or_role_fallback` — empty `stable_id` falls back to `role`
- `test_drain_role` — after `drain_role("developer")`, reviewer agents persist and developer count is zero
- `test_list_roles` — returns correct sorted list after mixed-role registrations
- `test_profile_survives_role_drain``get_learning_profile("developer")` returns `Some` after `drain_role("developer")`
#### Clippy fixes (pre-existing, unblocked by this change)
- `vapora-workflow-engine/src/config.rs`: 4× `std::io::Error::new(ErrorKind::Other, ...)``std::io::Error::other(...)`
- `vapora-llm-router/src/budget.rs`: 4× same pattern
- `vapora-llm-router/src/config.rs`: 3× same pattern
- `vapora-agents/src/config.rs`: 4× same pattern
---
### Fixed - Stub Elimination: Real Implementations for 6 Hollow Integration Points ### Fixed - Stub Elimination: Real Implementations for 6 Hollow Integration Points
#### `vapora-backend` — WorkflowOrchestrator and WorkflowService wiring #### `vapora-backend` — WorkflowOrchestrator and WorkflowService wiring

View file

@ -99,7 +99,15 @@
- **Workflow events**: `on_stage_complete`, `on_stage_failed`, `on_completed`, `on_cancelled` — per-workflow routing config - **Workflow events**: `on_stage_complete`, `on_stage_failed`, `on_completed`, `on_cancelled` — per-workflow routing config
- **REST API**: `GET /api/v1/channels` (list), `POST /api/v1/channels/:name/test` (connectivity check) - **REST API**: `GET /api/v1/channels` (list), `POST /api/v1/channels/:name/test` (connectivity check)
### 🧠 Intelligent Learning & Cost Optimization (Phase 5.3 + 5.4) ### ♻️ Agent Hot-Reload — Zero Config-Rotation Learning Loss
- **`stable_id` identity**: Each agent carries a `stable_id = role` that is deterministic across restarts. Learning profiles and KG execution records are keyed by `stable_id`, not the ephemeral UUID, so accumulated expertise survives every reload.
- **SIGHUP reload**: `kill -HUP $(pgrep vapora-agents)` drains all executors and re-spawns them from the updated config without restarting the process.
- **`POST /reload` endpoint**: HTTP entry point for CI pipelines and operators. Returns `{"reloaded": true, "agents": N}` on success.
- **Graceful drain**: Dropping the executor `Sender` closes the mpsc channel; in-flight messages are drained before the executor exits — no tasks are lost during reload.
- **Profile persistence**: `learning_profiles` (keyed by `stable_id`) is never touched during drain/respawn. New executor instances inherit accumulated expertise immediately.
### 🧠 Intelligent Learning & Cost Optimization
- **Per-Task-Type Learning**: Agents build expertise profiles from execution history - **Per-Task-Type Learning**: Agents build expertise profiles from execution history
- **Recency Bias**: Recent performance weighted 3x (last 7 days) for adaptive selection - **Recency Bias**: Recent performance weighted 3x (last 7 days) for adaptive selection

View file

@ -1,39 +0,0 @@
# Agent Role Budget Configuration
# Defines monthly and weekly spending limits per agent role
# Budget enforcement prevents runaway LLM costs
# Fallback providers used when budget thresholds exceeded
[budgets.architect]
role = "architect"
monthly_limit_cents = 50000 # $500/month
weekly_limit_cents = 12500 # $125/week
fallback_provider = "gemini" # Cheaper alternative when budget hit
alert_threshold = 0.8 # Alert at 80% utilization
[budgets.developer]
role = "developer"
monthly_limit_cents = 30000 # $300/month
weekly_limit_cents = 7500 # $75/week
fallback_provider = "ollama" # Free local model
alert_threshold = 0.8
[budgets.reviewer]
role = "reviewer"
monthly_limit_cents = 20000 # $200/month
weekly_limit_cents = 5000 # $50/week
fallback_provider = "gemini"
alert_threshold = 0.8
[budgets.documenter]
role = "documenter"
monthly_limit_cents = 15000 # $150/month
weekly_limit_cents = 3750 # $37.50/week
fallback_provider = "ollama"
alert_threshold = 0.8
[budgets.tester]
role = "tester"
monthly_limit_cents = 25000 # $250/month
weekly_limit_cents = 6250 # $62.50/week
fallback_provider = "ollama"
alert_threshold = 0.8

120
config/agents.ncl Normal file
View file

@ -0,0 +1,120 @@
let C = import "../nickel/agents/contracts.ncl" in
{
registry | C.RegistryConfig = {
max_agents_per_role = 5,
health_check_interval = 30,
agent_timeout = 300,
},
agents | Array C.AgentDefinition = [
{
role = "architect",
description = "System design, architecture decisions, ADRs",
llm_provider = "claude",
llm_model = "claude-opus-4-20250514",
parallelizable = false,
priority = 100,
capabilities = ["system_design", "architecture", "adr", "patterns"],
},
{
role = "developer",
description = "Code implementation, feature development",
llm_provider = "claude",
llm_model = "claude-sonnet-4-5-20250929",
parallelizable = true,
priority = 80,
capabilities = ["coding", "implementation", "debugging"],
},
{
role = "code_reviewer",
description = "Code quality assurance, style checking",
llm_provider = "claude",
llm_model = "claude-sonnet-4-5-20250929",
parallelizable = true,
priority = 70,
capabilities = ["code_review", "quality", "best_practices"],
},
{
role = "tester",
description = "Tests, benchmarks, quality validation",
llm_provider = "claude",
llm_model = "claude-sonnet-4-5-20250929",
parallelizable = true,
priority = 75,
capabilities = ["testing", "benchmarks", "validation"],
},
{
role = "documenter",
description = "Documentation, root files (README, CHANGELOG)",
llm_provider = "openai",
llm_model = "gpt-4o",
parallelizable = true,
priority = 60,
capabilities = ["documentation", "readme", "changelog", "guides"],
},
{
role = "marketer",
description = "Marketing content, announcements",
llm_provider = "claude",
llm_model = "claude-sonnet-4-5-20250929",
parallelizable = true,
priority = 40,
capabilities = ["marketing", "content", "announcements"],
},
{
role = "presenter",
description = "Presentations, slides, demos",
llm_provider = "claude",
llm_model = "claude-sonnet-4-5-20250929",
parallelizable = false,
priority = 50,
capabilities = ["presentations", "slides", "demos"],
},
{
role = "devops",
description = "CI/CD, deployment, infrastructure",
llm_provider = "claude",
llm_model = "claude-sonnet-4-5-20250929",
parallelizable = true,
priority = 85,
capabilities = ["cicd", "deployment", "kubernetes", "infrastructure"],
},
{
role = "monitor",
description = "System health, alerting, observability",
llm_provider = "gemini",
llm_model = "gemini-2.0-flash",
parallelizable = false,
priority = 90,
capabilities = ["monitoring", "health", "alerts", "metrics"],
},
{
role = "security",
description = "Security audit, vulnerability detection",
llm_provider = "claude",
llm_model = "claude-opus-4-20250514",
parallelizable = true,
priority = 95,
capabilities = ["security", "audit", "vulnerabilities"],
},
{
role = "project_manager",
description = "Roadmap, task tracking, coordination",
llm_provider = "claude",
llm_model = "claude-sonnet-4-5-20250929",
parallelizable = false,
priority = 65,
capabilities = ["planning", "tracking", "coordination"],
},
{
role = "decision_maker",
description = "Conflict resolution, strategic decisions",
llm_provider = "claude",
llm_model = "claude-opus-4-20250514",
parallelizable = false,
priority = 100,
capabilities = ["decisions", "conflict_resolution", "strategy"],
},
],
}

View file

@ -1,122 +0,0 @@
# Agent Registry Configuration
# Phase 0: Definition of 12 agent roles
[registry]
# Maximum number of concurrent agents per role
max_agents_per_role = 5
# Agent health check interval (seconds)
health_check_interval = 30
# Agent timeout (seconds)
agent_timeout = 300
# The 12 Agent Roles
[[agents]]
role = "architect"
description = "System design, architecture decisions, ADRs"
llm_provider = "claude"
llm_model = "claude-opus-4-20250514"
parallelizable = false
priority = 100
capabilities = ["system_design", "architecture", "adr", "patterns"]
[[agents]]
role = "developer"
description = "Code implementation, feature development"
llm_provider = "claude"
llm_model = "claude-sonnet-4-5-20250929"
parallelizable = true
priority = 80
capabilities = ["coding", "implementation", "debugging"]
[[agents]]
role = "code_reviewer"
description = "Code quality assurance, style checking"
llm_provider = "claude"
llm_model = "claude-sonnet-4-5-20250929"
parallelizable = true
priority = 70
capabilities = ["code_review", "quality", "best_practices"]
[[agents]]
role = "tester"
description = "Tests, benchmarks, quality validation"
llm_provider = "claude"
llm_model = "claude-sonnet-4-5-20250929"
parallelizable = true
priority = 75
capabilities = ["testing", "benchmarks", "validation"]
[[agents]]
role = "documenter"
description = "Documentation, root files (README, CHANGELOG)"
llm_provider = "openai"
llm_model = "gpt-4o"
parallelizable = true
priority = 60
capabilities = ["documentation", "readme", "changelog", "guides"]
[[agents]]
role = "marketer"
description = "Marketing content, announcements"
llm_provider = "claude"
llm_model = "claude-sonnet-4-5-20250929"
parallelizable = true
priority = 40
capabilities = ["marketing", "content", "announcements"]
[[agents]]
role = "presenter"
description = "Presentations, slides, demos"
llm_provider = "claude"
llm_model = "claude-sonnet-4-5-20250929"
parallelizable = false
priority = 50
capabilities = ["presentations", "slides", "demos"]
[[agents]]
role = "devops"
description = "CI/CD, deployment, infrastructure"
llm_provider = "claude"
llm_model = "claude-sonnet-4-5-20250929"
parallelizable = true
priority = 85
capabilities = ["cicd", "deployment", "kubernetes", "infrastructure"]
[[agents]]
role = "monitor"
description = "System health, alerting, observability"
llm_provider = "gemini"
llm_model = "gemini-2.0-flash"
parallelizable = false
priority = 90
capabilities = ["monitoring", "health", "alerts", "metrics"]
[[agents]]
role = "security"
description = "Security audit, vulnerability detection"
llm_provider = "claude"
llm_model = "claude-opus-4-20250514"
parallelizable = true
priority = 95
capabilities = ["security", "audit", "vulnerabilities"]
[[agents]]
role = "project_manager"
description = "Roadmap, task tracking, coordination"
llm_provider = "claude"
llm_model = "claude-sonnet-4-5-20250929"
parallelizable = false
priority = 65
capabilities = ["planning", "tracking", "coordination"]
[[agents]]
role = "decision_maker"
description = "Conflict resolution, strategic decisions"
llm_provider = "claude"
llm_model = "claude-opus-4-20250514"
parallelizable = false
priority = 100
capabilities = ["decisions", "conflict_resolution", "strategy"]

45
config/budgets.ncl Normal file
View file

@ -0,0 +1,45 @@
let C = import "../nickel/budgets/contracts.ncl" in
{
budgets = {
architect | C.RoleBudget = {
role = "architect",
monthly_limit_cents = 50000,
weekly_limit_cents = 12500,
fallback_provider = "gemini",
alert_threshold = 0.8,
},
developer | C.RoleBudget = {
role = "developer",
monthly_limit_cents = 30000,
weekly_limit_cents = 7500,
fallback_provider = "ollama",
alert_threshold = 0.8,
},
reviewer | C.RoleBudget = {
role = "reviewer",
monthly_limit_cents = 20000,
weekly_limit_cents = 5000,
fallback_provider = "gemini",
alert_threshold = 0.8,
},
documenter | C.RoleBudget = {
role = "documenter",
monthly_limit_cents = 15000,
weekly_limit_cents = 3750,
fallback_provider = "ollama",
alert_threshold = 0.8,
},
tester | C.RoleBudget = {
role = "tester",
monthly_limit_cents = 25000,
weekly_limit_cents = 6250,
fallback_provider = "ollama",
alert_threshold = 0.8,
},
},
}

10
config/channels.ncl Normal file
View file

@ -0,0 +1,10 @@
{
channels = {},
notifications = {
on_task_done = [],
on_proposal_approved = [],
on_proposal_rejected = [],
on_agent_inactive = [],
},
}

6
config/config.ncl Normal file
View file

@ -0,0 +1,6 @@
(import "./server.ncl")
& (import "./agents.ncl")
& (import "./llm-router.ncl")
& (import "./budgets.ncl")
& (import "./workflows.ncl")
& (import "./channels.ncl")

80
config/llm-router.ncl Normal file
View file

@ -0,0 +1,80 @@
let C = import "../nickel/llm-router/contracts.ncl" in
{
routing | C.RoutingConfig = {
default_provider = "claude",
cost_tracking_enabled = true,
fallback_enabled = true,
},
providers = {
claude | C.ProviderConfig = {
enabled = true,
api_key = "${ANTHROPIC_API_KEY}",
model = "claude-sonnet-4-5-20250929",
max_tokens = 8192,
temperature = 0.7,
cost_per_1m_input = 3.00,
cost_per_1m_output = 15.00,
},
openai | C.ProviderConfig = {
enabled = true,
api_key = "${OPENAI_API_KEY}",
model = "gpt-4o",
max_tokens = 4096,
temperature = 0.7,
cost_per_1m_input = 2.50,
cost_per_1m_output = 10.00,
},
gemini | C.ProviderConfig = {
enabled = true,
api_key = "${GOOGLE_API_KEY}",
model = "gemini-2.0-flash",
max_tokens = 8192,
temperature = 0.7,
cost_per_1m_input = 0.30,
cost_per_1m_output = 1.20,
},
ollama | C.ProviderConfig = {
enabled = true,
url = "${OLLAMA_URL:-http://localhost:11434}",
model = "llama3.2",
max_tokens = 4096,
temperature = 0.7,
cost_per_1m_input = 0.00,
cost_per_1m_output = 0.00,
},
},
routing_rules | Array C.RoutingRule = [
{
name = "architecture_design",
condition = { task_type = "architecture" },
provider = "claude",
model_override = "claude-opus-4-20250514",
},
{
name = "code_generation",
condition = { task_type = "development" },
provider = "claude",
},
{
name = "documentation",
condition = { task_type = "documentation" },
provider = "openai",
},
{
name = "monitoring",
condition = { task_type = "monitoring" },
provider = "gemini",
},
{
name = "local_testing",
condition = { environment = "development" },
provider = "ollama",
},
],
}

View file

@ -1,87 +0,0 @@
# Multi-IA Router Configuration
# Phase 0: Configuration for LLM provider selection
[routing]
# Default provider if no specific routing rules match
default_provider = "claude"
# Enable cost tracking
cost_tracking_enabled = true
# Enable fallback on provider failure
fallback_enabled = true
[providers.claude]
enabled = true
# ANTHROPIC_API_KEY environment variable required
api_key = "${ANTHROPIC_API_KEY}"
model = "claude-sonnet-4-5-20250929"
max_tokens = 8192
temperature = 0.7
# Cost per 1M tokens (input/output)
cost_per_1m_input = 3.00
cost_per_1m_output = 15.00
[providers.openai]
enabled = true
# OPENAI_API_KEY environment variable required
api_key = "${OPENAI_API_KEY}"
model = "gpt-4o"
max_tokens = 4096
temperature = 0.7
# Cost per 1M tokens (input/output)
cost_per_1m_input = 2.50
cost_per_1m_output = 10.00
[providers.gemini]
enabled = true
# GOOGLE_API_KEY environment variable required
api_key = "${GOOGLE_API_KEY}"
model = "gemini-2.0-flash"
max_tokens = 8192
temperature = 0.7
# Cost per 1M tokens (input/output)
cost_per_1m_input = 0.30
cost_per_1m_output = 1.20
[providers.ollama]
enabled = true
# Local Ollama instance, no API key needed
url = "${OLLAMA_URL:-http://localhost:11434}"
model = "llama3.2"
max_tokens = 4096
temperature = 0.7
# No cost for local models
cost_per_1m_input = 0.00
cost_per_1m_output = 0.00
# Routing rules: assign providers based on task characteristics
[[routing_rules]]
name = "architecture_design"
condition = { task_type = "architecture" }
provider = "claude"
model_override = "claude-opus-4-20250514"
[[routing_rules]]
name = "code_generation"
condition = { task_type = "development" }
provider = "claude"
[[routing_rules]]
name = "documentation"
condition = { task_type = "documentation" }
provider = "openai"
[[routing_rules]]
name = "monitoring"
condition = { task_type = "monitoring" }
provider = "gemini"
[[routing_rules]]
name = "local_testing"
condition = { environment = "development" }
provider = "ollama"

38
config/server.ncl Normal file
View file

@ -0,0 +1,38 @@
let C = import "../nickel/vapora/contracts.ncl" in
{
server | C.ServerConfig = {
host = "127.0.0.1",
port = 3000,
tls = {
enabled = false,
cert_path = "",
key_path = "",
},
},
database | C.DatabaseConfig = {
url = "ws://localhost:8000",
max_connections = 10,
},
nats | C.NatsConfig = {
url = "nats://localhost:4222",
stream_name = "vapora-tasks",
},
auth | C.AuthConfig = {
jwt_secret = "change-in-production",
jwt_expiration_hours = 24,
},
logging | C.LoggingConfig = {
level = "info",
json = false,
},
metrics | C.MetricsConfig = {
enabled = true,
port = 9090,
},
}

View file

@ -1,40 +0,0 @@
# VAPORA Server Configuration
# Phase 0: Environment-based configuration
# Note: Load runtime configuration from environment variables, not this file
[server]
# Server configuration (override with env vars: VAPORA_HOST, VAPORA_PORT)
host = "127.0.0.1"
port = 3000
[server.tls]
# TLS configuration (optional)
# Override with: VAPORA_TLS_ENABLED, VAPORA_TLS_CERT_PATH, VAPORA_TLS_KEY_PATH
enabled = false
cert_path = ""
key_path = ""
[database]
# Database connection (override with: VAPORA_DB_URL, VAPORA_DB_MAX_CONNECTIONS)
url = "ws://localhost:8000"
max_connections = 10
[nats]
# NATS JetStream configuration (override with: VAPORA_NATS_URL, VAPORA_NATS_STREAM)
url = "nats://localhost:4222"
stream_name = "vapora-tasks"
[auth]
# Authentication configuration (override with: VAPORA_JWT_SECRET, VAPORA_JWT_EXPIRATION_HOURS)
jwt_secret = "change-in-production"
jwt_expiration_hours = 24
[logging]
# Logging configuration (override with: VAPORA_LOG_LEVEL, VAPORA_LOG_JSON)
level = "info"
json = false
[metrics]
# Metrics configuration (override with: VAPORA_METRICS_ENABLED, VAPORA_METRICS_PORT)
enabled = true
port = 9090

153
config/workflows.ncl Normal file
View file

@ -0,0 +1,153 @@
let C = import "../nickel/workflows/contracts.ncl" in
{
engine | C.EngineConfig = {
max_parallel_tasks = 10,
workflow_timeout = 3600,
approval_gates_enabled = true,
},
workflows = [
{
name = "feature_development",
trigger = "manual",
stages | Array C.StageConfig = [
{
name = "architecture_design",
agents = ["architect"],
parallel = false,
approval_required = false,
},
{
name = "implementation",
agents = ["developer", "developer"],
parallel = true,
max_parallel = 2,
approval_required = false,
},
{
name = "testing",
agents = ["tester"],
parallel = false,
approval_required = false,
},
{
name = "code_review",
agents = ["reviewer"],
parallel = false,
approval_required = true,
},
{
name = "deployment",
agents = ["devops"],
parallel = false,
approval_required = true,
},
],
notifications = {
on_completed = [],
on_failed = [],
on_approval_required = [],
},
},
{
name = "bugfix",
trigger = "manual",
stages | Array C.StageConfig = [
{
name = "investigation",
agents = ["developer"],
parallel = false,
approval_required = false,
},
{
name = "fix_implementation",
agents = ["developer"],
parallel = false,
approval_required = false,
},
{
name = "testing",
agents = ["tester"],
parallel = false,
approval_required = false,
},
{
name = "deployment",
agents = ["devops"],
parallel = false,
approval_required = false,
},
],
notifications = {
on_completed = [],
on_failed = [],
on_approval_required = [],
},
},
{
name = "documentation_update",
trigger = "manual",
stages | Array C.StageConfig = [
{
name = "content_creation",
agents = ["technical_writer"],
parallel = false,
approval_required = false,
},
{
name = "review",
agents = ["reviewer"],
parallel = false,
approval_required = true,
},
{
name = "publish",
agents = ["devops"],
parallel = false,
approval_required = false,
},
],
notifications = {
on_completed = [],
on_failed = [],
on_approval_required = [],
},
},
{
name = "security_audit",
trigger = "manual",
stages | Array C.StageConfig = [
{
name = "code_analysis",
agents = ["security_engineer"],
parallel = false,
approval_required = false,
},
{
name = "penetration_testing",
agents = ["security_engineer"],
parallel = false,
approval_required = false,
},
{
name = "remediation",
agents = ["developer"],
parallel = false,
approval_required = false,
},
{
name = "verification",
agents = ["security_engineer"],
parallel = false,
approval_required = true,
},
],
notifications = {
on_completed = [],
on_failed = [],
on_approval_required = [],
},
},
],
}

View file

@ -1,117 +0,0 @@
[engine]
max_parallel_tasks = 10
workflow_timeout = 3600
approval_gates_enabled = true
[[workflows]]
name = "feature_development"
trigger = "manual"
[[workflows.stages]]
name = "architecture_design"
agents = ["architect"]
parallel = false
approval_required = false
[[workflows.stages]]
name = "implementation"
agents = ["developer", "developer"]
parallel = true
max_parallel = 2
approval_required = false
[[workflows.stages]]
name = "testing"
agents = ["tester"]
parallel = false
approval_required = false
[[workflows.stages]]
name = "code_review"
agents = ["reviewer"]
parallel = false
approval_required = true
[[workflows.stages]]
name = "deployment"
agents = ["devops"]
parallel = false
approval_required = true
[[workflows]]
name = "bugfix"
trigger = "manual"
[[workflows.stages]]
name = "investigation"
agents = ["developer"]
parallel = false
approval_required = false
[[workflows.stages]]
name = "fix_implementation"
agents = ["developer"]
parallel = false
approval_required = false
[[workflows.stages]]
name = "testing"
agents = ["tester"]
parallel = false
approval_required = false
[[workflows.stages]]
name = "deployment"
agents = ["devops"]
parallel = false
approval_required = false
[[workflows]]
name = "documentation_update"
trigger = "manual"
[[workflows.stages]]
name = "content_creation"
agents = ["technical_writer"]
parallel = false
approval_required = false
[[workflows.stages]]
name = "review"
agents = ["reviewer"]
parallel = false
approval_required = true
[[workflows.stages]]
name = "publish"
agents = ["devops"]
parallel = false
approval_required = false
[[workflows]]
name = "security_audit"
trigger = "manual"
[[workflows.stages]]
name = "code_analysis"
agents = ["security_engineer"]
parallel = false
approval_required = false
[[workflows.stages]]
name = "penetration_testing"
agents = ["security_engineer"]
parallel = false
approval_required = false
[[workflows.stages]]
name = "remediation"
agents = ["developer"]
parallel = false
approval_required = false
[[workflows.stages]]
name = "verification"
agents = ["security_engineer"]
parallel = false
approval_required = true

View file

@ -5,7 +5,7 @@ use std::collections::HashMap;
use std::sync::Arc; use std::sync::Arc;
use anyhow::Result; use anyhow::Result;
use axum::{extract::State, routing::get, Json, Router}; use axum::{extract::State, http::StatusCode, response::IntoResponse, routing::get, Json, Router};
use clap::Parser; use clap::Parser;
use serde_json::json; use serde_json::json;
use tokio::net::TcpListener; use tokio::net::TcpListener;
@ -26,6 +26,12 @@ struct AppState {
coordinator: Arc<AgentCoordinator>, coordinator: Arc<AgentCoordinator>,
#[allow(dead_code)] #[allow(dead_code)]
budget_manager: Option<Arc<BudgetManager>>, budget_manager: Option<Arc<BudgetManager>>,
/// Path to the agent config file, used for re-loading on hot-reload.
config_path: String,
/// LLM router shared across executors.
router: Option<Arc<LLMRouter>>,
/// Capability registry for re-spawning capability executors.
cap_registry: Arc<CapabilityRegistry>,
} }
#[derive(Parser, Debug)] #[derive(Parser, Debug)]
@ -41,11 +47,7 @@ struct AppState {
)] )]
struct Args { struct Args {
/// Path to budget configuration file /// Path to budget configuration file
#[arg( #[arg(long, default_value = "config/config.ncl", env = "BUDGET_CONFIG_PATH")]
long,
default_value = "config/agent-budgets.toml",
env = "BUDGET_CONFIG_PATH"
)]
budget_config: String, budget_config: String,
} }
@ -64,6 +66,8 @@ async fn main() -> Result<()> {
// Load agent configuration // Load agent configuration
let config = AgentConfig::from_env()?; let config = AgentConfig::from_env()?;
let config_path =
std::env::var("VAPORA_AGENT_CONFIG").unwrap_or_else(|_| "config/config.ncl".to_string());
info!("Loaded configuration from environment"); info!("Loaded configuration from environment");
// Load budget configuration // Load budget configuration
@ -96,7 +100,7 @@ async fn main() -> Result<()> {
let router = router.map(Arc::new); let router = router.map(Arc::new);
// Initialize capability registry with built-in capability packages // Initialize capability registry with built-in capability packages
let cap_registry = CapabilityRegistry::with_built_ins(); let cap_registry = Arc::new(CapabilityRegistry::with_built_ins());
info!( info!(
"Capability registry initialized: {:?}", "Capability registry initialized: {:?}",
cap_registry.list_ids() cap_registry.list_ids()
@ -118,8 +122,6 @@ async fn main() -> Result<()> {
let coordinator = Arc::new(coordinator); let coordinator = Arc::new(coordinator);
// Spawn one executor per built-in capability, each wired to the LLM router. // Spawn one executor per built-in capability, each wired to the LLM router.
// The executor's channel sender is registered with the coordinator so that
// assign_task() dispatches directly in-process.
for cap_id in cap_registry.list_ids() { for cap_id in cap_registry.list_ids() {
spawn_capability_executor( spawn_capability_executor(
&cap_id, &cap_id,
@ -130,8 +132,7 @@ async fn main() -> Result<()> {
); );
} }
// Spawn executors for any agents defined in agents.toml that are NOT // Spawn executors for config agents not covered by a capability package.
// already covered by a capability package (role not registered yet).
for agent_def in &config.agents { for agent_def in &config.agents {
if registry.get_agents_by_role(&agent_def.role).is_empty() { if registry.get_agents_by_role(&agent_def.role).is_empty() {
spawn_single_config_executor(agent_def, &registry, &coordinator, router.as_ref()); spawn_single_config_executor(agent_def, &registry, &coordinator, router.as_ref());
@ -151,13 +152,38 @@ async fn main() -> Result<()> {
}; };
let state = AppState { let state = AppState {
coordinator, coordinator: Arc::clone(&coordinator),
budget_manager, budget_manager,
config_path: config_path.clone(),
router: router.clone(),
cap_registry: Arc::clone(&cap_registry),
}; };
// SIGHUP handler for config reload. Note: there is a brief unavailability
// window between drain and re-registration during which assign_task returns
// NoAvailableAgent. Learning profiles (keyed by stable_id) are preserved.
tokio::spawn({
let state = state.clone();
let registry = Arc::clone(&registry);
async move {
let mut sighup =
match tokio::signal::unix::signal(tokio::signal::unix::SignalKind::hangup()) {
Ok(s) => s,
Err(e) => {
warn!("Failed to register SIGHUP handler: {}", e);
return;
}
};
while sighup.recv().await.is_some() {
handle_sighup_reload(&state, &registry).await;
}
}
});
let app = Router::new() let app = Router::new()
.route("/health", get(health_handler)) .route("/health", get(health_handler))
.route("/ready", get(readiness_handler)) .route("/ready", get(readiness_handler))
.route("/reload", axum::routing::post(reload_handler))
.with_state(state); .with_state(state);
let addr = std::env::var("BIND_ADDR").unwrap_or_else(|_| "0.0.0.0:9000".to_string()); let addr = std::env::var("BIND_ADDR").unwrap_or_else(|_| "0.0.0.0:9000".to_string());
@ -169,6 +195,76 @@ async fn main() -> Result<()> {
Ok(()) Ok(())
} }
/// Called from the SIGHUP loop. Loads a fresh config and delegates to
/// `reload_agents`. Errors are logged; the process keeps running.
async fn handle_sighup_reload(state: &AppState, registry: &Arc<AgentRegistry>) {
info!("SIGHUP received: reloading agent configuration");
let new_config = match AgentConfig::load(&state.config_path) {
Ok(c) => c,
Err(e) => {
error!("Config parse failed during reload: {}", e);
return;
}
};
match reload_agents(
&state.coordinator,
registry,
&new_config,
&state.cap_registry,
state.router.as_ref(),
)
.await
{
Ok(n) => info!("Reload complete: {} agents active", n),
Err(e) => error!("Reload failed: {}", e),
}
}
/// Drain all roles and re-spawn executors from config + capability registry.
///
/// Learning profiles (keyed by `stable_id`) are preserved in the coordinator;
/// new executor instances inherit the accumulated expertise immediately.
///
/// # Availability window
///
/// Between the drain phase and the first successful `register_agent`, callers
/// of `assign_task` for the affected roles receive `NoAvailableAgent`. This is
/// a brief window (microseconds to low milliseconds). Callers must handle this
/// error and retry. BudgetManager and LLMRouter are not reloaded; changes to
/// those require a process restart.
async fn reload_agents(
coordinator: &Arc<AgentCoordinator>,
registry: &Arc<AgentRegistry>,
new_config: &AgentConfig,
cap_registry: &Arc<CapabilityRegistry>,
router: Option<&Arc<LLMRouter>>,
) -> anyhow::Result<usize> {
// Drain all currently active roles. The dropped Sender causes each
// executor's recv loop to exit after draining in-flight messages.
for role in registry.list_roles() {
let drained = coordinator.drain_role(&role);
if !drained.is_empty() {
info!("Drained {} agent(s) for role '{}'", drained.len(), role);
}
}
// Re-spawn capability executors
for cap_id in cap_registry.list_ids() {
spawn_capability_executor(cap_id.as_str(), cap_registry, registry, coordinator, router);
}
// Re-spawn config agents not covered by capabilities
for agent_def in &new_config.agents {
if registry.get_agents_by_role(&agent_def.role).is_empty() {
spawn_single_config_executor(agent_def, registry, coordinator, router);
}
}
let total = registry.total_count();
info!("Reload complete: {} agents active", total);
Ok(total)
}
/// Activate a capability, register the resulting agent, and spawn its executor. /// Activate a capability, register the resulting agent, and spawn its executor.
fn spawn_capability_executor( fn spawn_capability_executor(
cap_id: &str, cap_id: &str,
@ -381,3 +477,34 @@ async fn readiness_handler(State(state): State<AppState>) -> Json<serde_json::Va
"agents": state.coordinator.get_agent_count().await "agents": state.coordinator.get_agent_count().await
})) }))
} }
async fn reload_handler(State(state): State<AppState>) -> impl IntoResponse {
let new_config = match AgentConfig::load(&state.config_path) {
Ok(c) => c,
Err(e) => {
return (
StatusCode::INTERNAL_SERVER_ERROR,
Json(json!({"error": e.to_string()})),
)
.into_response();
}
};
let registry = state.coordinator.registry_arc();
match reload_agents(
&state.coordinator,
&registry,
&new_config,
&state.cap_registry,
state.router.as_ref(),
)
.await
{
Ok(n) => (StatusCode::OK, Json(json!({"reloaded": true, "agents": n}))).into_response(),
Err(e) => (
StatusCode::INTERNAL_SERVER_ERROR,
Json(json!({"error": e.to_string()})),
)
.into_response(),
}
}

View file

@ -11,8 +11,8 @@ pub enum ConfigError {
#[error("Failed to read config file: {0}")] #[error("Failed to read config file: {0}")]
ReadError(#[from] std::io::Error), ReadError(#[from] std::io::Error),
#[error("Failed to parse TOML: {0}")] #[error("Failed to parse config: {0}")]
ParseError(#[from] toml::de::Error), ParseJson(#[from] serde_json::Error),
#[error("Invalid configuration: {0}")] #[error("Invalid configuration: {0}")]
ValidationError(String), ValidationError(String),
@ -52,10 +52,55 @@ fn default_agent_timeout() -> u64 {
pub use vapora_shared::AgentDefinition; pub use vapora_shared::AgentDefinition;
impl AgentConfig { impl AgentConfig {
/// Load configuration from TOML file /// Load configuration from a TOML or NCL file. When the path has a `.ncl`
/// extension, `nickel export --format json` is invoked and the resulting
/// JSON is parsed. Otherwise the file is read and parsed as TOML.
pub fn load<P: AsRef<Path>>(path: P) -> Result<Self, ConfigError> { pub fn load<P: AsRef<Path>>(path: P) -> Result<Self, ConfigError> {
let content = std::fs::read_to_string(path)?; let path = path.as_ref();
let config: Self = toml::from_str(&content)?;
let (raw, is_json) = if path.extension().and_then(|e| e.to_str()) == Some("ncl") {
let out = std::process::Command::new("nickel")
.args(["export", "--format", "json"])
.arg(path)
.output()
.map_err(|e| {
ConfigError::ReadError(std::io::Error::other(format!(
"Failed to invoke nickel for {:?}: {}",
path, e
)))
})?;
if !out.status.success() {
let stderr = String::from_utf8_lossy(&out.stderr);
return Err(ConfigError::ReadError(std::io::Error::other(format!(
"nickel export failed for {:?}: {}",
path, stderr
))));
}
let json = String::from_utf8(out.stdout).map_err(|e| {
ConfigError::ReadError(std::io::Error::other(format!(
"nickel output is not valid UTF-8: {}",
e
)))
})?;
(json, true)
} else {
let content = std::fs::read_to_string(path)?;
(content, false)
};
let interpolated = interpolate_env_vars(&raw);
let config: Self = if is_json {
serde_json::from_str(&interpolated)?
} else {
toml::from_str(&interpolated).map_err(|e| {
ConfigError::ReadError(std::io::Error::other(format!(
"Failed to parse TOML: {}",
e
)))
})?
};
config.validate()?; config.validate()?;
Ok(config) Ok(config)
} }
@ -63,12 +108,11 @@ impl AgentConfig {
/// Load configuration from environment or default file /// Load configuration from environment or default file
pub fn from_env() -> Result<Self, ConfigError> { pub fn from_env() -> Result<Self, ConfigError> {
let config_path = std::env::var("VAPORA_AGENT_CONFIG") let config_path = std::env::var("VAPORA_AGENT_CONFIG")
.unwrap_or_else(|_| "/etc/vapora/agents.toml".to_string()); .unwrap_or_else(|_| "config/config.ncl".to_string());
if Path::new(&config_path).exists() { if Path::new(&config_path).exists() {
Self::load(&config_path) Self::load(&config_path)
} else { } else {
// Return default config if file doesn't exist
Ok(Self::default()) Ok(Self::default())
} }
} }
@ -129,6 +173,34 @@ impl Default for AgentConfig {
} }
} }
/// Expand every `${VAR}` / `${VAR:-default}` reference in `content`.
/// Unresolved vars without a default are replaced with an empty string.
fn interpolate_env_vars(content: &str) -> String {
let mut result = String::with_capacity(content.len());
let mut remaining = content;
while let Some(start) = remaining.find("${") {
result.push_str(&remaining[..start]);
let after_open = &remaining[start + 2..];
if let Some(close) = after_open.find('}') {
let var_expr = &after_open[..close];
let value = if let Some(sep) = var_expr.find(":-") {
let var_name = &var_expr[..sep];
let default_val = &var_expr[sep + 2..];
std::env::var(var_name).unwrap_or_else(|_| default_val.to_string())
} else {
std::env::var(var_expr).unwrap_or_default()
};
result.push_str(&value);
remaining = &after_open[close + 1..];
} else {
result.push_str("${");
remaining = after_open;
}
}
result.push_str(remaining);
result
}
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use super::*; use super::*;

View file

@ -1,5 +1,4 @@
// vapora-agents: Agent coordinator - orchestrates agent workflows // vapora-agents: Agent coordinator - orchestrates agent workflows
// Phase 2: Complete implementation with NATS integration
use std::collections::HashMap; use std::collections::HashMap;
use std::path::PathBuf; use std::path::PathBuf;
@ -238,7 +237,7 @@ impl AgentCoordinator {
// Simple heuristic: check if title/description contains known task types // Simple heuristic: check if title/description contains known task types
let task_type = extract_task_type(&title, &description, role); let task_type = extract_task_type(&title, &description, role);
// Get learning profiles for all candidates // Get learning profiles for all candidates, keyed by stable_id.
let learning_profiles = { let learning_profiles = {
let profiles = self let profiles = self
.learning_profiles .learning_profiles
@ -246,7 +245,10 @@ impl AgentCoordinator {
.unwrap_or_else(|e| e.into_inner()); .unwrap_or_else(|e| e.into_inner());
candidates candidates
.iter() .iter()
.map(|a| (a.id.clone(), profiles.get(&a.id).cloned())) .map(|a| {
let key = a.stable_id_or_role();
(a.id.clone(), profiles.get(key).cloned())
})
.collect::<Vec<_>>() .collect::<Vec<_>>()
}; };
@ -426,6 +428,27 @@ impl AgentCoordinator {
Arc::clone(&self.registry) Arc::clone(&self.registry)
} }
/// Shared reference to the registry (for hot-reload coordination).
pub fn registry_arc(&self) -> Arc<AgentRegistry> {
Arc::clone(&self.registry)
}
/// Drain all agents for `role`: removes them from the registry and drops
/// their in-process executor channels.
///
/// Learning profiles keyed by `stable_id` are intentionally preserved so
/// that re-spawned executors immediately benefit from past expertise.
///
/// The dropped `Sender` causes the executor loop to exit once it drains
/// any in-flight messages.
pub fn drain_role(&self, role: &str) -> Vec<String> {
let ids = self.registry.drain_role(role);
for id in &ids {
self.executor_channels.remove(id);
}
ids
}
/// Start coordinator (subscribe to NATS topics) /// Start coordinator (subscribe to NATS topics)
pub async fn start(&self) -> Result<(), CoordinatorError> { pub async fn start(&self) -> Result<(), CoordinatorError> {
if self.nats_client.is_none() { if self.nats_client.is_none() {
@ -535,17 +558,18 @@ impl AgentCoordinator {
); );
for agent in agents { for agent in agents {
let key = agent.stable_id_or_role().to_string();
match self match self
.load_learning_profile_from_kg(&agent.id, task_type, kg_persistence) .load_learning_profile_from_kg(&key, task_type, kg_persistence)
.await .await
{ {
Ok(profile) => { Ok(profile) => {
self.update_learning_profile(&agent.id, profile)?; self.update_learning_profile(&key, profile)?;
} }
Err(e) => { Err(e) => {
warn!( warn!(
"Failed to load learning profile for agent {}: {}", "Failed to load learning profile for agent {} (stable_id: {}): {}",
agent.id, e agent.id, key, e
); );
// Continue with other agents on failure // Continue with other agents on failure
} }
@ -743,6 +767,35 @@ mod tests {
assert!(task_id.is_ok()); assert!(task_id.is_ok());
} }
#[test]
fn test_profile_survives_role_drain() {
let registry = Arc::new(AgentRegistry::new(5));
let agent = AgentMetadata::new(
"developer".to_string(),
"Dev 1".to_string(),
"claude".to_string(),
"claude-sonnet-4".to_string(),
vec![],
);
registry.register_agent(agent).unwrap();
let coordinator = AgentCoordinator::with_registry(Arc::clone(&registry));
// Insert a profile under the stable_id key
let profile = crate::learning_profile::LearningProfile::new("developer".to_string());
coordinator
.update_learning_profile("developer", profile)
.unwrap();
// Drain the role (removes agents + channels)
let drained = coordinator.drain_role("developer");
assert_eq!(drained.len(), 1);
// Profile must survive the drain
let retained = coordinator.get_learning_profile("developer");
assert!(retained.is_some(), "profile must survive drain_role");
}
#[tokio::test] #[tokio::test]
async fn test_no_available_agent() { async fn test_no_available_agent() {
// Set schema directory for tests (relative to workspace root) // Set schema directory for tests (relative to workspace root)

View file

@ -1,6 +1,4 @@
// Profile adapter: AgentMetadata + KG metrics → Swarm AgentProfile // Profile adapter: AgentMetadata + KG metrics → Swarm AgentProfile
// Phase 5.2: Bridges agent registry with swarm coordination
// Phase 5.3: Integrates per-task-type learning profiles from KG
use vapora_swarm::messages::AgentProfile; use vapora_swarm::messages::AgentProfile;
@ -40,9 +38,12 @@ impl ProfileAdapter {
profile profile
} }
/// Create learning profile from agent with task-type expertise. /// Create a learning profile with the given ID.
/// Integrates per-task-type learning data from KG for intelligent ///
/// assignment. /// Callers should pass `agent.stable_id_or_role()` (not `agent.id`) so
/// that the profile survives hot-reloads. The coordinator already enforces
/// this; use this function for constructing profiles that will be inserted
/// via `AgentCoordinator::update_learning_profile`.
pub fn create_learning_profile(agent_id: String) -> LearningProfile { pub fn create_learning_profile(agent_id: String) -> LearningProfile {
LearningProfile::new(agent_id) LearningProfile::new(agent_id)
} }
@ -79,6 +80,7 @@ mod tests {
fn test_profile_creation_from_metadata() { fn test_profile_creation_from_metadata() {
let agent = AgentMetadata { let agent = AgentMetadata {
id: "agent-1".to_string(), id: "agent-1".to_string(),
stable_id: "developer".to_string(),
role: "developer".to_string(), role: "developer".to_string(),
name: "Dev Agent 1".to_string(), name: "Dev Agent 1".to_string(),
version: "0.1.0".to_string(), version: "0.1.0".to_string(),
@ -109,6 +111,7 @@ mod tests {
let agents = vec![ let agents = vec![
AgentMetadata { AgentMetadata {
id: "agent-1".to_string(), id: "agent-1".to_string(),
stable_id: "developer".to_string(),
role: "developer".to_string(), role: "developer".to_string(),
name: "Dev 1".to_string(), name: "Dev 1".to_string(),
version: "0.1.0".to_string(), version: "0.1.0".to_string(),
@ -126,6 +129,7 @@ mod tests {
}, },
AgentMetadata { AgentMetadata {
id: "agent-2".to_string(), id: "agent-2".to_string(),
stable_id: "reviewer".to_string(),
role: "reviewer".to_string(), role: "reviewer".to_string(),
name: "Reviewer 1".to_string(), name: "Reviewer 1".to_string(),
version: "0.1.0".to_string(), version: "0.1.0".to_string(),

View file

@ -1,5 +1,4 @@
// vapora-agents: Agent registry - manages agent lifecycle and availability // vapora-agents: Agent registry - manages agent lifecycle and availability
// Phase 2: Complete implementation with 12 agent roles
use std::collections::HashMap; use std::collections::HashMap;
use std::sync::{Arc, RwLock}; use std::sync::{Arc, RwLock};
@ -35,7 +34,12 @@ pub enum AgentStatus {
#[derive(Debug, Clone, Serialize, Deserialize)] #[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AgentMetadata { pub struct AgentMetadata {
/// Ephemeral UUID generated at runtime — changes every startup.
pub id: String, pub id: String,
/// Deterministic identity keyed on role — survives hot-reloads.
/// Defaults to `role` for data produced before this field existed.
#[serde(default)]
pub stable_id: String,
pub role: String, pub role: String,
pub name: String, pub name: String,
pub version: String, pub version: String,
@ -66,6 +70,7 @@ impl AgentMetadata {
let now = Utc::now(); let now = Utc::now();
Self { Self {
id: Uuid::new_v4().to_string(), id: Uuid::new_v4().to_string(),
stable_id: role.clone(),
role, role,
name, name,
version: "0.1.0".to_string(), version: "0.1.0".to_string(),
@ -90,6 +95,18 @@ impl AgentMetadata {
self self
} }
/// Stable profile identity: `stable_id` when non-empty, `role` otherwise.
///
/// The fallback handles data deserialized from records produced before
/// `stable_id` was introduced.
pub fn stable_id_or_role(&self) -> &str {
if self.stable_id.is_empty() {
&self.role
} else {
&self.stable_id
}
}
/// Check if agent can accept new tasks /// Check if agent can accept new tasks
pub fn can_accept_task(&self) -> bool { pub fn can_accept_task(&self) -> bool {
self.status == AgentStatus::Active && self.current_tasks < self.max_concurrent_tasks self.status == AgentStatus::Active && self.current_tasks < self.max_concurrent_tasks
@ -283,6 +300,40 @@ impl AgentRegistry {
let inner = self.inner.read().expect("Failed to acquire read lock"); let inner = self.inner.read().expect("Failed to acquire read lock");
inner.agents.len() inner.agents.len()
} }
/// Remove all agents for `role` from the registry and return their
/// ephemeral IDs. The `running_count` entry for the role is also cleared.
///
/// Called during hot-reload to drain a role before re-spawning executors.
/// Learning profiles (keyed by `stable_id`) are unaffected.
pub fn drain_role(&self, role: &str) -> Vec<String> {
let mut inner = self.inner.write().expect("registry write lock");
let ids: Vec<String> = inner
.agents
.values()
.filter(|a| a.role == role)
.map(|a| a.id.clone())
.collect();
for id in &ids {
inner.agents.remove(id);
}
inner.running_count.remove(role);
ids
}
/// Return the set of roles currently present in the registry.
pub fn list_roles(&self) -> Vec<String> {
let inner = self.inner.read().expect("registry read lock");
let mut roles: Vec<String> = inner
.agents
.values()
.map(|a| a.role.clone())
.collect::<std::collections::HashSet<_>>()
.into_iter()
.collect();
roles.sort();
roles
}
} }
impl Default for AgentRegistry { impl Default for AgentRegistry {
@ -363,6 +414,98 @@ mod tests {
assert_eq!(agent.total_tasks_completed, 1); assert_eq!(agent.total_tasks_completed, 1);
} }
#[test]
fn test_stable_id_deterministic() {
let a = AgentMetadata::new(
"developer".to_string(),
"Dev 1".to_string(),
"claude".to_string(),
"claude-sonnet-4".to_string(),
vec![],
);
let b = AgentMetadata::new(
"developer".to_string(),
"Dev 2".to_string(),
"claude".to_string(),
"claude-sonnet-4".to_string(),
vec![],
);
assert_ne!(a.id, b.id, "ephemeral IDs must differ");
assert_eq!(a.stable_id, "developer");
assert_eq!(b.stable_id, "developer");
assert_eq!(a.stable_id_or_role(), "developer");
}
#[test]
fn test_stable_id_or_role_fallback() {
let mut agent = AgentMetadata::new(
"reviewer".to_string(),
"Rev".to_string(),
"claude".to_string(),
"claude-sonnet-4".to_string(),
vec![],
);
agent.stable_id = String::new();
assert_eq!(agent.stable_id_or_role(), "reviewer");
}
#[test]
fn test_drain_role() {
let registry = AgentRegistry::new(5);
for i in 0..2 {
let agent = AgentMetadata::new(
"developer".to_string(),
format!("Dev {}", i),
"claude".to_string(),
"claude-sonnet-4".to_string(),
vec![],
);
registry.register_agent(agent).unwrap();
}
let reviewer = AgentMetadata::new(
"reviewer".to_string(),
"Rev".to_string(),
"claude".to_string(),
"claude-sonnet-4".to_string(),
vec![],
);
registry.register_agent(reviewer).unwrap();
let drained = registry.drain_role("developer");
assert_eq!(drained.len(), 2);
assert_eq!(registry.count_by_role("developer"), 0);
assert_eq!(registry.count_by_role("reviewer"), 1);
assert_eq!(registry.total_count(), 1);
}
#[test]
fn test_list_roles() {
let registry = AgentRegistry::new(5);
let a = AgentMetadata::new(
"developer".to_string(),
"Dev".to_string(),
"claude".to_string(),
"claude-sonnet-4".to_string(),
vec![],
);
let b = AgentMetadata::new(
"reviewer".to_string(),
"Rev".to_string(),
"claude".to_string(),
"claude-sonnet-4".to_string(),
vec![],
);
registry.register_agent(a).unwrap();
registry.register_agent(b).unwrap();
let mut roles = registry.list_roles();
roles.sort();
assert_eq!(roles, vec!["developer", "reviewer"]);
}
#[test] #[test]
fn test_get_available_agent() { fn test_get_available_agent() {
let registry = AgentRegistry::new(5); let registry = AgentRegistry::new(5);

View file

@ -64,10 +64,13 @@ impl AgentExecutor {
/// Run the executor loop, processing tasks until the channel closes. /// Run the executor loop, processing tasks until the channel closes.
pub async fn run(mut self) { pub async fn run(mut self) {
info!( info!(
"AgentExecutor started for agent: {}", "AgentExecutor started for agent: {} (stable_id: {})",
self.agent.metadata.id self.agent.metadata.id,
self.agent.metadata.stable_id_or_role()
); );
let agent_id = self.agent.metadata.id.clone(); // Use stable_id so KG records survive hot-reloads and are correlated
// across agent restarts for the same role.
let agent_id = self.agent.metadata.stable_id_or_role().to_string();
while let Some(task) = self.task_rx.recv().await { while let Some(task) = self.task_rx.recv().await {
debug!("Received task: {}", task.id); debug!("Received task: {}", task.id);
@ -275,6 +278,7 @@ mod tests {
async fn test_executor_creation() { async fn test_executor_creation() {
let metadata = AgentMetadata { let metadata = AgentMetadata {
id: "test-executor".to_string(), id: "test-executor".to_string(),
stable_id: "developer".to_string(),
role: "developer".to_string(), role: "developer".to_string(),
name: "Test Executor".to_string(), name: "Test Executor".to_string(),
version: "0.1.0".to_string(), version: "0.1.0".to_string(),
@ -303,6 +307,7 @@ mod tests {
fn test_executor_persistence_disabled_by_default() { fn test_executor_persistence_disabled_by_default() {
let metadata = AgentMetadata { let metadata = AgentMetadata {
id: "test-no-persist".to_string(), id: "test-no-persist".to_string(),
stable_id: "reviewer".to_string(),
role: "reviewer".to_string(), role: "reviewer".to_string(),
name: "Test No Persist".to_string(), name: "Test No Persist".to_string(),
version: "0.1.0".to_string(), version: "0.1.0".to_string(),

View file

@ -150,6 +150,7 @@ mod tests {
// Create metadata for testing // Create metadata for testing
let metadata = AgentMetadata { let metadata = AgentMetadata {
id: "test-agent".to_string(), id: "test-agent".to_string(),
stable_id: "developer".to_string(),
role: "developer".to_string(), role: "developer".to_string(),
name: "Test Developer".to_string(), name: "Test Developer".to_string(),
version: "0.1.0".to_string(), version: "0.1.0".to_string(),
@ -201,6 +202,7 @@ mod tests {
fn test_failed_state_transition() { fn test_failed_state_transition() {
let metadata = AgentMetadata { let metadata = AgentMetadata {
id: "test-agent".to_string(), id: "test-agent".to_string(),
stable_id: "developer".to_string(),
role: "developer".to_string(), role: "developer".to_string(),
name: "Test Developer".to_string(), name: "Test Developer".to_string(),
version: "0.1.0".to_string(), version: "0.1.0".to_string(),

View file

@ -37,9 +37,6 @@ async fn test_end_to_end_learning_with_budget_enforcement() {
vec!["coding".to_string(), "documentation".to_string()], vec!["coding".to_string(), "documentation".to_string()],
); );
let dev_a_id = developer_a.id.clone();
let dev_b_id = developer_b.id.clone();
registry.register_agent(developer_a).ok(); registry.register_agent(developer_a).ok();
registry.register_agent(developer_b).ok(); registry.register_agent(developer_b).ok();
@ -62,9 +59,10 @@ async fn test_end_to_end_learning_with_budget_enforcement() {
let budget_manager = Arc::new(BudgetManager::new(budgets)); let budget_manager = Arc::new(BudgetManager::new(budgets));
let coordinator = coordinator.with_budget_manager(budget_manager.clone()); let coordinator = coordinator.with_budget_manager(budget_manager.clone());
// Simulate historical executions for developer_a (excellent at coding) // Build a role-level learning profile keyed by stable_id ("developer").
// All developer agents share one profile since stable_id = role.
let now = Utc::now(); let now = Utc::now();
let dev_a_executions: Vec<ExecutionData> = (0..30) let dev_executions: Vec<ExecutionData> = (0..30)
.map(|i| ExecutionData { .map(|i| ExecutionData {
timestamp: now - Duration::days(i), timestamp: now - Duration::days(i),
duration_ms: 200 + (i as u64 * 5), duration_ms: 200 + (i as u64 * 5),
@ -72,46 +70,22 @@ async fn test_end_to_end_learning_with_budget_enforcement() {
}) })
.collect(); .collect();
// Simulate historical executions for developer_b (mediocre at coding) let dev_expertise = TaskTypeExpertise::from_executions(dev_executions, "coding");
let dev_b_executions: Vec<ExecutionData> = (0..30)
.map(|i| ExecutionData {
timestamp: now - Duration::days(i),
duration_ms: 300 + (i as u64 * 10),
success: i < 20, // 67% success rate
})
.collect();
// Calculate expertise from executions assert!(dev_expertise.success_rate > 0.9);
let dev_a_expertise = TaskTypeExpertise::from_executions(dev_a_executions, "coding");
let dev_b_expertise = TaskTypeExpertise::from_executions(dev_b_executions, "coding");
// Verify expertise calculations let mut role_profile = ProfileAdapter::create_learning_profile("developer".to_string());
assert!(dev_a_expertise.success_rate > 0.9); role_profile =
assert!(dev_b_expertise.success_rate > 0.6 && dev_b_expertise.success_rate < 0.7); ProfileAdapter::add_task_type_expertise(role_profile, "coding".to_string(), dev_expertise);
assert!(dev_a_expertise.success_rate > dev_b_expertise.success_rate);
// Create learning profiles // Insert under stable_id = "developer" so assign_task can find it.
let mut profile_a = ProfileAdapter::create_learning_profile(dev_a_id.clone());
profile_a =
ProfileAdapter::add_task_type_expertise(profile_a, "coding".to_string(), dev_a_expertise);
let mut profile_b = ProfileAdapter::create_learning_profile(dev_b_id.clone());
profile_b =
ProfileAdapter::add_task_type_expertise(profile_b, "coding".to_string(), dev_b_expertise);
// Update coordinator with learning profiles
coordinator coordinator
.update_learning_profile(&dev_a_id, profile_a.clone()) .update_learning_profile("developer", role_profile)
.ok();
coordinator
.update_learning_profile(&dev_b_id, profile_b.clone())
.ok(); .ok();
// Verify profiles are stored // Verify profile is stored under stable_id key
let stored_a = coordinator.get_learning_profile(&dev_a_id); let stored = coordinator.get_learning_profile("developer");
let stored_b = coordinator.get_learning_profile(&dev_b_id); assert!(stored.is_some(), "Role-level profile must be stored");
assert!(stored_a.is_some());
assert!(stored_b.is_some());
// Check budget status before task assignment // Check budget status before task assignment
let budget_status = budget_manager.check_budget("developer").await.unwrap(); let budget_status = budget_manager.check_budget("developer").await.unwrap();
@ -119,7 +93,7 @@ async fn test_end_to_end_learning_with_budget_enforcement() {
assert!(!budget_status.near_threshold); assert!(!budget_status.near_threshold);
assert_eq!(budget_status.monthly_remaining_cents, 100000); assert_eq!(budget_status.monthly_remaining_cents, 100000);
// Assign a coding task (should go to developer_a based on learning) // Assign a coding task — learning profile found → profile-based routing.
let task_id = coordinator let task_id = coordinator
.assign_task( .assign_task(
"developer", "developer",
@ -131,28 +105,15 @@ async fn test_end_to_end_learning_with_budget_enforcement() {
.await .await
.expect("Should assign task"); .expect("Should assign task");
// Verify task was assigned (we can check via registry) // Verify one developer was assigned the task
let all_agents = coordinator.registry().list_all(); let all_agents = coordinator.registry().list_all();
let dev_a_tasks = all_agents let total_assigned: u32 = all_agents.iter().map(|a| a.current_tasks).sum();
.iter() assert_eq!(total_assigned, 1, "Exactly one agent must have the task");
.find(|a| a.id == dev_a_id)
.map(|a| a.current_tasks)
.unwrap_or(0);
let _dev_b_tasks = all_agents // Complete the task using the agent that received it
.iter() if let Some(assigned) = all_agents.iter().find(|a| a.current_tasks > 0) {
.find(|a| a.id == dev_b_id) coordinator.complete_task(&task_id, &assigned.id).await.ok();
.map(|a| a.current_tasks) }
.unwrap_or(0);
// Developer A (high expertise) should be selected
assert!(
dev_a_tasks > 0,
"Developer A (high expertise) should have been assigned the task"
);
// Simulate task completion
coordinator.complete_task(&task_id, &dev_a_id).await.ok();
// Verify budget status is still within limits // Verify budget status is still within limits
let budget_status = budget_manager.check_budget("developer").await.unwrap(); let budget_status = budget_manager.check_budget("developer").await.unwrap();
@ -160,7 +121,7 @@ async fn test_end_to_end_learning_with_budget_enforcement() {
// Simulate multiple tasks to test cumulative budget tracking // Simulate multiple tasks to test cumulative budget tracking
for i in 0..5 { for i in 0..5 {
let task = coordinator if let Ok(tid) = coordinator
.assign_task( .assign_task(
"developer", "developer",
format!("Task {}", i), format!("Task {}", i),
@ -168,15 +129,11 @@ async fn test_end_to_end_learning_with_budget_enforcement() {
"Context".to_string(), "Context".to_string(),
1, 1,
) )
.await; .await
{
if task.is_ok() {
let agents = coordinator.registry().list_all(); let agents = coordinator.registry().list_all();
if let Some(dev_a) = agents.iter().find(|a| a.id == dev_a_id) { if let Some(assigned) = agents.iter().find(|a| a.current_tasks > 0) {
coordinator coordinator.complete_task(&tid, &assigned.id).await.ok();
.complete_task(&format!("task-{}", i), &dev_a.id)
.await
.ok();
} }
} }
} }
@ -188,9 +145,12 @@ async fn test_end_to_end_learning_with_budget_enforcement() {
"Should not exceed monthly budget" "Should not exceed monthly budget"
); );
// Verify learning profiles are still intact // Verify role-level profile is intact
let all_profiles = coordinator.get_all_learning_profiles(); let all_profiles = coordinator.get_all_learning_profiles();
assert_eq!(all_profiles.len(), 2, "Both profiles should be stored"); assert!(
all_profiles.contains_key("developer"),
"Role-level profile must survive task processing"
);
} }
/// Test that budget enforcement doesn't break learning-based selection /// Test that budget enforcement doesn't break learning-based selection
@ -214,9 +174,6 @@ async fn test_learning_selection_with_budget_constraints() {
vec!["coding".to_string()], vec!["coding".to_string()],
); );
let expert_id = agent_expert.id.clone();
let novice_id = agent_novice.id.clone();
registry.register_agent(agent_expert).ok(); registry.register_agent(agent_expert).ok();
registry.register_agent(agent_novice).ok(); registry.register_agent(agent_novice).ok();
@ -238,9 +195,9 @@ async fn test_learning_selection_with_budget_constraints() {
let budget_manager = Arc::new(BudgetManager::new(budgets)); let budget_manager = Arc::new(BudgetManager::new(budgets));
let coordinator = coordinator.with_budget_manager(budget_manager.clone()); let coordinator = coordinator.with_budget_manager(budget_manager.clone());
// Create expertise profiles // Build a role-level learning profile for "developer" (stable_id = role).
let now = Utc::now(); let now = Utc::now();
let expert_execs: Vec<ExecutionData> = (0..20) let role_execs: Vec<ExecutionData> = (0..20)
.map(|i| ExecutionData { .map(|i| ExecutionData {
timestamp: now - Duration::days(i), timestamp: now - Duration::days(i),
duration_ms: 100, duration_ms: 100,
@ -248,36 +205,14 @@ async fn test_learning_selection_with_budget_constraints() {
}) })
.collect(); .collect();
let novice_execs: Vec<ExecutionData> = (0..20) let role_expertise = TaskTypeExpertise::from_executions(role_execs, "coding");
.map(|i| ExecutionData {
timestamp: now - Duration::days(i),
duration_ms: 100,
success: i < 12, // 60% success
})
.collect();
let expert_expertise = TaskTypeExpertise::from_executions(expert_execs, "coding"); let mut role_profile = ProfileAdapter::create_learning_profile("developer".to_string());
let novice_expertise = TaskTypeExpertise::from_executions(novice_execs, "coding"); role_profile =
ProfileAdapter::add_task_type_expertise(role_profile, "coding".to_string(), role_expertise);
let mut expert_profile = ProfileAdapter::create_learning_profile(expert_id.clone());
expert_profile = ProfileAdapter::add_task_type_expertise(
expert_profile,
"coding".to_string(),
expert_expertise,
);
let mut novice_profile = ProfileAdapter::create_learning_profile(novice_id.clone());
novice_profile = ProfileAdapter::add_task_type_expertise(
novice_profile,
"coding".to_string(),
novice_expertise,
);
coordinator coordinator
.update_learning_profile(&expert_id, expert_profile) .update_learning_profile("developer", role_profile)
.ok();
coordinator
.update_learning_profile(&novice_id, novice_profile)
.ok(); .ok();
// Verify budget status // Verify budget status
@ -287,11 +222,10 @@ async fn test_learning_selection_with_budget_constraints() {
"Initial budget should be healthy" "Initial budget should be healthy"
); );
// Assign multiple tasks - expert should be consistently selected // Assign multiple tasks — profile-based scoring should pick some developer.
let mut expert_count = 0; let mut assigned_count = 0;
#[allow(clippy::excessive_nesting)]
for i in 0..3 { for i in 0..3 {
if let Ok(_task_id) = coordinator if let Ok(tid) = coordinator
.assign_task( .assign_task(
"developer", "developer",
format!("Coding Task {}", i), format!("Coding Task {}", i),
@ -301,19 +235,17 @@ async fn test_learning_selection_with_budget_constraints() {
) )
.await .await
{ {
assigned_count += 1;
let agents = coordinator.registry().list_all(); let agents = coordinator.registry().list_all();
if let Some(expert) = agents.iter().find(|a| a.id == expert_id) { if let Some(assigned) = agents.iter().find(|a| a.current_tasks > 0) {
if expert.current_tasks > 0 { coordinator.complete_task(&tid, &assigned.id).await.ok();
expert_count += 1;
}
} }
} }
} }
// Expert should have been selected more often
assert!( assert!(
expert_count > 0, assigned_count > 0,
"Expert should have been selected despite budget constraints" "Tasks should be assigned despite budget constraints"
); );
} }
@ -330,7 +262,6 @@ async fn test_learning_profile_improvement_with_budget_tracking() {
vec!["coding".to_string()], vec!["coding".to_string()],
); );
let agent_id = agent.id.clone();
registry.register_agent(agent).ok(); registry.register_agent(agent).ok();
let coordinator = AgentCoordinator::with_registry(registry); let coordinator = AgentCoordinator::with_registry(registry);
@ -351,7 +282,7 @@ async fn test_learning_profile_improvement_with_budget_tracking() {
let budget_manager = Arc::new(BudgetManager::new(budgets)); let budget_manager = Arc::new(BudgetManager::new(budgets));
let coordinator = coordinator.with_budget_manager(budget_manager.clone()); let coordinator = coordinator.with_budget_manager(budget_manager.clone());
// Initial profile: mediocre performance // Initial profile: mediocre performance — keyed by stable_id = "developer".
let now = Utc::now(); let now = Utc::now();
let initial_execs: Vec<ExecutionData> = (0..10) let initial_execs: Vec<ExecutionData> = (0..10)
.map(|i| ExecutionData { .map(|i| ExecutionData {
@ -364,7 +295,7 @@ async fn test_learning_profile_improvement_with_budget_tracking() {
let mut initial_expertise = TaskTypeExpertise::from_executions(initial_execs, "coding"); let mut initial_expertise = TaskTypeExpertise::from_executions(initial_execs, "coding");
assert!((initial_expertise.success_rate - 0.5).abs() < 0.01); assert!((initial_expertise.success_rate - 0.5).abs() < 0.01);
let mut profile = ProfileAdapter::create_learning_profile(agent_id.clone()); let mut profile = ProfileAdapter::create_learning_profile("developer".to_string());
profile = ProfileAdapter::add_task_type_expertise( profile = ProfileAdapter::add_task_type_expertise(
profile, profile,
"coding".to_string(), "coding".to_string(),
@ -372,17 +303,17 @@ async fn test_learning_profile_improvement_with_budget_tracking() {
); );
coordinator coordinator
.update_learning_profile(&agent_id, profile.clone()) .update_learning_profile("developer", profile)
.ok(); .ok();
// Check initial profile // Check initial profile
let stored_profile = coordinator.get_learning_profile(&agent_id).unwrap(); let stored_profile = coordinator.get_learning_profile("developer").unwrap();
assert_eq!( assert_eq!(
stored_profile.get_task_type_score("coding"), stored_profile.get_task_type_score("coding"),
initial_expertise.success_rate initial_expertise.success_rate
); );
// Simulate improvement: add successful recent executions // Simulate improvement: add successful recent execution
let new_exec = ExecutionData { let new_exec = ExecutionData {
timestamp: now, timestamp: now,
duration_ms: 120, duration_ms: 120,
@ -396,7 +327,7 @@ async fn test_learning_profile_improvement_with_budget_tracking() {
); );
// Update profile with improved expertise // Update profile with improved expertise
let mut updated_profile = ProfileAdapter::create_learning_profile(agent_id.clone()); let mut updated_profile = ProfileAdapter::create_learning_profile("developer".to_string());
updated_profile = ProfileAdapter::add_task_type_expertise( updated_profile = ProfileAdapter::add_task_type_expertise(
updated_profile, updated_profile,
"coding".to_string(), "coding".to_string(),
@ -404,11 +335,11 @@ async fn test_learning_profile_improvement_with_budget_tracking() {
); );
coordinator coordinator
.update_learning_profile(&agent_id, updated_profile) .update_learning_profile("developer", updated_profile)
.ok(); .ok();
// Verify improvement is reflected // Verify improvement is reflected
let final_profile = coordinator.get_learning_profile(&agent_id).unwrap(); let final_profile = coordinator.get_learning_profile("developer").unwrap();
let final_score = final_profile.get_task_type_score("coding"); let final_score = final_profile.get_task_type_score("coding");
assert!(final_score > 0.5, "Final score should reflect improvement"); assert!(final_score > 0.5, "Final score should reflect improvement");

View file

@ -320,9 +320,10 @@ async fn test_coordinator_assignment_with_learning_scores() {
// Create coordinator // Create coordinator
let coordinator = AgentCoordinator::with_registry(registry); let coordinator = AgentCoordinator::with_registry(registry);
// Create learning profiles: Agent A excels at coding, Agent B is mediocre // Build a role-level learning profile for "developer" (stable_id = role).
// Both agents share this profile since they share a role.
let now = Utc::now(); let now = Utc::now();
let agent_a_executions: Vec<ExecutionData> = (0..20) let executions: Vec<ExecutionData> = (0..20)
.map(|i| ExecutionData { .map(|i| ExecutionData {
timestamp: now - Duration::days(i), timestamp: now - Duration::days(i),
duration_ms: 100, duration_ms: 100,
@ -330,40 +331,19 @@ async fn test_coordinator_assignment_with_learning_scores() {
}) })
.collect(); .collect();
let agent_b_executions: Vec<ExecutionData> = (0..20) let expertise = TaskTypeExpertise::from_executions(executions, "coding");
.map(|i| ExecutionData {
timestamp: now - Duration::days(i),
duration_ms: 100,
success: i < 14, // 70% success rate
})
.collect();
let agent_a_expertise = TaskTypeExpertise::from_executions(agent_a_executions, "coding"); // Profiles are keyed by stable_id ("developer") so they survive hot-reloads.
let agent_b_expertise = TaskTypeExpertise::from_executions(agent_b_executions, "coding"); let mut role_profile = ProfileAdapter::create_learning_profile("developer".to_string());
role_profile =
ProfileAdapter::add_task_type_expertise(role_profile, "coding".to_string(), expertise);
let mut agent_a_profile = ProfileAdapter::create_learning_profile(agent_a_id.clone());
agent_a_profile = ProfileAdapter::add_task_type_expertise(
agent_a_profile,
"coding".to_string(),
agent_a_expertise,
);
let mut agent_b_profile = ProfileAdapter::create_learning_profile(agent_b_id.clone());
agent_b_profile = ProfileAdapter::add_task_type_expertise(
agent_b_profile,
"coding".to_string(),
agent_b_expertise,
);
// Update coordinator with learning profiles
coordinator coordinator
.update_learning_profile(&agent_a_id, agent_a_profile) .update_learning_profile("developer", role_profile)
.ok();
coordinator
.update_learning_profile(&agent_b_id, agent_b_profile)
.ok(); .ok();
// Assign a coding task // Assign a coding task — profile-based scoring will be used since a
// "developer" profile exists.
let _task_id = coordinator let _task_id = coordinator
.assign_task( .assign_task(
"developer", "developer",
@ -375,37 +355,31 @@ async fn test_coordinator_assignment_with_learning_scores() {
.await .await
.expect("Should assign task"); .expect("Should assign task");
// Get the registry to verify which agent was selected // Verify one developer was assigned the task
let registry = coordinator.registry(); let registry = coordinator.registry();
let agent_a_tasks = registry let total_tasks: u32 = registry
.list_all() .list_all()
.iter() .iter()
.find(|a| a.id == agent_a_id) .filter(|a| a.role == "developer")
.map(|a| a.current_tasks) .map(|a| a.current_tasks)
.unwrap_or(0); .sum();
let agent_b_tasks = registry assert_eq!(total_tasks, 1, "Exactly one developer should have the task");
.list_all()
.iter()
.find(|a| a.id == agent_b_id)
.map(|a| a.current_tasks)
.unwrap_or(0);
// Agent A (higher expertise in coding) should have been selected // Verify the profile is stored under the stable_id key
assert!(
agent_a_tasks > 0,
"Agent A (coding specialist) should have 1+ tasks"
);
assert_eq!(agent_b_tasks, 0, "Agent B (generalist) should have 0 tasks");
// Verify learning profiles are stored
let stored_profiles = coordinator.get_all_learning_profiles(); let stored_profiles = coordinator.get_all_learning_profiles();
assert!( assert!(
stored_profiles.contains_key(&agent_a_id), stored_profiles.contains_key("developer"),
"Agent A profile should be stored" "Role-level 'developer' profile must be stored"
);
// The per-instance IDs should NOT be the profile keys after the refactor
assert!(
!stored_profiles.contains_key(&agent_a_id),
"Ephemeral agent IDs must not be profile keys"
); );
assert!( assert!(
stored_profiles.contains_key(&agent_b_id), !stored_profiles.contains_key(&agent_b_id),
"Agent B profile should be stored" "Ephemeral agent IDs must not be profile keys"
); );
} }

View file

@ -104,23 +104,52 @@ pub struct MetricsConfig {
} }
impl Config { impl Config {
/// Load configuration from a TOML file with environment variable /// Load configuration from a TOML or NCL file with environment variable
/// interpolation /// interpolation. When the path has a `.ncl` extension, `nickel export
/// --format json` is invoked and the resulting JSON is parsed. Otherwise
/// the file is read and parsed as TOML (legacy / test compatibility).
pub fn load<P: AsRef<Path>>(path: P) -> Result<Self> { pub fn load<P: AsRef<Path>>(path: P) -> Result<Self> {
let path = path.as_ref(); let path = path.as_ref();
// Read file content let (raw, is_json) = if path.extension().and_then(|e| e.to_str()) == Some("ncl") {
let content = fs::read_to_string(path).map_err(|e| { let out = std::process::Command::new("nickel")
VaporaError::ConfigError(format!("Failed to read config file {:?}: {}", path, e)) .args(["export", "--format", "json"])
})?; .arg(path)
.output()
.map_err(|e| {
VaporaError::ConfigError(format!(
"Failed to invoke nickel for {:?}: {}",
path, e
))
})?;
if !out.status.success() {
let stderr = String::from_utf8_lossy(&out.stderr);
return Err(VaporaError::ConfigError(format!(
"nickel export failed for {:?}: {}",
path, stderr
)));
}
let json = String::from_utf8(out.stdout).map_err(|e| {
VaporaError::ConfigError(format!("nickel output is not valid UTF-8: {}", e))
})?;
(json, true)
} else {
let content = fs::read_to_string(path).map_err(|e| {
VaporaError::ConfigError(format!("Failed to read config file {:?}: {}", path, e))
})?;
(content, false)
};
// Interpolate environment variables let interpolated = Self::interpolate_env_vars(&raw)?;
let interpolated = Self::interpolate_env_vars(&content)?;
// Parse TOML let config: Config = if is_json {
let config: Config = toml::from_str(&interpolated)?; serde_json::from_str(&interpolated).map_err(|e| {
VaporaError::ConfigError(format!("Failed to parse config JSON: {}", e))
})?
} else {
toml::from_str(&interpolated)?
};
// Validate configuration
config.validate()?; config.validate()?;
Ok(config) Ok(config)

View file

@ -48,7 +48,7 @@ struct Args {
#[arg( #[arg(
short, short,
long, long,
default_value = "config/vapora.toml", default_value = "config/config.ncl",
env = "VAPORA_CONFIG" env = "VAPORA_CONFIG"
)] )]
config: String, config: String,

View file

@ -7,14 +7,16 @@ use serde::{Deserialize, Serialize};
use thiserror::Error; use thiserror::Error;
use tokio::sync::RwLock; use tokio::sync::RwLock;
use crate::config::interpolate_env_vars;
/// Budget configuration errors /// Budget configuration errors
#[derive(Debug, Error)] #[derive(Debug, Error)]
pub enum BudgetConfigError { pub enum BudgetConfigError {
#[error("Failed to read budget config file: {0}")] #[error("Failed to read budget config file: {0}")]
ReadError(#[from] std::io::Error), ReadError(#[from] std::io::Error),
#[error("Failed to parse TOML: {0}")] #[error("Failed to parse config: {0}")]
ParseError(#[from] toml::de::Error), ParseJson(#[from] serde_json::Error),
#[error("Invalid budget configuration: {0}")] #[error("Invalid budget configuration: {0}")]
ValidationError(String), ValidationError(String),
@ -84,24 +86,65 @@ pub struct BudgetConfig {
} }
impl BudgetConfig { impl BudgetConfig {
/// Load budget configuration from TOML file /// Load budget configuration from a TOML or NCL file. When the path has a
/// `.ncl` extension, `nickel export --format json` is invoked and the
/// resulting JSON is parsed. Otherwise the file is read and parsed as TOML.
pub fn load<P: AsRef<Path>>(path: P) -> Result<Self, BudgetConfigError> { pub fn load<P: AsRef<Path>>(path: P) -> Result<Self, BudgetConfigError> {
let content = std::fs::read_to_string(path)?; let path = path.as_ref();
let config: Self = toml::from_str(&content)?;
let (raw, is_json) = if path.extension().and_then(|e| e.to_str()) == Some("ncl") {
let out = std::process::Command::new("nickel")
.args(["export", "--format", "json"])
.arg(path)
.output()
.map_err(|e| {
BudgetConfigError::ReadError(std::io::Error::other(format!(
"Failed to invoke nickel for {:?}: {}",
path, e
)))
})?;
if !out.status.success() {
let stderr = String::from_utf8_lossy(&out.stderr);
return Err(BudgetConfigError::ReadError(std::io::Error::other(
format!("nickel export failed for {:?}: {}", path, stderr),
)));
}
let json = String::from_utf8(out.stdout).map_err(|e| {
BudgetConfigError::ReadError(std::io::Error::other(format!(
"nickel output is not valid UTF-8: {}",
e
)))
})?;
(json, true)
} else {
let content = std::fs::read_to_string(path)?;
(content, false)
};
let interpolated = interpolate_env_vars(&raw);
let config: Self = if is_json {
serde_json::from_str(&interpolated)?
} else {
toml::from_str(&interpolated).map_err(|e| {
BudgetConfigError::ReadError(std::io::Error::other(format!(
"Failed to parse TOML: {}",
e
)))
})?
};
config.validate()?; config.validate()?;
Ok(config) Ok(config)
} }
/// Load from TOML with default fallback if file doesn't exist /// Load with default fallback if file doesn't exist
pub fn load_or_default<P: AsRef<Path>>(path: P) -> Result<Self, BudgetConfigError> { pub fn load_or_default<P: AsRef<Path>>(path: P) -> Result<Self, BudgetConfigError> {
match Self::load(&path) { match Self::load(&path) {
Ok(config) => Ok(config), Ok(config) => Ok(config),
Err(BudgetConfigError::ReadError(_)) => { Err(BudgetConfigError::ReadError(_)) => Ok(BudgetConfig {
// File doesn't exist, use defaults budgets: HashMap::new(),
Ok(BudgetConfig { }),
budgets: HashMap::new(),
})
}
Err(e) => Err(e), Err(e) => Err(e),
} }
} }

View file

@ -12,8 +12,8 @@ pub enum ConfigError {
#[error("Failed to read config file: {0}")] #[error("Failed to read config file: {0}")]
ReadError(#[from] std::io::Error), ReadError(#[from] std::io::Error),
#[error("Failed to parse TOML: {0}")] #[error("Failed to parse config: {0}")]
ParseError(#[from] toml::de::Error), ParseJson(#[from] serde_json::Error),
#[error("Invalid configuration: {0}")] #[error("Invalid configuration: {0}")]
ValidationError(String), ValidationError(String),
@ -74,21 +74,66 @@ pub struct RoutingRule {
} }
impl LLMRouterConfig { impl LLMRouterConfig {
/// Load configuration from TOML file /// Load configuration from a TOML or NCL file. When the path has a `.ncl`
/// extension, `nickel export --format json` is invoked and the resulting
/// JSON is parsed with full `${VAR}` interpolation applied pre-parse.
/// Otherwise the file is read and parsed as TOML (legacy compatibility).
pub fn load<P: AsRef<Path>>(path: P) -> Result<Self, ConfigError> { pub fn load<P: AsRef<Path>>(path: P) -> Result<Self, ConfigError> {
let content = std::fs::read_to_string(path)?; let path = path.as_ref();
let mut config: Self = toml::from_str(&content)?;
let (raw, is_json) = if path.extension().and_then(|e| e.to_str()) == Some("ncl") {
let out = std::process::Command::new("nickel")
.args(["export", "--format", "json"])
.arg(path)
.output()
.map_err(|e| {
ConfigError::ReadError(std::io::Error::other(format!(
"Failed to invoke nickel for {:?}: {}",
path, e
)))
})?;
if !out.status.success() {
let stderr = String::from_utf8_lossy(&out.stderr);
return Err(ConfigError::ReadError(std::io::Error::other(format!(
"nickel export failed for {:?}: {}",
path, stderr
))));
}
let json = String::from_utf8(out.stdout).map_err(|e| {
ConfigError::ReadError(std::io::Error::other(format!(
"nickel output is not valid UTF-8: {}",
e
)))
})?;
(json, true)
} else {
let content = std::fs::read_to_string(path)?;
(content, false)
};
let interpolated = interpolate_env_vars(&raw);
let config: Self = if is_json {
serde_json::from_str(&interpolated)?
} else {
let mut c: Self = toml::from_str(&interpolated).map_err(|e| {
ConfigError::ReadError(std::io::Error::other(format!(
"Failed to parse TOML: {}",
e
)))
})?;
// Legacy TOML path: expand env vars in specific fields
c.expand_env_vars();
c
};
// Expand environment variables in API keys and URLs
config.expand_env_vars();
config.validate()?; config.validate()?;
Ok(config) Ok(config)
} }
/// Expand environment variables in configuration /// Expand environment variables in API key and URL fields (TOML path only).
fn expand_env_vars(&mut self) { fn expand_env_vars(&mut self) {
for (_, provider) in self.providers.iter_mut() { for provider in self.providers.values_mut() {
if let Some(ref api_key) = provider.api_key { if let Some(ref api_key) = provider.api_key {
provider.api_key = Some(expand_env_var(api_key)); provider.api_key = Some(expand_env_var(api_key));
} }
@ -136,7 +181,36 @@ impl LLMRouterConfig {
} }
} }
/// Expand environment variables in format ${VAR} or ${VAR:-default} /// Expand every `${VAR}` / `${VAR:-default}` reference in `content`.
/// Unresolved vars without a default are replaced with an empty string.
pub(crate) fn interpolate_env_vars(content: &str) -> String {
let mut result = String::with_capacity(content.len());
let mut remaining = content;
while let Some(start) = remaining.find("${") {
result.push_str(&remaining[..start]);
let after_open = &remaining[start + 2..];
if let Some(close) = after_open.find('}') {
let var_expr = &after_open[..close];
let value = if let Some(sep) = var_expr.find(":-") {
let var_name = &var_expr[..sep];
let default_val = &var_expr[sep + 2..];
std::env::var(var_name).unwrap_or_else(|_| default_val.to_string())
} else {
std::env::var(var_expr).unwrap_or_default()
};
result.push_str(&value);
remaining = &after_open[close + 1..];
} else {
result.push_str("${");
remaining = after_open;
}
}
result.push_str(remaining);
result
}
/// Expand environment variables in format ${VAR} or ${VAR:-default} (single
/// token).
fn expand_env_var(input: &str) -> String { fn expand_env_var(input: &str) -> String {
if !input.starts_with("${") || !input.ends_with('}') { if !input.starts_with("${") || !input.ends_with('}') {
return input.to_string(); return input.to_string();

View file

@ -101,11 +101,53 @@ pub struct StageConfig {
impl WorkflowsConfig { impl WorkflowsConfig {
pub fn load<P: AsRef<Path>>(path: P) -> Result<Self> { pub fn load<P: AsRef<Path>>(path: P) -> Result<Self> {
let content = std::fs::read_to_string(path).map_err(ConfigError::IoError)?; let path = path.as_ref();
let config: WorkflowsConfig = toml::from_str(&content).map_err(ConfigError::Parse)?;
let (raw, is_json) = if path.extension().and_then(|e| e.to_str()) == Some("ncl") {
let out = std::process::Command::new("nickel")
.args(["export", "--format", "json"])
.arg(path)
.output()
.map_err(|e| {
ConfigError::IoError(std::io::Error::other(format!(
"Failed to invoke nickel for {:?}: {}",
path, e
)))
})?;
if !out.status.success() {
let stderr = String::from_utf8_lossy(&out.stderr);
return Err(ConfigError::IoError(std::io::Error::other(format!(
"nickel export failed for {:?}: {}",
path, stderr
)))
.into());
}
let json = String::from_utf8(out.stdout).map_err(|e| {
ConfigError::IoError(std::io::Error::other(format!(
"nickel output is not valid UTF-8: {}",
e
)))
})?;
(json, true)
} else {
let content = std::fs::read_to_string(path).map_err(ConfigError::IoError)?;
(content, false)
};
let interpolated = interpolate_env_vars(&raw);
let config: WorkflowsConfig = if is_json {
serde_json::from_str(&interpolated).map_err(ConfigError::ParseJson)?
} else {
toml::from_str(&interpolated).map_err(|e| {
ConfigError::IoError(std::io::Error::other(format!(
"Failed to parse TOML: {}",
e
)))
})?
};
config.validate()?; config.validate()?;
Ok(config) Ok(config)
} }
@ -149,6 +191,34 @@ impl WorkflowsConfig {
} }
} }
/// Expand every `${VAR}` / `${VAR:-default}` reference in `content`.
/// Unresolved vars without a default are replaced with an empty string.
fn interpolate_env_vars(content: &str) -> String {
let mut result = String::with_capacity(content.len());
let mut remaining = content;
while let Some(start) = remaining.find("${") {
result.push_str(&remaining[..start]);
let after_open = &remaining[start + 2..];
if let Some(close) = after_open.find('}') {
let var_expr = &after_open[..close];
let value = if let Some(sep) = var_expr.find(":-") {
let var_name = &var_expr[..sep];
let default_val = &var_expr[sep + 2..];
std::env::var(var_name).unwrap_or_else(|_| default_val.to_string())
} else {
std::env::var(var_expr).unwrap_or_default()
};
result.push_str(&value);
remaining = &after_open[close + 1..];
} else {
result.push_str("${");
remaining = after_open;
}
}
result.push_str(remaining);
result
}
fn validate_schedule_config( fn validate_schedule_config(
workflow_name: &str, workflow_name: &str,
schedule: &Option<ScheduleConfig>, schedule: &Option<ScheduleConfig>,

View file

@ -68,8 +68,8 @@ pub enum ConfigError {
#[error("Failed to read config file: {0}")] #[error("Failed to read config file: {0}")]
IoError(#[from] std::io::Error), IoError(#[from] std::io::Error),
#[error("Failed to parse TOML: {0}")] #[error("Failed to parse config: {0}")]
Parse(#[from] toml::de::Error), ParseJson(#[from] serde_json::Error),
#[error("Invalid configuration: {0}")] #[error("Invalid configuration: {0}")]
Invalid(String), Invalid(String),

View file

@ -0,0 +1,222 @@
# ADR-0040: Agent Hot-Reload — Stable Identity and Zero-Downtime Config Reload
**Status**: Implemented
**Date**: 2026-03-02
**Deciders**: VAPORA Team
**Technical Story**: `AgentMetadata::id` was a `Uuid::new_v4()` generated at startup. `learning_profiles` in `AgentCoordinator` and execution records in `KGPersistence` used this UUID as the key. Every process restart or SIGHUP reload rotated all UUIDs, orphaning accumulated expertise profiles and resetting the learning system to zero.
---
## Decision
Introduce `stable_id: String` on `AgentMetadata`, computed as `role.clone()` at construction time. Switch all learning profile keys and KG execution records from the ephemeral `id` (UUID) to `stable_id`. Add hot-reload mechanics — SIGHUP handler and `POST /reload` endpoint — that drain and re-spawn executors while leaving `learning_profiles` untouched.
---
## Context
### The Identity Problem
Before this change, every agent had two implicit identities that were conflated into one field:
| Identity | Purpose | Lifecycle |
|----------|---------|-----------|
| Instance ID (`id`) | Sender handle in `executor_channels`, registry key | Ephemeral — dies with the process or on reload |
| Profile ID | Key for `learning_profiles` and KG records | Must survive restarts to preserve learning |
Using `Uuid::new_v4()` for both meant any reload (SIGHUP, restart, crash recovery) threw away all accumulated expertise. An agent that had processed 500 coding tasks and learned optimal patterns would start from zero on the next deploy.
### Why `role` as stable_id
VAPORA's architecture already partitions learning at the role level: `AgentScoringService::rank_agents` accepts `Vec<(agent_id, Option<LearningProfile>)>` where multiple agents of the same role compete for a task. The profile that matters for selection is role-level expertise (how well the "developer" role handles "coding" tasks), not per-instance expertise. Using `role` as the stable key:
- Is deterministic across restarts
- Aggregates learning across all instances of the same role
- Requires no additional persistence (no UUID→role mapping table)
- Degrades gracefully: legacy-deserialized records with empty `stable_id` fall back to `role` via `stable_id_or_role()`
---
## Implementation
### `AgentMetadata` (registry.rs)
```rust
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AgentMetadata {
pub id: String, // Uuid::new_v4() — ephemeral, per-instance
#[serde(default)]
pub stable_id: String, // role.clone() — persistent across restarts
pub role: String,
// ...
}
impl AgentMetadata {
pub fn new(role: String, ...) -> Self {
Self {
id: Uuid::new_v4().to_string(),
stable_id: role.clone(), // set before role is moved
role,
// ...
}
}
pub fn stable_id_or_role(&self) -> &str {
if self.stable_id.is_empty() { &self.role } else { &self.stable_id }
}
}
```
### `AgentRegistry::drain_role` (registry.rs)
Removes all agents for a role from the `agents` map and clears `running_count`. This allows immediate re-registration after drain without hitting `MaxAgentsReached`.
```rust
pub fn drain_role(&self, role: &str) -> Vec<String> {
let mut inner = self.inner.write().expect("registry write lock");
let ids: Vec<String> = inner.agents.values()
.filter(|a| a.role == role)
.map(|a| a.id.clone())
.collect();
for id in &ids { inner.agents.remove(id); }
inner.running_count.remove(role);
ids
}
```
### `AgentCoordinator::drain_role` (coordinator.rs)
Delegates to `registry.drain_role`, then removes the corresponding `Sender` entries from `executor_channels`. Dropping the `Sender` closes the mpsc channel; the executor's `while let Some(task) = rx.recv().await` loop exits after draining any buffered messages — no explicit shutdown signal required.
```rust
pub fn drain_role(&self, role: &str) -> Vec<String> {
let ids = self.registry.drain_role(role);
for id in &ids {
self.executor_channels.remove(id);
}
ids
}
```
`learning_profiles` is keyed by `stable_id` (= role) and is **not** touched during drain. New executor instances spawned after reload inherit accumulated expertise immediately.
### Profile lookup (coordinator.rs)
```rust
// assign_task — before:
.map(|a| (a.id.clone(), profiles.get(&a.id).cloned()))
// assign_task — after:
.map(|a| {
let key = a.stable_id_or_role();
(a.id.clone(), profiles.get(key).cloned())
})
```
### Hot-reload entry points (server.rs)
Two entry points invoke the same `reload_agents` function:
```rust
// SIGHUP
while sighup.recv().await.is_some() {
handle_sighup_reload(&state, &registry).await;
}
// REST
.route("/reload", axum::routing::post(reload_handler))
```
`reload_agents` sequence:
1. `registry.list_roles()` → drain each role via `coordinator.drain_role`
2. Re-spawn capability executors from `CapabilityRegistry`
3. Re-spawn config agents not covered by capabilities
4. Return `registry.total_count()`
---
## Availability Window
`reload_agents` drains all roles before re-spawning. During the window between the last drain and the first successful `register_agent`, `assign_task` for those roles returns `CoordinatorError::NoAvailableAgent`. This window is typically sub-millisecond on the same thread, but callers must handle this error and retry.
This is a deliberate trade-off: atomic swap-in of new executors would require a blue-green registry pattern, adding significant complexity for a latency window that is orders of magnitude shorter than any typical LLM call (which takes 500ms30s).
---
## Out of Scope
- **BudgetManager reload**: budget limit changes require process restart. The `BudgetManager` is constructed once from config in `main()` and stored in `AppState`. Adding reload support requires either a `RwLock<BudgetConfig>` wrapper or rebuilding the manager and swapping it in `AppState` under a lock.
- **LLMRouter reload**: provider API key changes require restart for the same reason.
---
## Alternatives Considered
### UUID + external persistence of UUID→role mapping
Would preserve per-instance identity. Rejected: adds a SurrealDB table (UUID→role) that must be kept in sync across restarts, adds a lookup on every `assign_task`, and provides no additional value since role-level profiles already capture collective expertise.
### Blue-green registry swap
Two `AgentRegistry` instances: old one drains while new one accepts assignments. Rejected: requires `AgentCoordinator` to hold `Arc<RwLock<Arc<AgentRegistry>>>` and all call sites to acquire the inner lock on every call. Complexity disproportionate to the gain (sub-millisecond → zero gap).
### Versioned stable_id (e.g., `developer-v2`)
For breaking role renames. Rejected: out of scope; role renames already require explicit operator action.
---
## Trade-offs
**Pros**:
- Learning profiles survive indefinitely across restarts and hot-reloads
- SIGHUP and `POST /reload` provide two operator-friendly reload paths
- `stable_id_or_role()` fallback ensures backward compatibility with persisted data that predates this change
- `drain_role` cleans up cleanly: no stale executor channels, no MaxAgentsReached on re-register
**Cons**:
- All agents of the same role share one learning profile. Per-instance specialization (e.g., "this specific GPU node is faster at inference") is not representable. Acceptable: VAPORA's role model deliberately treats same-role agents as interchangeable for task routing purposes.
- Brief `NoAvailableAgent` window during reload (see Availability Window above).
- BudgetManager and LLMRouter not reloadable without restart.
---
## Verification
```bash
cargo test -p vapora-agents test_stable_id_deterministic
cargo test -p vapora-agents test_drain_role
cargo test -p vapora-agents test_profile_survives_role_drain
cargo test -p vapora-agents test_list_roles
# Hot-reload via signal
kill -HUP $(pgrep vapora-agents)
# Hot-reload via REST
curl -s -X POST http://localhost:9000/reload | jq .
# Expected: {"reloaded": true, "agents": N}
cargo clippy -p vapora-agents -- -D warnings
```
---
## Consequences
- `AgentMetadata` gains a new field `stable_id` with `#[serde(default)]`. Existing serialized records deserialize cleanly; `stable_id_or_role()` falls back to `role`.
- KG execution records (the `agent_id` field in SurrealDB) now store `stable_id` (= role) instead of a UUID. Existing records with UUID keys remain in the database but are no longer updated; they can be cleaned up with a migration if needed.
- ADR-0014 (Learning Profiles) and ADR-0015 (Budget Enforcement) are unaffected at the API level; only the internal key used to look up profiles changes.
---
## References
- [ADR-0014 — Learning Profiles](./0014-learning-profiles.md)
- [ADR-0015 — Budget Enforcement](./0015-budget-enforcement.md)
- [ADR-0026 — Arc-Based Shared State](./0026-shared-state.md)
- `crates/vapora-agents/src/registry.rs``AgentMetadata`, `drain_role`, `list_roles`
- `crates/vapora-agents/src/coordinator.rs``drain_role`, `registry_arc`, profile lookup
- `crates/vapora-agents/src/bin/server.rs``reload_agents`, SIGHUP handler, `/reload` endpoint

View file

@ -2,8 +2,8 @@
Documentación de las decisiones arquitectónicas clave del proyecto VAPORA. Documentación de las decisiones arquitectónicas clave del proyecto VAPORA.
**Status**: Complete (39 ADRs documented) **Status**: Complete (40 ADRs documented)
**Last Updated**: 2026-02-26 **Last Updated**: 2026-03-02
**Format**: Custom VAPORA (Decision, Rationale, Alternatives, Trade-offs, Implementation, Verification, Consequences) **Format**: Custom VAPORA (Decision, Rationale, Alternatives, Trade-offs, Implementation, Verification, Consequences)
--- ---
@ -37,7 +37,7 @@ Decisiones fundamentales sobre el stack tecnológico y estructura base del proye
--- ---
## 🔄 Agent Coordination & Messaging (5 ADRs) ## 🔄 Agent Coordination & Messaging (6 ADRs)
Decisiones sobre coordinación entre agentes y comunicación de mensajes. Decisiones sobre coordinación entre agentes y comunicación de mensajes.
@ -48,6 +48,7 @@ Decisiones sobre coordinación entre agentes y comunicación de mensajes.
| [030](./0030-a2a-protocol-implementation.md) | A2A Protocol Implementation | Axum JSON-RPC 2.0 server + resilient client con exponential backoff | ✅ Implemented | | [030](./0030-a2a-protocol-implementation.md) | A2A Protocol Implementation | Axum JSON-RPC 2.0 server + resilient client con exponential backoff | ✅ Implemented |
| [031](./0031-kubernetes-deployment-kagent.md) | Kubernetes Deployment Strategy para kagent | Kustomize + StatefulSet con overlays dev/prod | ✅ Accepted | | [031](./0031-kubernetes-deployment-kagent.md) | Kubernetes Deployment Strategy para kagent | Kustomize + StatefulSet con overlays dev/prod | ✅ Accepted |
| [032](./0032-a2a-error-handling-json-rpc.md) | A2A Error Handling y JSON-RPC 2.0 Compliance | Two-layer: thiserror domain errors + JSON-RPC 2.0 protocol conversion | ✅ Implemented | | [032](./0032-a2a-error-handling-json-rpc.md) | A2A Error Handling y JSON-RPC 2.0 Compliance | Two-layer: thiserror domain errors + JSON-RPC 2.0 protocol conversion | ✅ Implemented |
| [040](./0040-agent-hot-reload-stable-identity.md) | Agent Hot-Reload — Stable Identity and Zero-Downtime Config Reload | `stable_id = role` as persistent profile key; SIGHUP + `POST /reload` drain-and-respawn without learning loss | ✅ Implemented |
--- ---
@ -126,6 +127,7 @@ Patrones de desarrollo y arquitectura utilizados en todo el codebase.
- **A2A Protocol**: JSON-RPC 2.0 over HTTP enables interoperability with Google kagent and other A2A-compliant agents - **A2A Protocol**: JSON-RPC 2.0 over HTTP enables interoperability with Google kagent and other A2A-compliant agents
- **kagent Kubernetes Deployment**: Kustomize StatefulSet with stable pod identities for predictable A2A endpoint addressing - **kagent Kubernetes Deployment**: Kustomize StatefulSet with stable pod identities for predictable A2A endpoint addressing
- **A2A Error Handling**: Two-layer strategy (domain `thiserror` + JSON-RPC 2.0 protocol conversion) specializes ADR-0022 for A2A - **A2A Error Handling**: Two-layer strategy (domain `thiserror` + JSON-RPC 2.0 protocol conversion) specializes ADR-0022 for A2A
- **Agent Hot-Reload**: `stable_id = role` decouples ephemeral instance identity from persistent profile key; SIGHUP and `POST /reload` drain executors while preserving all learning profiles
### ☁️ Infrastructure & Security ### ☁️ Infrastructure & Security
@ -270,7 +272,7 @@ Each ADR follows the Custom VAPORA format:
## Statistics ## Statistics
- **Total ADRs**: 38 - **Total ADRs**: 40
- **Core Architecture**: 13 (41%) - **Core Architecture**: 13 (41%)
- **Agent Coordination**: 5 (16%) - **Agent Coordination**: 5 (16%)
- **Infrastructure**: 4 (12%) - **Infrastructure**: 4 (12%)
@ -291,4 +293,4 @@ Each ADR follows the Custom VAPORA format:
**Generated**: January 12, 2026 **Generated**: January 12, 2026
**Status**: Production-Ready **Status**: Production-Ready
**Last Reviewed**: 2026-02-17 **Last Reviewed**: 2026-03-02

View file

@ -0,0 +1,18 @@
{
RegistryConfig = {
max_agents_per_role | Number,
health_check_interval | Number,
agent_timeout | Number,
},
AgentDefinition = {
role | String,
description | String,
llm_provider | String,
llm_model | String,
parallelizable | Bool,
priority | Number,
capabilities | Array String,
system_prompt | String | optional,
},
}

View file

@ -0,0 +1,20 @@
let AlertThreshold = std.contract.custom (
fun label =>
fun value =>
if value >= 0.0 && value <= 1.0 then
'Ok value
else
'Error {
message = "Invalid alert_threshold '%{std.to_string value}'.\nValid range: 0.0 - 1.0"
}
) in
{
RoleBudget = {
role | String,
monthly_limit_cents | Number,
weekly_limit_cents | Number,
fallback_provider | String,
alert_threshold | AlertThreshold,
},
}

View file

@ -0,0 +1,27 @@
{
SlackConfig = {
type | String,
webhook_url | String,
channel | String | optional,
username | String | optional,
},
DiscordConfig = {
type | String,
webhook_url | String,
username | String | optional,
avatar_url | String | optional,
},
TelegramConfig = {
type | String,
bot_token | String,
chat_id | String,
api_base | String | optional,
},
ChannelEntry = {
type | String,
..
},
}

View file

@ -0,0 +1,25 @@
{
RoutingConfig = {
default_provider | String,
cost_tracking_enabled | Bool,
fallback_enabled | Bool,
},
ProviderConfig = {
enabled | Bool,
api_key | String | optional,
url | String | optional,
model | String,
max_tokens | Number,
temperature | Number,
cost_per_1m_input | Number,
cost_per_1m_output | Number,
},
RoutingRule = {
name | String,
condition | { .. },
provider | String,
model_override | String | optional,
},
}

View file

@ -0,0 +1,68 @@
let LogLevel = std.contract.custom (
fun label =>
fun value =>
let valid = ["trace", "debug", "info", "warn", "error"] in
if std.array.any (fun x => x == value) valid then
'Ok value
else
'Error {
message = "Invalid log_level '%{value}'.\nValid values: trace | debug | info | warn | error"
}
) in
let Port = std.contract.custom (
fun label =>
fun value =>
if value >= 1 && value <= 65535 then
'Ok value
else
'Error {
message = "Invalid port '%{std.to_string value}'.\nValid range: 1 - 65535"
}
) in
{
TlsConfig = {
enabled | Bool,
cert_path | String,
key_path | String,
},
ServerConfig = {
host | String,
port | Port,
tls | TlsConfig,
},
DatabaseConfig = {
url | String,
max_connections | Number,
},
NatsConfig = {
url | String,
stream_name | String,
},
AuthConfig = {
jwt_secret | String,
jwt_expiration_hours | Number,
},
LoggingConfig = {
level | LogLevel,
json | Bool,
},
MetricsConfig = {
enabled | Bool,
port | Port,
},
NotificationConfig = {
on_task_done | Array String | default = [],
on_proposal_approved | Array String | default = [],
on_proposal_rejected | Array String | default = [],
on_agent_inactive | Array String | default = [],
},
}

View file

@ -0,0 +1,30 @@
{
EngineConfig = {
max_parallel_tasks | Number,
workflow_timeout | Number,
approval_gates_enabled | Bool,
cedar_policy_dir | String | optional,
},
ScheduleConfig = {
cron | String,
timezone | String | optional,
allow_concurrent | Bool,
catch_up | Bool,
},
WorkflowNotifications = {
on_completed | Array String | default = [],
on_failed | Array String | default = [],
on_approval_required | Array String | default = [],
},
StageConfig = {
name | String,
agents | Array String,
parallel | Bool | default = false,
max_parallel | Number | optional,
approval_required | Bool | default = false,
compensation_agents | Array String | optional,
},
}