3.8 KiB
3.8 KiB
ADR-005: NATS JetStream para Agent Coordination
Status: Accepted | Implemented Date: 2024-11-01 Deciders: Agent Architecture Team Technical Story: Selecting persistent message broker for reliable agent task queuing
Decision
Usar async-nats 0.45 con JetStream para coordinación de agentes (no Redis Pub/Sub, no RabbitMQ).
Rationale
- At-Least-Once Delivery: JetStream garantiza persistencia + retries (vs Redis Pub/Sub que pierde mensajes)
- Lightweight: Ninguna dependencia pesada (vs RabbitMQ/Kafka setup)
- Async Native: Diseñado para Tokio (mismo runtime que VAPORA)
- VAPORA Use Case: Coordinar tareas entre múltiples agentes con garantías de entrega
Alternatives Considered
❌ Redis Pub/Sub
- Pros: Simple, fast
- Cons: Sin persistencia, mensajes perdidos si broker cae
❌ RabbitMQ
- Pros: Maduro, confiable
- Cons: Pesado, require seperate server, más complejidad operacional
✅ NATS JetStream (CHOSEN)
- At-least-once delivery
- Lightweight
- Tokio-native async
Trade-offs
Pros:
- ✅ Persistencia garantizada (JetStream)
- ✅ Retries automáticos
- ✅ Bajo overhead operacional
- ✅ Integración natural con Tokio
Cons:
- ⚠️ Cluster setup requiere configuración adicional
- ⚠️ Menos tooling que RabbitMQ
- ⚠️ Fallback a in-memory si NATS cae (degrada a at-most-once)
Implementation
Task Publishing:
// crates/vapora-agents/src/coordinator.rs
let client = async_nats::connect(&nats_url).await?;
let jetstream = async_nats::jetstream::new(client);
// Publish task assignment
jetstream.publish("tasks.assigned", serde_json::to_vec(&task_msg)?).await?;
Agent Subscription:
// Subscribe to task queue
let subscriber = jetstream
.subscribe_durable("tasks.assigned", "agent-consumer")
.await?;
// Process incoming tasks
while let Some(message) = subscriber.next().await {
let task: TaskMessage = serde_json::from_slice(&message.payload)?;
process_task(task).await?;
message.ack().await?; // Acknowledge after successful processing
}
Key Files:
/crates/vapora-agents/src/coordinator.rs:53-72(message dispatch)/crates/vapora-agents/src/messages.rs(message types)/crates/vapora-backend/src/api/(task creation publishes to JetStream)
Verification
# Start NATS with JetStream support
docker run -d -p 4222:4222 nats:latest -js
# Create stream and consumer
nats stream add TASKS --subjects 'tasks.assigned' --storage file
# Monitor message throughput
nats sub 'tasks.assigned' --raw
# Test agent coordination
cargo test -p vapora-agents -- --nocapture
# Check message processing
nats stats
Expected Output:
- JetStream stream created with persistence
- Messages published to
tasks.assignedpersisted - Agent subscribers receive and acknowledge messages
- Retries work if agent processing fails
- All agent tests pass
Consequences
Message Queue Management
- Streams must be pre-created (infra responsibility)
- Retention policies configured per stream (age, size limits)
- Consumer groups enable load-balanced processing
Failure Modes
- If NATS unavailable: Agents fallback to in-memory queue (graceful degradation)
- Lost messages only if dual failure (server down + no backup)
- See disaster recovery plan for NATS clustering
Scaling
- Multiple agents subscribe to same consumer group (load balancing)
- One message processed by one agent (exclusive delivery)
- Ordering preserved within subject
References
- NATS JetStream Documentation
/crates/vapora-agents/src/coordinator.rs(coordinator implementation)/crates/vapora-agents/src/messages.rs(message types)
Related ADRs: ADR-001 (Workspace), ADR-018 (Swarm Load Balancing)