Vapora/docs/adrs/0005-nats-jetstream.md

147 lines
3.8 KiB
Markdown
Raw Normal View History

# ADR-005: NATS JetStream para Agent Coordination
**Status**: Accepted | Implemented
**Date**: 2024-11-01
**Deciders**: Agent Architecture Team
**Technical Story**: Selecting persistent message broker for reliable agent task queuing
---
## Decision
Usar **async-nats 0.45 con JetStream** para coordinación de agentes (no Redis Pub/Sub, no RabbitMQ).
---
## Rationale
1. **At-Least-Once Delivery**: JetStream garantiza persistencia + retries (vs Redis Pub/Sub que pierde mensajes)
2. **Lightweight**: Ninguna dependencia pesada (vs RabbitMQ/Kafka setup)
3. **Async Native**: Diseñado para Tokio (mismo runtime que VAPORA)
4. **VAPORA Use Case**: Coordinar tareas entre múltiples agentes con garantías de entrega
---
## Alternatives Considered
### ❌ Redis Pub/Sub
- **Pros**: Simple, fast
- **Cons**: Sin persistencia, mensajes perdidos si broker cae
### ❌ RabbitMQ
- **Pros**: Maduro, confiable
- **Cons**: Pesado, require seperate server, más complejidad operacional
### ✅ NATS JetStream (CHOSEN)
- At-least-once delivery
- Lightweight
- Tokio-native async
---
## Trade-offs
**Pros**:
- ✅ Persistencia garantizada (JetStream)
- ✅ Retries automáticos
- ✅ Bajo overhead operacional
- ✅ Integración natural con Tokio
**Cons**:
- ⚠️ Cluster setup requiere configuración adicional
- ⚠️ Menos tooling que RabbitMQ
- ⚠️ Fallback a in-memory si NATS cae (degrada a at-most-once)
---
## Implementation
**Task Publishing**:
```rust
// crates/vapora-agents/src/coordinator.rs
let client = async_nats::connect(&nats_url).await?;
let jetstream = async_nats::jetstream::new(client);
// Publish task assignment
jetstream.publish("tasks.assigned", serde_json::to_vec(&task_msg)?).await?;
```
**Agent Subscription**:
```rust
// Subscribe to task queue
let subscriber = jetstream
.subscribe_durable("tasks.assigned", "agent-consumer")
.await?;
// Process incoming tasks
while let Some(message) = subscriber.next().await {
let task: TaskMessage = serde_json::from_slice(&message.payload)?;
process_task(task).await?;
message.ack().await?; // Acknowledge after successful processing
}
```
**Key Files**:
- `/crates/vapora-agents/src/coordinator.rs:53-72` (message dispatch)
- `/crates/vapora-agents/src/messages.rs` (message types)
- `/crates/vapora-backend/src/api/` (task creation publishes to JetStream)
---
## Verification
```bash
# Start NATS with JetStream support
docker run -d -p 4222:4222 nats:latest -js
# Create stream and consumer
nats stream add TASKS --subjects 'tasks.assigned' --storage file
# Monitor message throughput
nats sub 'tasks.assigned' --raw
# Test agent coordination
cargo test -p vapora-agents -- --nocapture
# Check message processing
nats stats
```
**Expected Output**:
- JetStream stream created with persistence
- Messages published to `tasks.assigned` persisted
- Agent subscribers receive and acknowledge messages
- Retries work if agent processing fails
- All agent tests pass
---
## Consequences
### Message Queue Management
- Streams must be pre-created (infra responsibility)
- Retention policies configured per stream (age, size limits)
- Consumer groups enable load-balanced processing
### Failure Modes
- If NATS unavailable: Agents fallback to in-memory queue (graceful degradation)
- Lost messages only if dual failure (server down + no backup)
- See disaster recovery plan for NATS clustering
### Scaling
- Multiple agents subscribe to same consumer group (load balancing)
- One message processed by one agent (exclusive delivery)
- Ordering preserved within subject
---
## References
- [NATS JetStream Documentation](https://docs.nats.io/nats-concepts/jetstream)
- `/crates/vapora-agents/src/coordinator.rs` (coordinator implementation)
- `/crates/vapora-agents/src/messages.rs` (message types)
---
**Related ADRs**: ADR-001 (Workspace), ADR-018 (Swarm Load Balancing)