Vapora/docs/adrs/0005-nats-jetstream.md
Jesús Pérez 7110ffeea2
Some checks failed
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
chore: extend doc: adr, tutorials, operations, etc
2026-01-12 03:32:47 +00:00

3.8 KiB

ADR-005: NATS JetStream para Agent Coordination

Status: Accepted | Implemented Date: 2024-11-01 Deciders: Agent Architecture Team Technical Story: Selecting persistent message broker for reliable agent task queuing


Decision

Usar async-nats 0.45 con JetStream para coordinación de agentes (no Redis Pub/Sub, no RabbitMQ).


Rationale

  1. At-Least-Once Delivery: JetStream garantiza persistencia + retries (vs Redis Pub/Sub que pierde mensajes)
  2. Lightweight: Ninguna dependencia pesada (vs RabbitMQ/Kafka setup)
  3. Async Native: Diseñado para Tokio (mismo runtime que VAPORA)
  4. VAPORA Use Case: Coordinar tareas entre múltiples agentes con garantías de entrega

Alternatives Considered

Redis Pub/Sub

  • Pros: Simple, fast
  • Cons: Sin persistencia, mensajes perdidos si broker cae

RabbitMQ

  • Pros: Maduro, confiable
  • Cons: Pesado, require seperate server, más complejidad operacional

NATS JetStream (CHOSEN)

  • At-least-once delivery
  • Lightweight
  • Tokio-native async

Trade-offs

Pros:

  • Persistencia garantizada (JetStream)
  • Retries automáticos
  • Bajo overhead operacional
  • Integración natural con Tokio

Cons:

  • ⚠️ Cluster setup requiere configuración adicional
  • ⚠️ Menos tooling que RabbitMQ
  • ⚠️ Fallback a in-memory si NATS cae (degrada a at-most-once)

Implementation

Task Publishing:

// crates/vapora-agents/src/coordinator.rs
let client = async_nats::connect(&nats_url).await?;
let jetstream = async_nats::jetstream::new(client);

// Publish task assignment
jetstream.publish("tasks.assigned", serde_json::to_vec(&task_msg)?).await?;

Agent Subscription:

// Subscribe to task queue
let subscriber = jetstream
    .subscribe_durable("tasks.assigned", "agent-consumer")
    .await?;

// Process incoming tasks
while let Some(message) = subscriber.next().await {
    let task: TaskMessage = serde_json::from_slice(&message.payload)?;
    process_task(task).await?;
    message.ack().await?; // Acknowledge after successful processing
}

Key Files:

  • /crates/vapora-agents/src/coordinator.rs:53-72 (message dispatch)
  • /crates/vapora-agents/src/messages.rs (message types)
  • /crates/vapora-backend/src/api/ (task creation publishes to JetStream)

Verification

# Start NATS with JetStream support
docker run -d -p 4222:4222 nats:latest -js

# Create stream and consumer
nats stream add TASKS --subjects 'tasks.assigned' --storage file

# Monitor message throughput
nats sub 'tasks.assigned' --raw

# Test agent coordination
cargo test -p vapora-agents -- --nocapture

# Check message processing
nats stats

Expected Output:

  • JetStream stream created with persistence
  • Messages published to tasks.assigned persisted
  • Agent subscribers receive and acknowledge messages
  • Retries work if agent processing fails
  • All agent tests pass

Consequences

Message Queue Management

  • Streams must be pre-created (infra responsibility)
  • Retention policies configured per stream (age, size limits)
  • Consumer groups enable load-balanced processing

Failure Modes

  • If NATS unavailable: Agents fallback to in-memory queue (graceful degradation)
  • Lost messages only if dual failure (server down + no backup)
  • See disaster recovery plan for NATS clustering

Scaling

  • Multiple agents subscribe to same consumer group (load balancing)
  • One message processed by one agent (exclusive delivery)
  • Ordering preserved within subject

References

  • NATS JetStream Documentation
  • /crates/vapora-agents/src/coordinator.rs (coordinator implementation)
  • /crates/vapora-agents/src/messages.rs (message types)

Related ADRs: ADR-001 (Workspace), ADR-018 (Swarm Load Balancing)