11 KiB
ADR-0028: Workflow Orchestrator for Cost-Efficient Multi-Agent Pipelines
Status
Accepted - Implemented in v1.2.0
Context
The Problem: Excessive LLM Costs from Cache Token Accumulation
Analysis of real Claude Code usage data (5 weeks, individual developer) reveals a critical cost pattern:
| Metric | Value |
|---|---|
| Total cost | $1,050.68 |
| Weekly average | ~$210 |
| Monthly projection | ~$840 |
| Cache read tokens | 3.82B (95.7% of total) |
| Cache creation tokens | 170M (4.3%) |
| Direct input tokens | 2.4M (0.06%) |
| Direct output tokens | 366K (0.009%) |
The cost is dominated by cache tokens, not generation.
Root Cause: Monolithic Session Pattern
Current workflow with Claude Code follows a monolithic session pattern:
Session start
├─ Message 1: context 50K → cache read 50K
├─ Message 2: context 100K → cache read 100K
├─ Message 3: context 150K → cache read 150K
├─ ...
└─ Message 50: context 800K → cache read 800K
─────────────────
~20M cache reads per session
Each message in a long session re-sends the entire conversation history. Over a typical development session (50+ messages), context accumulates to 500K-1M tokens, with each subsequent message re-transmitting all previous context.
Why This Matters
At current pricing (2026 rates):
- Cache read (Haiku): $0.03/1M tokens
- Cache read (Sonnet): $0.30/1M tokens
- Cache read (Opus): $1.50/1M tokens
With 3.82B cache read tokens distributed across Sonnet (51%) and Haiku (38%), the cache cost alone exceeds what direct input/output would cost.
Decision
Implement a Workflow Orchestrator (vapora-workflow-engine) that executes multi-stage pipelines with short-lived agent contexts.
Architecture: Agents with Short Lifecycles
Instead of one long session accumulating context, workflows execute as discrete stages:
┌─────────────────────────────────────────────────────────┐
│ Task: "Implement feature X" │
└─────────────────────────────────────────────────────────┘
│
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌──────────┐
│Architect│ │Developer │ │ Reviewer │
│ (Opus) │ │ (Haiku) │ │ (Sonnet) │
├─────────┤ ├──────────┤ ├──────────┤
│Context: │ │Context: │ │Context: │
│ 40K │───────▶│ 25K │───────▶│ 35K │
│ 5 msgs │ spec │ 12 msgs │ code │ 4 msgs │
│ 200K │ │ 300K │ │ 140K │
│ cache │ │ cache │ │ cache │
└────┬────┘ └────┬─────┘ └────┬─────┘
│ │ │
▼ ▼ ▼
TERMINATES TERMINATES TERMINATES
(context (context (context
discarded) discarded) discarded)
Total cache: ~640K
Monolithic equivalent: ~20-40M
Reduction: 95-97%
Key Principles
-
Context isolation: Each agent receives only what it needs (spec, relevant files), not full conversation history
-
Artifact passing, not conversation passing: Between agents flows the result (spec, code, review), not the dialogue that produced it
-
Short lifecycles: Agent completes task → context dies → next agent starts fresh
-
Persistent memory via Kogral: Important decisions/patterns stored in knowledge base, not in session context
Implementation
Components
-
vapora-workflow-engine (new crate):
WorkflowOrchestrator: Main coordinator managing workflow lifecycleWorkflowInstance: State machine tracking individual workflow executionStageState: Manages stage execution and task assignmentArtifact: Data passed between stages (ADR, Code, TestResults, Review, Documentation)
-
Workflow Templates (
config/workflows.toml):feature_development(5 stages): architecture → implementation → testing → review → deploymentbugfix(4 stages): investigation → fix → testing → deploymentdocumentation_update(3 stages): content → review → publishsecurity_audit(4 stages): analysis → pentesting → remediation → verification
-
REST API (
/api/v1/workflow_orchestrator):POST /- Start workflowGET /- List active workflowsGET /:id- Get workflow statusPOST /:id/approve- Approve waiting stagePOST /:id/cancel- Cancel running workflowGET /templates- List available templates
-
CLI (vapora-cli):
vapora workflow start --template <name> --context context.jsonvapora workflow listvapora workflow status <id>vapora workflow approve <id> --approver "Name"vapora workflow cancel <id> --reason "Reason"vapora workflow templates
-
Kogral Integration:
enrich_context_from_kogral()- Loads guidelines, patterns, ADRs- Filesystem-based knowledge retrieval from
.kogral/directory - Configurable via
KOGRAL_PATHenvironment variable
Integration with Existing Components
| Component | Usage |
|---|---|
| SwarmCoordinator | Task assignment via submit_task_for_bidding() |
| AgentRegistry | 12 roles with lifecycle management |
| LearningProfiles | Expertise-based agent selection |
| KGPersistence | Workflow execution history |
| NATS JetStream | Inter-stage event coordination |
Rationale
Why Vapora Already Has the Pieces
Current Vapora implementation includes:
| Component | Status | Functionality |
|---|---|---|
| SwarmCoordinator | Complete | Task assignment, load balancing |
| AgentRegistry | Complete | 12 roles, lifecycle management |
| Learning Profiles | Complete | Expertise scoring with recency bias |
| KG Persistence | Complete | SurrealDB, execution history |
| NATS Messaging | Complete | Inter-agent communication |
| Workflow Templates | Complete | workflows.toml with stage definitions |
| Artifact Types | Complete | TaskCompleted.artifacts field |
What was missing: The orchestration layer that executes workflow templates by loading templates, creating instances, listening for task completions, advancing stages, and passing artifacts.
Why Not Alternative Solutions
| Alternative | Why Not |
|---|---|
Manual /compact in Claude Code |
Requires user discipline, doesn't fundamentally change pattern |
| Shorter sessions manually | Loses context continuity, user must track state |
| External tools (LiteLLM, CrewAI) | Python-based, doesn't leverage existing Vapora infrastructure |
| Just use Haiku everywhere | Quality degradation for complex tasks |
Vapora already has budget-aware routing, learning profiles, and swarm coordination. The workflow orchestrator completes the picture.
Why Kogral Integration
Kogral provides persistent knowledge that would otherwise bloat session context:
| Without Kogral | With Kogral |
|---|---|
| Guidelines re-explained each session | Query once via MCP, inject 5K tokens |
| ADRs repeated in conversation | Reference by ID, inject summary |
| Patterns described verbally | Structured retrieval, minimal tokens |
Kogral transforms "remember our auth pattern" (requires context) into "query pattern:auth" (stateless lookup).
Consequences
Positive
-
~95% reduction in cache token costs: $840/month → ~$50-100/month for same workload
-
Better model allocation: Opus for architecture (high quality, few tokens), Haiku for implementation (lower quality acceptable, many tokens)
-
Leverages existing investment: Uses SwarmCoordinator, LearningProfiles, KGPersistence already built
-
Audit trail: Each agent execution persisted to KG with tokens, cost, duration
-
Parallelization: Multiple developers can work simultaneously on different parts
-
Quality through specialization: Each agent optimized for its role vs one generalist session
Negative
-
Orchestration overhead: Additional component to maintain
-
Latency between stages: Artifact passing adds delay vs continuous conversation
-
Context loss between agents: Agent B doesn't know what Agent A "considered but rejected"
-
Debugging complexity: Issues span multiple agent executions
Mitigations
| Negative | Mitigation |
|---|---|
| Orchestration overhead | Minimal code (~1500 lines), clear separation of concerns |
| Latency | Parallel stages where possible, async execution |
| Context loss | Kogral captures decisions, not just outcomes |
| Debugging | Workflow ID traces all related executions in KG |
Metrics for Success
| Metric | Before | After (Target) |
|---|---|---|
| Monthly LLM cost | ~$840 | <$150 |
| Cache tokens per task | ~20M | <1M |
| Average context size | 500K+ | <50K per agent |
| Workflow completion rate | N/A | >95% |
Cost Projection
Based on analyzed usage patterns with optimized workflow:
| Role | Model | % of Work | Monthly Cost |
|---|---|---|---|
| Architect | Opus | 10% | ~$25 |
| Developer | Haiku | 50% | ~$30 |
| Reviewer | Sonnet | 25% | ~$40 |
| Tester | Haiku | 15% | ~$15 |
| Total | ~$110 |
Savings: ~$730/month (87% reduction)
Implementation Status
- Status: Complete (v1.2.0)
- Crates: vapora-workflow-engine, vapora-cli
- Tests: 26 unit tests + 1 doc test passing
- Endpoints: 6 REST API endpoints
- Templates: 4 pre-configured workflows
- CLI Commands: 6 workflow management commands
References
- Usage data: Claude Code usage analysis (5 weeks, 3.82B cache tokens)
- Vapora SwarmCoordinator:
crates/vapora-swarm/src/coordinator.rs - Vapora Workflows Config:
config/workflows.toml - Kogral MCP:
kogral-mcp(external project) - Implementation:
crates/vapora-workflow-engine/ - CLI:
crates/vapora-cli/
Related ADRs
- ADR-0014: Learning-Based Agent Selection
- ADR-0015: Budget Enforcement & Cost Optimization
- ADR-0013: Knowledge Graph for Temporal Execution History
- ADR-0018: Swarm Load Balancing
Decision Drivers
- Data-driven: 95% of cost is cache tokens from long sessions
- Infrastructure exists: Vapora has all pieces except orchestrator
- Kogral synergy: Persistent knowledge reduces context requirements
- Measurable outcome: Clear before/after metrics for validation
- Production-ready: Complete implementation with tests and documentation