13 KiB
ADR-008: Recursive Language Models (RLM) Integration
Date: 2026-02-16 Status: Accepted Deciders: VAPORA Team Technical Story: Phase 9 - RLM as Core Foundation
Context and Problem Statement
VAPORA's agent system relied on direct LLM calls for all reasoning tasks, which created fundamental limitations:
- Context window limitations: Single LLM calls fail beyond 50-100k tokens (context rot)
- No knowledge reuse: Historical executions were not semantically searchable
- Single-shot reasoning: No distributed analysis across document chunks
- Cost inefficiency: Processing entire documents repeatedly instead of relevant chunks
- No incremental learning: Agents couldn't learn from past successful solutions
Question: How do we enable long-context reasoning, knowledge reuse, and distributed LLM processing in VAPORA?
Decision Drivers
Must Have:
- Handle documents >100k tokens without context rot
- Semantic search over historical executions
- Distributed reasoning across document chunks
- Integration with existing SurrealDB + NATS architecture
- Support multiple LLM providers (OpenAI, Claude, Ollama)
Should Have:
- Hybrid search (keyword + semantic)
- Cost tracking per provider
- Prometheus metrics
- Sandboxed execution environment
Nice to Have:
- WASM-based fast execution tier
- Docker warm pool for complex tasks
Considered Options
Option 1: RAG (Retrieval-Augmented Generation) Only
Approach: Traditional RAG with vector embeddings + SurrealDB
Pros:
- Simple to implement
- Well-understood pattern
- Good for basic Q&A
Cons:
- ❌ No distributed reasoning (single LLM call)
- ❌ Keyword search limitations (only semantic)
- ❌ No execution sandbox
- ❌ Limited to simple retrieval tasks
Option 2: LangChain/LlamaIndex Integration
Approach: Use existing framework (LangChain or LlamaIndex)
Pros:
- Pre-built components
- Active community
- Many integrations
Cons:
- ❌ Python-based (VAPORA is Rust-first)
- ❌ Heavy dependencies
- ❌ Less control over implementation
- ❌ Tight coupling to framework abstractions
Option 3: Recursive Language Models (RLM) - SELECTED
Approach: Custom Rust implementation with distributed reasoning, hybrid search, and sandboxed execution
Pros:
- ✅ Native Rust (zero-cost abstractions, safety)
- ✅ Hybrid search (BM25 + semantic + RRF fusion)
- ✅ Distributed LLM calls across chunks
- ✅ Sandboxed execution (WASM + Docker)
- ✅ Full control over implementation
- ✅ Reuses existing VAPORA patterns (SurrealDB, NATS, Prometheus)
Cons:
- ⚠️ More initial implementation effort
- ⚠️ Maintaining custom codebase
Decision: Option 3 - RLM Custom Implementation
Decision Outcome
Chosen Solution: Recursive Language Models (RLM)
Implement a native Rust RLM system as a foundational VAPORA component, providing:
- Chunking: Fixed, Semantic, Code-aware strategies
- Hybrid Search: BM25 (Tantivy) + Semantic (embeddings) + RRF fusion
- Distributed Reasoning: Parallel LLM calls across relevant chunks
- Sandboxed Execution: WASM tier (<10ms) + Docker tier (80-150ms)
- Knowledge Graph: Store execution history with learning curves
- Multi-Provider: OpenAI, Claude, Gemini, Ollama support
Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ RLM Engine │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Chunking │ │ Hybrid Search│ │ Dispatcher │ │
│ │ │ │ │ │ │ │
│ │ • Fixed │ │ • BM25 │ │ • Parallel │ │
│ │ • Semantic │ │ • Semantic │ │ LLM calls │ │
│ │ • Code │ │ • RRF Fusion │ │ • Aggregation│ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Storage │ │ Sandbox │ │ Metrics │ │
│ │ │ │ │ │ │ │
│ │ • SurrealDB │ │ • WASM │ │ • Prometheus │ │
│ │ • Chunks │ │ • Docker │ │ • Costs │ │
│ │ • Buffers │ │ • Auto-tier │ │ • Latency │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
Implementation Details
Crate: vapora-rlm (17,000+ LOC)
Key Components:
// 1. Chunking
pub enum ChunkingStrategy {
Fixed, // Fixed-size chunks with overlap
Semantic, // Unicode-aware, sentence boundaries
Code, // AST-based (Rust, Python, JS)
}
// 2. Hybrid Search
pub struct HybridSearch {
bm25_index: Arc<BM25Index>, // Tantivy in-memory
storage: Arc<dyn Storage>, // SurrealDB
config: HybridSearchConfig, // RRF weights
}
// 3. LLM Dispatch
pub struct LLMDispatcher {
client: Option<Arc<dyn LLMClient>>, // Multi-provider
config: DispatchConfig, // Aggregation strategy
}
// 4. Sandbox
pub enum SandboxTier {
WASM, // <10ms, WASI-compatible commands
Docker, // <150ms, full compatibility
}
Database Schema (SCHEMALESS for flexibility):
-- Chunks (from documents)
DEFINE TABLE rlm_chunks SCHEMALESS;
DEFINE INDEX idx_rlm_chunks_chunk_id ON TABLE rlm_chunks COLUMNS chunk_id UNIQUE;
DEFINE INDEX idx_rlm_chunks_doc_id ON TABLE rlm_chunks COLUMNS doc_id;
-- Execution History (for learning)
DEFINE TABLE rlm_executions SCHEMALESS;
DEFINE INDEX idx_rlm_executions_execution_id ON TABLE rlm_executions COLUMNS execution_id UNIQUE;
DEFINE INDEX idx_rlm_executions_doc_id ON TABLE rlm_executions COLUMNS doc_id;
Key Decision: Use SCHEMALESS instead of SCHEMAFULL tables to avoid conflicts with SurrealDB's auto-generated id fields.
Production Usage
use vapora_rlm::{RLMEngine, ChunkingConfig, EmbeddingConfig};
use vapora_llm_router::providers::OpenAIClient;
// Setup LLM client
let llm_client = Arc::new(OpenAIClient::new(
api_key, "gpt-4".to_string(),
4096, 0.7, 5.0, 15.0
)?);
// Configure RLM
let config = RLMEngineConfig {
chunking: ChunkingConfig {
strategy: ChunkingStrategy::Semantic,
chunk_size: 1000,
overlap: 200,
},
embedding: Some(EmbeddingConfig::openai_small()),
auto_rebuild_bm25: true,
max_chunks_per_doc: 10_000,
};
// Create engine
let engine = RLMEngine::with_llm_client(
storage, bm25_index, llm_client, Some(config)
)?;
// Usage
let chunks = engine.load_document(doc_id, content, None).await?;
let results = engine.query(doc_id, "error handling", None, 5).await?;
let response = engine.dispatch_subtask(doc_id, "Analyze code", None, 5).await?;
Consequences
Positive
Performance:
- ✅ Handles 100k+ line documents without context rot
- ✅ Query latency: ~90ms average (100 queries benchmark)
- ✅ WASM tier: <10ms for simple commands
- ✅ Docker tier: <150ms from warm pool
- ✅ Full workflow: <30s for 10k lines (2728 chunks)
Functionality:
- ✅ Hybrid search outperforms pure semantic or BM25 alone
- ✅ Distributed reasoning reduces hallucinations
- ✅ Knowledge Graph enables learning from past executions
- ✅ Multi-provider support (OpenAI, Claude, Ollama)
Quality:
- ✅ 38/38 tests passing (100% pass rate)
- ✅ 0 clippy warnings
- ✅ Comprehensive E2E, performance, security tests
- ✅ Production-ready with real persistence (no stubs)
Cost Efficiency:
- ✅ Chunk-based processing reduces token usage
- ✅ Cost tracking per provider and task
- ✅ Local Ollama option for development (free)
Negative
Complexity:
- ⚠️ Additional component to maintain (17k+ LOC)
- ⚠️ Learning curve for distributed reasoning patterns
- ⚠️ More moving parts (chunking, BM25, embeddings, dispatch)
Infrastructure:
- ⚠️ Requires SurrealDB for persistence
- ⚠️ Requires embedding provider (OpenAI/Ollama)
- ⚠️ Optional Docker for full sandbox tier
Performance Trade-offs:
- ⚠️ Load time ~22s for 10k lines (chunking + embedding + indexing)
- ⚠️ BM25 rebuild time proportional to document size
- ⚠️ Memory usage: ~25MB per WASM instance, ~100-300MB per Docker container
Risks and Mitigations
| Risk | Mitigation | Status |
|---|---|---|
| SurrealDB schema conflicts | Use SCHEMALESS tables | ✅ Resolved |
| BM25 index performance | In-memory Tantivy, auto-rebuild | ✅ Verified |
| LLM provider costs | Cost tracking, local Ollama option | ✅ Implemented |
| Sandbox escape | WASM isolation, Docker security tests | ✅ 13/13 tests passing |
| Context window limits | Chunking + hybrid search + aggregation | ✅ Handles 100k+ tokens |
Validation
Test Coverage
Basic integration: 4/4 ✅ (100%)
E2E integration: 9/9 ✅ (100%)
Security: 13/13 ✅ (100%)
Performance: 8/8 ✅ (100%)
Debug tests: 4/4 ✅ (100%)
───────────────────────────────────
Total: 38/38 ✅ (100%)
Performance Benchmarks
Query Latency (100 queries):
Average: 90.6ms
P50: 87.5ms
P95: 88.3ms
P99: 91.7ms
Large Document (10k lines):
Load: ~22s (2728 chunks)
Query: ~565ms
Full workflow: <30s
BM25 Index:
Build time: ~100ms for 1000 docs
Search: <1ms for most queries
Integration Points
Existing VAPORA Components:
- ✅
vapora-llm-router: LLM client integration - ✅
vapora-knowledge-graph: Execution history persistence - ✅
vapora-shared: Common error types and models - ✅ SurrealDB: Persistent storage backend
- ✅ Prometheus: Metrics export
New Integration Surface:
// Backend API
POST /api/v1/rlm/analyze
{
"content": "...",
"query": "...",
"strategy": "semantic"
}
// Agent Coordinator
let rlm_result = rlm_engine.dispatch_subtask(
doc_id, task.description, None, 5
).await?;
Related Decisions
- ADR-003: Multi-provider LLM routing (Phase 6 dependency)
- ADR-005: Knowledge Graph temporal modeling (RLM execution history)
- ADR-006: Prometheus metrics standardization (RLM metrics)
References
Implementation:
crates/vapora-rlm/- Full RLM implementationcrates/vapora-rlm/PRODUCTION.md- Production setup guidecrates/vapora-rlm/examples/- Working examplesmigrations/008_rlm_schema.surql- Database schema
External:
- Tantivy - BM25 full-text search
- RRF Paper - Reciprocal Rank Fusion
- WASM Security Model
Tests:
tests/e2e_integration.rs- End-to-end workflow teststests/performance_test.rs- Performance benchmarkstests/security_test.rs- Sandbox security validation
Notes
Why SCHEMALESS vs SCHEMAFULL?
Initial implementation used SCHEMAFULL with explicit id field definitions:
DEFINE TABLE rlm_chunks SCHEMAFULL;
DEFINE FIELD id ON TABLE rlm_chunks TYPE record<rlm_chunks>; -- ❌ Conflict
This caused data persistence failures because SurrealDB auto-generates id fields. Changed to SCHEMALESS:
DEFINE TABLE rlm_chunks SCHEMALESS; -- ✅ Works
DEFINE INDEX idx_rlm_chunks_chunk_id ON TABLE rlm_chunks COLUMNS chunk_id UNIQUE;
Indexes still work with SCHEMALESS, providing necessary performance without schema conflicts.
Why Hybrid Search?
Pure BM25 (keyword):
- ✅ Fast, exact matches
- ❌ Misses semantic similarity
Pure Semantic (embeddings):
- ✅ Understands meaning
- ❌ Expensive, misses exact keywords
Hybrid (BM25 + Semantic + RRF):
- ✅ Best of both worlds
- ✅ Reciprocal Rank Fusion combines rankings optimally
- ✅ Empirically outperforms either alone
Why Custom Implementation vs Framework?
Frameworks (LangChain, LlamaIndex):
- Python-based (VAPORA is Rust)
- Heavy abstractions
- Less control
- Dependency lock-in
Custom Rust RLM:
- Native performance
- Full control
- Zero-cost abstractions
- Direct integration with VAPORA patterns
Trade-off accepted: More initial effort for long-term maintainability and performance.
Supersedes: None (new decision) Amended by: None Last Updated: 2026-02-16