# ADR-008: Recursive Language Models (RLM) Integration **Date**: 2026-02-16 **Status**: Accepted **Deciders**: VAPORA Team **Technical Story**: Phase 9 - RLM as Core Foundation ## Context and Problem Statement VAPORA's agent system relied on **direct LLM calls** for all reasoning tasks, which created fundamental limitations: 1. **Context window limitations**: Single LLM calls fail beyond 50-100k tokens (context rot) 2. **No knowledge reuse**: Historical executions were not semantically searchable 3. **Single-shot reasoning**: No distributed analysis across document chunks 4. **Cost inefficiency**: Processing entire documents repeatedly instead of relevant chunks 5. **No incremental learning**: Agents couldn't learn from past successful solutions **Question**: How do we enable long-context reasoning, knowledge reuse, and distributed LLM processing in VAPORA? ## Decision Drivers **Must Have:** - Handle documents >100k tokens without context rot - Semantic search over historical executions - Distributed reasoning across document chunks - Integration with existing SurrealDB + NATS architecture - Support multiple LLM providers (OpenAI, Claude, Ollama) **Should Have:** - Hybrid search (keyword + semantic) - Cost tracking per provider - Prometheus metrics - Sandboxed execution environment **Nice to Have:** - WASM-based fast execution tier - Docker warm pool for complex tasks ## Considered Options ### Option 1: RAG (Retrieval-Augmented Generation) Only **Approach**: Traditional RAG with vector embeddings + SurrealDB **Pros:** - Simple to implement - Well-understood pattern - Good for basic Q&A **Cons:** - ❌ No distributed reasoning (single LLM call) - ❌ Keyword search limitations (only semantic) - ❌ No execution sandbox - ❌ Limited to simple retrieval tasks ### Option 2: LangChain/LlamaIndex Integration **Approach**: Use existing framework (LangChain or LlamaIndex) **Pros:** - Pre-built components - Active community - Many integrations **Cons:** - ❌ Python-based (VAPORA is Rust-first) - ❌ Heavy dependencies - ❌ Less control over implementation - ❌ Tight coupling to framework abstractions ### Option 3: Recursive Language Models (RLM) - **SELECTED** **Approach**: Custom Rust implementation with distributed reasoning, hybrid search, and sandboxed execution **Pros:** - ✅ Native Rust (zero-cost abstractions, safety) - ✅ Hybrid search (BM25 + semantic + RRF fusion) - ✅ Distributed LLM calls across chunks - ✅ Sandboxed execution (WASM + Docker) - ✅ Full control over implementation - ✅ Reuses existing VAPORA patterns (SurrealDB, NATS, Prometheus) **Cons:** - ⚠️ More initial implementation effort - ⚠️ Maintaining custom codebase **Decision**: **Option 3 - RLM Custom Implementation** ## Decision Outcome ### Chosen Solution: Recursive Language Models (RLM) Implement a **native Rust RLM system** as a foundational VAPORA component, providing: 1. **Chunking**: Fixed, Semantic, Code-aware strategies 2. **Hybrid Search**: BM25 (Tantivy) + Semantic (embeddings) + RRF fusion 3. **Distributed Reasoning**: Parallel LLM calls across relevant chunks 4. **Sandboxed Execution**: WASM tier (<10ms) + Docker tier (80-150ms) 5. **Knowledge Graph**: Store execution history with learning curves 6. **Multi-Provider**: OpenAI, Claude, Gemini, Ollama support ### Architecture Overview ``` ┌─────────────────────────────────────────────────────────────┐ │ RLM Engine │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Chunking │ │ Hybrid Search│ │ Dispatcher │ │ │ │ │ │ │ │ │ │ │ │ • Fixed │ │ • BM25 │ │ • Parallel │ │ │ │ • Semantic │ │ • Semantic │ │ LLM calls │ │ │ │ • Code │ │ • RRF Fusion │ │ • Aggregation│ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Storage │ │ Sandbox │ │ Metrics │ │ │ │ │ │ │ │ │ │ │ │ • SurrealDB │ │ • WASM │ │ • Prometheus │ │ │ │ • Chunks │ │ • Docker │ │ • Costs │ │ │ │ • Buffers │ │ • Auto-tier │ │ • Latency │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ └─────────────────────────────────────────────────────────────┘ ``` ### Implementation Details **Crate**: `vapora-rlm` (17,000+ LOC) **Key Components:** ```rust // 1. Chunking pub enum ChunkingStrategy { Fixed, // Fixed-size chunks with overlap Semantic, // Unicode-aware, sentence boundaries Code, // AST-based (Rust, Python, JS) } // 2. Hybrid Search pub struct HybridSearch { bm25_index: Arc, // Tantivy in-memory storage: Arc, // SurrealDB config: HybridSearchConfig, // RRF weights } // 3. LLM Dispatch pub struct LLMDispatcher { client: Option>, // Multi-provider config: DispatchConfig, // Aggregation strategy } // 4. Sandbox pub enum SandboxTier { WASM, // <10ms, WASI-compatible commands Docker, // <150ms, full compatibility } ``` **Database Schema** (SCHEMALESS for flexibility): ```sql -- Chunks (from documents) DEFINE TABLE rlm_chunks SCHEMALESS; DEFINE INDEX idx_rlm_chunks_chunk_id ON TABLE rlm_chunks COLUMNS chunk_id UNIQUE; DEFINE INDEX idx_rlm_chunks_doc_id ON TABLE rlm_chunks COLUMNS doc_id; -- Execution History (for learning) DEFINE TABLE rlm_executions SCHEMALESS; DEFINE INDEX idx_rlm_executions_execution_id ON TABLE rlm_executions COLUMNS execution_id UNIQUE; DEFINE INDEX idx_rlm_executions_doc_id ON TABLE rlm_executions COLUMNS doc_id; ``` **Key Decision**: Use **SCHEMALESS** instead of SCHEMAFULL tables to avoid conflicts with SurrealDB's auto-generated `id` fields. ### Production Usage ```rust use vapora_rlm::{RLMEngine, ChunkingConfig, EmbeddingConfig}; use vapora_llm_router::providers::OpenAIClient; // Setup LLM client let llm_client = Arc::new(OpenAIClient::new( api_key, "gpt-4".to_string(), 4096, 0.7, 5.0, 15.0 )?); // Configure RLM let config = RLMEngineConfig { chunking: ChunkingConfig { strategy: ChunkingStrategy::Semantic, chunk_size: 1000, overlap: 200, }, embedding: Some(EmbeddingConfig::openai_small()), auto_rebuild_bm25: true, max_chunks_per_doc: 10_000, }; // Create engine let engine = RLMEngine::with_llm_client( storage, bm25_index, llm_client, Some(config) )?; // Usage let chunks = engine.load_document(doc_id, content, None).await?; let results = engine.query(doc_id, "error handling", None, 5).await?; let response = engine.dispatch_subtask(doc_id, "Analyze code", None, 5).await?; ``` ## Consequences ### Positive **Performance:** - ✅ Handles 100k+ line documents without context rot - ✅ Query latency: ~90ms average (100 queries benchmark) - ✅ WASM tier: <10ms for simple commands - ✅ Docker tier: <150ms from warm pool - ✅ Full workflow: <30s for 10k lines (2728 chunks) **Functionality:** - ✅ Hybrid search outperforms pure semantic or BM25 alone - ✅ Distributed reasoning reduces hallucinations - ✅ Knowledge Graph enables learning from past executions - ✅ Multi-provider support (OpenAI, Claude, Ollama) **Quality:** - ✅ 38/38 tests passing (100% pass rate) - ✅ 0 clippy warnings - ✅ Comprehensive E2E, performance, security tests - ✅ Production-ready with real persistence (no stubs) **Cost Efficiency:** - ✅ Chunk-based processing reduces token usage - ✅ Cost tracking per provider and task - ✅ Local Ollama option for development (free) ### Negative **Complexity:** - ⚠️ Additional component to maintain (17k+ LOC) - ⚠️ Learning curve for distributed reasoning patterns - ⚠️ More moving parts (chunking, BM25, embeddings, dispatch) **Infrastructure:** - ⚠️ Requires SurrealDB for persistence - ⚠️ Requires embedding provider (OpenAI/Ollama) - ⚠️ Optional Docker for full sandbox tier **Performance Trade-offs:** - ⚠️ Load time ~22s for 10k lines (chunking + embedding + indexing) - ⚠️ BM25 rebuild time proportional to document size - ⚠️ Memory usage: ~25MB per WASM instance, ~100-300MB per Docker container ### Risks and Mitigations | Risk | Mitigation | Status | |------|-----------|--------| | SurrealDB schema conflicts | Use SCHEMALESS tables | ✅ Resolved | | BM25 index performance | In-memory Tantivy, auto-rebuild | ✅ Verified | | LLM provider costs | Cost tracking, local Ollama option | ✅ Implemented | | Sandbox escape | WASM isolation, Docker security tests | ✅ 13/13 tests passing | | Context window limits | Chunking + hybrid search + aggregation | ✅ Handles 100k+ tokens | ## Validation ### Test Coverage ``` Basic integration: 4/4 ✅ (100%) E2E integration: 9/9 ✅ (100%) Security: 13/13 ✅ (100%) Performance: 8/8 ✅ (100%) Debug tests: 4/4 ✅ (100%) ─────────────────────────────────── Total: 38/38 ✅ (100%) ``` ### Performance Benchmarks ``` Query Latency (100 queries): Average: 90.6ms P50: 87.5ms P95: 88.3ms P99: 91.7ms Large Document (10k lines): Load: ~22s (2728 chunks) Query: ~565ms Full workflow: <30s BM25 Index: Build time: ~100ms for 1000 docs Search: <1ms for most queries ``` ### Integration Points **Existing VAPORA Components:** - ✅ `vapora-llm-router`: LLM client integration - ✅ `vapora-knowledge-graph`: Execution history persistence - ✅ `vapora-shared`: Common error types and models - ✅ SurrealDB: Persistent storage backend - ✅ Prometheus: Metrics export **New Integration Surface:** ```rust // Backend API POST /api/v1/rlm/analyze { "content": "...", "query": "...", "strategy": "semantic" } // Agent Coordinator let rlm_result = rlm_engine.dispatch_subtask( doc_id, task.description, None, 5 ).await?; ``` ## Related Decisions - **ADR-003**: Multi-provider LLM routing (Phase 6 dependency) - **ADR-005**: Knowledge Graph temporal modeling (RLM execution history) - **ADR-006**: Prometheus metrics standardization (RLM metrics) ## References **Implementation:** - `crates/vapora-rlm/` - Full RLM implementation - `crates/vapora-rlm/PRODUCTION.md` - Production setup guide - `crates/vapora-rlm/examples/` - Working examples - `migrations/008_rlm_schema.surql` - Database schema **External:** - [Tantivy](https://github.com/quickwit-oss/tantivy) - BM25 full-text search - [RRF Paper](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) - Reciprocal Rank Fusion - [WASM Security Model](https://webassembly.org/docs/security/) **Tests:** - `tests/e2e_integration.rs` - End-to-end workflow tests - `tests/performance_test.rs` - Performance benchmarks - `tests/security_test.rs` - Sandbox security validation ## Notes **Why SCHEMALESS vs SCHEMAFULL?** Initial implementation used SCHEMAFULL with explicit `id` field definitions: ```sql DEFINE TABLE rlm_chunks SCHEMAFULL; DEFINE FIELD id ON TABLE rlm_chunks TYPE record; -- ❌ Conflict ``` This caused data persistence failures because SurrealDB auto-generates `id` fields. Changed to SCHEMALESS: ```sql DEFINE TABLE rlm_chunks SCHEMALESS; -- ✅ Works DEFINE INDEX idx_rlm_chunks_chunk_id ON TABLE rlm_chunks COLUMNS chunk_id UNIQUE; ``` Indexes still work with SCHEMALESS, providing necessary performance without schema conflicts. **Why Hybrid Search?** Pure BM25 (keyword): - ✅ Fast, exact matches - ❌ Misses semantic similarity Pure Semantic (embeddings): - ✅ Understands meaning - ❌ Expensive, misses exact keywords Hybrid (BM25 + Semantic + RRF): - ✅ Best of both worlds - ✅ Reciprocal Rank Fusion combines rankings optimally - ✅ Empirically outperforms either alone **Why Custom Implementation vs Framework?** Frameworks (LangChain, LlamaIndex): - Python-based (VAPORA is Rust) - Heavy abstractions - Less control - Dependency lock-in Custom Rust RLM: - Native performance - Full control - Zero-cost abstractions - Direct integration with VAPORA patterns **Trade-off accepted**: More initial effort for long-term maintainability and performance. --- **Supersedes**: None (new decision) **Amended by**: None **Last Updated**: 2026-02-16