Vapora/docs/architecture/decisions/008-recursive-language-models-integration.md
Jesús Pérez df829421d8
Some checks failed
Documentation Lint & Validation / Markdown Linting (push) Has been cancelled
Documentation Lint & Validation / Validate mdBook Configuration (push) Has been cancelled
Documentation Lint & Validation / Content & Structure Validation (push) Has been cancelled
mdBook Build & Deploy / Build mdBook (push) Has been cancelled
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
Documentation Lint & Validation / Lint & Validation Summary (push) Has been cancelled
mdBook Build & Deploy / Documentation Quality Check (push) Has been cancelled
mdBook Build & Deploy / Deploy to GitHub Pages (push) Has been cancelled
mdBook Build & Deploy / Notification (push) Has been cancelled
chore: udate docs, add architecture diagrams
2026-02-16 05:12:22 +00:00

13 KiB

ADR-008: Recursive Language Models (RLM) Integration

Date: 2026-02-16 Status: Accepted Deciders: VAPORA Team Technical Story: Phase 9 - RLM as Core Foundation

Context and Problem Statement

VAPORA's agent system relied on direct LLM calls for all reasoning tasks, which created fundamental limitations:

  1. Context window limitations: Single LLM calls fail beyond 50-100k tokens (context rot)
  2. No knowledge reuse: Historical executions were not semantically searchable
  3. Single-shot reasoning: No distributed analysis across document chunks
  4. Cost inefficiency: Processing entire documents repeatedly instead of relevant chunks
  5. No incremental learning: Agents couldn't learn from past successful solutions

Question: How do we enable long-context reasoning, knowledge reuse, and distributed LLM processing in VAPORA?

Decision Drivers

Must Have:

  • Handle documents >100k tokens without context rot
  • Semantic search over historical executions
  • Distributed reasoning across document chunks
  • Integration with existing SurrealDB + NATS architecture
  • Support multiple LLM providers (OpenAI, Claude, Ollama)

Should Have:

  • Hybrid search (keyword + semantic)
  • Cost tracking per provider
  • Prometheus metrics
  • Sandboxed execution environment

Nice to Have:

  • WASM-based fast execution tier
  • Docker warm pool for complex tasks

Considered Options

Option 1: RAG (Retrieval-Augmented Generation) Only

Approach: Traditional RAG with vector embeddings + SurrealDB

Pros:

  • Simple to implement
  • Well-understood pattern
  • Good for basic Q&A

Cons:

  • No distributed reasoning (single LLM call)
  • Keyword search limitations (only semantic)
  • No execution sandbox
  • Limited to simple retrieval tasks

Option 2: LangChain/LlamaIndex Integration

Approach: Use existing framework (LangChain or LlamaIndex)

Pros:

  • Pre-built components
  • Active community
  • Many integrations

Cons:

  • Python-based (VAPORA is Rust-first)
  • Heavy dependencies
  • Less control over implementation
  • Tight coupling to framework abstractions

Option 3: Recursive Language Models (RLM) - SELECTED

Approach: Custom Rust implementation with distributed reasoning, hybrid search, and sandboxed execution

Pros:

  • Native Rust (zero-cost abstractions, safety)
  • Hybrid search (BM25 + semantic + RRF fusion)
  • Distributed LLM calls across chunks
  • Sandboxed execution (WASM + Docker)
  • Full control over implementation
  • Reuses existing VAPORA patterns (SurrealDB, NATS, Prometheus)

Cons:

  • ⚠️ More initial implementation effort
  • ⚠️ Maintaining custom codebase

Decision: Option 3 - RLM Custom Implementation

Decision Outcome

Chosen Solution: Recursive Language Models (RLM)

Implement a native Rust RLM system as a foundational VAPORA component, providing:

  1. Chunking: Fixed, Semantic, Code-aware strategies
  2. Hybrid Search: BM25 (Tantivy) + Semantic (embeddings) + RRF fusion
  3. Distributed Reasoning: Parallel LLM calls across relevant chunks
  4. Sandboxed Execution: WASM tier (<10ms) + Docker tier (80-150ms)
  5. Knowledge Graph: Store execution history with learning curves
  6. Multi-Provider: OpenAI, Claude, Gemini, Ollama support

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                        RLM Engine                            │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │  Chunking    │  │ Hybrid Search│  │  Dispatcher  │      │
│  │              │  │              │  │              │      │
│  │ • Fixed      │  │ • BM25       │  │ • Parallel   │      │
│  │ • Semantic   │  │ • Semantic   │  │   LLM calls  │      │
│  │ • Code       │  │ • RRF Fusion │  │ • Aggregation│      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
│                                                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │   Storage    │  │   Sandbox    │  │  Metrics     │      │
│  │              │  │              │  │              │      │
│  │ • SurrealDB  │  │ • WASM       │  │ • Prometheus │      │
│  │ • Chunks     │  │ • Docker     │  │ • Costs      │      │
│  │ • Buffers    │  │ • Auto-tier  │  │ • Latency    │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└─────────────────────────────────────────────────────────────┘

Implementation Details

Crate: vapora-rlm (17,000+ LOC)

Key Components:

// 1. Chunking
pub enum ChunkingStrategy {
    Fixed,      // Fixed-size chunks with overlap
    Semantic,   // Unicode-aware, sentence boundaries
    Code,       // AST-based (Rust, Python, JS)
}

// 2. Hybrid Search
pub struct HybridSearch {
    bm25_index: Arc<BM25Index>,      // Tantivy in-memory
    storage: Arc<dyn Storage>,        // SurrealDB
    config: HybridSearchConfig,       // RRF weights
}

// 3. LLM Dispatch
pub struct LLMDispatcher {
    client: Option<Arc<dyn LLMClient>>,  // Multi-provider
    config: DispatchConfig,               // Aggregation strategy
}

// 4. Sandbox
pub enum SandboxTier {
    WASM,   // <10ms, WASI-compatible commands
    Docker, // <150ms, full compatibility
}

Database Schema (SCHEMALESS for flexibility):

-- Chunks (from documents)
DEFINE TABLE rlm_chunks SCHEMALESS;
DEFINE INDEX idx_rlm_chunks_chunk_id ON TABLE rlm_chunks COLUMNS chunk_id UNIQUE;
DEFINE INDEX idx_rlm_chunks_doc_id ON TABLE rlm_chunks COLUMNS doc_id;

-- Execution History (for learning)
DEFINE TABLE rlm_executions SCHEMALESS;
DEFINE INDEX idx_rlm_executions_execution_id ON TABLE rlm_executions COLUMNS execution_id UNIQUE;
DEFINE INDEX idx_rlm_executions_doc_id ON TABLE rlm_executions COLUMNS doc_id;

Key Decision: Use SCHEMALESS instead of SCHEMAFULL tables to avoid conflicts with SurrealDB's auto-generated id fields.

Production Usage

use vapora_rlm::{RLMEngine, ChunkingConfig, EmbeddingConfig};
use vapora_llm_router::providers::OpenAIClient;

// Setup LLM client
let llm_client = Arc::new(OpenAIClient::new(
    api_key, "gpt-4".to_string(),
    4096, 0.7, 5.0, 15.0
)?);

// Configure RLM
let config = RLMEngineConfig {
    chunking: ChunkingConfig {
        strategy: ChunkingStrategy::Semantic,
        chunk_size: 1000,
        overlap: 200,
    },
    embedding: Some(EmbeddingConfig::openai_small()),
    auto_rebuild_bm25: true,
    max_chunks_per_doc: 10_000,
};

// Create engine
let engine = RLMEngine::with_llm_client(
    storage, bm25_index, llm_client, Some(config)
)?;

// Usage
let chunks = engine.load_document(doc_id, content, None).await?;
let results = engine.query(doc_id, "error handling", None, 5).await?;
let response = engine.dispatch_subtask(doc_id, "Analyze code", None, 5).await?;

Consequences

Positive

Performance:

  • Handles 100k+ line documents without context rot
  • Query latency: ~90ms average (100 queries benchmark)
  • WASM tier: <10ms for simple commands
  • Docker tier: <150ms from warm pool
  • Full workflow: <30s for 10k lines (2728 chunks)

Functionality:

  • Hybrid search outperforms pure semantic or BM25 alone
  • Distributed reasoning reduces hallucinations
  • Knowledge Graph enables learning from past executions
  • Multi-provider support (OpenAI, Claude, Ollama)

Quality:

  • 38/38 tests passing (100% pass rate)
  • 0 clippy warnings
  • Comprehensive E2E, performance, security tests
  • Production-ready with real persistence (no stubs)

Cost Efficiency:

  • Chunk-based processing reduces token usage
  • Cost tracking per provider and task
  • Local Ollama option for development (free)

Negative

Complexity:

  • ⚠️ Additional component to maintain (17k+ LOC)
  • ⚠️ Learning curve for distributed reasoning patterns
  • ⚠️ More moving parts (chunking, BM25, embeddings, dispatch)

Infrastructure:

  • ⚠️ Requires SurrealDB for persistence
  • ⚠️ Requires embedding provider (OpenAI/Ollama)
  • ⚠️ Optional Docker for full sandbox tier

Performance Trade-offs:

  • ⚠️ Load time ~22s for 10k lines (chunking + embedding + indexing)
  • ⚠️ BM25 rebuild time proportional to document size
  • ⚠️ Memory usage: ~25MB per WASM instance, ~100-300MB per Docker container

Risks and Mitigations

Risk Mitigation Status
SurrealDB schema conflicts Use SCHEMALESS tables Resolved
BM25 index performance In-memory Tantivy, auto-rebuild Verified
LLM provider costs Cost tracking, local Ollama option Implemented
Sandbox escape WASM isolation, Docker security tests 13/13 tests passing
Context window limits Chunking + hybrid search + aggregation Handles 100k+ tokens

Validation

Test Coverage

Basic integration:     4/4  ✅ (100%)
E2E integration:       9/9  ✅ (100%)
Security:             13/13 ✅ (100%)
Performance:           8/8  ✅ (100%)
Debug tests:           4/4  ✅ (100%)
───────────────────────────────────
Total:                38/38 ✅ (100%)

Performance Benchmarks

Query Latency (100 queries):
  Average: 90.6ms
  P50: 87.5ms
  P95: 88.3ms
  P99: 91.7ms

Large Document (10k lines):
  Load: ~22s (2728 chunks)
  Query: ~565ms
  Full workflow: <30s

BM25 Index:
  Build time: ~100ms for 1000 docs
  Search: <1ms for most queries

Integration Points

Existing VAPORA Components:

  • vapora-llm-router: LLM client integration
  • vapora-knowledge-graph: Execution history persistence
  • vapora-shared: Common error types and models
  • SurrealDB: Persistent storage backend
  • Prometheus: Metrics export

New Integration Surface:

// Backend API
POST /api/v1/rlm/analyze
{
  "content": "...",
  "query": "...",
  "strategy": "semantic"
}

// Agent Coordinator
let rlm_result = rlm_engine.dispatch_subtask(
    doc_id, task.description, None, 5
).await?;
  • ADR-003: Multi-provider LLM routing (Phase 6 dependency)
  • ADR-005: Knowledge Graph temporal modeling (RLM execution history)
  • ADR-006: Prometheus metrics standardization (RLM metrics)

References

Implementation:

  • crates/vapora-rlm/ - Full RLM implementation
  • crates/vapora-rlm/PRODUCTION.md - Production setup guide
  • crates/vapora-rlm/examples/ - Working examples
  • migrations/008_rlm_schema.surql - Database schema

External:

Tests:

  • tests/e2e_integration.rs - End-to-end workflow tests
  • tests/performance_test.rs - Performance benchmarks
  • tests/security_test.rs - Sandbox security validation

Notes

Why SCHEMALESS vs SCHEMAFULL?

Initial implementation used SCHEMAFULL with explicit id field definitions:

DEFINE TABLE rlm_chunks SCHEMAFULL;
DEFINE FIELD id ON TABLE rlm_chunks TYPE record<rlm_chunks>;  -- ❌ Conflict

This caused data persistence failures because SurrealDB auto-generates id fields. Changed to SCHEMALESS:

DEFINE TABLE rlm_chunks SCHEMALESS;  -- ✅ Works
DEFINE INDEX idx_rlm_chunks_chunk_id ON TABLE rlm_chunks COLUMNS chunk_id UNIQUE;

Indexes still work with SCHEMALESS, providing necessary performance without schema conflicts.

Why Hybrid Search?

Pure BM25 (keyword):

  • Fast, exact matches
  • Misses semantic similarity

Pure Semantic (embeddings):

  • Understands meaning
  • Expensive, misses exact keywords

Hybrid (BM25 + Semantic + RRF):

  • Best of both worlds
  • Reciprocal Rank Fusion combines rankings optimally
  • Empirically outperforms either alone

Why Custom Implementation vs Framework?

Frameworks (LangChain, LlamaIndex):

  • Python-based (VAPORA is Rust)
  • Heavy abstractions
  • Less control
  • Dependency lock-in

Custom Rust RLM:

  • Native performance
  • Full control
  • Zero-cost abstractions
  • Direct integration with VAPORA patterns

Trade-off accepted: More initial effort for long-term maintainability and performance.


Supersedes: None (new decision) Amended by: None Last Updated: 2026-02-16