jesus/Vapora

Fork 0

Jesús Pérez df829421d8

Documentation Lint & Validation / Markdown Linting (push) Has been cancelled

Details

Documentation Lint & Validation / Validate mdBook Configuration (push) Has been cancelled

Details

Documentation Lint & Validation / Content & Structure Validation (push) Has been cancelled

Details

mdBook Build & Deploy / Build mdBook (push) Has been cancelled

Details

Rust CI / Security Audit (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (stable) (push) Has been cancelled

Details

Documentation Lint & Validation / Lint & Validation Summary (push) Has been cancelled

Details

mdBook Build & Deploy / Documentation Quality Check (push) Has been cancelled

Details

mdBook Build & Deploy / Deploy to GitHub Pages (push) Has been cancelled

Details

mdBook Build & Deploy / Notification (push) Has been cancelled

Details

chore: udate docs, add architecture diagrams

2026-02-16 05:12:22 +00:00

13 KiB

Raw Blame History

ADR-008: Recursive Language Models (RLM) Integration

Date: 2026-02-16 Status: Accepted Deciders: VAPORA Team Technical Story: Phase 9 - RLM as Core Foundation

Context and Problem Statement

VAPORA's agent system relied on direct LLM calls for all reasoning tasks, which created fundamental limitations:

Context window limitations: Single LLM calls fail beyond 50-100k tokens (context rot)
No knowledge reuse: Historical executions were not semantically searchable
Single-shot reasoning: No distributed analysis across document chunks
Cost inefficiency: Processing entire documents repeatedly instead of relevant chunks
No incremental learning: Agents couldn't learn from past successful solutions

Question: How do we enable long-context reasoning, knowledge reuse, and distributed LLM processing in VAPORA?

Decision Drivers

Must Have:

Handle documents >100k tokens without context rot
Semantic search over historical executions
Distributed reasoning across document chunks
Integration with existing SurrealDB + NATS architecture
Support multiple LLM providers (OpenAI, Claude, Ollama)

Should Have:

Hybrid search (keyword + semantic)
Cost tracking per provider
Prometheus metrics
Sandboxed execution environment

Nice to Have:

WASM-based fast execution tier
Docker warm pool for complex tasks

Considered Options

Option 1: RAG (Retrieval-Augmented Generation) Only

Approach: Traditional RAG with vector embeddings + SurrealDB

Pros:

Simple to implement
Well-understood pattern
Good for basic Q&A

Cons:

❌ No distributed reasoning (single LLM call)
❌ Keyword search limitations (only semantic)
❌ No execution sandbox
❌ Limited to simple retrieval tasks

Option 2: LangChain/LlamaIndex Integration

Approach: Use existing framework (LangChain or LlamaIndex)

Pros:

Pre-built components
Active community
Many integrations

Cons:

❌ Python-based (VAPORA is Rust-first)
❌ Heavy dependencies
❌ Less control over implementation
❌ Tight coupling to framework abstractions

Option 3: Recursive Language Models (RLM) - SELECTED

Approach: Custom Rust implementation with distributed reasoning, hybrid search, and sandboxed execution

Pros:

✅ Native Rust (zero-cost abstractions, safety)
✅ Hybrid search (BM25 + semantic + RRF fusion)
✅ Distributed LLM calls across chunks
✅ Sandboxed execution (WASM + Docker)
✅ Full control over implementation
✅ Reuses existing VAPORA patterns (SurrealDB, NATS, Prometheus)

Cons:

⚠️ More initial implementation effort
⚠️ Maintaining custom codebase

Decision: Option 3 - RLM Custom Implementation

Decision Outcome

Chosen Solution: Recursive Language Models (RLM)

Implement a native Rust RLM system as a foundational VAPORA component, providing:

Chunking: Fixed, Semantic, Code-aware strategies
Hybrid Search: BM25 (Tantivy) + Semantic (embeddings) + RRF fusion
Distributed Reasoning: Parallel LLM calls across relevant chunks
Sandboxed Execution: WASM tier (<10ms) + Docker tier (80-150ms)
Knowledge Graph: Store execution history with learning curves
Multi-Provider: OpenAI, Claude, Gemini, Ollama support

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                        RLM Engine                            │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │  Chunking    │  │ Hybrid Search│  │  Dispatcher  │      │
│  │              │  │              │  │              │      │
│  │ • Fixed      │  │ • BM25       │  │ • Parallel   │      │
│  │ • Semantic   │  │ • Semantic   │  │   LLM calls  │      │
│  │ • Code       │  │ • RRF Fusion │  │ • Aggregation│      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
│                                                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │   Storage    │  │   Sandbox    │  │  Metrics     │      │
│  │              │  │              │  │              │      │
│  │ • SurrealDB  │  │ • WASM       │  │ • Prometheus │      │
│  │ • Chunks     │  │ • Docker     │  │ • Costs      │      │
│  │ • Buffers    │  │ • Auto-tier  │  │ • Latency    │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└─────────────────────────────────────────────────────────────┘

Implementation Details

Crate: vapora-rlm (17,000+ LOC)

Key Components:

// 1. Chunking
pub enum ChunkingStrategy {
    Fixed,      // Fixed-size chunks with overlap
    Semantic,   // Unicode-aware, sentence boundaries
    Code,       // AST-based (Rust, Python, JS)
}

// 2. Hybrid Search
pub struct HybridSearch {
    bm25_index: Arc<BM25Index>,      // Tantivy in-memory
    storage: Arc<dyn Storage>,        // SurrealDB
    config: HybridSearchConfig,       // RRF weights
}

// 3. LLM Dispatch
pub struct LLMDispatcher {
    client: Option<Arc<dyn LLMClient>>,  // Multi-provider
    config: DispatchConfig,               // Aggregation strategy
}

// 4. Sandbox
pub enum SandboxTier {
    WASM,   // <10ms, WASI-compatible commands
    Docker, // <150ms, full compatibility
}

Database Schema (SCHEMALESS for flexibility):

-- Chunks (from documents)
DEFINE TABLE rlm_chunks SCHEMALESS;
DEFINE INDEX idx_rlm_chunks_chunk_id ON TABLE rlm_chunks COLUMNS chunk_id UNIQUE;
DEFINE INDEX idx_rlm_chunks_doc_id ON TABLE rlm_chunks COLUMNS doc_id;

-- Execution History (for learning)
DEFINE TABLE rlm_executions SCHEMALESS;
DEFINE INDEX idx_rlm_executions_execution_id ON TABLE rlm_executions COLUMNS execution_id UNIQUE;
DEFINE INDEX idx_rlm_executions_doc_id ON TABLE rlm_executions COLUMNS doc_id;

Key Decision: Use SCHEMALESS instead of SCHEMAFULL tables to avoid conflicts with SurrealDB's auto-generated id fields.

Production Usage

use vapora_rlm::{RLMEngine, ChunkingConfig, EmbeddingConfig};
use vapora_llm_router::providers::OpenAIClient;

// Setup LLM client
let llm_client = Arc::new(OpenAIClient::new(
    api_key, "gpt-4".to_string(),
    4096, 0.7, 5.0, 15.0
)?);

// Configure RLM
let config = RLMEngineConfig {
    chunking: ChunkingConfig {
        strategy: ChunkingStrategy::Semantic,
        chunk_size: 1000,
        overlap: 200,
    },
    embedding: Some(EmbeddingConfig::openai_small()),
    auto_rebuild_bm25: true,
    max_chunks_per_doc: 10_000,
};

// Create engine
let engine = RLMEngine::with_llm_client(
    storage, bm25_index, llm_client, Some(config)
)?;

// Usage
let chunks = engine.load_document(doc_id, content, None).await?;
let results = engine.query(doc_id, "error handling", None, 5).await?;
let response = engine.dispatch_subtask(doc_id, "Analyze code", None, 5).await?;

Consequences

Positive

Performance:

✅ Handles 100k+ line documents without context rot
✅ Query latency: ~90ms average (100 queries benchmark)
✅ WASM tier: <10ms for simple commands
✅ Docker tier: <150ms from warm pool
✅ Full workflow: <30s for 10k lines (2728 chunks)

Functionality:

✅ Hybrid search outperforms pure semantic or BM25 alone
✅ Distributed reasoning reduces hallucinations
✅ Knowledge Graph enables learning from past executions
✅ Multi-provider support (OpenAI, Claude, Ollama)

Quality:

✅ 38/38 tests passing (100% pass rate)
✅ 0 clippy warnings
✅ Comprehensive E2E, performance, security tests
✅ Production-ready with real persistence (no stubs)

Cost Efficiency:

✅ Chunk-based processing reduces token usage
✅ Cost tracking per provider and task
✅ Local Ollama option for development (free)

Negative

Complexity:

⚠️ Additional component to maintain (17k+ LOC)
⚠️ Learning curve for distributed reasoning patterns
⚠️ More moving parts (chunking, BM25, embeddings, dispatch)

Infrastructure:

⚠️ Requires SurrealDB for persistence
⚠️ Requires embedding provider (OpenAI/Ollama)
⚠️ Optional Docker for full sandbox tier

Performance Trade-offs:

⚠️ Load time ~22s for 10k lines (chunking + embedding + indexing)
⚠️ BM25 rebuild time proportional to document size
⚠️ Memory usage: ~25MB per WASM instance, ~100-300MB per Docker container

Risks and Mitigations

Risk	Mitigation	Status
SurrealDB schema conflicts	Use SCHEMALESS tables	✅ Resolved
BM25 index performance	In-memory Tantivy, auto-rebuild	✅ Verified
LLM provider costs	Cost tracking, local Ollama option	✅ Implemented
Sandbox escape	WASM isolation, Docker security tests	✅ 13/13 tests passing
Context window limits	Chunking + hybrid search + aggregation	✅ Handles 100k+ tokens

Validation

Test Coverage

Basic integration:     4/4  ✅ (100%)
E2E integration:       9/9  ✅ (100%)
Security:             13/13 ✅ (100%)
Performance:           8/8  ✅ (100%)
Debug tests:           4/4  ✅ (100%)
───────────────────────────────────
Total:                38/38 ✅ (100%)

Performance Benchmarks

Query Latency (100 queries):
  Average: 90.6ms
  P50: 87.5ms
  P95: 88.3ms
  P99: 91.7ms

Large Document (10k lines):
  Load: ~22s (2728 chunks)
  Query: ~565ms
  Full workflow: <30s

BM25 Index:
  Build time: ~100ms for 1000 docs
  Search: <1ms for most queries

Integration Points

Existing VAPORA Components:

✅ vapora-llm-router: LLM client integration
✅ vapora-knowledge-graph: Execution history persistence
✅ vapora-shared: Common error types and models
✅ SurrealDB: Persistent storage backend
✅ Prometheus: Metrics export

New Integration Surface:

// Backend API
POST /api/v1/rlm/analyze
{
  "content": "...",
  "query": "...",
  "strategy": "semantic"
}

// Agent Coordinator
let rlm_result = rlm_engine.dispatch_subtask(
    doc_id, task.description, None, 5
).await?;

ADR-003: Multi-provider LLM routing (Phase 6 dependency)
ADR-005: Knowledge Graph temporal modeling (RLM execution history)
ADR-006: Prometheus metrics standardization (RLM metrics)

References

Implementation:

crates/vapora-rlm/ - Full RLM implementation
crates/vapora-rlm/PRODUCTION.md - Production setup guide
crates/vapora-rlm/examples/ - Working examples
migrations/008_rlm_schema.surql - Database schema

External:

Tantivy - BM25 full-text search
RRF Paper - Reciprocal Rank Fusion
WASM Security Model

Tests:

tests/e2e_integration.rs - End-to-end workflow tests
tests/performance_test.rs - Performance benchmarks
tests/security_test.rs - Sandbox security validation

Notes

Why SCHEMALESS vs SCHEMAFULL?

Initial implementation used SCHEMAFULL with explicit id field definitions:

DEFINE TABLE rlm_chunks SCHEMAFULL;
DEFINE FIELD id ON TABLE rlm_chunks TYPE record<rlm_chunks>;  -- ❌ Conflict

This caused data persistence failures because SurrealDB auto-generates id fields. Changed to SCHEMALESS:

DEFINE TABLE rlm_chunks SCHEMALESS;  -- ✅ Works
DEFINE INDEX idx_rlm_chunks_chunk_id ON TABLE rlm_chunks COLUMNS chunk_id UNIQUE;

Indexes still work with SCHEMALESS, providing necessary performance without schema conflicts.

Why Hybrid Search?

Pure BM25 (keyword):

✅ Fast, exact matches
❌ Misses semantic similarity

Pure Semantic (embeddings):

✅ Understands meaning
❌ Expensive, misses exact keywords

Hybrid (BM25 + Semantic + RRF):

✅ Best of both worlds
✅ Reciprocal Rank Fusion combines rankings optimally
✅ Empirically outperforms either alone

Why Custom Implementation vs Framework?

Frameworks (LangChain, LlamaIndex):

Python-based (VAPORA is Rust)
Heavy abstractions
Less control
Dependency lock-in

Custom Rust RLM:

Native performance
Full control
Zero-cost abstractions
Direct integration with VAPORA patterns

Trade-off accepted: More initial effort for long-term maintainability and performance.

Supersedes: None (new decision) Amended by: None Last Updated: 2026-02-16

13 KiB Raw Blame History

ADR-008: Recursive Language Models (RLM) Integration

Context and Problem Statement

Decision Drivers

Considered Options

Option 1: RAG (Retrieval-Augmented Generation) Only

Option 2: LangChain/LlamaIndex Integration

Option 3: Recursive Language Models (RLM) - SELECTED

Decision Outcome

Chosen Solution: Recursive Language Models (RLM)

Architecture Overview

Implementation Details

Production Usage

Consequences

Positive

Negative

Risks and Mitigations

Validation

Test Coverage

Performance Benchmarks

Integration Points

Related Decisions

References

Notes

13 KiB

Raw Blame History