Vapora/docs/architecture/decisions/008-recursive-language-models-integration.md
Jesús Pérez df829421d8
Some checks failed
Documentation Lint & Validation / Markdown Linting (push) Has been cancelled
Documentation Lint & Validation / Validate mdBook Configuration (push) Has been cancelled
Documentation Lint & Validation / Content & Structure Validation (push) Has been cancelled
mdBook Build & Deploy / Build mdBook (push) Has been cancelled
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
Documentation Lint & Validation / Lint & Validation Summary (push) Has been cancelled
mdBook Build & Deploy / Documentation Quality Check (push) Has been cancelled
mdBook Build & Deploy / Deploy to GitHub Pages (push) Has been cancelled
mdBook Build & Deploy / Notification (push) Has been cancelled
chore: udate docs, add architecture diagrams
2026-02-16 05:12:22 +00:00

403 lines
13 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-008: Recursive Language Models (RLM) Integration
**Date**: 2026-02-16
**Status**: Accepted
**Deciders**: VAPORA Team
**Technical Story**: Phase 9 - RLM as Core Foundation
## Context and Problem Statement
VAPORA's agent system relied on **direct LLM calls** for all reasoning tasks, which created fundamental limitations:
1. **Context window limitations**: Single LLM calls fail beyond 50-100k tokens (context rot)
2. **No knowledge reuse**: Historical executions were not semantically searchable
3. **Single-shot reasoning**: No distributed analysis across document chunks
4. **Cost inefficiency**: Processing entire documents repeatedly instead of relevant chunks
5. **No incremental learning**: Agents couldn't learn from past successful solutions
**Question**: How do we enable long-context reasoning, knowledge reuse, and distributed LLM processing in VAPORA?
## Decision Drivers
**Must Have:**
- Handle documents >100k tokens without context rot
- Semantic search over historical executions
- Distributed reasoning across document chunks
- Integration with existing SurrealDB + NATS architecture
- Support multiple LLM providers (OpenAI, Claude, Ollama)
**Should Have:**
- Hybrid search (keyword + semantic)
- Cost tracking per provider
- Prometheus metrics
- Sandboxed execution environment
**Nice to Have:**
- WASM-based fast execution tier
- Docker warm pool for complex tasks
## Considered Options
### Option 1: RAG (Retrieval-Augmented Generation) Only
**Approach**: Traditional RAG with vector embeddings + SurrealDB
**Pros:**
- Simple to implement
- Well-understood pattern
- Good for basic Q&A
**Cons:**
- ❌ No distributed reasoning (single LLM call)
- ❌ Keyword search limitations (only semantic)
- ❌ No execution sandbox
- ❌ Limited to simple retrieval tasks
### Option 2: LangChain/LlamaIndex Integration
**Approach**: Use existing framework (LangChain or LlamaIndex)
**Pros:**
- Pre-built components
- Active community
- Many integrations
**Cons:**
- ❌ Python-based (VAPORA is Rust-first)
- ❌ Heavy dependencies
- ❌ Less control over implementation
- ❌ Tight coupling to framework abstractions
### Option 3: Recursive Language Models (RLM) - **SELECTED**
**Approach**: Custom Rust implementation with distributed reasoning, hybrid search, and sandboxed execution
**Pros:**
- ✅ Native Rust (zero-cost abstractions, safety)
- ✅ Hybrid search (BM25 + semantic + RRF fusion)
- ✅ Distributed LLM calls across chunks
- ✅ Sandboxed execution (WASM + Docker)
- ✅ Full control over implementation
- ✅ Reuses existing VAPORA patterns (SurrealDB, NATS, Prometheus)
**Cons:**
- ⚠️ More initial implementation effort
- ⚠️ Maintaining custom codebase
**Decision**: **Option 3 - RLM Custom Implementation**
## Decision Outcome
### Chosen Solution: Recursive Language Models (RLM)
Implement a **native Rust RLM system** as a foundational VAPORA component, providing:
1. **Chunking**: Fixed, Semantic, Code-aware strategies
2. **Hybrid Search**: BM25 (Tantivy) + Semantic (embeddings) + RRF fusion
3. **Distributed Reasoning**: Parallel LLM calls across relevant chunks
4. **Sandboxed Execution**: WASM tier (<10ms) + Docker tier (80-150ms)
5. **Knowledge Graph**: Store execution history with learning curves
6. **Multi-Provider**: OpenAI, Claude, Gemini, Ollama support
### Architecture Overview
```
┌─────────────────────────────────────────────────────────────┐
│ RLM Engine │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Chunking │ │ Hybrid Search│ │ Dispatcher │ │
│ │ │ │ │ │ │ │
│ │ • Fixed │ │ • BM25 │ │ • Parallel │ │
│ │ • Semantic │ │ • Semantic │ │ LLM calls │ │
│ │ • Code │ │ • RRF Fusion │ │ • Aggregation│ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Storage │ │ Sandbox │ │ Metrics │ │
│ │ │ │ │ │ │ │
│ │ • SurrealDB │ │ • WASM │ │ • Prometheus │ │
│ │ • Chunks │ │ • Docker │ │ • Costs │ │
│ │ • Buffers │ │ • Auto-tier │ │ • Latency │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
```
### Implementation Details
**Crate**: `vapora-rlm` (17,000+ LOC)
**Key Components:**
```rust
// 1. Chunking
pub enum ChunkingStrategy {
Fixed, // Fixed-size chunks with overlap
Semantic, // Unicode-aware, sentence boundaries
Code, // AST-based (Rust, Python, JS)
}
// 2. Hybrid Search
pub struct HybridSearch {
bm25_index: Arc<BM25Index>, // Tantivy in-memory
storage: Arc<dyn Storage>, // SurrealDB
config: HybridSearchConfig, // RRF weights
}
// 3. LLM Dispatch
pub struct LLMDispatcher {
client: Option<Arc<dyn LLMClient>>, // Multi-provider
config: DispatchConfig, // Aggregation strategy
}
// 4. Sandbox
pub enum SandboxTier {
WASM, // <10ms, WASI-compatible commands
Docker, // <150ms, full compatibility
}
```
**Database Schema** (SCHEMALESS for flexibility):
```sql
-- Chunks (from documents)
DEFINE TABLE rlm_chunks SCHEMALESS;
DEFINE INDEX idx_rlm_chunks_chunk_id ON TABLE rlm_chunks COLUMNS chunk_id UNIQUE;
DEFINE INDEX idx_rlm_chunks_doc_id ON TABLE rlm_chunks COLUMNS doc_id;
-- Execution History (for learning)
DEFINE TABLE rlm_executions SCHEMALESS;
DEFINE INDEX idx_rlm_executions_execution_id ON TABLE rlm_executions COLUMNS execution_id UNIQUE;
DEFINE INDEX idx_rlm_executions_doc_id ON TABLE rlm_executions COLUMNS doc_id;
```
**Key Decision**: Use **SCHEMALESS** instead of SCHEMAFULL tables to avoid conflicts with SurrealDB's auto-generated `id` fields.
### Production Usage
```rust
use vapora_rlm::{RLMEngine, ChunkingConfig, EmbeddingConfig};
use vapora_llm_router::providers::OpenAIClient;
// Setup LLM client
let llm_client = Arc::new(OpenAIClient::new(
api_key, "gpt-4".to_string(),
4096, 0.7, 5.0, 15.0
)?);
// Configure RLM
let config = RLMEngineConfig {
chunking: ChunkingConfig {
strategy: ChunkingStrategy::Semantic,
chunk_size: 1000,
overlap: 200,
},
embedding: Some(EmbeddingConfig::openai_small()),
auto_rebuild_bm25: true,
max_chunks_per_doc: 10_000,
};
// Create engine
let engine = RLMEngine::with_llm_client(
storage, bm25_index, llm_client, Some(config)
)?;
// Usage
let chunks = engine.load_document(doc_id, content, None).await?;
let results = engine.query(doc_id, "error handling", None, 5).await?;
let response = engine.dispatch_subtask(doc_id, "Analyze code", None, 5).await?;
```
## Consequences
### Positive
**Performance:**
- Handles 100k+ line documents without context rot
- Query latency: ~90ms average (100 queries benchmark)
- WASM tier: <10ms for simple commands
- Docker tier: <150ms from warm pool
- Full workflow: <30s for 10k lines (2728 chunks)
**Functionality:**
- Hybrid search outperforms pure semantic or BM25 alone
- Distributed reasoning reduces hallucinations
- Knowledge Graph enables learning from past executions
- Multi-provider support (OpenAI, Claude, Ollama)
**Quality:**
- 38/38 tests passing (100% pass rate)
- 0 clippy warnings
- Comprehensive E2E, performance, security tests
- Production-ready with real persistence (no stubs)
**Cost Efficiency:**
- Chunk-based processing reduces token usage
- Cost tracking per provider and task
- Local Ollama option for development (free)
### Negative
**Complexity:**
- Additional component to maintain (17k+ LOC)
- Learning curve for distributed reasoning patterns
- More moving parts (chunking, BM25, embeddings, dispatch)
**Infrastructure:**
- Requires SurrealDB for persistence
- Requires embedding provider (OpenAI/Ollama)
- Optional Docker for full sandbox tier
**Performance Trade-offs:**
- Load time ~22s for 10k lines (chunking + embedding + indexing)
- BM25 rebuild time proportional to document size
- Memory usage: ~25MB per WASM instance, ~100-300MB per Docker container
### Risks and Mitigations
| Risk | Mitigation | Status |
|------|-----------|--------|
| SurrealDB schema conflicts | Use SCHEMALESS tables | Resolved |
| BM25 index performance | In-memory Tantivy, auto-rebuild | Verified |
| LLM provider costs | Cost tracking, local Ollama option | Implemented |
| Sandbox escape | WASM isolation, Docker security tests | 13/13 tests passing |
| Context window limits | Chunking + hybrid search + aggregation | Handles 100k+ tokens |
## Validation
### Test Coverage
```
Basic integration: 4/4 ✅ (100%)
E2E integration: 9/9 ✅ (100%)
Security: 13/13 ✅ (100%)
Performance: 8/8 ✅ (100%)
Debug tests: 4/4 ✅ (100%)
───────────────────────────────────
Total: 38/38 ✅ (100%)
```
### Performance Benchmarks
```
Query Latency (100 queries):
Average: 90.6ms
P50: 87.5ms
P95: 88.3ms
P99: 91.7ms
Large Document (10k lines):
Load: ~22s (2728 chunks)
Query: ~565ms
Full workflow: <30s
BM25 Index:
Build time: ~100ms for 1000 docs
Search: <1ms for most queries
```
### Integration Points
**Existing VAPORA Components:**
- `vapora-llm-router`: LLM client integration
- `vapora-knowledge-graph`: Execution history persistence
- `vapora-shared`: Common error types and models
- SurrealDB: Persistent storage backend
- Prometheus: Metrics export
**New Integration Surface:**
```rust
// Backend API
POST /api/v1/rlm/analyze
{
"content": "...",
"query": "...",
"strategy": "semantic"
}
// Agent Coordinator
let rlm_result = rlm_engine.dispatch_subtask(
doc_id, task.description, None, 5
).await?;
```
## Related Decisions
- **ADR-003**: Multi-provider LLM routing (Phase 6 dependency)
- **ADR-005**: Knowledge Graph temporal modeling (RLM execution history)
- **ADR-006**: Prometheus metrics standardization (RLM metrics)
## References
**Implementation:**
- `crates/vapora-rlm/` - Full RLM implementation
- `crates/vapora-rlm/PRODUCTION.md` - Production setup guide
- `crates/vapora-rlm/examples/` - Working examples
- `migrations/008_rlm_schema.surql` - Database schema
**External:**
- [Tantivy](https://github.com/quickwit-oss/tantivy) - BM25 full-text search
- [RRF Paper](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) - Reciprocal Rank Fusion
- [WASM Security Model](https://webassembly.org/docs/security/)
**Tests:**
- `tests/e2e_integration.rs` - End-to-end workflow tests
- `tests/performance_test.rs` - Performance benchmarks
- `tests/security_test.rs` - Sandbox security validation
## Notes
**Why SCHEMALESS vs SCHEMAFULL?**
Initial implementation used SCHEMAFULL with explicit `id` field definitions:
```sql
DEFINE TABLE rlm_chunks SCHEMAFULL;
DEFINE FIELD id ON TABLE rlm_chunks TYPE record<rlm_chunks>; -- ❌ Conflict
```
This caused data persistence failures because SurrealDB auto-generates `id` fields. Changed to SCHEMALESS:
```sql
DEFINE TABLE rlm_chunks SCHEMALESS; -- ✅ Works
DEFINE INDEX idx_rlm_chunks_chunk_id ON TABLE rlm_chunks COLUMNS chunk_id UNIQUE;
```
Indexes still work with SCHEMALESS, providing necessary performance without schema conflicts.
**Why Hybrid Search?**
Pure BM25 (keyword):
- Fast, exact matches
- Misses semantic similarity
Pure Semantic (embeddings):
- Understands meaning
- Expensive, misses exact keywords
Hybrid (BM25 + Semantic + RRF):
- Best of both worlds
- Reciprocal Rank Fusion combines rankings optimally
- Empirically outperforms either alone
**Why Custom Implementation vs Framework?**
Frameworks (LangChain, LlamaIndex):
- Python-based (VAPORA is Rust)
- Heavy abstractions
- Less control
- Dependency lock-in
Custom Rust RLM:
- Native performance
- Full control
- Zero-cost abstractions
- Direct integration with VAPORA patterns
**Trade-off accepted**: More initial effort for long-term maintainability and performance.
---
**Supersedes**: None (new decision)
**Amended by**: None
**Last Updated**: 2026-02-16