403 lines
13 KiB
Markdown
403 lines
13 KiB
Markdown
|
|
# ADR-008: Recursive Language Models (RLM) Integration
|
||
|
|
|
||
|
|
**Date**: 2026-02-16
|
||
|
|
**Status**: Accepted
|
||
|
|
**Deciders**: VAPORA Team
|
||
|
|
**Technical Story**: Phase 9 - RLM as Core Foundation
|
||
|
|
|
||
|
|
## Context and Problem Statement
|
||
|
|
|
||
|
|
VAPORA's agent system relied on **direct LLM calls** for all reasoning tasks, which created fundamental limitations:
|
||
|
|
|
||
|
|
1. **Context window limitations**: Single LLM calls fail beyond 50-100k tokens (context rot)
|
||
|
|
2. **No knowledge reuse**: Historical executions were not semantically searchable
|
||
|
|
3. **Single-shot reasoning**: No distributed analysis across document chunks
|
||
|
|
4. **Cost inefficiency**: Processing entire documents repeatedly instead of relevant chunks
|
||
|
|
5. **No incremental learning**: Agents couldn't learn from past successful solutions
|
||
|
|
|
||
|
|
**Question**: How do we enable long-context reasoning, knowledge reuse, and distributed LLM processing in VAPORA?
|
||
|
|
|
||
|
|
## Decision Drivers
|
||
|
|
|
||
|
|
**Must Have:**
|
||
|
|
- Handle documents >100k tokens without context rot
|
||
|
|
- Semantic search over historical executions
|
||
|
|
- Distributed reasoning across document chunks
|
||
|
|
- Integration with existing SurrealDB + NATS architecture
|
||
|
|
- Support multiple LLM providers (OpenAI, Claude, Ollama)
|
||
|
|
|
||
|
|
**Should Have:**
|
||
|
|
- Hybrid search (keyword + semantic)
|
||
|
|
- Cost tracking per provider
|
||
|
|
- Prometheus metrics
|
||
|
|
- Sandboxed execution environment
|
||
|
|
|
||
|
|
**Nice to Have:**
|
||
|
|
- WASM-based fast execution tier
|
||
|
|
- Docker warm pool for complex tasks
|
||
|
|
|
||
|
|
## Considered Options
|
||
|
|
|
||
|
|
### Option 1: RAG (Retrieval-Augmented Generation) Only
|
||
|
|
|
||
|
|
**Approach**: Traditional RAG with vector embeddings + SurrealDB
|
||
|
|
|
||
|
|
**Pros:**
|
||
|
|
- Simple to implement
|
||
|
|
- Well-understood pattern
|
||
|
|
- Good for basic Q&A
|
||
|
|
|
||
|
|
**Cons:**
|
||
|
|
- ❌ No distributed reasoning (single LLM call)
|
||
|
|
- ❌ Keyword search limitations (only semantic)
|
||
|
|
- ❌ No execution sandbox
|
||
|
|
- ❌ Limited to simple retrieval tasks
|
||
|
|
|
||
|
|
### Option 2: LangChain/LlamaIndex Integration
|
||
|
|
|
||
|
|
**Approach**: Use existing framework (LangChain or LlamaIndex)
|
||
|
|
|
||
|
|
**Pros:**
|
||
|
|
- Pre-built components
|
||
|
|
- Active community
|
||
|
|
- Many integrations
|
||
|
|
|
||
|
|
**Cons:**
|
||
|
|
- ❌ Python-based (VAPORA is Rust-first)
|
||
|
|
- ❌ Heavy dependencies
|
||
|
|
- ❌ Less control over implementation
|
||
|
|
- ❌ Tight coupling to framework abstractions
|
||
|
|
|
||
|
|
### Option 3: Recursive Language Models (RLM) - **SELECTED**
|
||
|
|
|
||
|
|
**Approach**: Custom Rust implementation with distributed reasoning, hybrid search, and sandboxed execution
|
||
|
|
|
||
|
|
**Pros:**
|
||
|
|
- ✅ Native Rust (zero-cost abstractions, safety)
|
||
|
|
- ✅ Hybrid search (BM25 + semantic + RRF fusion)
|
||
|
|
- ✅ Distributed LLM calls across chunks
|
||
|
|
- ✅ Sandboxed execution (WASM + Docker)
|
||
|
|
- ✅ Full control over implementation
|
||
|
|
- ✅ Reuses existing VAPORA patterns (SurrealDB, NATS, Prometheus)
|
||
|
|
|
||
|
|
**Cons:**
|
||
|
|
- ⚠️ More initial implementation effort
|
||
|
|
- ⚠️ Maintaining custom codebase
|
||
|
|
|
||
|
|
**Decision**: **Option 3 - RLM Custom Implementation**
|
||
|
|
|
||
|
|
## Decision Outcome
|
||
|
|
|
||
|
|
### Chosen Solution: Recursive Language Models (RLM)
|
||
|
|
|
||
|
|
Implement a **native Rust RLM system** as a foundational VAPORA component, providing:
|
||
|
|
|
||
|
|
1. **Chunking**: Fixed, Semantic, Code-aware strategies
|
||
|
|
2. **Hybrid Search**: BM25 (Tantivy) + Semantic (embeddings) + RRF fusion
|
||
|
|
3. **Distributed Reasoning**: Parallel LLM calls across relevant chunks
|
||
|
|
4. **Sandboxed Execution**: WASM tier (<10ms) + Docker tier (80-150ms)
|
||
|
|
5. **Knowledge Graph**: Store execution history with learning curves
|
||
|
|
6. **Multi-Provider**: OpenAI, Claude, Gemini, Ollama support
|
||
|
|
|
||
|
|
### Architecture Overview
|
||
|
|
|
||
|
|
```
|
||
|
|
┌─────────────────────────────────────────────────────────────┐
|
||
|
|
│ RLM Engine │
|
||
|
|
├─────────────────────────────────────────────────────────────┤
|
||
|
|
│ │
|
||
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
||
|
|
│ │ Chunking │ │ Hybrid Search│ │ Dispatcher │ │
|
||
|
|
│ │ │ │ │ │ │ │
|
||
|
|
│ │ • Fixed │ │ • BM25 │ │ • Parallel │ │
|
||
|
|
│ │ • Semantic │ │ • Semantic │ │ LLM calls │ │
|
||
|
|
│ │ • Code │ │ • RRF Fusion │ │ • Aggregation│ │
|
||
|
|
│ └──────────────┘ └──────────────┘ └──────────────┘ │
|
||
|
|
│ │
|
||
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
||
|
|
│ │ Storage │ │ Sandbox │ │ Metrics │ │
|
||
|
|
│ │ │ │ │ │ │ │
|
||
|
|
│ │ • SurrealDB │ │ • WASM │ │ • Prometheus │ │
|
||
|
|
│ │ • Chunks │ │ • Docker │ │ • Costs │ │
|
||
|
|
│ │ • Buffers │ │ • Auto-tier │ │ • Latency │ │
|
||
|
|
│ └──────────────┘ └──────────────┘ └──────────────┘ │
|
||
|
|
└─────────────────────────────────────────────────────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
### Implementation Details
|
||
|
|
|
||
|
|
**Crate**: `vapora-rlm` (17,000+ LOC)
|
||
|
|
|
||
|
|
**Key Components:**
|
||
|
|
|
||
|
|
```rust
|
||
|
|
// 1. Chunking
|
||
|
|
pub enum ChunkingStrategy {
|
||
|
|
Fixed, // Fixed-size chunks with overlap
|
||
|
|
Semantic, // Unicode-aware, sentence boundaries
|
||
|
|
Code, // AST-based (Rust, Python, JS)
|
||
|
|
}
|
||
|
|
|
||
|
|
// 2. Hybrid Search
|
||
|
|
pub struct HybridSearch {
|
||
|
|
bm25_index: Arc<BM25Index>, // Tantivy in-memory
|
||
|
|
storage: Arc<dyn Storage>, // SurrealDB
|
||
|
|
config: HybridSearchConfig, // RRF weights
|
||
|
|
}
|
||
|
|
|
||
|
|
// 3. LLM Dispatch
|
||
|
|
pub struct LLMDispatcher {
|
||
|
|
client: Option<Arc<dyn LLMClient>>, // Multi-provider
|
||
|
|
config: DispatchConfig, // Aggregation strategy
|
||
|
|
}
|
||
|
|
|
||
|
|
// 4. Sandbox
|
||
|
|
pub enum SandboxTier {
|
||
|
|
WASM, // <10ms, WASI-compatible commands
|
||
|
|
Docker, // <150ms, full compatibility
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Database Schema** (SCHEMALESS for flexibility):
|
||
|
|
|
||
|
|
```sql
|
||
|
|
-- Chunks (from documents)
|
||
|
|
DEFINE TABLE rlm_chunks SCHEMALESS;
|
||
|
|
DEFINE INDEX idx_rlm_chunks_chunk_id ON TABLE rlm_chunks COLUMNS chunk_id UNIQUE;
|
||
|
|
DEFINE INDEX idx_rlm_chunks_doc_id ON TABLE rlm_chunks COLUMNS doc_id;
|
||
|
|
|
||
|
|
-- Execution History (for learning)
|
||
|
|
DEFINE TABLE rlm_executions SCHEMALESS;
|
||
|
|
DEFINE INDEX idx_rlm_executions_execution_id ON TABLE rlm_executions COLUMNS execution_id UNIQUE;
|
||
|
|
DEFINE INDEX idx_rlm_executions_doc_id ON TABLE rlm_executions COLUMNS doc_id;
|
||
|
|
```
|
||
|
|
|
||
|
|
**Key Decision**: Use **SCHEMALESS** instead of SCHEMAFULL tables to avoid conflicts with SurrealDB's auto-generated `id` fields.
|
||
|
|
|
||
|
|
### Production Usage
|
||
|
|
|
||
|
|
```rust
|
||
|
|
use vapora_rlm::{RLMEngine, ChunkingConfig, EmbeddingConfig};
|
||
|
|
use vapora_llm_router::providers::OpenAIClient;
|
||
|
|
|
||
|
|
// Setup LLM client
|
||
|
|
let llm_client = Arc::new(OpenAIClient::new(
|
||
|
|
api_key, "gpt-4".to_string(),
|
||
|
|
4096, 0.7, 5.0, 15.0
|
||
|
|
)?);
|
||
|
|
|
||
|
|
// Configure RLM
|
||
|
|
let config = RLMEngineConfig {
|
||
|
|
chunking: ChunkingConfig {
|
||
|
|
strategy: ChunkingStrategy::Semantic,
|
||
|
|
chunk_size: 1000,
|
||
|
|
overlap: 200,
|
||
|
|
},
|
||
|
|
embedding: Some(EmbeddingConfig::openai_small()),
|
||
|
|
auto_rebuild_bm25: true,
|
||
|
|
max_chunks_per_doc: 10_000,
|
||
|
|
};
|
||
|
|
|
||
|
|
// Create engine
|
||
|
|
let engine = RLMEngine::with_llm_client(
|
||
|
|
storage, bm25_index, llm_client, Some(config)
|
||
|
|
)?;
|
||
|
|
|
||
|
|
// Usage
|
||
|
|
let chunks = engine.load_document(doc_id, content, None).await?;
|
||
|
|
let results = engine.query(doc_id, "error handling", None, 5).await?;
|
||
|
|
let response = engine.dispatch_subtask(doc_id, "Analyze code", None, 5).await?;
|
||
|
|
```
|
||
|
|
|
||
|
|
## Consequences
|
||
|
|
|
||
|
|
### Positive
|
||
|
|
|
||
|
|
**Performance:**
|
||
|
|
- ✅ Handles 100k+ line documents without context rot
|
||
|
|
- ✅ Query latency: ~90ms average (100 queries benchmark)
|
||
|
|
- ✅ WASM tier: <10ms for simple commands
|
||
|
|
- ✅ Docker tier: <150ms from warm pool
|
||
|
|
- ✅ Full workflow: <30s for 10k lines (2728 chunks)
|
||
|
|
|
||
|
|
**Functionality:**
|
||
|
|
- ✅ Hybrid search outperforms pure semantic or BM25 alone
|
||
|
|
- ✅ Distributed reasoning reduces hallucinations
|
||
|
|
- ✅ Knowledge Graph enables learning from past executions
|
||
|
|
- ✅ Multi-provider support (OpenAI, Claude, Ollama)
|
||
|
|
|
||
|
|
**Quality:**
|
||
|
|
- ✅ 38/38 tests passing (100% pass rate)
|
||
|
|
- ✅ 0 clippy warnings
|
||
|
|
- ✅ Comprehensive E2E, performance, security tests
|
||
|
|
- ✅ Production-ready with real persistence (no stubs)
|
||
|
|
|
||
|
|
**Cost Efficiency:**
|
||
|
|
- ✅ Chunk-based processing reduces token usage
|
||
|
|
- ✅ Cost tracking per provider and task
|
||
|
|
- ✅ Local Ollama option for development (free)
|
||
|
|
|
||
|
|
### Negative
|
||
|
|
|
||
|
|
**Complexity:**
|
||
|
|
- ⚠️ Additional component to maintain (17k+ LOC)
|
||
|
|
- ⚠️ Learning curve for distributed reasoning patterns
|
||
|
|
- ⚠️ More moving parts (chunking, BM25, embeddings, dispatch)
|
||
|
|
|
||
|
|
**Infrastructure:**
|
||
|
|
- ⚠️ Requires SurrealDB for persistence
|
||
|
|
- ⚠️ Requires embedding provider (OpenAI/Ollama)
|
||
|
|
- ⚠️ Optional Docker for full sandbox tier
|
||
|
|
|
||
|
|
**Performance Trade-offs:**
|
||
|
|
- ⚠️ Load time ~22s for 10k lines (chunking + embedding + indexing)
|
||
|
|
- ⚠️ BM25 rebuild time proportional to document size
|
||
|
|
- ⚠️ Memory usage: ~25MB per WASM instance, ~100-300MB per Docker container
|
||
|
|
|
||
|
|
### Risks and Mitigations
|
||
|
|
|
||
|
|
| Risk | Mitigation | Status |
|
||
|
|
|------|-----------|--------|
|
||
|
|
| SurrealDB schema conflicts | Use SCHEMALESS tables | ✅ Resolved |
|
||
|
|
| BM25 index performance | In-memory Tantivy, auto-rebuild | ✅ Verified |
|
||
|
|
| LLM provider costs | Cost tracking, local Ollama option | ✅ Implemented |
|
||
|
|
| Sandbox escape | WASM isolation, Docker security tests | ✅ 13/13 tests passing |
|
||
|
|
| Context window limits | Chunking + hybrid search + aggregation | ✅ Handles 100k+ tokens |
|
||
|
|
|
||
|
|
## Validation
|
||
|
|
|
||
|
|
### Test Coverage
|
||
|
|
|
||
|
|
```
|
||
|
|
Basic integration: 4/4 ✅ (100%)
|
||
|
|
E2E integration: 9/9 ✅ (100%)
|
||
|
|
Security: 13/13 ✅ (100%)
|
||
|
|
Performance: 8/8 ✅ (100%)
|
||
|
|
Debug tests: 4/4 ✅ (100%)
|
||
|
|
───────────────────────────────────
|
||
|
|
Total: 38/38 ✅ (100%)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Performance Benchmarks
|
||
|
|
|
||
|
|
```
|
||
|
|
Query Latency (100 queries):
|
||
|
|
Average: 90.6ms
|
||
|
|
P50: 87.5ms
|
||
|
|
P95: 88.3ms
|
||
|
|
P99: 91.7ms
|
||
|
|
|
||
|
|
Large Document (10k lines):
|
||
|
|
Load: ~22s (2728 chunks)
|
||
|
|
Query: ~565ms
|
||
|
|
Full workflow: <30s
|
||
|
|
|
||
|
|
BM25 Index:
|
||
|
|
Build time: ~100ms for 1000 docs
|
||
|
|
Search: <1ms for most queries
|
||
|
|
```
|
||
|
|
|
||
|
|
### Integration Points
|
||
|
|
|
||
|
|
**Existing VAPORA Components:**
|
||
|
|
- ✅ `vapora-llm-router`: LLM client integration
|
||
|
|
- ✅ `vapora-knowledge-graph`: Execution history persistence
|
||
|
|
- ✅ `vapora-shared`: Common error types and models
|
||
|
|
- ✅ SurrealDB: Persistent storage backend
|
||
|
|
- ✅ Prometheus: Metrics export
|
||
|
|
|
||
|
|
**New Integration Surface:**
|
||
|
|
```rust
|
||
|
|
// Backend API
|
||
|
|
POST /api/v1/rlm/analyze
|
||
|
|
{
|
||
|
|
"content": "...",
|
||
|
|
"query": "...",
|
||
|
|
"strategy": "semantic"
|
||
|
|
}
|
||
|
|
|
||
|
|
// Agent Coordinator
|
||
|
|
let rlm_result = rlm_engine.dispatch_subtask(
|
||
|
|
doc_id, task.description, None, 5
|
||
|
|
).await?;
|
||
|
|
```
|
||
|
|
|
||
|
|
## Related Decisions
|
||
|
|
|
||
|
|
- **ADR-003**: Multi-provider LLM routing (Phase 6 dependency)
|
||
|
|
- **ADR-005**: Knowledge Graph temporal modeling (RLM execution history)
|
||
|
|
- **ADR-006**: Prometheus metrics standardization (RLM metrics)
|
||
|
|
|
||
|
|
## References
|
||
|
|
|
||
|
|
**Implementation:**
|
||
|
|
- `crates/vapora-rlm/` - Full RLM implementation
|
||
|
|
- `crates/vapora-rlm/PRODUCTION.md` - Production setup guide
|
||
|
|
- `crates/vapora-rlm/examples/` - Working examples
|
||
|
|
- `migrations/008_rlm_schema.surql` - Database schema
|
||
|
|
|
||
|
|
**External:**
|
||
|
|
- [Tantivy](https://github.com/quickwit-oss/tantivy) - BM25 full-text search
|
||
|
|
- [RRF Paper](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) - Reciprocal Rank Fusion
|
||
|
|
- [WASM Security Model](https://webassembly.org/docs/security/)
|
||
|
|
|
||
|
|
**Tests:**
|
||
|
|
- `tests/e2e_integration.rs` - End-to-end workflow tests
|
||
|
|
- `tests/performance_test.rs` - Performance benchmarks
|
||
|
|
- `tests/security_test.rs` - Sandbox security validation
|
||
|
|
|
||
|
|
## Notes
|
||
|
|
|
||
|
|
**Why SCHEMALESS vs SCHEMAFULL?**
|
||
|
|
|
||
|
|
Initial implementation used SCHEMAFULL with explicit `id` field definitions:
|
||
|
|
```sql
|
||
|
|
DEFINE TABLE rlm_chunks SCHEMAFULL;
|
||
|
|
DEFINE FIELD id ON TABLE rlm_chunks TYPE record<rlm_chunks>; -- ❌ Conflict
|
||
|
|
```
|
||
|
|
|
||
|
|
This caused data persistence failures because SurrealDB auto-generates `id` fields. Changed to SCHEMALESS:
|
||
|
|
```sql
|
||
|
|
DEFINE TABLE rlm_chunks SCHEMALESS; -- ✅ Works
|
||
|
|
DEFINE INDEX idx_rlm_chunks_chunk_id ON TABLE rlm_chunks COLUMNS chunk_id UNIQUE;
|
||
|
|
```
|
||
|
|
|
||
|
|
Indexes still work with SCHEMALESS, providing necessary performance without schema conflicts.
|
||
|
|
|
||
|
|
**Why Hybrid Search?**
|
||
|
|
|
||
|
|
Pure BM25 (keyword):
|
||
|
|
- ✅ Fast, exact matches
|
||
|
|
- ❌ Misses semantic similarity
|
||
|
|
|
||
|
|
Pure Semantic (embeddings):
|
||
|
|
- ✅ Understands meaning
|
||
|
|
- ❌ Expensive, misses exact keywords
|
||
|
|
|
||
|
|
Hybrid (BM25 + Semantic + RRF):
|
||
|
|
- ✅ Best of both worlds
|
||
|
|
- ✅ Reciprocal Rank Fusion combines rankings optimally
|
||
|
|
- ✅ Empirically outperforms either alone
|
||
|
|
|
||
|
|
**Why Custom Implementation vs Framework?**
|
||
|
|
|
||
|
|
Frameworks (LangChain, LlamaIndex):
|
||
|
|
- Python-based (VAPORA is Rust)
|
||
|
|
- Heavy abstractions
|
||
|
|
- Less control
|
||
|
|
- Dependency lock-in
|
||
|
|
|
||
|
|
Custom Rust RLM:
|
||
|
|
- Native performance
|
||
|
|
- Full control
|
||
|
|
- Zero-cost abstractions
|
||
|
|
- Direct integration with VAPORA patterns
|
||
|
|
|
||
|
|
**Trade-off accepted**: More initial effort for long-term maintainability and performance.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Supersedes**: None (new decision)
|
||
|
|
**Amended by**: None
|
||
|
|
**Last Updated**: 2026-02-16
|