6.4 KiB
6.4 KiB
RLM Production Setup Guide
This guide shows how to configure vapora-rlm for production use with LLM clients and embeddings.
Prerequisites
- SurrealDB running on port 8000
- LLM Provider (choose one):
- OpenAI (cloud, requires API key)
- Anthropic Claude (cloud, requires API key)
- Ollama (local, free)
- Optional: Docker for Docker sandbox tier
Quick Start
Option 1: Cloud (OpenAI)
# Set API key
export OPENAI_API_KEY="sk-..."
# Run example
cargo run --example production_setup
Option 2: Local (Ollama)
# Install and start Ollama
brew install ollama
ollama serve
# Pull model
ollama pull llama3.2
# Run example
cargo run --example local_ollama
Production Configuration
1. Create RLM Engine with LLM Client
use std::sync::Arc;
use vapora_llm_router::providers::OpenAIClient;
use vapora_rlm::RLMEngine;
// Setup LLM client
let llm_client = Arc::new(OpenAIClient::new(
api_key,
"gpt-4".to_string(),
4096, // max_tokens
0.7, // temperature
5.0, // cost per 1M input tokens
15.0, // cost per 1M output tokens
)?);
// Create engine with LLM
let engine = RLMEngine::with_llm_client(
storage,
bm25_index,
llm_client,
Some(config),
)?;
2. Configure Chunking Strategy
use vapora_rlm::chunking::{ChunkingConfig, ChunkingStrategy};
use vapora_rlm::engine::RLMEngineConfig;
let config = RLMEngineConfig {
chunking: ChunkingConfig {
strategy: ChunkingStrategy::Semantic, // or Fixed, Code
chunk_size: 1000,
overlap: 200,
},
embedding: Some(EmbeddingConfig::openai_small()),
auto_rebuild_bm25: true,
max_chunks_per_doc: 10_000,
};
3. Configure Embeddings
use vapora_rlm::embeddings::EmbeddingConfig;
// OpenAI (1536 dimensions)
let embedding_config = EmbeddingConfig::openai_small();
// OpenAI (3072 dimensions)
let embedding_config = EmbeddingConfig::openai_large();
// Ollama (local)
let embedding_config = EmbeddingConfig::ollama("llama3.2");
4. Use RLM in Production
// Load document
let chunk_count = engine.load_document(doc_id, content, None).await?;
// Query with hybrid search (BM25 + semantic + RRF)
let results = engine.query(doc_id, "your query", None, 5).await?;
// Dispatch to LLM for distributed reasoning
let response = engine
.dispatch_subtask(doc_id, "Analyze this code", None, 5)
.await?;
println!("LLM Response: {}", response.text);
println!("Tokens: {} in, {} out",
response.total_input_tokens,
response.total_output_tokens
);
LLM Provider Options
OpenAI
use vapora_llm_router::providers::OpenAIClient;
let client = Arc::new(OpenAIClient::new(
api_key,
"gpt-4".to_string(),
4096, 0.7, 5.0, 15.0,
)?);
Models:
gpt-4- Most capablegpt-4-turbo- Faster, cheapergpt-3.5-turbo- Fast, cheapest
Anthropic Claude
use vapora_llm_router::providers::ClaudeClient;
let client = Arc::new(ClaudeClient::new(
api_key,
"claude-3-opus-20240229".to_string(),
4096, 0.7, 15.0, 75.0,
)?);
Models:
claude-3-opus- Most capableclaude-3-sonnet- Balancedclaude-3-haiku- Fast, cheap
Ollama (Local)
use vapora_llm_router::providers::OllamaClient;
let client = Arc::new(OllamaClient::new(
"http://localhost:11434".to_string(),
"llama3.2".to_string(),
4096, 0.7,
)?);
Popular models:
llama3.2- Meta's latestmistral- Fast, capablecodellama- Code-focusedmixtral- Large, powerful
Performance Tuning
Chunk Size Optimization
// Small chunks (500 chars) - Better precision, more chunks
ChunkingConfig {
strategy: ChunkingStrategy::Fixed,
chunk_size: 500,
overlap: 100,
}
// Large chunks (2000 chars) - More context, fewer chunks
ChunkingConfig {
strategy: ChunkingStrategy::Fixed,
chunk_size: 2000,
overlap: 400,
}
BM25 Index Tuning
let config = RLMEngineConfig {
auto_rebuild_bm25: true, // Rebuild after loading
..Default::default()
};
Max Chunks Per Document
let config = RLMEngineConfig {
max_chunks_per_doc: 10_000, // Safety limit
..Default::default()
};
Production Checklist
- LLM client configured with valid API key
- Embedding provider configured
- SurrealDB schema applied:
bash tests/test_setup.sh - Chunking strategy selected (Semantic for prose, Code for code)
- Max chunks per doc set appropriately
- Prometheus metrics endpoint exposed
- Error handling and retries in place
- Cost tracking enabled (for cloud providers)
Troubleshooting
"No LLM client configured"
// Don't use RLMEngine::new() - it has no LLM client
let engine = RLMEngine::new(storage, bm25_index)?; // ❌
// Use with_llm_client() instead
let engine = RLMEngine::with_llm_client(
storage, bm25_index, llm_client, Some(config)
)?; // ✅
"Embedding generation failed"
// Make sure embedding config matches your provider
let config = RLMEngineConfig {
embedding: Some(EmbeddingConfig::openai_small()), // ✅
..Default::default()
};
"SurrealDB schema error"
# Apply the schema
cd crates/vapora-rlm/tests
bash test_setup.sh
Examples
See examples/ directory:
production_setup.rs- OpenAI production setuplocal_ollama.rs- Local development with Ollama
Run with:
cargo run --example production_setup
cargo run --example local_ollama
Cost Optimization
Use Local Ollama for Development
// Free, local, no API keys
let client = Arc::new(OllamaClient::new(
"http://localhost:11434".to_string(),
"llama3.2".to_string(),
4096, 0.7,
)?);
Choose Cheaper Models for Production
// Instead of gpt-4 ($5/$15 per 1M tokens)
OpenAIClient::new(api_key, "gpt-4".to_string(), ...)
// Use gpt-3.5-turbo ($0.50/$1.50 per 1M tokens)
OpenAIClient::new(api_key, "gpt-3.5-turbo".to_string(), ...)
Track Costs with Metrics
// RLM automatically tracks token usage
let response = engine.dispatch_subtask(...).await?;
println!("Cost: ${:.4}",
(response.total_input_tokens as f64 * 5.0 / 1_000_000.0) +
(response.total_output_tokens as f64 * 15.0 / 1_000_000.0)
);
Next Steps
- Review examples:
cargo run --example local_ollama - Run tests:
cargo test -p vapora-rlm - Check metrics: See
src/metrics.rs - Integrate with backend: See
vapora-backendintegration patterns