jesus/Vapora

Fork 0

Jesús Pérez 4efea3053e

chore: add A2A y RLM

2026-02-16 05:09:51 +00:00

6.4 KiB

Raw Blame History

RLM Production Setup Guide

This guide shows how to configure vapora-rlm for production use with LLM clients and embeddings.

Prerequisites

SurrealDB running on port 8000
LLM Provider (choose one):
- OpenAI (cloud, requires API key)
- Anthropic Claude (cloud, requires API key)
- Ollama (local, free)
Optional: Docker for Docker sandbox tier

Quick Start

Option 1: Cloud (OpenAI)

# Set API key
export OPENAI_API_KEY="sk-..."

# Run example
cargo run --example production_setup

Option 2: Local (Ollama)

# Install and start Ollama
brew install ollama
ollama serve

# Pull model
ollama pull llama3.2

# Run example
cargo run --example local_ollama

Production Configuration

1. Create RLM Engine with LLM Client

use std::sync::Arc;
use vapora_llm_router::providers::OpenAIClient;
use vapora_rlm::RLMEngine;

// Setup LLM client
let llm_client = Arc::new(OpenAIClient::new(
    api_key,
    "gpt-4".to_string(),
    4096,    // max_tokens
    0.7,     // temperature
    5.0,     // cost per 1M input tokens
    15.0,    // cost per 1M output tokens
)?);

// Create engine with LLM
let engine = RLMEngine::with_llm_client(
    storage,
    bm25_index,
    llm_client,
    Some(config),
)?;

2. Configure Chunking Strategy

use vapora_rlm::chunking::{ChunkingConfig, ChunkingStrategy};
use vapora_rlm::engine::RLMEngineConfig;

let config = RLMEngineConfig {
    chunking: ChunkingConfig {
        strategy: ChunkingStrategy::Semantic,  // or Fixed, Code
        chunk_size: 1000,
        overlap: 200,
    },
    embedding: Some(EmbeddingConfig::openai_small()),
    auto_rebuild_bm25: true,
    max_chunks_per_doc: 10_000,
};

3. Configure Embeddings

use vapora_rlm::embeddings::EmbeddingConfig;

// OpenAI (1536 dimensions)
let embedding_config = EmbeddingConfig::openai_small();

// OpenAI (3072 dimensions)
let embedding_config = EmbeddingConfig::openai_large();

// Ollama (local)
let embedding_config = EmbeddingConfig::ollama("llama3.2");

4. Use RLM in Production

// Load document
let chunk_count = engine.load_document(doc_id, content, None).await?;

// Query with hybrid search (BM25 + semantic + RRF)
let results = engine.query(doc_id, "your query", None, 5).await?;

// Dispatch to LLM for distributed reasoning
let response = engine
    .dispatch_subtask(doc_id, "Analyze this code", None, 5)
    .await?;

println!("LLM Response: {}", response.text);
println!("Tokens: {} in, {} out",
    response.total_input_tokens,
    response.total_output_tokens
);

LLM Provider Options

OpenAI

use vapora_llm_router::providers::OpenAIClient;

let client = Arc::new(OpenAIClient::new(
    api_key,
    "gpt-4".to_string(),
    4096, 0.7, 5.0, 15.0,
)?);

Models:

gpt-4 - Most capable
gpt-4-turbo - Faster, cheaper
gpt-3.5-turbo - Fast, cheapest

Anthropic Claude

use vapora_llm_router::providers::ClaudeClient;

let client = Arc::new(ClaudeClient::new(
    api_key,
    "claude-3-opus-20240229".to_string(),
    4096, 0.7, 15.0, 75.0,
)?);

Models:

claude-3-opus - Most capable
claude-3-sonnet - Balanced
claude-3-haiku - Fast, cheap

Ollama (Local)

use vapora_llm_router::providers::OllamaClient;

let client = Arc::new(OllamaClient::new(
    "http://localhost:11434".to_string(),
    "llama3.2".to_string(),
    4096, 0.7,
)?);

Popular models:

llama3.2 - Meta's latest
mistral - Fast, capable
codellama - Code-focused
mixtral - Large, powerful

Performance Tuning

Chunk Size Optimization

// Small chunks (500 chars) - Better precision, more chunks
ChunkingConfig {
    strategy: ChunkingStrategy::Fixed,
    chunk_size: 500,
    overlap: 100,
}

// Large chunks (2000 chars) - More context, fewer chunks
ChunkingConfig {
    strategy: ChunkingStrategy::Fixed,
    chunk_size: 2000,
    overlap: 400,
}

BM25 Index Tuning

let config = RLMEngineConfig {
    auto_rebuild_bm25: true,  // Rebuild after loading
    ..Default::default()
};

Max Chunks Per Document

let config = RLMEngineConfig {
    max_chunks_per_doc: 10_000,  // Safety limit
    ..Default::default()
};

Production Checklist

LLM client configured with valid API key
Embedding provider configured
SurrealDB schema applied: bash tests/test_setup.sh
Chunking strategy selected (Semantic for prose, Code for code)
Max chunks per doc set appropriately
Prometheus metrics endpoint exposed
Error handling and retries in place
Cost tracking enabled (for cloud providers)

Troubleshooting

"No LLM client configured"

// Don't use RLMEngine::new() - it has no LLM client
let engine = RLMEngine::new(storage, bm25_index)?;  // ❌

// Use with_llm_client() instead
let engine = RLMEngine::with_llm_client(
    storage, bm25_index, llm_client, Some(config)
)?;  // ✅

"Embedding generation failed"

// Make sure embedding config matches your provider
let config = RLMEngineConfig {
    embedding: Some(EmbeddingConfig::openai_small()),  // ✅
    ..Default::default()
};

"SurrealDB schema error"

# Apply the schema
cd crates/vapora-rlm/tests
bash test_setup.sh

Examples

See examples/ directory:

production_setup.rs - OpenAI production setup
local_ollama.rs - Local development with Ollama

Run with:

cargo run --example production_setup
cargo run --example local_ollama

Cost Optimization

Use Local Ollama for Development

// Free, local, no API keys
let client = Arc::new(OllamaClient::new(
    "http://localhost:11434".to_string(),
    "llama3.2".to_string(),
    4096, 0.7,
)?);

Choose Cheaper Models for Production

// Instead of gpt-4 ($5/$15 per 1M tokens)
OpenAIClient::new(api_key, "gpt-4".to_string(), ...)

// Use gpt-3.5-turbo ($0.50/$1.50 per 1M tokens)
OpenAIClient::new(api_key, "gpt-3.5-turbo".to_string(), ...)

Track Costs with Metrics

// RLM automatically tracks token usage
let response = engine.dispatch_subtask(...).await?;
println!("Cost: ${:.4}",
    (response.total_input_tokens as f64 * 5.0 / 1_000_000.0) +
    (response.total_output_tokens as f64 * 15.0 / 1_000_000.0)
);

Next Steps

Review examples: cargo run --example local_ollama
Run tests: cargo test -p vapora-rlm
Check metrics: See src/metrics.rs
Integrate with backend: See vapora-backend integration patterns

6.4 KiB Raw Blame History