Vapora/crates/vapora-rlm/PRODUCTION.md

# RLM Production Setup Guide

This guide shows how to configure vapora-rlm for production use with LLM clients and embeddings.

## Prerequisites

1. **SurrealDB** running on port 8000
2. **LLM Provider** (choose one):
   - OpenAI (cloud, requires API key)
   - Anthropic Claude (cloud, requires API key)
   - Ollama (local, free)
3. **Optional**: Docker for Docker sandbox tier

## Quick Start

### Option 1: Cloud (OpenAI)

```bash
# Set API key
export OPENAI_API_KEY="sk-..."

# Run example
cargo run --example production_setup
```

### Option 2: Local (Ollama)

```bash
# Install and start Ollama
brew install ollama
ollama serve

# Pull model
ollama pull llama3.2

# Run example
cargo run --example local_ollama
```

## Production Configuration

### 1. Create RLM Engine with LLM Client

```rust
use std::sync::Arc;
use vapora_llm_router::providers::OpenAIClient;
use vapora_rlm::RLMEngine;

// Setup LLM client
let llm_client = Arc::new(OpenAIClient::new(
    api_key,
    "gpt-4".to_string(),
    4096,    // max_tokens
    0.7,     // temperature
    5.0,     // cost per 1M input tokens
    15.0,    // cost per 1M output tokens
)?);

// Create engine with LLM
let engine = RLMEngine::with_llm_client(
    storage,
    bm25_index,
    llm_client,
    Some(config),
)?;
```

### 2. Configure Chunking Strategy

```rust
use vapora_rlm::chunking::{ChunkingConfig, ChunkingStrategy};
use vapora_rlm::engine::RLMEngineConfig;

let config = RLMEngineConfig {
    chunking: ChunkingConfig {
        strategy: ChunkingStrategy::Semantic,  // or Fixed, Code
        chunk_size: 1000,
        overlap: 200,
    },
    embedding: Some(EmbeddingConfig::openai_small()),
    auto_rebuild_bm25: true,
    max_chunks_per_doc: 10_000,
};
```

### 3. Configure Embeddings

```rust
use vapora_rlm::embeddings::EmbeddingConfig;

// OpenAI (1536 dimensions)
let embedding_config = EmbeddingConfig::openai_small();

// OpenAI (3072 dimensions)
let embedding_config = EmbeddingConfig::openai_large();

// Ollama (local)
let embedding_config = EmbeddingConfig::ollama("llama3.2");
```

### 4. Use RLM in Production

```rust
// Load document
let chunk_count = engine.load_document(doc_id, content, None).await?;

// Query with hybrid search (BM25 + semantic + RRF)
let results = engine.query(doc_id, "your query", None, 5).await?;

// Dispatch to LLM for distributed reasoning
let response = engine
    .dispatch_subtask(doc_id, "Analyze this code", None, 5)
    .await?;

println!("LLM Response: {}", response.text);
println!("Tokens: {} in, {} out",
    response.total_input_tokens,
    response.total_output_tokens
);
```

## LLM Provider Options

### OpenAI

```rust
use vapora_llm_router::providers::OpenAIClient;

let client = Arc::new(OpenAIClient::new(
    api_key,
    "gpt-4".to_string(),
    4096, 0.7, 5.0, 15.0,
)?);
```

**Models:**
- `gpt-4` - Most capable
- `gpt-4-turbo` - Faster, cheaper
- `gpt-3.5-turbo` - Fast, cheapest

### Anthropic Claude

```rust
use vapora_llm_router::providers::ClaudeClient;

let client = Arc::new(ClaudeClient::new(
    api_key,
    "claude-3-opus-20240229".to_string(),
    4096, 0.7, 15.0, 75.0,
)?);
```

**Models:**
- `claude-3-opus` - Most capable
- `claude-3-sonnet` - Balanced
- `claude-3-haiku` - Fast, cheap

### Ollama (Local)

```rust
use vapora_llm_router::providers::OllamaClient;

let client = Arc::new(OllamaClient::new(
    "http://localhost:11434".to_string(),
    "llama3.2".to_string(),
    4096, 0.7,
)?);
```

**Popular models:**
- `llama3.2` - Meta's latest
- `mistral` - Fast, capable
- `codellama` - Code-focused
- `mixtral` - Large, powerful

## Performance Tuning

### Chunk Size Optimization

```rust
// Small chunks (500 chars) - Better precision, more chunks
ChunkingConfig {
    strategy: ChunkingStrategy::Fixed,
    chunk_size: 500,
    overlap: 100,
}

// Large chunks (2000 chars) - More context, fewer chunks
ChunkingConfig {
    strategy: ChunkingStrategy::Fixed,
    chunk_size: 2000,
    overlap: 400,
}
```

### BM25 Index Tuning

```rust
let config = RLMEngineConfig {
    auto_rebuild_bm25: true,  // Rebuild after loading
    ..Default::default()
};
```

### Max Chunks Per Document

```rust
let config = RLMEngineConfig {
    max_chunks_per_doc: 10_000,  // Safety limit
    ..Default::default()
};
```

## Production Checklist

- [ ] LLM client configured with valid API key
- [ ] Embedding provider configured
- [ ] SurrealDB schema applied: `bash tests/test_setup.sh`
- [ ] Chunking strategy selected (Semantic for prose, Code for code)
- [ ] Max chunks per doc set appropriately
- [ ] Prometheus metrics endpoint exposed
- [ ] Error handling and retries in place
- [ ] Cost tracking enabled (for cloud providers)

## Troubleshooting

### "No LLM client configured"

```rust
// Don't use RLMEngine::new() - it has no LLM client
let engine = RLMEngine::new(storage, bm25_index)?;  // ❌

// Use with_llm_client() instead
let engine = RLMEngine::with_llm_client(
    storage, bm25_index, llm_client, Some(config)
)?;  // ✅
```

### "Embedding generation failed"

```rust
// Make sure embedding config matches your provider
let config = RLMEngineConfig {
    embedding: Some(EmbeddingConfig::openai_small()),  // ✅
    ..Default::default()
};
```

### "SurrealDB schema error"

```bash
# Apply the schema
cd crates/vapora-rlm/tests
bash test_setup.sh
```

## Examples

See `examples/` directory:

- `production_setup.rs` - OpenAI production setup
- `local_ollama.rs` - Local development with Ollama

Run with:
```bash
cargo run --example production_setup
cargo run --example local_ollama
```

## Cost Optimization

### Use Local Ollama for Development

```rust
// Free, local, no API keys
let client = Arc::new(OllamaClient::new(
    "http://localhost:11434".to_string(),
    "llama3.2".to_string(),
    4096, 0.7,
)?);
```

### Choose Cheaper Models for Production

```rust
// Instead of gpt-4 ($5/$15 per 1M tokens)
OpenAIClient::new(api_key, "gpt-4".to_string(), ...)

// Use gpt-3.5-turbo ($0.50/$1.50 per 1M tokens)
OpenAIClient::new(api_key, "gpt-3.5-turbo".to_string(), ...)
```

### Track Costs with Metrics

```rust
// RLM automatically tracks token usage
let response = engine.dispatch_subtask(...).await?;
println!("Cost: ${:.4}",
    (response.total_input_tokens as f64 * 5.0 / 1_000_000.0) +
    (response.total_output_tokens as f64 * 15.0 / 1_000_000.0)
);
```

## Next Steps

1. Review examples: `cargo run --example local_ollama`
2. Run tests: `cargo test -p vapora-rlm`
3. Check metrics: See `src/metrics.rs`
4. Integrate with backend: See `vapora-backend` integration patterns
chore: add A2A y RLM 2026-02-16 05:09:51 +00:00			`# RLM Production Setup Guide`

			`This guide shows how to configure vapora-rlm for production use with LLM clients and embeddings.`

			`## Prerequisites`

			`1. SurrealDB running on port 8000`
			`2. LLM Provider (choose one):`
			`- OpenAI (cloud, requires API key)`
			`- Anthropic Claude (cloud, requires API key)`
			`- Ollama (local, free)`
			`3. Optional: Docker for Docker sandbox tier`

			`## Quick Start`

			`### Option 1: Cloud (OpenAI)`

			```bash
			`# Set API key`
			`export OPENAI_API_KEY="sk-..."`

			`# Run example`
			`cargo run --example production_setup`
			```

			`### Option 2: Local (Ollama)`

			```bash
			`# Install and start Ollama`
			`brew install ollama`
			`ollama serve`

			`# Pull model`
			`ollama pull llama3.2`

			`# Run example`
			`cargo run --example local_ollama`
			```

			`## Production Configuration`

			`### 1. Create RLM Engine with LLM Client`

			```rust
			`use std::sync::Arc;`
			`use vapora_llm_router::providers::OpenAIClient;`
			`use vapora_rlm::RLMEngine;`

			`// Setup LLM client`
			`let llm_client = Arc::new(OpenAIClient::new(`
			`api_key,`
			`"gpt-4".to_string(),`
			`4096, // max_tokens`
			`0.7, // temperature`
			`5.0, // cost per 1M input tokens`
			`15.0, // cost per 1M output tokens`
			`)?);`

			`// Create engine with LLM`
			`let engine = RLMEngine::with_llm_client(`
			`storage,`
			`bm25_index,`
			`llm_client,`
			`Some(config),`
			`)?;`
			```

			`### 2. Configure Chunking Strategy`

			```rust
			`use vapora_rlm::chunking::{ChunkingConfig, ChunkingStrategy};`
			`use vapora_rlm::engine::RLMEngineConfig;`

			`let config = RLMEngineConfig {`
			`chunking: ChunkingConfig {`
			`strategy: ChunkingStrategy::Semantic, // or Fixed, Code`
			`chunk_size: 1000,`
			`overlap: 200,`
			`},`
			`embedding: Some(EmbeddingConfig::openai_small()),`
			`auto_rebuild_bm25: true,`
			`max_chunks_per_doc: 10_000,`
			`};`
			```

			`### 3. Configure Embeddings`

			```rust
			`use vapora_rlm::embeddings::EmbeddingConfig;`

			`// OpenAI (1536 dimensions)`
			`let embedding_config = EmbeddingConfig::openai_small();`

			`// OpenAI (3072 dimensions)`
			`let embedding_config = EmbeddingConfig::openai_large();`

			`// Ollama (local)`
			`let embedding_config = EmbeddingConfig::ollama("llama3.2");`
			```

			`### 4. Use RLM in Production`

			```rust
			`// Load document`
			`let chunk_count = engine.load_document(doc_id, content, None).await?;`

			`// Query with hybrid search (BM25 + semantic + RRF)`
			`let results = engine.query(doc_id, "your query", None, 5).await?;`

			`// Dispatch to LLM for distributed reasoning`
			`let response = engine`
			`.dispatch_subtask(doc_id, "Analyze this code", None, 5)`
			`.await?;`

			`println!("LLM Response: {}", response.text);`
			`println!("Tokens: {} in, {} out",`
			`response.total_input_tokens,`
			`response.total_output_tokens`
			`);`
			```

			`## LLM Provider Options`

			`### OpenAI`

			```rust
			`use vapora_llm_router::providers::OpenAIClient;`

			`let client = Arc::new(OpenAIClient::new(`
			`api_key,`
			`"gpt-4".to_string(),`
			`4096, 0.7, 5.0, 15.0,`
			`)?);`
			```

			`Models:`
			- `gpt-4` - Most capable
			- `gpt-4-turbo` - Faster, cheaper
			- `gpt-3.5-turbo` - Fast, cheapest

			`### Anthropic Claude`

			```rust
			`use vapora_llm_router::providers::ClaudeClient;`

			`let client = Arc::new(ClaudeClient::new(`
			`api_key,`
			`"claude-3-opus-20240229".to_string(),`
			`4096, 0.7, 15.0, 75.0,`
			`)?);`
			```

			`Models:`
			- `claude-3-opus` - Most capable
			- `claude-3-sonnet` - Balanced
			- `claude-3-haiku` - Fast, cheap

			`### Ollama (Local)`

			```rust
			`use vapora_llm_router::providers::OllamaClient;`

			`let client = Arc::new(OllamaClient::new(`
			`"http://localhost:11434".to_string(),`
			`"llama3.2".to_string(),`
			`4096, 0.7,`
			`)?);`
			```

			`Popular models:`
			- `llama3.2` - Meta's latest
			- `mistral` - Fast, capable
			- `codellama` - Code-focused
			- `mixtral` - Large, powerful

			`## Performance Tuning`

			`### Chunk Size Optimization`

			```rust
			`// Small chunks (500 chars) - Better precision, more chunks`
			`ChunkingConfig {`
			`strategy: ChunkingStrategy::Fixed,`
			`chunk_size: 500,`
			`overlap: 100,`
			`}`

			`// Large chunks (2000 chars) - More context, fewer chunks`
			`ChunkingConfig {`
			`strategy: ChunkingStrategy::Fixed,`
			`chunk_size: 2000,`
			`overlap: 400,`
			`}`
			```

			`### BM25 Index Tuning`

			```rust
			`let config = RLMEngineConfig {`
			`auto_rebuild_bm25: true, // Rebuild after loading`
			`..Default::default()`
			`};`
			```

			`### Max Chunks Per Document`

			```rust
			`let config = RLMEngineConfig {`
			`max_chunks_per_doc: 10_000, // Safety limit`
			`..Default::default()`
			`};`
			```

			`## Production Checklist`

			`- [ ] LLM client configured with valid API key`
			`- [ ] Embedding provider configured`
			- [ ] SurrealDB schema applied: `bash tests/test_setup.sh`
			`- [ ] Chunking strategy selected (Semantic for prose, Code for code)`
			`- [ ] Max chunks per doc set appropriately`
			`- [ ] Prometheus metrics endpoint exposed`
			`- [ ] Error handling and retries in place`
			`- [ ] Cost tracking enabled (for cloud providers)`

			`## Troubleshooting`

			`### "No LLM client configured"`

			```rust
			`// Don't use RLMEngine::new() - it has no LLM client`
			`let engine = RLMEngine::new(storage, bm25_index)?; // ❌`

			`// Use with_llm_client() instead`
			`let engine = RLMEngine::with_llm_client(`
			`storage, bm25_index, llm_client, Some(config)`
			`)?; // ✅`
			```

			`### "Embedding generation failed"`

			```rust
			`// Make sure embedding config matches your provider`
			`let config = RLMEngineConfig {`
			`embedding: Some(EmbeddingConfig::openai_small()), // ✅`
			`..Default::default()`
			`};`
			```

			`### "SurrealDB schema error"`

			```bash
			`# Apply the schema`
			`cd crates/vapora-rlm/tests`
			`bash test_setup.sh`
			```

			`## Examples`

			See `examples/` directory:

			- `production_setup.rs` - OpenAI production setup
			- `local_ollama.rs` - Local development with Ollama

			`Run with:`
			```bash
			`cargo run --example production_setup`
			`cargo run --example local_ollama`
			```

			`## Cost Optimization`

			`### Use Local Ollama for Development`

			```rust
			`// Free, local, no API keys`
			`let client = Arc::new(OllamaClient::new(`
			`"http://localhost:11434".to_string(),`
			`"llama3.2".to_string(),`
			`4096, 0.7,`
			`)?);`
			```

			`### Choose Cheaper Models for Production`

			```rust
			`// Instead of gpt-4 ($5/$15 per 1M tokens)`
			`OpenAIClient::new(api_key, "gpt-4".to_string(), ...)`

			`// Use gpt-3.5-turbo ($0.50/$1.50 per 1M tokens)`
			`OpenAIClient::new(api_key, "gpt-3.5-turbo".to_string(), ...)`
			```

			`### Track Costs with Metrics`

			```rust
			`// RLM automatically tracks token usage`
			`let response = engine.dispatch_subtask(...).await?;`
			`println!("Cost: ${:.4}",`
			`(response.total_input_tokens as f64 * 5.0 / 1_000_000.0) +`
			`(response.total_output_tokens as f64 * 15.0 / 1_000_000.0)`
			`);`
			```

			`## Next Steps`

			1. Review examples: `cargo run --example local_ollama`
			2. Run tests: `cargo test -p vapora-rlm`
			3. Check metrics: See `src/metrics.rs`
			4. Integrate with backend: See `vapora-backend` integration patterns