provisioning/docs/src/ai/rag-system.md

# Retrieval-Augmented Generation (RAG) System

**Status**: ✅ Production-Ready (SurrealDB 1.5.0+, 22/22 tests passing)

The RAG system enables the AI service to access, retrieve, and reason over infrastructure documentation, schemas, and past configurations. This allows
the AI to generate contextually accurate infrastructure configurations and provide intelligent troubleshooting advice grounded in actual platform
knowledge.

## Architecture Overview

The RAG system consists of:

1. **Document Store**: SurrealDB vector store with semantic indexing
2. **Hybrid Search**: Vector similarity + BM25 keyword search
3. **Chunk Management**: Intelligent document chunking for code and markdown
4. **Context Ranking**: Relevance scoring for retrieved documents
5. **Semantic Cache**: Deduplication of repeated queries

## Core Components

### 1. Vector Embeddings

The system uses embedding models to convert documents into vector representations:

```text
┌─────────────────────┐
│ Document Source     │
│ (Markdown, Code)    │
└──────────┬──────────┘
           │
           ▼
┌──────────────────────────────────┐
│ Chunking & Tokenization          │
│ - Code-aware splits              │
│ - Markdown aware                 │
│ - Preserves context              │
└──────────┬───────────────────────┘
           │
           ▼
┌──────────────────────────────────┐
│ Embedding Model                  │
│ (OpenAI Ada, Anthropic, Local)   │
└──────────┬───────────────────────┘
           │
           ▼
┌──────────────────────────────────┐
│ Vector Storage (SurrealDB)       │
│ - Vector index                   │
│ - Metadata indexed               │
│ - BM25 index for keywords        │
└──────────────────────────────────┘
```

### 2. SurrealDB Integration

SurrealDB serves as the vector database and knowledge store:

```text
# Configuration in provisioning/schemas/ai.ncl
let {
  rag = {
    enabled = true,
    db_url = "surreal://localhost:8000",
    namespace = "provisioning",
    database = "ai_rag",
    
    # Collections for different document types
    collections = {
      documentation = {
        chunking_strategy = "markdown",
        chunk_size = 1024,
        overlap = 256,
      },
      schemas = {
        chunking_strategy = "code",
        chunk_size = 512,
        overlap = 128,
      },
      deployments = {
        chunking_strategy = "json",
        chunk_size = 2048,
        overlap = 512,
      },
    },
    
    # Embedding configuration
    embedding = {
      provider = "openai",  # or "anthropic", "local"
      model = "text-embedding-3-small",
      cache_vectors = true,
    },
    
    # Search configuration
    search = {
      hybrid_enabled = true,
      vector_weight = 0.7,
      keyword_weight = 0.3,
      top_k = 5,  # Number of results to return
      semantic_cache = true,
    },
  }
}
```

### 3. Document Chunking

Intelligent chunking preserves context while managing token limits:

#### Markdown Chunking Strategy

```text
Input Document: provisioning/docs/src/guides/from-scratch.md

Chunks:
  [1] Header + first section (up to 1024 tokens)
  [2] Next logical section + overlap with [1]
  [3] Code examples preserve as atomic units
  [4] Continue with overlap...

Each chunk includes:
  - Original section heading (for context)
  - Content
  - Source file and line numbers
  - Metadata (doctype, category, version)
```

#### Code Chunking Strategy

```text
Input Document: provisioning/schemas/main.ncl

Chunks:
  [1] Top-level let binding + comments
  [2] Function definition (atomic, preserves signature)
  [3] Type definition (atomic, preserves interface)
  [4] Implementation blocks with context overlap

Each chunk preserves:
  - Type signatures
  - Function signatures
  - Import statements needed for context
  - Comments and docstrings
```

## Hybrid Search

The system implements dual search strategy for optimal results:

### Vector Similarity Search

```text
// Find semantically similar documents
async fn vector_search(query: &str, top_k: usize) -> Vec<Document> {
    let embedding = embed(query).await?;
    
    // L2 distance in SurrealDB
    db.query("
        SELECT *, vector::similarity::cosine(embedding, $embedding) AS score
        FROM documents
        WHERE embedding <~> $embedding
        ORDER BY score DESC
        LIMIT $top_k
    ")
    .bind(("embedding", embedding))
    .bind(("top_k", top_k))
    .await
}
```

**Use case**: Semantic understanding of intent
- Query: "How to configure PostgreSQL"
- Finds: Documents about database configuration, examples, schemas

### BM25 Keyword Search

```text
// Find documents with matching keywords
async fn keyword_search(query: &str, top_k: usize) -> Vec<Document> {
    // BM25 full-text search in SurrealDB
    db.query("
        SELECT *, search::bm25(.) AS score
        FROM documents
        WHERE text @@ $query
        ORDER BY score DESC
        LIMIT $top_k
    ")
    .bind(("query", query))
    .bind(("top_k", top_k))
    .await
}
```

**Use case**: Exact term matching
- Query: "SurrealDB configuration"
- Finds: Documents mentioning SurrealDB specifically

### Hybrid Results

```text
async fn hybrid_search(
    query: &str,
    vector_weight: f32,
    keyword_weight: f32,
    top_k: usize,
) -> Vec<Document> {
    let vector_results = vector_search(query, top_k * 2).await?;
    let keyword_results = keyword_search(query, top_k * 2).await?;
    
    let mut scored = HashMap::new();
    
    // Score from vector search
    for (i, doc) in vector_results.iter().enumerate() {
        *scored.entry(doc.id).or_insert(0.0) +=
            vector_weight * (1.0 - (i as f32 / top_k as f32));
    }
    
    // Score from keyword search
    for (i, doc) in keyword_results.iter().enumerate() {
        *scored.entry(doc.id).or_insert(0.0) +=
            keyword_weight * (1.0 - (i as f32 / top_k as f32));
    }
    
    // Return top-k by combined score
    let mut results: Vec<_> = scored.into_iter().collect();
| results.sort_by( | a, b | b.1.partial_cmp(&a.1).unwrap()); |
| Ok(results.into_iter().take(top_k).map( | (id, _) | ...).collect()) |
}
```

## Semantic Caching

Reduces API calls by caching embeddings of repeated queries:

```text
struct SemanticCache {
    queries: Arc<DashMap<Vec<f32>, CachedResult>>,
    similarity_threshold: f32,
}

impl SemanticCache {
    async fn get(&self, query: &str) -> Option<CachedResult> {
        let embedding = embed(query).await?;
        
        // Find cached query with similar embedding
        // (cosine distance < threshold)
        for entry in self.queries.iter() {
            let distance = cosine_distance(&embedding, entry.key());
            if distance < self.similarity_threshold {
                return Some(entry.value().clone());
            }
        }
        None
    }
    
    async fn insert(&self, query: &str, result: CachedResult) {
        let embedding = embed(query).await?;
        self.queries.insert(embedding, result);
    }
}
```

**Benefits**:
- 50-80% reduction in embedding API calls
- Identical queries return in <10ms
- Similar queries reuse cached context

## Ingestion Workflow

### Document Indexing

```text
# Index all documentation
provisioning ai index-docs provisioning/docs/src

# Index schemas
provisioning ai index-schemas provisioning/schemas

# Index past deployments
provisioning ai index-deployments workspaces/*/deployments

# Watch directory for changes (development mode)
provisioning ai watch docs provisioning/docs/src
```

### Programmatic Indexing

```text
// In ai-service on startup
async fn initialize_rag() -> Result<()> {
    let rag = RAGSystem::new(&config.rag).await?;
    
    // Index documentation
    let docs = load_markdown_docs("provisioning/docs/src")?;
    for doc in docs {
        rag.ingest_document(&doc).await?;
    }
    
    // Index schemas
    let schemas = load_nickel_schemas("provisioning/schemas")?;
    for schema in schemas {
        rag.ingest_schema(&schema).await?;
    }
    
    Ok(())
}
```

## Usage Examples

### Query the RAG System

```text
# Search for context-aware information
provisioning ai query "How do I configure PostgreSQL with encryption?"

# Get configuration template
provisioning ai template "Describe production Kubernetes on AWS"

# Interactive mode
provisioning ai chat
> What are the best practices for database backup?
```

### AI Service Integration

```text
// AI service uses RAG to enhance generation
async fn generate_config(user_request: &str) -> Result<String> {
    // Retrieve relevant context
    let context = rag.search(user_request, top_k=5).await?;
    
    // Build prompt with context
    let prompt = build_prompt_with_context(user_request, &context);
    
    // Generate configuration
    let config = llm.generate(&prompt).await?;
    
    // Validate against schemas
    validate_nickel_config(&config)?;
    
    Ok(config)
}
```

### Form Assistance Integration

```text
// In typdialog-ai (JavaScript/TypeScript)
async function suggestFieldValue(fieldName, currentInput) {
    // Query RAG for similar configurations
    const context = await rag.search(
        `Field: ${fieldName}, Input: ${currentInput}`,
        { topK: 3, semantic: true }
    );
    
    // Generate suggestion using context
    const suggestion = await ai.suggest({
        field: fieldName,
        input: currentInput,
        context: context,
    });
    
    return suggestion;
}
```

## Performance Characteristics

|  | Operation | Time | Cache Hit |  |
|  | ----------- | ------ | ----------- |  |
|  | Vector embedding | 200-500ms | N/A |  |
|  | Vector search (cold) | 300-800ms | N/A |  |
|  | Keyword search | 50-200ms | N/A |  |
|  | Hybrid search | 500-1200ms | <100ms cached |  |
|  | Semantic cache hit | 10-50ms | Always |  |

**Typical query flow**:
1. Embedding: 300ms
2. Vector search: 400ms
3. Keyword search: 100ms
4. Ranking: 50ms
5. **Total**: ~850ms (first call), <100ms (cached)

## Configuration

See [Configuration Guide](configuration.md) for detailed RAG setup:

- LLM provider for embeddings
- SurrealDB connection
- Chunking strategies
- Search weights and limits
- Cache settings and TTLs

## Limitations and Considerations

### Document Freshness

- RAG indexes static snapshots
- Changes to documentation require re-indexing
- Use watch mode during development

### Token Limits

- Large documents chunked to fit LLM context
- Some context may be lost in chunking
- Adjustable chunk size vs. context trade-off

### Embedding Quality

- Quality depends on embedding model
- Domain-specific models perform better
- Fine-tuning possible for specialized vocabularies

## Monitoring and Debugging

### Query Metrics

```text
# View RAG search metrics
provisioning ai metrics show rag

# Analysis of search quality
provisioning ai eval-rag --sample-queries 100
```

### Debug Mode

```text
# In provisioning/config/ai.toml
[ai.rag.debug]
enabled = true
log_embeddings = true      # Log embedding vectors
log_search_scores = true   # Log relevance scores
log_context_used = true    # Log context retrieved
```

## Related Documentation

- [Architecture](architecture.md) - AI system overview
- [MCP Integration](mcp-integration.md) - RAG access via MCP
- [Configuration](configuration.md) - RAG setup guide
- [API Reference](api-reference.md) - RAG API endpoints
- [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions

---

**Last Updated**: 2025-01-13
**Status**: ✅ Production-Ready
**Test Coverage**: 22/22 tests passing
**Database**: SurrealDB 1.5.0+
chore: fix docs after fences fix 2026-01-14 04:53:21 +00:00			`# Retrieval-Augmented Generation (RAG) System`

			`Status: ✅ Production-Ready (SurrealDB 1.5.0+, 22/22 tests passing)`

			`The RAG system enables the AI service to access, retrieve, and reason over infrastructure documentation, schemas, and past configurations. This allows`
			`the AI to generate contextually accurate infrastructure configurations and provide intelligent troubleshooting advice grounded in actual platform`
			`knowledge.`

			`## Architecture Overview`

			`The RAG system consists of:`

			`1. Document Store: SurrealDB vector store with semantic indexing`
			`2. Hybrid Search: Vector similarity + BM25 keyword search`
			`3. Chunk Management: Intelligent document chunking for code and markdown`
			`4. Context Ranking: Relevance scoring for retrieved documents`
			`5. Semantic Cache: Deduplication of repeated queries`

			`## Core Components`

			`### 1. Vector Embeddings`

			`The system uses embedding models to convert documents into vector representations:`

			```text
			`┌─────────────────────┐`
			`│ Document Source │`
			`│ (Markdown, Code) │`
			`└──────────┬──────────┘`
			`│`
			`▼`
			`┌──────────────────────────────────┐`
			`│ Chunking & Tokenization │`
			`│ - Code-aware splits │`
			`│ - Markdown aware │`
			`│ - Preserves context │`
			`└──────────┬───────────────────────┘`
			`│`
			`▼`
			`┌──────────────────────────────────┐`
			`│ Embedding Model │`
			`│ (OpenAI Ada, Anthropic, Local) │`
			`└──────────┬───────────────────────┘`
			`│`
			`▼`
			`┌──────────────────────────────────┐`
			`│ Vector Storage (SurrealDB) │`
			`│ - Vector index │`
			`│ - Metadata indexed │`
			`│ - BM25 index for keywords │`
			`└──────────────────────────────────┘`
			```

			`### 2. SurrealDB Integration`

			`SurrealDB serves as the vector database and knowledge store:`

			```text
			`# Configuration in provisioning/schemas/ai.ncl`
			`let {`
			`rag = {`
			`enabled = true,`
			`db_url = "surreal://localhost:8000",`
			`namespace = "provisioning",`
			`database = "ai_rag",`

			`# Collections for different document types`
			`collections = {`
			`documentation = {`
			`chunking_strategy = "markdown",`
			`chunk_size = 1024,`
			`overlap = 256,`
			`},`
			`schemas = {`
			`chunking_strategy = "code",`
			`chunk_size = 512,`
			`overlap = 128,`
			`},`
			`deployments = {`
			`chunking_strategy = "json",`
			`chunk_size = 2048,`
			`overlap = 512,`
			`},`
			`},`

			`# Embedding configuration`
			`embedding = {`
			`provider = "openai", # or "anthropic", "local"`
			`model = "text-embedding-3-small",`
			`cache_vectors = true,`
			`},`

			`# Search configuration`
			`search = {`
			`hybrid_enabled = true,`
			`vector_weight = 0.7,`
			`keyword_weight = 0.3,`
			`top_k = 5, # Number of results to return`
			`semantic_cache = true,`
			`},`
			`}`
			`}`
			```

			`### 3. Document Chunking`

			`Intelligent chunking preserves context while managing token limits:`

			`#### Markdown Chunking Strategy`

			```text
			`Input Document: provisioning/docs/src/guides/from-scratch.md`

			`Chunks:`
			`[1] Header + first section (up to 1024 tokens)`
			`[2] Next logical section + overlap with [1]`
			`[3] Code examples preserve as atomic units`
			`[4] Continue with overlap...`

			`Each chunk includes:`
			`- Original section heading (for context)`
			`- Content`
			`- Source file and line numbers`
			`- Metadata (doctype, category, version)`
			```

			`#### Code Chunking Strategy`

			```text
			`Input Document: provisioning/schemas/main.ncl`

			`Chunks:`
			`[1] Top-level let binding + comments`
			`[2] Function definition (atomic, preserves signature)`
			`[3] Type definition (atomic, preserves interface)`
			`[4] Implementation blocks with context overlap`

			`Each chunk preserves:`
			`- Type signatures`
			`- Function signatures`
			`- Import statements needed for context`
			`- Comments and docstrings`
			```

			`## Hybrid Search`

			`The system implements dual search strategy for optimal results:`

			`### Vector Similarity Search`

			```text
			`// Find semantically similar documents`
			`async fn vector_search(query: &str, top_k: usize) -> Vec<Document> {`
			`let embedding = embed(query).await?;`

			`// L2 distance in SurrealDB`
			`db.query("`
			`SELECT *, vector::similarity::cosine(embedding, $embedding) AS score`
			`FROM documents`
			`WHERE embedding <~> $embedding`
			`ORDER BY score DESC`
			`LIMIT $top_k`
			`")`
			`.bind(("embedding", embedding))`
			`.bind(("top_k", top_k))`
			`.await`
			`}`
			```

			`Use case: Semantic understanding of intent`
			`- Query: "How to configure PostgreSQL"`
			`- Finds: Documents about database configuration, examples, schemas`

			`### BM25 Keyword Search`

			```text
			`// Find documents with matching keywords`
			`async fn keyword_search(query: &str, top_k: usize) -> Vec<Document> {`
			`// BM25 full-text search in SurrealDB`
			`db.query("`
			`SELECT *, search::bm25(.) AS score`
			`FROM documents`
			`WHERE text @@ $query`
			`ORDER BY score DESC`
			`LIMIT $top_k`
			`")`
			`.bind(("query", query))`
			`.bind(("top_k", top_k))`
			`.await`
			`}`
			```

			`Use case: Exact term matching`
			`- Query: "SurrealDB configuration"`
			`- Finds: Documents mentioning SurrealDB specifically`

			`### Hybrid Results`

			```text
			`async fn hybrid_search(`
			`query: &str,`
			`vector_weight: f32,`
			`keyword_weight: f32,`
			`top_k: usize,`
			`) -> Vec<Document> {`
			`let vector_results = vector_search(query, top_k * 2).await?;`
			`let keyword_results = keyword_search(query, top_k * 2).await?;`

			`let mut scored = HashMap::new();`

			`// Score from vector search`
			`for (i, doc) in vector_results.iter().enumerate() {`
			`*scored.entry(doc.id).or_insert(0.0) +=`
			`vector_weight * (1.0 - (i as f32 / top_k as f32));`
			`}`

			`// Score from keyword search`
			`for (i, doc) in keyword_results.iter().enumerate() {`
			`*scored.entry(doc.id).or_insert(0.0) +=`
			`keyword_weight * (1.0 - (i as f32 / top_k as f32));`
			`}`

			`// Return top-k by combined score`
			`let mut results: Vec<_> = scored.into_iter().collect();`
			`\| results.sort_by( \| a, b \| b.1.partial_cmp(&a.1).unwrap()); \|`
			`\| Ok(results.into_iter().take(top_k).map( \| (id, _) \| ...).collect()) \|`
			`}`
			```

			`## Semantic Caching`

			`Reduces API calls by caching embeddings of repeated queries:`

			```text
			`struct SemanticCache {`
			`queries: Arc<DashMap<Vec<f32>, CachedResult>>,`
			`similarity_threshold: f32,`
			`}`

			`impl SemanticCache {`
			`async fn get(&self, query: &str) -> Option<CachedResult> {`
			`let embedding = embed(query).await?;`

			`// Find cached query with similar embedding`
			`// (cosine distance < threshold)`
			`for entry in self.queries.iter() {`
			`let distance = cosine_distance(&embedding, entry.key());`
			`if distance < self.similarity_threshold {`
			`return Some(entry.value().clone());`
			`}`
			`}`
			`None`
			`}`

			`async fn insert(&self, query: &str, result: CachedResult) {`
			`let embedding = embed(query).await?;`
			`self.queries.insert(embedding, result);`
			`}`
			`}`
			```

			`Benefits:`
			`- 50-80% reduction in embedding API calls`
			`- Identical queries return in <10ms`
			`- Similar queries reuse cached context`

			`## Ingestion Workflow`

			`### Document Indexing`

			```text
			`# Index all documentation`
			`provisioning ai index-docs provisioning/docs/src`

			`# Index schemas`
			`provisioning ai index-schemas provisioning/schemas`

			`# Index past deployments`
			`provisioning ai index-deployments workspaces/*/deployments`

			`# Watch directory for changes (development mode)`
			`provisioning ai watch docs provisioning/docs/src`
			```

			`### Programmatic Indexing`

			```text
			`// In ai-service on startup`
			`async fn initialize_rag() -> Result<()> {`
			`let rag = RAGSystem::new(&config.rag).await?;`

			`// Index documentation`
			`let docs = load_markdown_docs("provisioning/docs/src")?;`
			`for doc in docs {`
			`rag.ingest_document(&doc).await?;`
			`}`

			`// Index schemas`
			`let schemas = load_nickel_schemas("provisioning/schemas")?;`
			`for schema in schemas {`
			`rag.ingest_schema(&schema).await?;`
			`}`

			`Ok(())`
			`}`
			```

			`## Usage Examples`

			`### Query the RAG System`

			```text
			`# Search for context-aware information`
			`provisioning ai query "How do I configure PostgreSQL with encryption?"`

			`# Get configuration template`
			`provisioning ai template "Describe production Kubernetes on AWS"`

			`# Interactive mode`
			`provisioning ai chat`
			`> What are the best practices for database backup?`
			```

			`### AI Service Integration`

			```text
			`// AI service uses RAG to enhance generation`
			`async fn generate_config(user_request: &str) -> Result<String> {`
			`// Retrieve relevant context`
			`let context = rag.search(user_request, top_k=5).await?;`

			`// Build prompt with context`
			`let prompt = build_prompt_with_context(user_request, &context);`

			`// Generate configuration`
			`let config = llm.generate(&prompt).await?;`

			`// Validate against schemas`
			`validate_nickel_config(&config)?;`

			`Ok(config)`
			`}`
			```

			`### Form Assistance Integration`

			```text
			`// In typdialog-ai (JavaScript/TypeScript)`
			`async function suggestFieldValue(fieldName, currentInput) {`
			`// Query RAG for similar configurations`
			`const context = await rag.search(`
			`Field: ${fieldName}, Input: ${currentInput}`,
			`{ topK: 3, semantic: true }`
			`);`

			`// Generate suggestion using context`
			`const suggestion = await ai.suggest({`
			`field: fieldName,`
			`input: currentInput,`
			`context: context,`
			`});`

			`return suggestion;`
			`}`
			```

			`## Performance Characteristics`

			`\| \| Operation \| Time \| Cache Hit \| \|`
			`\| \| ----------- \| ------ \| ----------- \| \|`
			`\| \| Vector embedding \| 200-500ms \| N/A \| \|`
			`\| \| Vector search (cold) \| 300-800ms \| N/A \| \|`
			`\| \| Keyword search \| 50-200ms \| N/A \| \|`
			`\| \| Hybrid search \| 500-1200ms \| <100ms cached \| \|`
			`\| \| Semantic cache hit \| 10-50ms \| Always \| \|`

			`Typical query flow:`
			`1. Embedding: 300ms`
			`2. Vector search: 400ms`
			`3. Keyword search: 100ms`
			`4. Ranking: 50ms`
			`5. Total: ~850ms (first call), <100ms (cached)`

			`## Configuration`

			`See [Configuration Guide](configuration.md) for detailed RAG setup:`

			`- LLM provider for embeddings`
			`- SurrealDB connection`
			`- Chunking strategies`
			`- Search weights and limits`
			`- Cache settings and TTLs`

			`## Limitations and Considerations`

			`### Document Freshness`

			`- RAG indexes static snapshots`
			`- Changes to documentation require re-indexing`
			`- Use watch mode during development`

			`### Token Limits`

			`- Large documents chunked to fit LLM context`
			`- Some context may be lost in chunking`
			`- Adjustable chunk size vs. context trade-off`

			`### Embedding Quality`

			`- Quality depends on embedding model`
			`- Domain-specific models perform better`
			`- Fine-tuning possible for specialized vocabularies`

			`## Monitoring and Debugging`

			`### Query Metrics`

			```text
			`# View RAG search metrics`
			`provisioning ai metrics show rag`

			`# Analysis of search quality`
			`provisioning ai eval-rag --sample-queries 100`
			```

			`### Debug Mode`

			```text
			`# In provisioning/config/ai.toml`
			`[ai.rag.debug]`
			`enabled = true`
			`log_embeddings = true # Log embedding vectors`
			`log_search_scores = true # Log relevance scores`
			`log_context_used = true # Log context retrieved`
			```

			`## Related Documentation`

			`- [Architecture](architecture.md) - AI system overview`
			`- [MCP Integration](mcp-integration.md) - RAG access via MCP`
			`- [Configuration](configuration.md) - RAG setup guide`
			`- [API Reference](api-reference.md) - RAG API endpoints`
			`- [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions`

			`---`

			`Last Updated: 2025-01-13`
			`Status: ✅ Production-Ready`
			`Test Coverage: 22/22 tests passing`
			`Database: SurrealDB 1.5.0+`