provisioning/docs/src/ai/rag-system.md
2026-01-14 04:53:58 +00:00

12 KiB

Retrieval-Augmented Generation (RAG) System

Status: Production-Ready (SurrealDB 1.5.0+, 22/22 tests passing)

The RAG system enables the AI service to access, retrieve, and reason over infrastructure documentation, schemas, and past configurations. This allows the AI to generate contextually accurate infrastructure configurations and provide intelligent troubleshooting advice grounded in actual platform knowledge.

Architecture Overview

The RAG system consists of:

  1. Document Store: SurrealDB vector store with semantic indexing
  2. Hybrid Search: Vector similarity + BM25 keyword search
  3. Chunk Management: Intelligent document chunking for code and markdown
  4. Context Ranking: Relevance scoring for retrieved documents
  5. Semantic Cache: Deduplication of repeated queries

Core Components

1. Vector Embeddings

The system uses embedding models to convert documents into vector representations:

┌─────────────────────┐
│ Document Source     │
│ (Markdown, Code)    │
└──────────┬──────────┘
           │
           ▼
┌──────────────────────────────────┐
│ Chunking & Tokenization          │
│ - Code-aware splits              │
│ - Markdown aware                 │
│ - Preserves context              │
└──────────┬───────────────────────┘
           │
           ▼
┌──────────────────────────────────┐
│ Embedding Model                  │
│ (OpenAI Ada, Anthropic, Local)   │
└──────────┬───────────────────────┘
           │
           ▼
┌──────────────────────────────────┐
│ Vector Storage (SurrealDB)       │
│ - Vector index                   │
│ - Metadata indexed               │
│ - BM25 index for keywords        │
└──────────────────────────────────┘

2. SurrealDB Integration

SurrealDB serves as the vector database and knowledge store:

# Configuration in provisioning/schemas/ai.ncl
let {
  rag = {
    enabled = true,
    db_url = "surreal://localhost:8000",
    namespace = "provisioning",
    database = "ai_rag",
    
    # Collections for different document types
    collections = {
      documentation = {
        chunking_strategy = "markdown",
        chunk_size = 1024,
        overlap = 256,
      },
      schemas = {
        chunking_strategy = "code",
        chunk_size = 512,
        overlap = 128,
      },
      deployments = {
        chunking_strategy = "json",
        chunk_size = 2048,
        overlap = 512,
      },
    },
    
    # Embedding configuration
    embedding = {
      provider = "openai",  # or "anthropic", "local"
      model = "text-embedding-3-small",
      cache_vectors = true,
    },
    
    # Search configuration
    search = {
      hybrid_enabled = true,
      vector_weight = 0.7,
      keyword_weight = 0.3,
      top_k = 5,  # Number of results to return
      semantic_cache = true,
    },
  }
}

3. Document Chunking

Intelligent chunking preserves context while managing token limits:

Markdown Chunking Strategy

Input Document: provisioning/docs/src/guides/from-scratch.md

Chunks:
  [1] Header + first section (up to 1024 tokens)
  [2] Next logical section + overlap with [1]
  [3] Code examples preserve as atomic units
  [4] Continue with overlap...

Each chunk includes:
  - Original section heading (for context)
  - Content
  - Source file and line numbers
  - Metadata (doctype, category, version)

Code Chunking Strategy

Input Document: provisioning/schemas/main.ncl

Chunks:
  [1] Top-level let binding + comments
  [2] Function definition (atomic, preserves signature)
  [3] Type definition (atomic, preserves interface)
  [4] Implementation blocks with context overlap

Each chunk preserves:
  - Type signatures
  - Function signatures
  - Import statements needed for context
  - Comments and docstrings

The system implements dual search strategy for optimal results:

// Find semantically similar documents
async fn vector_search(query: &str, top_k: usize) -> Vec<Document> {
    let embedding = embed(query).await?;
    
    // L2 distance in SurrealDB
    db.query("
        SELECT *, vector::similarity::cosine(embedding, $embedding) AS score
        FROM documents
        WHERE embedding <~> $embedding
        ORDER BY score DESC
        LIMIT $top_k
    ")
    .bind(("embedding", embedding))
    .bind(("top_k", top_k))
    .await
}

Use case: Semantic understanding of intent

  • Query: "How to configure PostgreSQL"
  • Finds: Documents about database configuration, examples, schemas
// Find documents with matching keywords
async fn keyword_search(query: &str, top_k: usize) -> Vec<Document> {
    // BM25 full-text search in SurrealDB
    db.query("
        SELECT *, search::bm25(.) AS score
        FROM documents
        WHERE text @@ $query
        ORDER BY score DESC
        LIMIT $top_k
    ")
    .bind(("query", query))
    .bind(("top_k", top_k))
    .await
}

Use case: Exact term matching

  • Query: "SurrealDB configuration"
  • Finds: Documents mentioning SurrealDB specifically

Hybrid Results

async fn hybrid_search(
    query: &str,
    vector_weight: f32,
    keyword_weight: f32,
    top_k: usize,
) -> Vec<Document> {
    let vector_results = vector_search(query, top_k * 2).await?;
    let keyword_results = keyword_search(query, top_k * 2).await?;
    
    let mut scored = HashMap::new();
    
    // Score from vector search
    for (i, doc) in vector_results.iter().enumerate() {
        *scored.entry(doc.id).or_insert(0.0) +=
            vector_weight * (1.0 - (i as f32 / top_k as f32));
    }
    
    // Score from keyword search
    for (i, doc) in keyword_results.iter().enumerate() {
        *scored.entry(doc.id).or_insert(0.0) +=
            keyword_weight * (1.0 - (i as f32 / top_k as f32));
    }
    
    // Return top-k by combined score
    let mut results: Vec<_> = scored.into_iter().collect();
| results.sort_by( | a, b | b.1.partial_cmp(&a.1).unwrap()); |
| Ok(results.into_iter().take(top_k).map( | (id, _) | ...).collect()) |
}

Semantic Caching

Reduces API calls by caching embeddings of repeated queries:

struct SemanticCache {
    queries: Arc<DashMap<Vec<f32>, CachedResult>>,
    similarity_threshold: f32,
}

impl SemanticCache {
    async fn get(&self, query: &str) -> Option<CachedResult> {
        let embedding = embed(query).await?;
        
        // Find cached query with similar embedding
        // (cosine distance < threshold)
        for entry in self.queries.iter() {
            let distance = cosine_distance(&embedding, entry.key());
            if distance < self.similarity_threshold {
                return Some(entry.value().clone());
            }
        }
        None
    }
    
    async fn insert(&self, query: &str, result: CachedResult) {
        let embedding = embed(query).await?;
        self.queries.insert(embedding, result);
    }
}

Benefits:

  • 50-80% reduction in embedding API calls
  • Identical queries return in <10ms
  • Similar queries reuse cached context

Ingestion Workflow

Document Indexing

# Index all documentation
provisioning ai index-docs provisioning/docs/src

# Index schemas
provisioning ai index-schemas provisioning/schemas

# Index past deployments
provisioning ai index-deployments workspaces/*/deployments

# Watch directory for changes (development mode)
provisioning ai watch docs provisioning/docs/src

Programmatic Indexing

// In ai-service on startup
async fn initialize_rag() -> Result<()> {
    let rag = RAGSystem::new(&config.rag).await?;
    
    // Index documentation
    let docs = load_markdown_docs("provisioning/docs/src")?;
    for doc in docs {
        rag.ingest_document(&doc).await?;
    }
    
    // Index schemas
    let schemas = load_nickel_schemas("provisioning/schemas")?;
    for schema in schemas {
        rag.ingest_schema(&schema).await?;
    }
    
    Ok(())
}

Usage Examples

Query the RAG System

# Search for context-aware information
provisioning ai query "How do I configure PostgreSQL with encryption?"

# Get configuration template
provisioning ai template "Describe production Kubernetes on AWS"

# Interactive mode
provisioning ai chat
> What are the best practices for database backup?

AI Service Integration

// AI service uses RAG to enhance generation
async fn generate_config(user_request: &str) -> Result<String> {
    // Retrieve relevant context
    let context = rag.search(user_request, top_k=5).await?;
    
    // Build prompt with context
    let prompt = build_prompt_with_context(user_request, &context);
    
    // Generate configuration
    let config = llm.generate(&prompt).await?;
    
    // Validate against schemas
    validate_nickel_config(&config)?;
    
    Ok(config)
}

Form Assistance Integration

// In typdialog-ai (JavaScript/TypeScript)
async function suggestFieldValue(fieldName, currentInput) {
    // Query RAG for similar configurations
    const context = await rag.search(
        `Field: ${fieldName}, Input: ${currentInput}`,
        { topK: 3, semantic: true }
    );
    
    // Generate suggestion using context
    const suggestion = await ai.suggest({
        field: fieldName,
        input: currentInput,
        context: context,
    });
    
    return suggestion;
}

Performance Characteristics

| | Operation | Time | Cache Hit | | | | ----------- | ------ | ----------- | | | | Vector embedding | 200-500ms | N/A | | | | Vector search (cold) | 300-800ms | N/A | | | | Keyword search | 50-200ms | N/A | | | | Hybrid search | 500-1200ms | <100ms cached | | | | Semantic cache hit | 10-50ms | Always | |

Typical query flow:

  1. Embedding: 300ms
  2. Vector search: 400ms
  3. Keyword search: 100ms
  4. Ranking: 50ms
  5. Total: ~850ms (first call), <100ms (cached)

Configuration

See Configuration Guide for detailed RAG setup:

  • LLM provider for embeddings
  • SurrealDB connection
  • Chunking strategies
  • Search weights and limits
  • Cache settings and TTLs

Limitations and Considerations

Document Freshness

  • RAG indexes static snapshots
  • Changes to documentation require re-indexing
  • Use watch mode during development

Token Limits

  • Large documents chunked to fit LLM context
  • Some context may be lost in chunking
  • Adjustable chunk size vs. context trade-off

Embedding Quality

  • Quality depends on embedding model
  • Domain-specific models perform better
  • Fine-tuning possible for specialized vocabularies

Monitoring and Debugging

Query Metrics

# View RAG search metrics
provisioning ai metrics show rag

# Analysis of search quality
provisioning ai eval-rag --sample-queries 100

Debug Mode

# In provisioning/config/ai.toml
[ai.rag.debug]
enabled = true
log_embeddings = true      # Log embedding vectors
log_search_scores = true   # Log relevance scores
log_context_used = true    # Log context retrieved

Last Updated: 2025-01-13 Status: Production-Ready Test Coverage: 22/22 tests passing Database: SurrealDB 1.5.0+