prvng_kcl/docs/VALIDATION.md
2025-10-07 11:17:54 +01:00

23 KiB

Schema Validation and Best Practices

This document provides comprehensive guidance on validating KCL schemas and following best practices for the provisioning package.

Table of Contents

Schema Validation

Basic Validation

# Validate syntax and run schema checks
kcl run config.k

# Format and validate all files
kcl fmt *.k

# Validate with verbose output
kcl run config.k --debug

# Validate against specific schema
kcl vet config.k --schema main.Server

JSON Output Validation

# Generate and validate JSON output
kcl run config.k --format json | jq '.'

# Validate JSON schema structure
kcl run config.k --format json | jq '.workflow_id // error("Missing workflow_id")'

# Pretty print for inspection
kcl run config.k --format json | jq '.operations[] | {operation_id, name, provider}'

Validation in CI/CD

# GitHub Actions example
- name: Validate KCL Schemas
  run: |
    find . -name "*.k" -exec kcl fmt {} \;
    find . -name "*.k" -exec kcl run {} \;

# Check for schema changes
- name: Check Schema Compatibility
  run: |
    kcl run main.k --format json > current_schema.json
    diff expected_schema.json current_schema.json

Built-in Constraints

Server Schema Constraints

import .main

# ✅ Valid server configuration
valid_server: main.Server = main.Server {
    hostname: "web-01"        # ✅ Non-empty string required
    title: "Web Server"       # ✅ Non-empty string required
    labels: "env: prod"       # ✅ Required field
    user: "admin"             # ✅ Required field

    # Optional but validated fields
    user_ssh_port: 22         # ✅ Valid port number
    running_timeout: 300      # ✅ Positive integer
    time_zone: "UTC"          # ✅ Valid timezone string
}

# ❌ Invalid configurations that will fail validation
invalid_examples: {
    # hostname: ""            # ❌ Empty hostname not allowed
    # title: ""               # ❌ Empty title not allowed
    # user_ssh_port: -1       # ❌ Negative port not allowed
    # running_timeout: 0      # ❌ Zero timeout not allowed
}

Workflow Schema Constraints

import .main

# ✅ Valid workflow with proper constraints
valid_workflow: main.BatchWorkflow = main.BatchWorkflow {
    workflow_id: "deploy_001"           # ✅ Non-empty ID required
    name: "Production Deployment"       # ✅ Non-empty name required
    operations: [                       # ✅ At least one operation required
        main.BatchOperation {
            operation_id: "create_servers"  # ✅ Unique operation ID
            name: "Create Servers"
            operation_type: "server"
            action: "create"
            parameters: {}
            timeout: 600                     # ✅ Positive timeout
            priority: 5                      # ✅ Valid priority
        }
    ]
    max_parallel_operations: 3              # ✅ Non-negative number
    global_timeout: 3600                    # ✅ Positive global timeout
}

# ❌ Constraint violations
constraint_violations: {
    # workflow_id: ""                     # ❌ Empty workflow ID
    # operations: []                      # ❌ Empty operations list
    # max_parallel_operations: -1         # ❌ Negative parallel limit
    # global_timeout: 0                   # ❌ Zero global timeout
}

Kubernetes Schema Constraints

import .main

# ✅ Valid Kubernetes deployment with constraints
valid_k8s: main.K8sDeploy = main.K8sDeploy {
    name: "webapp"                      # ✅ Non-empty name
    namespace: "production"             # ✅ Valid namespace

    spec: main.K8sDeploySpec {
        replicas: 3                     # ✅ Positive replica count
        containers: [                   # ✅ At least one container required
            main.K8sContainers {
                name: "app"             # ✅ Non-empty container name
                image: "nginx:1.21"     # ✅ Valid image reference

                resources_requests: main.K8sResources {
                    memory: "128Mi"     # ✅ Valid K8s memory format
                    cpu: "100m"         # ✅ Valid K8s CPU format
                }

                resources_limits: main.K8sResources {
                    memory: "256Mi"     # ✅ Limits >= requests (enforced)
                    cpu: "200m"
                }
            }
        ]
    }
}

Dependency Schema Constraints

import .main

# ✅ Valid dependency definitions
valid_dependencies: main.TaskservDependencies = main.TaskservDependencies {
    name: "kubernetes"                  # ✅ Lowercase name required

    requires: ["containerd", "cni"]     # ✅ Valid dependency list
    conflicts: ["docker"]               # ✅ Cannot coexist with docker

    resources: main.ResourceRequirement {
        cpu: "100m"                     # ✅ Non-empty CPU requirement
        memory: "128Mi"                 # ✅ Non-empty memory requirement
        disk: "1Gi"                     # ✅ Non-empty disk requirement
    }

    timeout: 600                        # ✅ Positive timeout
    retry_count: 3                      # ✅ Non-negative retry count

    os_support: ["linux"]               # ✅ At least one OS required
    arch_support: ["amd64", "arm64"]    # ✅ At least one arch required
}

# ❌ Constraint violations
dependency_violations: {
    # name: "Kubernetes"                # ❌ Must be lowercase
    # name: ""                          # ❌ Cannot be empty
    # timeout: 0                        # ❌ Must be positive
    # retry_count: -1                   # ❌ Cannot be negative
    # os_support: []                    # ❌ Must specify at least one OS
}

Custom Validation

Adding Custom Constraints

import .main
import regex

# Custom server schema with additional validation
schema CustomServer(main.Server):
    """Custom server with additional business rules"""

    # Additional custom fields
    environment: "dev" | "staging" | "prod"
    cost_center: str

    check:
        # Business rule: production servers must have specific naming
        environment == "prod" and regex.match(hostname, "^prod-[a-z0-9-]+$"),
        "Production servers must start with 'prod-'"

        # Business rule: staging servers have resource limits
        environment == "staging" and len(taskservs or []) <= 3,
        "Staging servers limited to 3 taskservs"

        # Business rule: cost center must be valid
        cost_center in ["engineering", "operations", "security"],
        "Invalid cost center: ${cost_center}"

# Usage with validation
prod_server: CustomServer = CustomServer {
    hostname: "prod-web-01"         # ✅ Matches production naming
    title: "Production Web Server"
    labels: "env: prod"
    user: "admin"
    environment: "prod"             # ✅ Valid environment
    cost_center: "engineering"      # ✅ Valid cost center
}

Conditional Validation

import .main

# Workflow with conditional validation based on environment
schema EnvironmentWorkflow(main.BatchWorkflow):
    """Workflow with environment-specific validation"""

    environment: "dev" | "staging" | "prod"

    check:
        # Production workflows must have monitoring
        environment == "prod" and monitoring.enabled == True,
        "Production workflows must enable monitoring"

        # Production workflows must have rollback enabled
        environment == "prod" and default_rollback_strategy.enabled == True,
        "Production workflows must enable rollback"

        # Development can have shorter timeouts
        environment == "dev" and global_timeout <= 1800,  # 30 minutes
        "Development workflows should complete within 30 minutes"

        # Staging must have retry policies
        environment == "staging" and default_retry_policy.max_attempts >= 2,
        "Staging workflows must have retry policies"

# Valid production workflow
prod_workflow: EnvironmentWorkflow = EnvironmentWorkflow {
    workflow_id: "prod_deploy_001"
    name: "Production Deployment"
    environment: "prod"             # ✅ Production environment

    operations: [
        main.BatchOperation {
            operation_id: "deploy"
            name: "Deploy Application"
            operation_type: "server"
            action: "create"
            parameters: {}
        }
    ]

    # ✅ Required for production
    monitoring: main.MonitoringConfig {
        enabled: True
        backend: "prometheus"
    }

    # ✅ Required for production
    default_rollback_strategy: main.RollbackStrategy {
        enabled: True
        strategy: "immediate"
    }
}

Cross-Field Validation

import .main

# Validate relationships between fields
schema ValidatedBatchOperation(main.BatchOperation):
    """Batch operation with cross-field validation"""

    check:
        # Timeout should be reasonable for operation type
        operation_type == "server" and timeout >= 300,
        "Server operations need at least 5 minutes timeout"

        operation_type == "taskserv" and timeout >= 600,
        "Taskserv operations need at least 10 minutes timeout"

        # High priority operations should have retry policies
        priority >= 8 and retry_policy.max_attempts >= 2,
        "High priority operations should have retry policies"

        # Parallel operations should have lower priority
        allow_parallel == True and priority <= 7,
        "Parallel operations should have lower priority for scheduling"

# Validate workflow operation consistency
schema ConsistentWorkflow(main.BatchWorkflow):
    """Workflow with consistent operation validation"""

    check:
        # All operation IDs must be unique
        len(operations) == len([op.operation_id for op in operations] | unique),
        "All operation IDs must be unique"

        # Dependencies must reference existing operations
        all([
            dep.target_operation_id in [op.operation_id for op in operations]
            for op in operations
            for dep in op.dependencies or []
        ]),
        "All dependencies must reference existing operations"

        # No circular dependencies (simplified check)
        len(operations) > 0,
        "Workflow must have at least one operation"

Best Practices

1. Schema Design Principles

# ✅ Good: Descriptive field names and documentation
schema WellDocumentedServer:
    """
    Server configuration for production workloads
    Follows company security and operational standards
    """

    # Core identification
    hostname: str              # DNS-compliant hostname
    fqdn?: str                # Fully qualified domain name

    # Environment classification
    environment: "dev" | "staging" | "prod"
    classification: "public" | "internal" | "confidential"

    # Operational metadata
    owner_team: str           # Team responsible for maintenance
    cost_center: str          # Billing allocation
    backup_required: bool     # Whether automated backups are needed

    check:
        len(hostname) > 0 and len(hostname) <= 63, "Hostname must be 1-63 characters"
        len(owner_team) > 0, "Owner team must be specified"
        len(cost_center) > 0, "Cost center must be specified"

# ❌ Avoid: Unclear field names and missing validation
schema PoorlyDocumentedServer:
    name: str                 # ❌ Ambiguous - hostname? title? display name?
    env: str                  # ❌ No constraints - any string allowed
    data: {str: str}          # ❌ Unstructured data without validation

2. Validation Strategy

# ✅ Good: Layered validation with clear error messages
schema ProductionWorkflow(main.BatchWorkflow):
    """Production workflow with comprehensive validation"""

    # Business metadata
    change_request_id: str
    approver: str
    maintenance_window?: str

    check:
        # Business process validation
        regex.match(change_request_id, "^CHG-[0-9]{4}-[0-9]{3}$"),
        "Change request ID must match format CHG-YYYY-NNN"

        # Operational validation
        global_timeout <= 14400,  # 4 hours max
        "Production workflows must complete within 4 hours"

        # Safety validation
        default_rollback_strategy.enabled == True,
        "Production workflows must enable rollback"

        # Monitoring validation
        monitoring.enabled == True and monitoring.enable_notifications == True,
        "Production workflows must enable monitoring and notifications"

# ✅ Good: Environment-specific defaults with validation
schema EnvironmentDefaults:
    """Environment-specific default configurations"""

    environment: "dev" | "staging" | "prod"

    # Default timeouts by environment
    default_timeout: int = environment == "prod" ? 1800 : (environment == "staging" ? 1200 : 600)

    # Default retry attempts by environment
    default_retries: int = environment == "prod" ? 3 : (environment == "staging" ? 2 : 1)

    # Default monitoring settings
    monitoring_enabled: bool = environment == "prod" ? True : False

    check:
        default_timeout > 0, "Timeout must be positive"
        default_retries >= 0, "Retries cannot be negative"

3. Schema Composition Patterns

# ✅ Good: Composable schema design
schema BaseResource:
    """Common fields for all resources"""
    name: str
    tags: {str: str} = {}
    created_at?: str
    updated_at?: str

    check:
        len(name) > 0, "Name cannot be empty"
        regex.match(name, "^[a-z0-9-]+$"), "Name must be lowercase alphanumeric with hyphens"

schema MonitoredResource(BaseResource):
    """Resource with monitoring capabilities"""
    monitoring_enabled: bool = True
    alert_thresholds: {str: float} = {}

    check:
        monitoring_enabled == True and len(alert_thresholds) > 0,
        "Monitored resources must define alert thresholds"

schema SecureResource(BaseResource):
    """Resource with security requirements"""
    encryption_enabled: bool = True
    access_policy: str
    compliance_tags: [str] = []

    check:
        encryption_enabled == True, "Security-sensitive resources must enable encryption"
        len(access_policy) > 0, "Access policy must be defined"
        "pci" in compliance_tags or "sox" in compliance_tags or "hipaa" in compliance_tags,
        "Must specify compliance requirements"

# Composed schema inheriting multiple patterns
schema ProductionDatabase(MonitoredResource, SecureResource):
    """Production database with full operational requirements"""
    backup_retention_days: int = 30
    high_availability: bool = True

    check:
        backup_retention_days >= 7, "Production databases need minimum 7 days backup retention"
        high_availability == True, "Production databases must be highly available"

4. Error Handling Patterns

# ✅ Good: Comprehensive error scenarios with specific handling
schema RobustBatchOperation(main.BatchOperation):
    """Batch operation with robust error handling"""

    # Error classification
    critical_operation: bool = False
    max_failure_rate: float = 0.1

    # Enhanced retry configuration
    retry_policy: main.RetryPolicy = main.RetryPolicy {
        max_attempts: critical_operation ? 5 : 3
        initial_delay: critical_operation ? 30 : 10
        max_delay: critical_operation ? 600 : 300
        backoff_multiplier: 2
        retry_on_errors: [
            "connection_error",
            "timeout",
            "rate_limit",
            "resource_unavailable"
        ]
    }

    # Enhanced rollback strategy
    rollback_strategy: main.RollbackStrategy = main.RollbackStrategy {
        enabled: True
        strategy: critical_operation ? "manual" : "immediate"
        preserve_partial_state: critical_operation
        custom_rollback_operations: critical_operation ? [
            "create_incident_ticket",
            "notify_on_call_engineer",
            "preserve_logs"
        ] : []
    }

    check:
        0 <= max_failure_rate and max_failure_rate <= 1,
        "Failure rate must be between 0 and 1"

        critical_operation == True and timeout >= 1800,
        "Critical operations need extended timeout"

Common Patterns

1. Multi-Environment Configuration

# Configuration that adapts to environment
schema EnvironmentAwareConfig:
    environment: "dev" | "staging" | "prod"

    # Computed values based on environment
    replica_count: int = (
        environment == "prod" ? 3 : (
        environment == "staging" ? 2 : 1)
    )

    resource_requests: main.K8sResources = main.K8sResources {
        memory: environment == "prod" ? "512Mi" : "256Mi"
        cpu: environment == "prod" ? "200m" : "100m"
    }

    monitoring_enabled: bool = environment != "dev"

    backup_enabled: bool = environment == "prod"

# Usage pattern
prod_config: EnvironmentAwareConfig = EnvironmentAwareConfig {
    environment: "prod"
    # replica_count automatically becomes 3
    # monitoring_enabled automatically becomes True
    # backup_enabled automatically becomes True
}

2. Provider Abstraction

# Provider-agnostic resource definition
schema AbstractServer:
    """Provider-agnostic server specification"""

    # Common specification
    cpu_cores: int
    memory_gb: int
    storage_gb: int
    network_performance: "low" | "moderate" | "high"

    # Provider-specific mapping
    provider: "upcloud" | "aws" | "gcp"

    # Computed provider-specific values
    instance_type: str = (
        provider == "upcloud" ? f"{cpu_cores}xCPU-{memory_gb}GB" : (
        provider == "aws" ? f"m5.{cpu_cores == 1 ? 'large' : 'xlarge'}" : (
        provider == "gcp" ? f"n2-standard-{cpu_cores}" : "unknown"
        ))
    )

    storage_type: str = (
        provider == "upcloud" ? "MaxIOPS" : (
        provider == "aws" ? "gp3" : (
        provider == "gcp" ? "pd-ssd" : "standard"
        ))
    )

# Multi-provider workflow using abstraction
mixed_deployment: main.BatchWorkflow = main.BatchWorkflow {
    workflow_id: "mixed_deploy_001"
    name: "Multi-Provider Deployment"

    operations: [
        # UpCloud servers
        main.BatchOperation {
            operation_id: "upcloud_servers"
            provider: "upcloud"
            parameters: {
                "instance_type": "2xCPU-4GB"  # UpCloud format
                "storage_type": "MaxIOPS"
            }
        },
        # AWS servers
        main.BatchOperation {
            operation_id: "aws_servers"
            provider: "aws"
            parameters: {
                "instance_type": "m5.large"   # AWS format
                "storage_type": "gp3"
            }
        }
    ]
}

3. Dependency Management

# Complex dependency patterns
schema DependencyAwareWorkflow(main.BatchWorkflow):
    """Workflow with intelligent dependency management"""

    # Categorize operations by type
    infrastructure_ops: [str] = [
        op.operation_id for op in operations
        if op.operation_type == "server"
    ]

    service_ops: [str] = [
        op.operation_id for op in operations
        if op.operation_type == "taskserv"
    ]

    validation_ops: [str] = [
        op.operation_id for op in operations
        if op.operation_type == "custom" and "validate" in op.name.lower()
    ]

    check:
        # Infrastructure must come before services
        all([
            len([dep for dep in op.dependencies or []
                if dep.target_operation_id in infrastructure_ops]) > 0
            for op in operations
            if op.operation_id in service_ops
        ]) or len(service_ops) == 0,
        "Service operations must depend on infrastructure operations"

        # Validation must come last
        all([
            len([dep for dep in op.dependencies or []
                if dep.target_operation_id in service_ops or dep.target_operation_id in infrastructure_ops]) > 0
            for op in operations
            if op.operation_id in validation_ops
        ]) or len(validation_ops) == 0,
        "Validation operations must depend on other operations"

Troubleshooting

Common Validation Errors

1. Missing Required Fields

# Error: attribute 'labels' of Server is required
# ❌ Incomplete server definition
server: main.Server = main.Server {
    hostname: "web-01"
    title: "Web Server"
    # Missing: labels, user
}

# ✅ Complete server definition
server: main.Server = main.Server {
    hostname: "web-01"
    title: "Web Server"
    labels: "env: prod"        # ✅ Required field
    user: "admin"              # ✅ Required field
}

2. Type Mismatches

# Error: expect int, got str
# ❌ Wrong type
workflow: main.BatchWorkflow = main.BatchWorkflow {
    max_parallel_operations: "3"  # ❌ String instead of int
}

# ✅ Correct type
workflow: main.BatchWorkflow = main.BatchWorkflow {
    max_parallel_operations: 3    # ✅ Integer
}

3. Constraint Violations

# Error: Check failed: hostname cannot be empty
# ❌ Constraint violation
server: main.Server = main.Server {
    hostname: ""              # ❌ Empty string violates constraint
    title: "Server"
    labels: "env: prod"
    user: "admin"
}

# ✅ Valid constraint
server: main.Server = main.Server {
    hostname: "web-01"        # ✅ Non-empty string
    title: "Server"
    labels: "env: prod"
    user: "admin"
}

Debugging Techniques

1. Step-by-step Validation

# Validate incrementally
kcl run basic_config.k        # Start with minimal config
kcl run enhanced_config.k     # Add features gradually
kcl run complete_config.k     # Full configuration

2. Schema Introspection

# Check what fields are available
kcl run -c 'import .main; main.Server' --format json

# Validate against specific schema
kcl vet config.k --schema main.Server

# Debug with verbose output
kcl run config.k --debug --verbose

3. Constraint Testing

# Test constraint behavior
test_constraints: {
    # Test minimum values
    min_timeout: main.BatchOperation {
        operation_id: "test"
        name: "Test"
        operation_type: "server"
        action: "create"
        parameters: {}
        timeout: 1              # Test minimum allowed
    }

    # Test maximum values
    max_parallel: main.BatchWorkflow {
        workflow_id: "test"
        name: "Test"
        operations: [min_timeout]
        max_parallel_operations: 100  # Test upper limits
    }
}

Performance Considerations

1. Schema Complexity

# ✅ Good: Simple, focused schemas
schema SimpleServer:
    hostname: str
    user: str
    labels: str

    check:
        len(hostname) > 0, "Hostname required"

# ❌ Avoid: Overly complex schemas with many computed fields
schema OverlyComplexServer:
    # ... many fields with complex interdependencies
    # ... computationally expensive check conditions
    # ... deep nested validations

2. Validation Efficiency

# ✅ Good: Efficient validation
schema EfficientValidation:
    name: str
    tags: {str: str}

    check:
        len(name) > 0, "Name required"                    # ✅ Simple check
        len(tags) <= 10, "Maximum 10 tags allowed"       # ✅ Simple count check

# ❌ Avoid: Expensive validation
schema ExpensiveValidation:
    items: [str]

    check:
        # ❌ Expensive nested operations
        all([regex.match(item, "^[a-z0-9-]+$") for item in items]),
        "All items must match pattern"

This validation guide provides the foundation for creating robust, maintainable KCL schemas with proper error handling and validation strategies.