prvng_kcl/docs/VALIDATION.md
2025-10-07 11:17:54 +01:00

800 lines
23 KiB
Markdown

# Schema Validation and Best Practices
This document provides comprehensive guidance on validating KCL schemas and following best practices for the provisioning package.
## Table of Contents
- [Schema Validation](#schema-validation)
- [Built-in Constraints](#built-in-constraints)
- [Custom Validation](#custom-validation)
- [Best Practices](#best-practices)
- [Common Patterns](#common-patterns)
- [Troubleshooting](#troubleshooting)
## Schema Validation
### Basic Validation
```bash
# Validate syntax and run schema checks
kcl run config.k
# Format and validate all files
kcl fmt *.k
# Validate with verbose output
kcl run config.k --debug
# Validate against specific schema
kcl vet config.k --schema main.Server
```
### JSON Output Validation
```bash
# Generate and validate JSON output
kcl run config.k --format json | jq '.'
# Validate JSON schema structure
kcl run config.k --format json | jq '.workflow_id // error("Missing workflow_id")'
# Pretty print for inspection
kcl run config.k --format json | jq '.operations[] | {operation_id, name, provider}'
```
### Validation in CI/CD
```yaml
# GitHub Actions example
- name: Validate KCL Schemas
run: |
find . -name "*.k" -exec kcl fmt {} \;
find . -name "*.k" -exec kcl run {} \;
# Check for schema changes
- name: Check Schema Compatibility
run: |
kcl run main.k --format json > current_schema.json
diff expected_schema.json current_schema.json
```
## Built-in Constraints
### Server Schema Constraints
```kcl
import .main
# ✅ Valid server configuration
valid_server: main.Server = main.Server {
hostname: "web-01" # ✅ Non-empty string required
title: "Web Server" # ✅ Non-empty string required
labels: "env: prod" # ✅ Required field
user: "admin" # ✅ Required field
# Optional but validated fields
user_ssh_port: 22 # ✅ Valid port number
running_timeout: 300 # ✅ Positive integer
time_zone: "UTC" # ✅ Valid timezone string
}
# ❌ Invalid configurations that will fail validation
invalid_examples: {
# hostname: "" # ❌ Empty hostname not allowed
# title: "" # ❌ Empty title not allowed
# user_ssh_port: -1 # ❌ Negative port not allowed
# running_timeout: 0 # ❌ Zero timeout not allowed
}
```
### Workflow Schema Constraints
```kcl
import .main
# ✅ Valid workflow with proper constraints
valid_workflow: main.BatchWorkflow = main.BatchWorkflow {
workflow_id: "deploy_001" # ✅ Non-empty ID required
name: "Production Deployment" # ✅ Non-empty name required
operations: [ # ✅ At least one operation required
main.BatchOperation {
operation_id: "create_servers" # ✅ Unique operation ID
name: "Create Servers"
operation_type: "server"
action: "create"
parameters: {}
timeout: 600 # ✅ Positive timeout
priority: 5 # ✅ Valid priority
}
]
max_parallel_operations: 3 # ✅ Non-negative number
global_timeout: 3600 # ✅ Positive global timeout
}
# ❌ Constraint violations
constraint_violations: {
# workflow_id: "" # ❌ Empty workflow ID
# operations: [] # ❌ Empty operations list
# max_parallel_operations: -1 # ❌ Negative parallel limit
# global_timeout: 0 # ❌ Zero global timeout
}
```
### Kubernetes Schema Constraints
```kcl
import .main
# ✅ Valid Kubernetes deployment with constraints
valid_k8s: main.K8sDeploy = main.K8sDeploy {
name: "webapp" # ✅ Non-empty name
namespace: "production" # ✅ Valid namespace
spec: main.K8sDeploySpec {
replicas: 3 # ✅ Positive replica count
containers: [ # ✅ At least one container required
main.K8sContainers {
name: "app" # ✅ Non-empty container name
image: "nginx:1.21" # ✅ Valid image reference
resources_requests: main.K8sResources {
memory: "128Mi" # ✅ Valid K8s memory format
cpu: "100m" # ✅ Valid K8s CPU format
}
resources_limits: main.K8sResources {
memory: "256Mi" # ✅ Limits >= requests (enforced)
cpu: "200m"
}
}
]
}
}
```
### Dependency Schema Constraints
```kcl
import .main
# ✅ Valid dependency definitions
valid_dependencies: main.TaskservDependencies = main.TaskservDependencies {
name: "kubernetes" # ✅ Lowercase name required
requires: ["containerd", "cni"] # ✅ Valid dependency list
conflicts: ["docker"] # ✅ Cannot coexist with docker
resources: main.ResourceRequirement {
cpu: "100m" # ✅ Non-empty CPU requirement
memory: "128Mi" # ✅ Non-empty memory requirement
disk: "1Gi" # ✅ Non-empty disk requirement
}
timeout: 600 # ✅ Positive timeout
retry_count: 3 # ✅ Non-negative retry count
os_support: ["linux"] # ✅ At least one OS required
arch_support: ["amd64", "arm64"] # ✅ At least one arch required
}
# ❌ Constraint violations
dependency_violations: {
# name: "Kubernetes" # ❌ Must be lowercase
# name: "" # ❌ Cannot be empty
# timeout: 0 # ❌ Must be positive
# retry_count: -1 # ❌ Cannot be negative
# os_support: [] # ❌ Must specify at least one OS
}
```
## Custom Validation
### Adding Custom Constraints
```kcl
import .main
import regex
# Custom server schema with additional validation
schema CustomServer(main.Server):
"""Custom server with additional business rules"""
# Additional custom fields
environment: "dev" | "staging" | "prod"
cost_center: str
check:
# Business rule: production servers must have specific naming
environment == "prod" and regex.match(hostname, "^prod-[a-z0-9-]+$"),
"Production servers must start with 'prod-'"
# Business rule: staging servers have resource limits
environment == "staging" and len(taskservs or []) <= 3,
"Staging servers limited to 3 taskservs"
# Business rule: cost center must be valid
cost_center in ["engineering", "operations", "security"],
"Invalid cost center: ${cost_center}"
# Usage with validation
prod_server: CustomServer = CustomServer {
hostname: "prod-web-01" # ✅ Matches production naming
title: "Production Web Server"
labels: "env: prod"
user: "admin"
environment: "prod" # ✅ Valid environment
cost_center: "engineering" # ✅ Valid cost center
}
```
### Conditional Validation
```kcl
import .main
# Workflow with conditional validation based on environment
schema EnvironmentWorkflow(main.BatchWorkflow):
"""Workflow with environment-specific validation"""
environment: "dev" | "staging" | "prod"
check:
# Production workflows must have monitoring
environment == "prod" and monitoring.enabled == True,
"Production workflows must enable monitoring"
# Production workflows must have rollback enabled
environment == "prod" and default_rollback_strategy.enabled == True,
"Production workflows must enable rollback"
# Development can have shorter timeouts
environment == "dev" and global_timeout <= 1800, # 30 minutes
"Development workflows should complete within 30 minutes"
# Staging must have retry policies
environment == "staging" and default_retry_policy.max_attempts >= 2,
"Staging workflows must have retry policies"
# Valid production workflow
prod_workflow: EnvironmentWorkflow = EnvironmentWorkflow {
workflow_id: "prod_deploy_001"
name: "Production Deployment"
environment: "prod" # ✅ Production environment
operations: [
main.BatchOperation {
operation_id: "deploy"
name: "Deploy Application"
operation_type: "server"
action: "create"
parameters: {}
}
]
# ✅ Required for production
monitoring: main.MonitoringConfig {
enabled: True
backend: "prometheus"
}
# ✅ Required for production
default_rollback_strategy: main.RollbackStrategy {
enabled: True
strategy: "immediate"
}
}
```
### Cross-Field Validation
```kcl
import .main
# Validate relationships between fields
schema ValidatedBatchOperation(main.BatchOperation):
"""Batch operation with cross-field validation"""
check:
# Timeout should be reasonable for operation type
operation_type == "server" and timeout >= 300,
"Server operations need at least 5 minutes timeout"
operation_type == "taskserv" and timeout >= 600,
"Taskserv operations need at least 10 minutes timeout"
# High priority operations should have retry policies
priority >= 8 and retry_policy.max_attempts >= 2,
"High priority operations should have retry policies"
# Parallel operations should have lower priority
allow_parallel == True and priority <= 7,
"Parallel operations should have lower priority for scheduling"
# Validate workflow operation consistency
schema ConsistentWorkflow(main.BatchWorkflow):
"""Workflow with consistent operation validation"""
check:
# All operation IDs must be unique
len(operations) == len([op.operation_id for op in operations] | unique),
"All operation IDs must be unique"
# Dependencies must reference existing operations
all([
dep.target_operation_id in [op.operation_id for op in operations]
for op in operations
for dep in op.dependencies or []
]),
"All dependencies must reference existing operations"
# No circular dependencies (simplified check)
len(operations) > 0,
"Workflow must have at least one operation"
```
## Best Practices
### 1. Schema Design Principles
```kcl
# ✅ Good: Descriptive field names and documentation
schema WellDocumentedServer:
"""
Server configuration for production workloads
Follows company security and operational standards
"""
# Core identification
hostname: str # DNS-compliant hostname
fqdn?: str # Fully qualified domain name
# Environment classification
environment: "dev" | "staging" | "prod"
classification: "public" | "internal" | "confidential"
# Operational metadata
owner_team: str # Team responsible for maintenance
cost_center: str # Billing allocation
backup_required: bool # Whether automated backups are needed
check:
len(hostname) > 0 and len(hostname) <= 63, "Hostname must be 1-63 characters"
len(owner_team) > 0, "Owner team must be specified"
len(cost_center) > 0, "Cost center must be specified"
# ❌ Avoid: Unclear field names and missing validation
schema PoorlyDocumentedServer:
name: str # ❌ Ambiguous - hostname? title? display name?
env: str # ❌ No constraints - any string allowed
data: {str: str} # ❌ Unstructured data without validation
```
### 2. Validation Strategy
```kcl
# ✅ Good: Layered validation with clear error messages
schema ProductionWorkflow(main.BatchWorkflow):
"""Production workflow with comprehensive validation"""
# Business metadata
change_request_id: str
approver: str
maintenance_window?: str
check:
# Business process validation
regex.match(change_request_id, "^CHG-[0-9]{4}-[0-9]{3}$"),
"Change request ID must match format CHG-YYYY-NNN"
# Operational validation
global_timeout <= 14400, # 4 hours max
"Production workflows must complete within 4 hours"
# Safety validation
default_rollback_strategy.enabled == True,
"Production workflows must enable rollback"
# Monitoring validation
monitoring.enabled == True and monitoring.enable_notifications == True,
"Production workflows must enable monitoring and notifications"
# ✅ Good: Environment-specific defaults with validation
schema EnvironmentDefaults:
"""Environment-specific default configurations"""
environment: "dev" | "staging" | "prod"
# Default timeouts by environment
default_timeout: int = environment == "prod" ? 1800 : (environment == "staging" ? 1200 : 600)
# Default retry attempts by environment
default_retries: int = environment == "prod" ? 3 : (environment == "staging" ? 2 : 1)
# Default monitoring settings
monitoring_enabled: bool = environment == "prod" ? True : False
check:
default_timeout > 0, "Timeout must be positive"
default_retries >= 0, "Retries cannot be negative"
```
### 3. Schema Composition Patterns
```kcl
# ✅ Good: Composable schema design
schema BaseResource:
"""Common fields for all resources"""
name: str
tags: {str: str} = {}
created_at?: str
updated_at?: str
check:
len(name) > 0, "Name cannot be empty"
regex.match(name, "^[a-z0-9-]+$"), "Name must be lowercase alphanumeric with hyphens"
schema MonitoredResource(BaseResource):
"""Resource with monitoring capabilities"""
monitoring_enabled: bool = True
alert_thresholds: {str: float} = {}
check:
monitoring_enabled == True and len(alert_thresholds) > 0,
"Monitored resources must define alert thresholds"
schema SecureResource(BaseResource):
"""Resource with security requirements"""
encryption_enabled: bool = True
access_policy: str
compliance_tags: [str] = []
check:
encryption_enabled == True, "Security-sensitive resources must enable encryption"
len(access_policy) > 0, "Access policy must be defined"
"pci" in compliance_tags or "sox" in compliance_tags or "hipaa" in compliance_tags,
"Must specify compliance requirements"
# Composed schema inheriting multiple patterns
schema ProductionDatabase(MonitoredResource, SecureResource):
"""Production database with full operational requirements"""
backup_retention_days: int = 30
high_availability: bool = True
check:
backup_retention_days >= 7, "Production databases need minimum 7 days backup retention"
high_availability == True, "Production databases must be highly available"
```
### 4. Error Handling Patterns
```kcl
# ✅ Good: Comprehensive error scenarios with specific handling
schema RobustBatchOperation(main.BatchOperation):
"""Batch operation with robust error handling"""
# Error classification
critical_operation: bool = False
max_failure_rate: float = 0.1
# Enhanced retry configuration
retry_policy: main.RetryPolicy = main.RetryPolicy {
max_attempts: critical_operation ? 5 : 3
initial_delay: critical_operation ? 30 : 10
max_delay: critical_operation ? 600 : 300
backoff_multiplier: 2
retry_on_errors: [
"connection_error",
"timeout",
"rate_limit",
"resource_unavailable"
]
}
# Enhanced rollback strategy
rollback_strategy: main.RollbackStrategy = main.RollbackStrategy {
enabled: True
strategy: critical_operation ? "manual" : "immediate"
preserve_partial_state: critical_operation
custom_rollback_operations: critical_operation ? [
"create_incident_ticket",
"notify_on_call_engineer",
"preserve_logs"
] : []
}
check:
0 <= max_failure_rate and max_failure_rate <= 1,
"Failure rate must be between 0 and 1"
critical_operation == True and timeout >= 1800,
"Critical operations need extended timeout"
```
## Common Patterns
### 1. Multi-Environment Configuration
```kcl
# Configuration that adapts to environment
schema EnvironmentAwareConfig:
environment: "dev" | "staging" | "prod"
# Computed values based on environment
replica_count: int = (
environment == "prod" ? 3 : (
environment == "staging" ? 2 : 1)
)
resource_requests: main.K8sResources = main.K8sResources {
memory: environment == "prod" ? "512Mi" : "256Mi"
cpu: environment == "prod" ? "200m" : "100m"
}
monitoring_enabled: bool = environment != "dev"
backup_enabled: bool = environment == "prod"
# Usage pattern
prod_config: EnvironmentAwareConfig = EnvironmentAwareConfig {
environment: "prod"
# replica_count automatically becomes 3
# monitoring_enabled automatically becomes True
# backup_enabled automatically becomes True
}
```
### 2. Provider Abstraction
```kcl
# Provider-agnostic resource definition
schema AbstractServer:
"""Provider-agnostic server specification"""
# Common specification
cpu_cores: int
memory_gb: int
storage_gb: int
network_performance: "low" | "moderate" | "high"
# Provider-specific mapping
provider: "upcloud" | "aws" | "gcp"
# Computed provider-specific values
instance_type: str = (
provider == "upcloud" ? f"{cpu_cores}xCPU-{memory_gb}GB" : (
provider == "aws" ? f"m5.{cpu_cores == 1 ? 'large' : 'xlarge'}" : (
provider == "gcp" ? f"n2-standard-{cpu_cores}" : "unknown"
))
)
storage_type: str = (
provider == "upcloud" ? "MaxIOPS" : (
provider == "aws" ? "gp3" : (
provider == "gcp" ? "pd-ssd" : "standard"
))
)
# Multi-provider workflow using abstraction
mixed_deployment: main.BatchWorkflow = main.BatchWorkflow {
workflow_id: "mixed_deploy_001"
name: "Multi-Provider Deployment"
operations: [
# UpCloud servers
main.BatchOperation {
operation_id: "upcloud_servers"
provider: "upcloud"
parameters: {
"instance_type": "2xCPU-4GB" # UpCloud format
"storage_type": "MaxIOPS"
}
},
# AWS servers
main.BatchOperation {
operation_id: "aws_servers"
provider: "aws"
parameters: {
"instance_type": "m5.large" # AWS format
"storage_type": "gp3"
}
}
]
}
```
### 3. Dependency Management
```kcl
# Complex dependency patterns
schema DependencyAwareWorkflow(main.BatchWorkflow):
"""Workflow with intelligent dependency management"""
# Categorize operations by type
infrastructure_ops: [str] = [
op.operation_id for op in operations
if op.operation_type == "server"
]
service_ops: [str] = [
op.operation_id for op in operations
if op.operation_type == "taskserv"
]
validation_ops: [str] = [
op.operation_id for op in operations
if op.operation_type == "custom" and "validate" in op.name.lower()
]
check:
# Infrastructure must come before services
all([
len([dep for dep in op.dependencies or []
if dep.target_operation_id in infrastructure_ops]) > 0
for op in operations
if op.operation_id in service_ops
]) or len(service_ops) == 0,
"Service operations must depend on infrastructure operations"
# Validation must come last
all([
len([dep for dep in op.dependencies or []
if dep.target_operation_id in service_ops or dep.target_operation_id in infrastructure_ops]) > 0
for op in operations
if op.operation_id in validation_ops
]) or len(validation_ops) == 0,
"Validation operations must depend on other operations"
```
## Troubleshooting
### Common Validation Errors
#### 1. Missing Required Fields
```bash
# Error: attribute 'labels' of Server is required
# ❌ Incomplete server definition
server: main.Server = main.Server {
hostname: "web-01"
title: "Web Server"
# Missing: labels, user
}
# ✅ Complete server definition
server: main.Server = main.Server {
hostname: "web-01"
title: "Web Server"
labels: "env: prod" # ✅ Required field
user: "admin" # ✅ Required field
}
```
#### 2. Type Mismatches
```bash
# Error: expect int, got str
# ❌ Wrong type
workflow: main.BatchWorkflow = main.BatchWorkflow {
max_parallel_operations: "3" # ❌ String instead of int
}
# ✅ Correct type
workflow: main.BatchWorkflow = main.BatchWorkflow {
max_parallel_operations: 3 # ✅ Integer
}
```
#### 3. Constraint Violations
```bash
# Error: Check failed: hostname cannot be empty
# ❌ Constraint violation
server: main.Server = main.Server {
hostname: "" # ❌ Empty string violates constraint
title: "Server"
labels: "env: prod"
user: "admin"
}
# ✅ Valid constraint
server: main.Server = main.Server {
hostname: "web-01" # ✅ Non-empty string
title: "Server"
labels: "env: prod"
user: "admin"
}
```
### Debugging Techniques
#### 1. Step-by-step Validation
```bash
# Validate incrementally
kcl run basic_config.k # Start with minimal config
kcl run enhanced_config.k # Add features gradually
kcl run complete_config.k # Full configuration
```
#### 2. Schema Introspection
```bash
# Check what fields are available
kcl run -c 'import .main; main.Server' --format json
# Validate against specific schema
kcl vet config.k --schema main.Server
# Debug with verbose output
kcl run config.k --debug --verbose
```
#### 3. Constraint Testing
```kcl
# Test constraint behavior
test_constraints: {
# Test minimum values
min_timeout: main.BatchOperation {
operation_id: "test"
name: "Test"
operation_type: "server"
action: "create"
parameters: {}
timeout: 1 # Test minimum allowed
}
# Test maximum values
max_parallel: main.BatchWorkflow {
workflow_id: "test"
name: "Test"
operations: [min_timeout]
max_parallel_operations: 100 # Test upper limits
}
}
```
### Performance Considerations
#### 1. Schema Complexity
```kcl
# ✅ Good: Simple, focused schemas
schema SimpleServer:
hostname: str
user: str
labels: str
check:
len(hostname) > 0, "Hostname required"
# ❌ Avoid: Overly complex schemas with many computed fields
schema OverlyComplexServer:
# ... many fields with complex interdependencies
# ... computationally expensive check conditions
# ... deep nested validations
```
#### 2. Validation Efficiency
```kcl
# ✅ Good: Efficient validation
schema EfficientValidation:
name: str
tags: {str: str}
check:
len(name) > 0, "Name required" # ✅ Simple check
len(tags) <= 10, "Maximum 10 tags allowed" # ✅ Simple count check
# ❌ Avoid: Expensive validation
schema ExpensiveValidation:
items: [str]
check:
# ❌ Expensive nested operations
all([regex.match(item, "^[a-z0-9-]+$") for item in items]),
"All items must match pattern"
```
This validation guide provides the foundation for creating robust, maintainable KCL schemas with proper error handling and validation strategies.