prvng_kcl/docs/BEST_PRACTICES.md

# KCL Best Practices for Provisioning

This document outlines best practices for using and developing with the provisioning KCL package, covering schema design, workflow patterns, and operational guidelines.

## Table of Contents

- [Schema Design](#schema-design)
- [Workflow Patterns](#workflow-patterns)
- [Error Handling](#error-handling)
- [Performance Optimization](#performance-optimization)
- [Security Considerations](#security-considerations)
- [Testing Strategies](#testing-strategies)
- [Maintenance Guidelines](#maintenance-guidelines)

## Schema Design

### 1. Clear Naming Conventions

```kcl
# ✅ Good: Descriptive, consistent naming
schema ProductionWebServer:
    """Web server optimized for production workloads"""
    hostname: str                    # Clear, specific field names
    fully_qualified_domain_name?: str
    environment_classification: "dev" | "staging" | "prod"
    cost_allocation_center: str
    operational_team_owner: str

# ✅ Good: Consistent prefixes for related schemas
schema K8sDeploymentSpec:
    """Kubernetes deployment specification"""
    replica_count: int
    container_definitions: [K8sContainerSpec]
    volume_mount_configs: [K8sVolumeMountSpec]

schema K8sContainerSpec:
    """Kubernetes container specification"""
    image_reference: str
    resource_requirements: K8sResourceRequirements

# ❌ Avoid: Ambiguous or inconsistent naming
schema Server:                       # ❌ Too generic
    name: str                        # ❌ Ambiguous - hostname? display name?
    env: str                         # ❌ Unclear - environment? variables?
    cfg: {str: str}                  # ❌ Cryptic abbreviations
```

### 2. Comprehensive Documentation

```kcl
# ✅ Good: Detailed documentation with examples
schema ServerConfiguration:
    """
    Production server configuration following company standards.

    This schema defines servers for multi-tier applications with
    proper security, monitoring, and operational requirements.

    Example:
        web_server: ServerConfiguration = ServerConfiguration {
            hostname: "prod-web-01"
            server_role: "frontend"
            environment: "production"
            cost_center: "engineering"
        }
    """

    # Core identification (required)
    hostname: str                    # DNS-compliant hostname (RFC 1123)
    server_role: "frontend" | "backend" | "database" | "cache"

    # Environment and operational metadata
    environment: "development" | "staging" | "production"
    cost_center: str                 # Billing allocation identifier
    primary_contact_team: str        # Team responsible for maintenance

    # Security and compliance
    security_zone: "dmz" | "internal" | "restricted"
    compliance_requirements: [str]   # e.g., ["pci", "sox", "hipaa"]

    # Optional operational settings
    backup_policy?: str              # Backup schedule identifier
    monitoring_profile?: str         # Monitoring configuration profile

    check:
        # Hostname validation (DNS RFC 1123)
        regex.match(hostname, "^[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?$"),
        "Hostname must be DNS-compliant (RFC 1123): ${hostname}"

        # Environment-specific validations
        environment == "production" and len(primary_contact_team) > 0,
        "Production servers must specify primary contact team"

        # Security requirements
        security_zone == "restricted" and "encryption" in compliance_requirements,
        "Restricted zone servers must have encryption compliance"

# ❌ Avoid: Minimal or missing documentation
schema Srv:                          # ❌ No documentation
    h: str                           # ❌ No field documentation
    t: str                           # ❌ Cryptic field names
```

### 3. Hierarchical Schema Design

```kcl
# ✅ Good: Base schemas with specialized extensions
schema BaseInfrastructureResource:
    """Foundation schema for all infrastructure resources"""

    # Universal metadata
    resource_name: str
    creation_timestamp?: str
    last_modified_timestamp?: str
    created_by_user?: str

    # Organizational metadata
    cost_center: str
    project_identifier: str
    environment: "dev" | "staging" | "prod"

    # Operational metadata
    tags: {str: str} = {}
    monitoring_enabled: bool = True

    check:
        len(resource_name) > 0 and len(resource_name) <= 63,
        "Resource name must be 1-63 characters"

        regex.match(resource_name, "^[a-z0-9]([a-z0-9-]*[a-z0-9])?$"),
        "Resource name must be DNS-label compatible"

schema ComputeResource(BaseInfrastructureResource):
    """Compute resources with CPU/memory specifications"""

    # Hardware specifications
    cpu_cores: int
    memory_gigabytes: int
    storage_gigabytes: int

    # Performance characteristics
    cpu_architecture: "x86_64" | "arm64"
    performance_tier: "burstable" | "standard" | "high_performance"

    check:
        cpu_cores > 0 and cpu_cores <= 128,
        "CPU cores must be between 1 and 128"

        memory_gigabytes > 0 and memory_gigabytes <= 1024,
        "Memory must be between 1GB and 1TB"

schema ManagedDatabaseResource(BaseInfrastructureResource):
    """Managed database service configuration"""

    # Database specifications
    database_engine: "postgresql" | "mysql" | "redis" | "mongodb"
    engine_version: str
    instance_class: str

    # High availability and backup
    multi_availability_zone: bool = False
    backup_retention_days: int = 7
    automated_backup_enabled: bool = True

    # Security
    encryption_at_rest: bool = True
    encryption_in_transit: bool = True

    check:
        environment == "prod" and multi_availability_zone == True,
        "Production databases must enable multi-AZ"

        environment == "prod" and backup_retention_days >= 30,
        "Production databases need minimum 30 days backup retention"
```

### 4. Flexible Configuration Patterns

```kcl
# ✅ Good: Environment-aware defaults
schema EnvironmentAdaptiveConfiguration:
    """Configuration that adapts based on environment"""

    environment: "dev" | "staging" | "prod"

    # Computed defaults based on environment
    default_timeout_seconds: int = (
        environment == "prod" ? 300 : (
        environment == "staging" ? 180 : 60
        )
    )

    default_retry_attempts: int = (
        environment == "prod" ? 5 : (
        environment == "staging" ? 3 : 1
        )
    )

    resource_allocation: ComputeResource = ComputeResource {
        resource_name: "default-compute"
        cost_center: "shared"
        project_identifier: "infrastructure"
        environment: environment

        # Environment-specific resource sizing
        cpu_cores: environment == "prod" ? 4 : (environment == "staging" ? 2 : 1)
        memory_gigabytes: environment == "prod" ? 8 : (environment == "staging" ? 4 : 2)
        storage_gigabytes: environment == "prod" ? 100 : 50

        cpu_architecture: "x86_64"
        performance_tier: environment == "prod" ? "high_performance" : "standard"
    }

    monitoring_configuration: MonitoringConfig = MonitoringConfig {
        collection_interval_seconds: environment == "prod" ? 15 : 60
        retention_days: environment == "prod" ? 90 : 30
        alert_thresholds: environment == "prod" ? "strict" : "relaxed"
    }

# ✅ Good: Composable configuration with mixins
schema SecurityMixin:
    """Security-related configuration that can be mixed into other schemas"""

    encryption_enabled: bool = True
    access_logging_enabled: bool = True
    security_scan_enabled: bool = True

    # Security-specific validations
    check:
        encryption_enabled == True,
        "Encryption must be enabled for security compliance"

schema ComplianceMixin:
    """Compliance-related configuration"""

    compliance_frameworks: [str] = []
    audit_logging_enabled: bool = False
    data_retention_policy?: str

    check:
        len(compliance_frameworks) > 0 and audit_logging_enabled == True,
        "Compliance frameworks require audit logging"

schema SecureComputeResource(ComputeResource, SecurityMixin, ComplianceMixin):
    """Compute resource with security and compliance requirements"""

    # Additional security requirements for compute
    secure_boot_enabled: bool = True
    encrypted_storage: bool = True

    check:
        # Inherit all parent validations, plus additional ones
        "pci" in compliance_frameworks and encrypted_storage == True,
        "PCI compliance requires encrypted storage"
```

## Workflow Patterns

### 1. Dependency Management

```kcl
# ✅ Good: Clear dependency patterns with proper error handling
schema InfrastructureWorkflow(main.BatchWorkflow):
    """Infrastructure deployment with proper dependency management"""

    # Categorize operations for dependency analysis
    foundation_operations: [str] = []      # Network, security groups, etc.
    compute_operations: [str] = []         # Servers, instances
    service_operations: [str] = []         # Applications, databases
    validation_operations: [str] = []      # Testing, health checks

    check:
        # Foundation must come first
        all([
            len([dep for dep in op.dependencies or []
                if dep.target_operation_id in foundation_operations]) > 0
            for op in operations
            if op.operation_id in compute_operations
        ]) or len(compute_operations) == 0,
        "Compute operations must depend on foundation operations"

        # Services depend on compute
        all([
            len([dep for dep in op.dependencies or []
                if dep.target_operation_id in compute_operations]) > 0
            for op in operations
            if op.operation_id in service_operations
        ]) or len(service_operations) == 0,
        "Service operations must depend on compute operations"

# Example usage with proper dependency chains
production_deployment: InfrastructureWorkflow = InfrastructureWorkflow {
    workflow_id: "prod-infra-2025-001"
    name: "Production Infrastructure Deployment"

    foundation_operations: ["create_vpc", "setup_security_groups"]
    compute_operations: ["create_web_servers", "create_db_servers"]
    service_operations: ["install_applications", "configure_databases"]
    validation_operations: ["run_health_checks", "validate_connectivity"]

    operations: [
        # Foundation layer
        main.BatchOperation {
            operation_id: "create_vpc"
            name: "Create VPC and Networking"
            operation_type: "custom"
            action: "create"
            parameters: {"cidr": "10.0.0.0/16"}
            priority: 10
            timeout: 600
        },

        # Compute layer (depends on foundation)
        main.BatchOperation {
            operation_id: "create_web_servers"
            name: "Create Web Servers"
            operation_type: "server"
            action: "create"
            parameters: {"count": "3", "type": "web"}
            dependencies: [
                main.DependencyDef {
                    target_operation_id: "create_vpc"
                    dependency_type: "sequential"
                    timeout: 300
                    fail_on_dependency_error: True
                }
            ]
            priority: 8
            timeout: 900
        },

        # Service layer (depends on compute)
        main.BatchOperation {
            operation_id: "install_applications"
            name: "Install Web Applications"
            operation_type: "taskserv"
            action: "create"
            parameters: {"apps": ["nginx", "prometheus"]}
            dependencies: [
                main.DependencyDef {
                    target_operation_id: "create_web_servers"
                    dependency_type: "conditional"
                    conditions: ["servers_ready", "ssh_accessible"]
                    timeout: 600
                }
            ]
            priority: 6
        }
    ]
}
```

### 2. Multi-Environment Workflows

```kcl
# ✅ Good: Environment-specific workflow configurations
schema MultiEnvironmentWorkflow:
    """Workflow that adapts to different environments"""

    base_workflow: main.BatchWorkflow
    target_environment: "dev" | "staging" | "prod"

    # Environment-specific overrides
    environment_config: EnvironmentConfig = EnvironmentConfig {
        environment: target_environment

        # Adjust parallelism based on environment
        max_parallel: target_environment == "prod" ? 3 : 5

        # Adjust timeouts
        operation_timeout_multiplier: target_environment == "prod" ? 1.5 : 1.0

        # Monitoring intensity
        monitoring_level: target_environment == "prod" ? "comprehensive" : "basic"
    }

    # Generate final workflow with environment adaptations
    final_workflow: main.BatchWorkflow = main.BatchWorkflow {
        workflow_id: f"{base_workflow.workflow_id}-{target_environment}"
        name: f"{base_workflow.name} ({target_environment})"
        description: base_workflow.description

        operations: [
            main.BatchOperation {
                operation_id: op.operation_id
                name: op.name
                operation_type: op.operation_type
                provider: op.provider
                action: op.action
                parameters: op.parameters
                dependencies: op.dependencies

                # Environment-adapted timeout
                timeout: int(op.timeout * environment_config.operation_timeout_multiplier)

                # Environment-adapted priority
                priority: op.priority
                allow_parallel: op.allow_parallel

                # Environment-specific retry policy
                retry_policy: main.RetryPolicy {
                    max_attempts: target_environment == "prod" ? 3 : 2
                    initial_delay: target_environment == "prod" ? 30 : 10
                    backoff_multiplier: 2
                }
            }
            for op in base_workflow.operations
        ]

        max_parallel_operations: environment_config.max_parallel
        global_timeout: base_workflow.global_timeout
        fail_fast: target_environment == "prod" ? False : True

        # Environment-specific storage
        storage: main.StorageConfig {
            backend: target_environment == "prod" ? "surrealdb" : "filesystem"
            base_path: f"./workflows/{target_environment}"
            enable_persistence: target_environment != "dev"
            retention_hours: target_environment == "prod" ? 2160 : 168  # 90 days vs 1 week
        }

        # Environment-specific monitoring
        monitoring: main.MonitoringConfig {
            enabled: True
            backend: "prometheus"
            enable_tracing: target_environment == "prod"
            enable_notifications: target_environment != "dev"
            log_level: target_environment == "dev" ? "debug" : "info"
        }
    }

# Usage for different environments
dev_deployment: MultiEnvironmentWorkflow = MultiEnvironmentWorkflow {
    target_environment: "dev"
    base_workflow: main.BatchWorkflow {
        workflow_id: "webapp-deploy"
        name: "Web Application Deployment"
        operations: [
            # ... base operations
        ]
    }
}

prod_deployment: MultiEnvironmentWorkflow = MultiEnvironmentWorkflow {
    target_environment: "prod"
    base_workflow: dev_deployment.base_workflow  # Reuse same base workflow
}
```

### 3. Error Recovery Patterns

```kcl
# ✅ Good: Comprehensive error recovery strategy
schema ResilientWorkflow(main.BatchWorkflow):
    """Workflow with advanced error recovery capabilities"""

    # Error categorization
    critical_operations: [str] = []        # Operations that cannot fail
    optional_operations: [str] = []        # Operations that can be skipped
    retry_operations: [str] = []           # Operations with custom retry logic

    # Recovery strategies
    global_error_strategy: "fail_fast" | "continue_on_error" | "intelligent" = "intelligent"

    # Enhanced operations with error handling
    enhanced_operations: [EnhancedBatchOperation] = [
        EnhancedBatchOperation {
            base_operation: op
            is_critical: op.operation_id in critical_operations
            is_optional: op.operation_id in optional_operations
            custom_retry: op.operation_id in retry_operations

            # Adaptive retry policy based on operation characteristics
            adaptive_retry_policy: main.RetryPolicy {
                max_attempts: (
                    is_critical ? 5 : (
                    is_optional ? 1 : 3
                    )
                )
                initial_delay: is_critical ? 60 : 30
                max_delay: is_critical ? 900 : 300
                backoff_multiplier: 2
                retry_on_errors: [
                    "timeout",
                    "connection_error",
                    "rate_limit"
                ] + (is_critical ? [
                    "resource_unavailable",
                    "quota_exceeded"
                ] : [])
            }

            # Adaptive rollback strategy
            adaptive_rollback_strategy: main.RollbackStrategy {
                enabled: True
                strategy: is_critical ? "manual" : "immediate"
                preserve_partial_state: is_critical
                custom_rollback_operations: is_critical ? [
                    "notify_engineering_team",
                    "create_incident_ticket",
                    "preserve_debug_info"
                ] : []
            }
        }
        for op in operations
    ]

schema EnhancedBatchOperation:
    """Batch operation with enhanced error handling"""

    base_operation: main.BatchOperation
    is_critical: bool = False
    is_optional: bool = False
    custom_retry: bool = False

    adaptive_retry_policy: main.RetryPolicy
    adaptive_rollback_strategy: main.RollbackStrategy

    # Circuit breaker pattern
    failure_threshold: int = 3
    recovery_timeout_seconds: int = 300

    check:
        not (is_critical and is_optional),
        "Operation cannot be both critical and optional"
```

## Error Handling

### 1. Graceful Degradation

```kcl
# ✅ Good: Graceful degradation for non-critical components
schema GracefulDegradationWorkflow(main.BatchWorkflow):
    """Workflow that can degrade gracefully on partial failures"""

    # Categorize operations by importance
    core_operations: [str] = []           # Must succeed
    enhancement_operations: [str] = []    # Nice to have
    monitoring_operations: [str] = []     # Can be skipped if needed

    # Minimum viable deployment definition
    minimum_viable_operations: [str] = core_operations

    # Degradation strategy
    degradation_policy: DegradationPolicy = DegradationPolicy {
        allow_partial_deployment: True
        minimum_success_percentage: 80.0

        operation_priorities: {
            # Core operations (must succeed)
            op_id: 10 for op_id in core_operations
        } | {
            # Enhancement operations (should succeed)
            op_id: 5 for op_id in enhancement_operations
        } | {
            # Monitoring operations (can fail)
            op_id: 1 for op_id in monitoring_operations
        }
    }

    check:
        # Ensure minimum viable deployment is achievable
        len(minimum_viable_operations) > 0,
        "Must specify at least one operation for minimum viable deployment"

        # Core operations should not depend on enhancement operations
        all([
            all([
                dep.target_operation_id not in enhancement_operations
                for dep in op.dependencies or []
            ])
            for op in operations
            if op.operation_id in core_operations
        ]),
        "Core operations should not depend on enhancement operations"

schema DegradationPolicy:
    """Policy for graceful degradation"""

    allow_partial_deployment: bool = False
    minimum_success_percentage: float = 100.0
    operation_priorities: {str: int} = {}

    # Fallback configurations
    fallback_configurations: {str: str} = {}
    emergency_contacts: [str] = []

    check:
        0.0 <= minimum_success_percentage and minimum_success_percentage <= 100.0,
        "Success percentage must be between 0 and 100"
```

### 2. Circuit Breaker Patterns

```kcl
# ✅ Good: Circuit breaker for external dependencies
schema CircuitBreakerOperation(main.BatchOperation):
    """Operation with circuit breaker pattern for external dependencies"""

    # Circuit breaker configuration
    circuit_breaker_enabled: bool = False
    failure_threshold: int = 5
    recovery_timeout_seconds: int = 300

    # Health check configuration
    health_check_endpoint?: str
    health_check_interval_seconds: int = 30

    # Fallback behavior
    fallback_enabled: bool = False
    fallback_operation?: main.BatchOperation

    check:
        circuit_breaker_enabled == True and failure_threshold > 0,
        "Circuit breaker must have positive failure threshold"

        circuit_breaker_enabled == True and recovery_timeout_seconds > 0,
        "Circuit breaker must have positive recovery timeout"

        fallback_enabled == True and fallback_operation != Undefined,
        "Fallback requires fallback operation definition"

# Example: Database operation with circuit breaker
database_operation_with_circuit_breaker: CircuitBreakerOperation = CircuitBreakerOperation {
    # Base operation
    operation_id: "setup_database"
    name: "Setup Production Database"
    operation_type: "server"
    action: "create"
    parameters: {"service": "postgresql", "version": "15"}
    timeout: 1800

    # Circuit breaker settings
    circuit_breaker_enabled: True
    failure_threshold: 3
    recovery_timeout_seconds: 600

    # Health monitoring
    health_check_endpoint: "http://db-health.internal/health"
    health_check_interval_seconds: 60

    # Fallback to read replica
    fallback_enabled: True
    fallback_operation: main.BatchOperation {
        operation_id: "setup_database_readonly"
        name: "Setup Read-Only Database Fallback"
        operation_type: "server"
        action: "create"
        parameters: {"service": "postgresql", "mode": "readonly"}
        timeout: 900
    }
}
```

## Performance Optimization

### 1. Parallel Execution Strategies

```kcl
# ✅ Good: Intelligent parallelization
schema OptimizedParallelWorkflow(main.BatchWorkflow):
    """Workflow optimized for parallel execution"""

    # Parallel execution groups
    parallel_groups: [[str]] = []  # Groups of operations that can run in parallel

    # Resource-aware scheduling
    resource_requirements: {str: ResourceRequirement} = {}
    total_available_resources: ResourceCapacity = ResourceCapacity {
        max_cpu_cores: 16
        max_memory_gb: 64
        max_network_bandwidth_mbps: 1000
        max_concurrent_operations: 10
    }

    # Computed optimal parallelism
    optimal_parallel_limit: int = min([
        total_available_resources.max_concurrent_operations,
        len(operations),
        8  # Reasonable default maximum
    ])

    # Generate workflow with optimized settings
    optimized_workflow: main.BatchWorkflow = main.BatchWorkflow {
        workflow_id: workflow_id
        name: name
        description: description

        operations: [
            OptimizedBatchOperation {
                base_operation: op
                resource_hint: resource_requirements[op.operation_id] or ResourceRequirement {
                    cpu_cores: 1
                    memory_gb: 2
                    estimated_duration_seconds: op.timeout / 2
                }

                # Enable parallelism for operations in parallel groups
                computed_allow_parallel: any([
                    op.operation_id in group and len(group) > 1
                    for group in parallel_groups
                ])
            }
            for op in operations
        ]

        max_parallel_operations: optimal_parallel_limit
        global_timeout: global_timeout
        fail_fast: fail_fast

        # Optimize storage for performance
        storage: main.StorageConfig {
            backend: "surrealdb"  # Better for concurrent access
            enable_compression: False  # Trade space for speed
            connection_config: {
                "connection_pool_size": str(optimal_parallel_limit * 2)
                "max_retries": "3"
                "timeout": "30"
            }
        }
    }

schema OptimizedBatchOperation:
    """Batch operation with performance optimizations"""

    base_operation: main.BatchOperation
    resource_hint: ResourceRequirement
    computed_allow_parallel: bool

    # Performance-optimized operation
    optimized_operation: main.BatchOperation = main.BatchOperation {
        operation_id: base_operation.operation_id
        name: base_operation.name
        operation_type: base_operation.operation_type
        provider: base_operation.provider
        action: base_operation.action
        parameters: base_operation.parameters
        dependencies: base_operation.dependencies

        # Optimized settings
        timeout: max([base_operation.timeout, resource_hint.estimated_duration_seconds * 2])
        allow_parallel: computed_allow_parallel
        priority: base_operation.priority

        # Performance-oriented retry policy
        retry_policy: main.RetryPolicy {
            max_attempts: 2  # Fewer retries for faster failure detection
            initial_delay: 10
            max_delay: 60
            backoff_multiplier: 1.5
            retry_on_errors: ["timeout", "rate_limit"]  # Only retry fast-failing errors
        }
    }

schema ResourceRequirement:
    """Resource requirements for performance planning"""
    cpu_cores: int = 1
    memory_gb: int = 2
    estimated_duration_seconds: int = 300
    io_intensive: bool = False
    network_intensive: bool = False

schema ResourceCapacity:
    """Available resource capacity"""
    max_cpu_cores: int
    max_memory_gb: int
    max_network_bandwidth_mbps: int
    max_concurrent_operations: int
```

### 2. Caching and Memoization

```kcl
# ✅ Good: Caching for expensive operations
schema CachedOperation(main.BatchOperation):
    """Operation with caching capabilities"""

    # Caching configuration
    cache_enabled: bool = False
    cache_key_template: str = "${operation_id}-${provider}-${action}"
    cache_ttl_seconds: int = 3600  # 1 hour default

    # Cache invalidation rules
    cache_invalidation_triggers: [str] = []
    force_cache_refresh: bool = False

    # Computed cache key
    computed_cache_key: str = f"{operation_id}-{provider}-{action}"

    # Cache-aware timeout (shorter if cache hit expected)
    cache_aware_timeout: int = cache_enabled ? timeout / 2 : timeout

    check:
        cache_enabled == True and cache_ttl_seconds > 0,
        "Cache TTL must be positive when caching is enabled"

# Example: Cached provider operations
cached_server_creation: CachedOperation = CachedOperation {
    # Base operation
    operation_id: "create_standardized_servers"
    name: "Create Standardized Web Servers"
    operation_type: "server"
    provider: "upcloud"
    action: "create"
    parameters: {
        "plan": "2xCPU-4GB"
        "zone": "fi-hel2"
        "image": "ubuntu-22.04"
    }
    timeout: 900

    # Caching settings
    cache_enabled: True
    cache_key_template: "server-${plan}-${zone}-${image}"
    cache_ttl_seconds: 7200  # 2 hours

    # Cache invalidation
    cache_invalidation_triggers: ["image_updated", "plan_changed"]
}
```

## Security Considerations

### 1. Secure Configuration Management

```kcl
# ✅ Good: Secure configuration with proper secret handling
schema SecureConfiguration:
    """Security-first configuration management"""

    # Secret management
    secrets_provider: main.SecretProvider = main.SecretProvider {
        provider: "sops"
        sops_config: main.SopsConfig {
            config_path: "./.sops.yaml"
            age_key_file: "{{env.HOME}}/.config/sops/age/keys.txt"
            use_age: True
        }
    }

    # Security classifications
    data_classification: "public" | "internal" | "confidential" | "restricted"
    encryption_required: bool = data_classification != "public"
    audit_logging_required: bool = data_classification in ["confidential", "restricted"]

    # Access control
    allowed_environments: [str] = ["dev", "staging", "prod"]
    environment_access_matrix: {str: [str]} = {
        "dev": ["developers", "qa_team"]
        "staging": ["developers", "qa_team", "release_team"]
        "prod": ["release_team", "operations_team"]
    }

    # Network security
    network_isolation_required: bool = data_classification in ["confidential", "restricted"]
    vpc_isolation: bool = network_isolation_required
    private_subnets_only: bool = data_classification == "restricted"

    check:
        data_classification == "restricted" and encryption_required == True,
        "Restricted data must be encrypted"

        audit_logging_required == True and len(audit_log_destinations) > 0,
        "Audit logging destinations must be specified for sensitive data"

# Example: Production security configuration
production_security: SecureConfiguration = SecureConfiguration {
    data_classification: "confidential"
    # encryption_required automatically becomes True
    # audit_logging_required automatically becomes True
    # network_isolation_required automatically becomes True

    allowed_environments: ["staging", "prod"]
    environment_access_matrix: {
        "staging": ["release_team", "security_team"]
        "prod": ["operations_team", "security_team"]
    }

    audit_log_destinations: [
        "siem://security.company.com",
        "s3://audit-logs-prod/workflows"
    ]
}
```

### 2. Compliance and Auditing

```kcl
# ✅ Good: Compliance-aware workflow design
schema ComplianceWorkflow(main.BatchWorkflow):
    """Workflow with built-in compliance features"""

    # Compliance framework requirements
    compliance_frameworks: [str] = []
    compliance_metadata: ComplianceMetadata = ComplianceMetadata {
        frameworks: compliance_frameworks
        audit_trail_required: "sox" in compliance_frameworks or "pci" in compliance_frameworks
        data_residency_requirements: "gdpr" in compliance_frameworks ? ["eu"] : []
        retention_requirements: get_retention_requirements(compliance_frameworks)
    }

    # Enhanced workflow with compliance features
    compliant_workflow: main.BatchWorkflow = main.BatchWorkflow {
        workflow_id: workflow_id
        name: name
        description: description

        operations: [
            ComplianceAwareBatchOperation {
                base_operation: op
                compliance_metadata: compliance_metadata
            }.compliant_operation
            for op in operations
        ]

        # Compliance-aware storage
        storage: main.StorageConfig {
            backend: "surrealdb"
            enable_persistence: True
            retention_hours: compliance_metadata.retention_requirements.workflow_data_hours
            enable_compression: False  # For audit clarity
            encryption: compliance_metadata.audit_trail_required ? main.SecretProvider {
                provider: "sops"
                sops_config: main.SopsConfig {
                    config_path: "./.sops.yaml"
                    age_key_file: "{{env.HOME}}/.config/sops/age/keys.txt"
                    use_age: True
                }
            } : Undefined
        }

        # Compliance-aware monitoring
        monitoring: main.MonitoringConfig {
            enabled: True
            backend: "prometheus"
            enable_tracing: compliance_metadata.audit_trail_required
            enable_notifications: True
            log_level: "info"
            collection_interval: compliance_metadata.audit_trail_required ? 15 : 30
        }

        # Audit trail in execution context
        execution_context: execution_context | {
            "compliance_frameworks": str(compliance_frameworks)
            "audit_trail_enabled": str(compliance_metadata.audit_trail_required)
            "data_classification": "confidential"
        }
    }

schema ComplianceMetadata:
    """Metadata for compliance requirements"""
    frameworks: [str]
    audit_trail_required: bool
    data_residency_requirements: [str]
    retention_requirements: RetentionRequirements

schema RetentionRequirements:
    """Data retention requirements based on compliance"""
    workflow_data_hours: int = 8760  # 1 year default
    audit_log_hours: int = 26280     # 3 years default
    backup_retention_hours: int = 43800  # 5 years default

schema ComplianceAwareBatchOperation:
    """Batch operation with compliance awareness"""
    base_operation: main.BatchOperation
    compliance_metadata: ComplianceMetadata

    compliant_operation: main.BatchOperation = main.BatchOperation {
        operation_id: base_operation.operation_id
        name: base_operation.name
        operation_type: base_operation.operation_type
        provider: base_operation.provider
        action: base_operation.action
        parameters: base_operation.parameters | (
            compliance_metadata.audit_trail_required ? {
                "audit_enabled": "true"
                "compliance_mode": "strict"
            } : {}
        )
        dependencies: base_operation.dependencies
        timeout: base_operation.timeout
        allow_parallel: base_operation.allow_parallel
        priority: base_operation.priority

        # Enhanced retry for compliance
        retry_policy: main.RetryPolicy {
            max_attempts: compliance_metadata.audit_trail_required ? 5 : 3
            initial_delay: 30
            max_delay: 300
            backoff_multiplier: 2
            retry_on_errors: ["timeout", "connection_error", "rate_limit"]
        }

        # Conservative rollback for compliance
        rollback_strategy: main.RollbackStrategy {
            enabled: True
            strategy: "manual"  # Manual approval for compliance
            preserve_partial_state: True
            rollback_timeout: 1800
            custom_rollback_operations: [
                "create_audit_entry",
                "notify_compliance_team",
                "preserve_evidence"
            ]
        }
    }

# Helper function for retention requirements
def get_retention_requirements(frameworks: [str]) -> RetentionRequirements:
    """Get retention requirements based on compliance frameworks"""
    if "sox" in frameworks:
        return RetentionRequirements {
            workflow_data_hours: 43800   # 5 years
            audit_log_hours: 61320       # 7 years
            backup_retention_hours: 87600  # 10 years
        }
    elif "pci" in frameworks:
        return RetentionRequirements {
            workflow_data_hours: 8760    # 1 year
            audit_log_hours: 26280       # 3 years
            backup_retention_hours: 43800  # 5 years
        }
    else:
        return RetentionRequirements {
            workflow_data_hours: 8760    # 1 year default
            audit_log_hours: 26280       # 3 years default
            backup_retention_hours: 43800  # 5 years default
        }
```

## Testing Strategies

### 1. Schema Testing

```bash
#!/bin/bash
# Schema testing script

# Test 1: Basic syntax validation
echo "Testing schema syntax..."
find . -name "*.k" -exec kcl fmt {} \;

# Test 2: Schema compilation
echo "Testing schema compilation..."
for file in *.k; do
    echo "Testing $file"
    kcl run "$file" > /dev/null || echo "FAILED: $file"
done

# Test 3: Constraint validation
echo "Testing constraints..."
kcl run test_constraints.k

# Test 4: JSON serialization
echo "Testing JSON serialization..."
kcl run examples/simple_workflow.k --format json | jq '.' > /dev/null

# Test 5: Cross-schema compatibility
echo "Testing cross-schema compatibility..."
kcl run integration_test.k
```

### 2. Validation Testing

```kcl
# Test configuration for validation
test_validation_cases: {
    # Valid cases
    valid_server: main.Server = main.Server {
        hostname: "test-01"
        title: "Test Server"
        labels: "env: test"
        user: "test"
    }

    # Edge cases
    minimal_workflow: main.BatchWorkflow = main.BatchWorkflow {
        workflow_id: "minimal"
        name: "Minimal Test Workflow"
        operations: [
            main.BatchOperation {
                operation_id: "test_op"
                name: "Test Operation"
                operation_type: "custom"
                action: "test"
                parameters: {}
            }
        ]
    }

    # Boundary testing
    max_timeout_operation: main.BatchOperation = main.BatchOperation {
        operation_id: "max_timeout"
        name: "Maximum Timeout Test"
        operation_type: "custom"
        action: "test"
        parameters: {}
        timeout: 86400  # 24 hours - test upper boundary
    }
}
```

## Maintenance Guidelines

### 1. Schema Evolution

```kcl
# ✅ Good: Backward-compatible schema evolution
schema ServerV2(main.Server):
    """Enhanced server schema with backward compatibility"""

    # New optional fields (backward compatible)
    performance_profile?: "standard" | "high_performance" | "burstable"
    auto_scaling_enabled?: bool = False

    # Deprecated fields (marked but still supported)
    deprecated_field?: str  # TODO: Remove in v3.0

    # Version metadata
    schema_version: str = "2.0"

    check:
        # Maintain existing validations
        len(hostname) > 0, "Hostname required"
        len(title) > 0, "Title required"

        # New validations for new fields
        performance_profile != Undefined and auto_scaling_enabled == True and performance_profile != "burstable",
        "Auto-scaling not compatible with burstable performance profile"

# Migration helper
schema ServerMigration:
    """Helper for migrating from ServerV1 to ServerV2"""

    v1_server: main.Server

    v2_server: ServerV2 = ServerV2 {
        # Copy all existing fields
        hostname: v1_server.hostname
        title: v1_server.title
        labels: v1_server.labels
        user: v1_server.user

        # Set defaults for new fields
        performance_profile: "standard"
        auto_scaling_enabled: False

        # Copy optional fields if they exist
        taskservs: v1_server.taskservs
        cluster: v1_server.cluster
    }
```

### 2. Documentation Updates

```kcl
# ✅ Good: Self-documenting schemas with examples
schema DocumentedWorkflow(main.BatchWorkflow):
    """
    Production workflow with comprehensive documentation

    This workflow follows company best practices for:
    - Multi-environment deployment
    - Error handling and recovery
    - Security and compliance
    - Performance optimization

    Example Usage:
        prod_workflow: DocumentedWorkflow = DocumentedWorkflow {
            environment: "production"
            security_level: "high"
            base_workflow: main.BatchWorkflow {
                workflow_id: "webapp-deploy-001"
                name: "Web Application Deployment"
                operations: [...]
            }
        }

    See Also:
        - examples/production_workflow.k
        - docs/WORKFLOW_PATTERNS.md
        - docs/SECURITY_GUIDELINES.md
    """

    # Required metadata for documentation
    environment: "dev" | "staging" | "prod"
    security_level: "low" | "medium" | "high"
    base_workflow: main.BatchWorkflow

    # Auto-generated documentation fields
    documentation_generated_at: str = "{{now.date}}"
    schema_version: str = "1.0"

    check:
        environment == "prod" and security_level == "high",
        "Production workflows must use high security level"
```

This comprehensive best practices guide provides the foundation for creating maintainable, secure, and performant KCL configurations for the provisioning system.