prvng_kcl/docs/BEST_PRACTICES.md

1200 lines
39 KiB
Markdown
Raw Normal View History

2025-10-07 11:17:54 +01:00
# KCL Best Practices for Provisioning
This document outlines best practices for using and developing with the provisioning KCL package, covering schema design, workflow patterns, and operational guidelines.
## Table of Contents
- [Schema Design](#schema-design)
- [Workflow Patterns](#workflow-patterns)
- [Error Handling](#error-handling)
- [Performance Optimization](#performance-optimization)
- [Security Considerations](#security-considerations)
- [Testing Strategies](#testing-strategies)
- [Maintenance Guidelines](#maintenance-guidelines)
## Schema Design
### 1. Clear Naming Conventions
```kcl
# ✅ Good: Descriptive, consistent naming
schema ProductionWebServer:
"""Web server optimized for production workloads"""
hostname: str # Clear, specific field names
fully_qualified_domain_name?: str
environment_classification: "dev" | "staging" | "prod"
cost_allocation_center: str
operational_team_owner: str
# ✅ Good: Consistent prefixes for related schemas
schema K8sDeploymentSpec:
"""Kubernetes deployment specification"""
replica_count: int
container_definitions: [K8sContainerSpec]
volume_mount_configs: [K8sVolumeMountSpec]
schema K8sContainerSpec:
"""Kubernetes container specification"""
image_reference: str
resource_requirements: K8sResourceRequirements
# ❌ Avoid: Ambiguous or inconsistent naming
schema Server: # ❌ Too generic
name: str # ❌ Ambiguous - hostname? display name?
env: str # ❌ Unclear - environment? variables?
cfg: {str: str} # ❌ Cryptic abbreviations
```
### 2. Comprehensive Documentation
```kcl
# ✅ Good: Detailed documentation with examples
schema ServerConfiguration:
"""
Production server configuration following company standards.
This schema defines servers for multi-tier applications with
proper security, monitoring, and operational requirements.
Example:
web_server: ServerConfiguration = ServerConfiguration {
hostname: "prod-web-01"
server_role: "frontend"
environment: "production"
cost_center: "engineering"
}
"""
# Core identification (required)
hostname: str # DNS-compliant hostname (RFC 1123)
server_role: "frontend" | "backend" | "database" | "cache"
# Environment and operational metadata
environment: "development" | "staging" | "production"
cost_center: str # Billing allocation identifier
primary_contact_team: str # Team responsible for maintenance
# Security and compliance
security_zone: "dmz" | "internal" | "restricted"
compliance_requirements: [str] # e.g., ["pci", "sox", "hipaa"]
# Optional operational settings
backup_policy?: str # Backup schedule identifier
monitoring_profile?: str # Monitoring configuration profile
check:
# Hostname validation (DNS RFC 1123)
regex.match(hostname, "^[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?$"),
"Hostname must be DNS-compliant (RFC 1123): ${hostname}"
# Environment-specific validations
environment == "production" and len(primary_contact_team) > 0,
"Production servers must specify primary contact team"
# Security requirements
security_zone == "restricted" and "encryption" in compliance_requirements,
"Restricted zone servers must have encryption compliance"
# ❌ Avoid: Minimal or missing documentation
schema Srv: # ❌ No documentation
h: str # ❌ No field documentation
t: str # ❌ Cryptic field names
```
### 3. Hierarchical Schema Design
```kcl
# ✅ Good: Base schemas with specialized extensions
schema BaseInfrastructureResource:
"""Foundation schema for all infrastructure resources"""
# Universal metadata
resource_name: str
creation_timestamp?: str
last_modified_timestamp?: str
created_by_user?: str
# Organizational metadata
cost_center: str
project_identifier: str
environment: "dev" | "staging" | "prod"
# Operational metadata
tags: {str: str} = {}
monitoring_enabled: bool = True
check:
len(resource_name) > 0 and len(resource_name) <= 63,
"Resource name must be 1-63 characters"
regex.match(resource_name, "^[a-z0-9]([a-z0-9-]*[a-z0-9])?$"),
"Resource name must be DNS-label compatible"
schema ComputeResource(BaseInfrastructureResource):
"""Compute resources with CPU/memory specifications"""
# Hardware specifications
cpu_cores: int
memory_gigabytes: int
storage_gigabytes: int
# Performance characteristics
cpu_architecture: "x86_64" | "arm64"
performance_tier: "burstable" | "standard" | "high_performance"
check:
cpu_cores > 0 and cpu_cores <= 128,
"CPU cores must be between 1 and 128"
memory_gigabytes > 0 and memory_gigabytes <= 1024,
"Memory must be between 1GB and 1TB"
schema ManagedDatabaseResource(BaseInfrastructureResource):
"""Managed database service configuration"""
# Database specifications
database_engine: "postgresql" | "mysql" | "redis" | "mongodb"
engine_version: str
instance_class: str
# High availability and backup
multi_availability_zone: bool = False
backup_retention_days: int = 7
automated_backup_enabled: bool = True
# Security
encryption_at_rest: bool = True
encryption_in_transit: bool = True
check:
environment == "prod" and multi_availability_zone == True,
"Production databases must enable multi-AZ"
environment == "prod" and backup_retention_days >= 30,
"Production databases need minimum 30 days backup retention"
```
### 4. Flexible Configuration Patterns
```kcl
# ✅ Good: Environment-aware defaults
schema EnvironmentAdaptiveConfiguration:
"""Configuration that adapts based on environment"""
environment: "dev" | "staging" | "prod"
# Computed defaults based on environment
default_timeout_seconds: int = (
environment == "prod" ? 300 : (
environment == "staging" ? 180 : 60
)
)
default_retry_attempts: int = (
environment == "prod" ? 5 : (
environment == "staging" ? 3 : 1
)
)
resource_allocation: ComputeResource = ComputeResource {
resource_name: "default-compute"
cost_center: "shared"
project_identifier: "infrastructure"
environment: environment
# Environment-specific resource sizing
cpu_cores: environment == "prod" ? 4 : (environment == "staging" ? 2 : 1)
memory_gigabytes: environment == "prod" ? 8 : (environment == "staging" ? 4 : 2)
storage_gigabytes: environment == "prod" ? 100 : 50
cpu_architecture: "x86_64"
performance_tier: environment == "prod" ? "high_performance" : "standard"
}
monitoring_configuration: MonitoringConfig = MonitoringConfig {
collection_interval_seconds: environment == "prod" ? 15 : 60
retention_days: environment == "prod" ? 90 : 30
alert_thresholds: environment == "prod" ? "strict" : "relaxed"
}
# ✅ Good: Composable configuration with mixins
schema SecurityMixin:
"""Security-related configuration that can be mixed into other schemas"""
encryption_enabled: bool = True
access_logging_enabled: bool = True
security_scan_enabled: bool = True
# Security-specific validations
check:
encryption_enabled == True,
"Encryption must be enabled for security compliance"
schema ComplianceMixin:
"""Compliance-related configuration"""
compliance_frameworks: [str] = []
audit_logging_enabled: bool = False
data_retention_policy?: str
check:
len(compliance_frameworks) > 0 and audit_logging_enabled == True,
"Compliance frameworks require audit logging"
schema SecureComputeResource(ComputeResource, SecurityMixin, ComplianceMixin):
"""Compute resource with security and compliance requirements"""
# Additional security requirements for compute
secure_boot_enabled: bool = True
encrypted_storage: bool = True
check:
# Inherit all parent validations, plus additional ones
"pci" in compliance_frameworks and encrypted_storage == True,
"PCI compliance requires encrypted storage"
```
## Workflow Patterns
### 1. Dependency Management
```kcl
# ✅ Good: Clear dependency patterns with proper error handling
schema InfrastructureWorkflow(main.BatchWorkflow):
"""Infrastructure deployment with proper dependency management"""
# Categorize operations for dependency analysis
foundation_operations: [str] = [] # Network, security groups, etc.
compute_operations: [str] = [] # Servers, instances
service_operations: [str] = [] # Applications, databases
validation_operations: [str] = [] # Testing, health checks
check:
# Foundation must come first
all([
len([dep for dep in op.dependencies or []
if dep.target_operation_id in foundation_operations]) > 0
for op in operations
if op.operation_id in compute_operations
]) or len(compute_operations) == 0,
"Compute operations must depend on foundation operations"
# Services depend on compute
all([
len([dep for dep in op.dependencies or []
if dep.target_operation_id in compute_operations]) > 0
for op in operations
if op.operation_id in service_operations
]) or len(service_operations) == 0,
"Service operations must depend on compute operations"
# Example usage with proper dependency chains
production_deployment: InfrastructureWorkflow = InfrastructureWorkflow {
workflow_id: "prod-infra-2025-001"
name: "Production Infrastructure Deployment"
foundation_operations: ["create_vpc", "setup_security_groups"]
compute_operations: ["create_web_servers", "create_db_servers"]
service_operations: ["install_applications", "configure_databases"]
validation_operations: ["run_health_checks", "validate_connectivity"]
operations: [
# Foundation layer
main.BatchOperation {
operation_id: "create_vpc"
name: "Create VPC and Networking"
operation_type: "custom"
action: "create"
parameters: {"cidr": "10.0.0.0/16"}
priority: 10
timeout: 600
},
# Compute layer (depends on foundation)
main.BatchOperation {
operation_id: "create_web_servers"
name: "Create Web Servers"
operation_type: "server"
action: "create"
parameters: {"count": "3", "type": "web"}
dependencies: [
main.DependencyDef {
target_operation_id: "create_vpc"
dependency_type: "sequential"
timeout: 300
fail_on_dependency_error: True
}
]
priority: 8
timeout: 900
},
# Service layer (depends on compute)
main.BatchOperation {
operation_id: "install_applications"
name: "Install Web Applications"
operation_type: "taskserv"
action: "create"
parameters: {"apps": ["nginx", "prometheus"]}
dependencies: [
main.DependencyDef {
target_operation_id: "create_web_servers"
dependency_type: "conditional"
conditions: ["servers_ready", "ssh_accessible"]
timeout: 600
}
]
priority: 6
}
]
}
```
### 2. Multi-Environment Workflows
```kcl
# ✅ Good: Environment-specific workflow configurations
schema MultiEnvironmentWorkflow:
"""Workflow that adapts to different environments"""
base_workflow: main.BatchWorkflow
target_environment: "dev" | "staging" | "prod"
# Environment-specific overrides
environment_config: EnvironmentConfig = EnvironmentConfig {
environment: target_environment
# Adjust parallelism based on environment
max_parallel: target_environment == "prod" ? 3 : 5
# Adjust timeouts
operation_timeout_multiplier: target_environment == "prod" ? 1.5 : 1.0
# Monitoring intensity
monitoring_level: target_environment == "prod" ? "comprehensive" : "basic"
}
# Generate final workflow with environment adaptations
final_workflow: main.BatchWorkflow = main.BatchWorkflow {
workflow_id: f"{base_workflow.workflow_id}-{target_environment}"
name: f"{base_workflow.name} ({target_environment})"
description: base_workflow.description
operations: [
main.BatchOperation {
operation_id: op.operation_id
name: op.name
operation_type: op.operation_type
provider: op.provider
action: op.action
parameters: op.parameters
dependencies: op.dependencies
# Environment-adapted timeout
timeout: int(op.timeout * environment_config.operation_timeout_multiplier)
# Environment-adapted priority
priority: op.priority
allow_parallel: op.allow_parallel
# Environment-specific retry policy
retry_policy: main.RetryPolicy {
max_attempts: target_environment == "prod" ? 3 : 2
initial_delay: target_environment == "prod" ? 30 : 10
backoff_multiplier: 2
}
}
for op in base_workflow.operations
]
max_parallel_operations: environment_config.max_parallel
global_timeout: base_workflow.global_timeout
fail_fast: target_environment == "prod" ? False : True
# Environment-specific storage
storage: main.StorageConfig {
backend: target_environment == "prod" ? "surrealdb" : "filesystem"
base_path: f"./workflows/{target_environment}"
enable_persistence: target_environment != "dev"
retention_hours: target_environment == "prod" ? 2160 : 168 # 90 days vs 1 week
}
# Environment-specific monitoring
monitoring: main.MonitoringConfig {
enabled: True
backend: "prometheus"
enable_tracing: target_environment == "prod"
enable_notifications: target_environment != "dev"
log_level: target_environment == "dev" ? "debug" : "info"
}
}
# Usage for different environments
dev_deployment: MultiEnvironmentWorkflow = MultiEnvironmentWorkflow {
target_environment: "dev"
base_workflow: main.BatchWorkflow {
workflow_id: "webapp-deploy"
name: "Web Application Deployment"
operations: [
# ... base operations
]
}
}
prod_deployment: MultiEnvironmentWorkflow = MultiEnvironmentWorkflow {
target_environment: "prod"
base_workflow: dev_deployment.base_workflow # Reuse same base workflow
}
```
### 3. Error Recovery Patterns
```kcl
# ✅ Good: Comprehensive error recovery strategy
schema ResilientWorkflow(main.BatchWorkflow):
"""Workflow with advanced error recovery capabilities"""
# Error categorization
critical_operations: [str] = [] # Operations that cannot fail
optional_operations: [str] = [] # Operations that can be skipped
retry_operations: [str] = [] # Operations with custom retry logic
# Recovery strategies
global_error_strategy: "fail_fast" | "continue_on_error" | "intelligent" = "intelligent"
# Enhanced operations with error handling
enhanced_operations: [EnhancedBatchOperation] = [
EnhancedBatchOperation {
base_operation: op
is_critical: op.operation_id in critical_operations
is_optional: op.operation_id in optional_operations
custom_retry: op.operation_id in retry_operations
# Adaptive retry policy based on operation characteristics
adaptive_retry_policy: main.RetryPolicy {
max_attempts: (
is_critical ? 5 : (
is_optional ? 1 : 3
)
)
initial_delay: is_critical ? 60 : 30
max_delay: is_critical ? 900 : 300
backoff_multiplier: 2
retry_on_errors: [
"timeout",
"connection_error",
"rate_limit"
] + (is_critical ? [
"resource_unavailable",
"quota_exceeded"
] : [])
}
# Adaptive rollback strategy
adaptive_rollback_strategy: main.RollbackStrategy {
enabled: True
strategy: is_critical ? "manual" : "immediate"
preserve_partial_state: is_critical
custom_rollback_operations: is_critical ? [
"notify_engineering_team",
"create_incident_ticket",
"preserve_debug_info"
] : []
}
}
for op in operations
]
schema EnhancedBatchOperation:
"""Batch operation with enhanced error handling"""
base_operation: main.BatchOperation
is_critical: bool = False
is_optional: bool = False
custom_retry: bool = False
adaptive_retry_policy: main.RetryPolicy
adaptive_rollback_strategy: main.RollbackStrategy
# Circuit breaker pattern
failure_threshold: int = 3
recovery_timeout_seconds: int = 300
check:
not (is_critical and is_optional),
"Operation cannot be both critical and optional"
```
## Error Handling
### 1. Graceful Degradation
```kcl
# ✅ Good: Graceful degradation for non-critical components
schema GracefulDegradationWorkflow(main.BatchWorkflow):
"""Workflow that can degrade gracefully on partial failures"""
# Categorize operations by importance
core_operations: [str] = [] # Must succeed
enhancement_operations: [str] = [] # Nice to have
monitoring_operations: [str] = [] # Can be skipped if needed
# Minimum viable deployment definition
minimum_viable_operations: [str] = core_operations
# Degradation strategy
degradation_policy: DegradationPolicy = DegradationPolicy {
allow_partial_deployment: True
minimum_success_percentage: 80.0
operation_priorities: {
# Core operations (must succeed)
op_id: 10 for op_id in core_operations
} | {
# Enhancement operations (should succeed)
op_id: 5 for op_id in enhancement_operations
} | {
# Monitoring operations (can fail)
op_id: 1 for op_id in monitoring_operations
}
}
check:
# Ensure minimum viable deployment is achievable
len(minimum_viable_operations) > 0,
"Must specify at least one operation for minimum viable deployment"
# Core operations should not depend on enhancement operations
all([
all([
dep.target_operation_id not in enhancement_operations
for dep in op.dependencies or []
])
for op in operations
if op.operation_id in core_operations
]),
"Core operations should not depend on enhancement operations"
schema DegradationPolicy:
"""Policy for graceful degradation"""
allow_partial_deployment: bool = False
minimum_success_percentage: float = 100.0
operation_priorities: {str: int} = {}
# Fallback configurations
fallback_configurations: {str: str} = {}
emergency_contacts: [str] = []
check:
0.0 <= minimum_success_percentage and minimum_success_percentage <= 100.0,
"Success percentage must be between 0 and 100"
```
### 2. Circuit Breaker Patterns
```kcl
# ✅ Good: Circuit breaker for external dependencies
schema CircuitBreakerOperation(main.BatchOperation):
"""Operation with circuit breaker pattern for external dependencies"""
# Circuit breaker configuration
circuit_breaker_enabled: bool = False
failure_threshold: int = 5
recovery_timeout_seconds: int = 300
# Health check configuration
health_check_endpoint?: str
health_check_interval_seconds: int = 30
# Fallback behavior
fallback_enabled: bool = False
fallback_operation?: main.BatchOperation
check:
circuit_breaker_enabled == True and failure_threshold > 0,
"Circuit breaker must have positive failure threshold"
circuit_breaker_enabled == True and recovery_timeout_seconds > 0,
"Circuit breaker must have positive recovery timeout"
fallback_enabled == True and fallback_operation != Undefined,
"Fallback requires fallback operation definition"
# Example: Database operation with circuit breaker
database_operation_with_circuit_breaker: CircuitBreakerOperation = CircuitBreakerOperation {
# Base operation
operation_id: "setup_database"
name: "Setup Production Database"
operation_type: "server"
action: "create"
parameters: {"service": "postgresql", "version": "15"}
timeout: 1800
# Circuit breaker settings
circuit_breaker_enabled: True
failure_threshold: 3
recovery_timeout_seconds: 600
# Health monitoring
health_check_endpoint: "http://db-health.internal/health"
health_check_interval_seconds: 60
# Fallback to read replica
fallback_enabled: True
fallback_operation: main.BatchOperation {
operation_id: "setup_database_readonly"
name: "Setup Read-Only Database Fallback"
operation_type: "server"
action: "create"
parameters: {"service": "postgresql", "mode": "readonly"}
timeout: 900
}
}
```
## Performance Optimization
### 1. Parallel Execution Strategies
```kcl
# ✅ Good: Intelligent parallelization
schema OptimizedParallelWorkflow(main.BatchWorkflow):
"""Workflow optimized for parallel execution"""
# Parallel execution groups
parallel_groups: [[str]] = [] # Groups of operations that can run in parallel
# Resource-aware scheduling
resource_requirements: {str: ResourceRequirement} = {}
total_available_resources: ResourceCapacity = ResourceCapacity {
max_cpu_cores: 16
max_memory_gb: 64
max_network_bandwidth_mbps: 1000
max_concurrent_operations: 10
}
# Computed optimal parallelism
optimal_parallel_limit: int = min([
total_available_resources.max_concurrent_operations,
len(operations),
8 # Reasonable default maximum
])
# Generate workflow with optimized settings
optimized_workflow: main.BatchWorkflow = main.BatchWorkflow {
workflow_id: workflow_id
name: name
description: description
operations: [
OptimizedBatchOperation {
base_operation: op
resource_hint: resource_requirements[op.operation_id] or ResourceRequirement {
cpu_cores: 1
memory_gb: 2
estimated_duration_seconds: op.timeout / 2
}
# Enable parallelism for operations in parallel groups
computed_allow_parallel: any([
op.operation_id in group and len(group) > 1
for group in parallel_groups
])
}
for op in operations
]
max_parallel_operations: optimal_parallel_limit
global_timeout: global_timeout
fail_fast: fail_fast
# Optimize storage for performance
storage: main.StorageConfig {
backend: "surrealdb" # Better for concurrent access
enable_compression: False # Trade space for speed
connection_config: {
"connection_pool_size": str(optimal_parallel_limit * 2)
"max_retries": "3"
"timeout": "30"
}
}
}
schema OptimizedBatchOperation:
"""Batch operation with performance optimizations"""
base_operation: main.BatchOperation
resource_hint: ResourceRequirement
computed_allow_parallel: bool
# Performance-optimized operation
optimized_operation: main.BatchOperation = main.BatchOperation {
operation_id: base_operation.operation_id
name: base_operation.name
operation_type: base_operation.operation_type
provider: base_operation.provider
action: base_operation.action
parameters: base_operation.parameters
dependencies: base_operation.dependencies
# Optimized settings
timeout: max([base_operation.timeout, resource_hint.estimated_duration_seconds * 2])
allow_parallel: computed_allow_parallel
priority: base_operation.priority
# Performance-oriented retry policy
retry_policy: main.RetryPolicy {
max_attempts: 2 # Fewer retries for faster failure detection
initial_delay: 10
max_delay: 60
backoff_multiplier: 1.5
retry_on_errors: ["timeout", "rate_limit"] # Only retry fast-failing errors
}
}
schema ResourceRequirement:
"""Resource requirements for performance planning"""
cpu_cores: int = 1
memory_gb: int = 2
estimated_duration_seconds: int = 300
io_intensive: bool = False
network_intensive: bool = False
schema ResourceCapacity:
"""Available resource capacity"""
max_cpu_cores: int
max_memory_gb: int
max_network_bandwidth_mbps: int
max_concurrent_operations: int
```
### 2. Caching and Memoization
```kcl
# ✅ Good: Caching for expensive operations
schema CachedOperation(main.BatchOperation):
"""Operation with caching capabilities"""
# Caching configuration
cache_enabled: bool = False
cache_key_template: str = "${operation_id}-${provider}-${action}"
cache_ttl_seconds: int = 3600 # 1 hour default
# Cache invalidation rules
cache_invalidation_triggers: [str] = []
force_cache_refresh: bool = False
# Computed cache key
computed_cache_key: str = f"{operation_id}-{provider}-{action}"
# Cache-aware timeout (shorter if cache hit expected)
cache_aware_timeout: int = cache_enabled ? timeout / 2 : timeout
check:
cache_enabled == True and cache_ttl_seconds > 0,
"Cache TTL must be positive when caching is enabled"
# Example: Cached provider operations
cached_server_creation: CachedOperation = CachedOperation {
# Base operation
operation_id: "create_standardized_servers"
name: "Create Standardized Web Servers"
operation_type: "server"
provider: "upcloud"
action: "create"
parameters: {
"plan": "2xCPU-4GB"
"zone": "fi-hel2"
"image": "ubuntu-22.04"
}
timeout: 900
# Caching settings
cache_enabled: True
cache_key_template: "server-${plan}-${zone}-${image}"
cache_ttl_seconds: 7200 # 2 hours
# Cache invalidation
cache_invalidation_triggers: ["image_updated", "plan_changed"]
}
```
## Security Considerations
### 1. Secure Configuration Management
```kcl
# ✅ Good: Secure configuration with proper secret handling
schema SecureConfiguration:
"""Security-first configuration management"""
# Secret management
secrets_provider: main.SecretProvider = main.SecretProvider {
provider: "sops"
sops_config: main.SopsConfig {
config_path: "./.sops.yaml"
age_key_file: "{{env.HOME}}/.config/sops/age/keys.txt"
use_age: True
}
}
# Security classifications
data_classification: "public" | "internal" | "confidential" | "restricted"
encryption_required: bool = data_classification != "public"
audit_logging_required: bool = data_classification in ["confidential", "restricted"]
# Access control
allowed_environments: [str] = ["dev", "staging", "prod"]
environment_access_matrix: {str: [str]} = {
"dev": ["developers", "qa_team"]
"staging": ["developers", "qa_team", "release_team"]
"prod": ["release_team", "operations_team"]
}
# Network security
network_isolation_required: bool = data_classification in ["confidential", "restricted"]
vpc_isolation: bool = network_isolation_required
private_subnets_only: bool = data_classification == "restricted"
check:
data_classification == "restricted" and encryption_required == True,
"Restricted data must be encrypted"
audit_logging_required == True and len(audit_log_destinations) > 0,
"Audit logging destinations must be specified for sensitive data"
# Example: Production security configuration
production_security: SecureConfiguration = SecureConfiguration {
data_classification: "confidential"
# encryption_required automatically becomes True
# audit_logging_required automatically becomes True
# network_isolation_required automatically becomes True
allowed_environments: ["staging", "prod"]
environment_access_matrix: {
"staging": ["release_team", "security_team"]
"prod": ["operations_team", "security_team"]
}
audit_log_destinations: [
"siem://security.company.com",
"s3://audit-logs-prod/workflows"
]
}
```
### 2. Compliance and Auditing
```kcl
# ✅ Good: Compliance-aware workflow design
schema ComplianceWorkflow(main.BatchWorkflow):
"""Workflow with built-in compliance features"""
# Compliance framework requirements
compliance_frameworks: [str] = []
compliance_metadata: ComplianceMetadata = ComplianceMetadata {
frameworks: compliance_frameworks
audit_trail_required: "sox" in compliance_frameworks or "pci" in compliance_frameworks
data_residency_requirements: "gdpr" in compliance_frameworks ? ["eu"] : []
retention_requirements: get_retention_requirements(compliance_frameworks)
}
# Enhanced workflow with compliance features
compliant_workflow: main.BatchWorkflow = main.BatchWorkflow {
workflow_id: workflow_id
name: name
description: description
operations: [
ComplianceAwareBatchOperation {
base_operation: op
compliance_metadata: compliance_metadata
}.compliant_operation
for op in operations
]
# Compliance-aware storage
storage: main.StorageConfig {
backend: "surrealdb"
enable_persistence: True
retention_hours: compliance_metadata.retention_requirements.workflow_data_hours
enable_compression: False # For audit clarity
encryption: compliance_metadata.audit_trail_required ? main.SecretProvider {
provider: "sops"
sops_config: main.SopsConfig {
config_path: "./.sops.yaml"
age_key_file: "{{env.HOME}}/.config/sops/age/keys.txt"
use_age: True
}
} : Undefined
}
# Compliance-aware monitoring
monitoring: main.MonitoringConfig {
enabled: True
backend: "prometheus"
enable_tracing: compliance_metadata.audit_trail_required
enable_notifications: True
log_level: "info"
collection_interval: compliance_metadata.audit_trail_required ? 15 : 30
}
# Audit trail in execution context
execution_context: execution_context | {
"compliance_frameworks": str(compliance_frameworks)
"audit_trail_enabled": str(compliance_metadata.audit_trail_required)
"data_classification": "confidential"
}
}
schema ComplianceMetadata:
"""Metadata for compliance requirements"""
frameworks: [str]
audit_trail_required: bool
data_residency_requirements: [str]
retention_requirements: RetentionRequirements
schema RetentionRequirements:
"""Data retention requirements based on compliance"""
workflow_data_hours: int = 8760 # 1 year default
audit_log_hours: int = 26280 # 3 years default
backup_retention_hours: int = 43800 # 5 years default
schema ComplianceAwareBatchOperation:
"""Batch operation with compliance awareness"""
base_operation: main.BatchOperation
compliance_metadata: ComplianceMetadata
compliant_operation: main.BatchOperation = main.BatchOperation {
operation_id: base_operation.operation_id
name: base_operation.name
operation_type: base_operation.operation_type
provider: base_operation.provider
action: base_operation.action
parameters: base_operation.parameters | (
compliance_metadata.audit_trail_required ? {
"audit_enabled": "true"
"compliance_mode": "strict"
} : {}
)
dependencies: base_operation.dependencies
timeout: base_operation.timeout
allow_parallel: base_operation.allow_parallel
priority: base_operation.priority
# Enhanced retry for compliance
retry_policy: main.RetryPolicy {
max_attempts: compliance_metadata.audit_trail_required ? 5 : 3
initial_delay: 30
max_delay: 300
backoff_multiplier: 2
retry_on_errors: ["timeout", "connection_error", "rate_limit"]
}
# Conservative rollback for compliance
rollback_strategy: main.RollbackStrategy {
enabled: True
strategy: "manual" # Manual approval for compliance
preserve_partial_state: True
rollback_timeout: 1800
custom_rollback_operations: [
"create_audit_entry",
"notify_compliance_team",
"preserve_evidence"
]
}
}
# Helper function for retention requirements
def get_retention_requirements(frameworks: [str]) -> RetentionRequirements:
"""Get retention requirements based on compliance frameworks"""
if "sox" in frameworks:
return RetentionRequirements {
workflow_data_hours: 43800 # 5 years
audit_log_hours: 61320 # 7 years
backup_retention_hours: 87600 # 10 years
}
elif "pci" in frameworks:
return RetentionRequirements {
workflow_data_hours: 8760 # 1 year
audit_log_hours: 26280 # 3 years
backup_retention_hours: 43800 # 5 years
}
else:
return RetentionRequirements {
workflow_data_hours: 8760 # 1 year default
audit_log_hours: 26280 # 3 years default
backup_retention_hours: 43800 # 5 years default
}
```
## Testing Strategies
### 1. Schema Testing
```bash
#!/bin/bash
# Schema testing script
# Test 1: Basic syntax validation
echo "Testing schema syntax..."
find . -name "*.k" -exec kcl fmt {} \;
# Test 2: Schema compilation
echo "Testing schema compilation..."
for file in *.k; do
echo "Testing $file"
kcl run "$file" > /dev/null || echo "FAILED: $file"
done
# Test 3: Constraint validation
echo "Testing constraints..."
kcl run test_constraints.k
# Test 4: JSON serialization
echo "Testing JSON serialization..."
kcl run examples/simple_workflow.k --format json | jq '.' > /dev/null
# Test 5: Cross-schema compatibility
echo "Testing cross-schema compatibility..."
kcl run integration_test.k
```
### 2. Validation Testing
```kcl
# Test configuration for validation
test_validation_cases: {
# Valid cases
valid_server: main.Server = main.Server {
hostname: "test-01"
title: "Test Server"
labels: "env: test"
user: "test"
}
# Edge cases
minimal_workflow: main.BatchWorkflow = main.BatchWorkflow {
workflow_id: "minimal"
name: "Minimal Test Workflow"
operations: [
main.BatchOperation {
operation_id: "test_op"
name: "Test Operation"
operation_type: "custom"
action: "test"
parameters: {}
}
]
}
# Boundary testing
max_timeout_operation: main.BatchOperation = main.BatchOperation {
operation_id: "max_timeout"
name: "Maximum Timeout Test"
operation_type: "custom"
action: "test"
parameters: {}
timeout: 86400 # 24 hours - test upper boundary
}
}
```
## Maintenance Guidelines
### 1. Schema Evolution
```kcl
# ✅ Good: Backward-compatible schema evolution
schema ServerV2(main.Server):
"""Enhanced server schema with backward compatibility"""
# New optional fields (backward compatible)
performance_profile?: "standard" | "high_performance" | "burstable"
auto_scaling_enabled?: bool = False
# Deprecated fields (marked but still supported)
deprecated_field?: str # TODO: Remove in v3.0
# Version metadata
schema_version: str = "2.0"
check:
# Maintain existing validations
len(hostname) > 0, "Hostname required"
len(title) > 0, "Title required"
# New validations for new fields
performance_profile != Undefined and auto_scaling_enabled == True and performance_profile != "burstable",
"Auto-scaling not compatible with burstable performance profile"
# Migration helper
schema ServerMigration:
"""Helper for migrating from ServerV1 to ServerV2"""
v1_server: main.Server
v2_server: ServerV2 = ServerV2 {
# Copy all existing fields
hostname: v1_server.hostname
title: v1_server.title
labels: v1_server.labels
user: v1_server.user
# Set defaults for new fields
performance_profile: "standard"
auto_scaling_enabled: False
# Copy optional fields if they exist
taskservs: v1_server.taskservs
cluster: v1_server.cluster
}
```
### 2. Documentation Updates
```kcl
# ✅ Good: Self-documenting schemas with examples
schema DocumentedWorkflow(main.BatchWorkflow):
"""
Production workflow with comprehensive documentation
This workflow follows company best practices for:
- Multi-environment deployment
- Error handling and recovery
- Security and compliance
- Performance optimization
Example Usage:
prod_workflow: DocumentedWorkflow = DocumentedWorkflow {
environment: "production"
security_level: "high"
base_workflow: main.BatchWorkflow {
workflow_id: "webapp-deploy-001"
name: "Web Application Deployment"
operations: [...]
}
}
See Also:
- examples/production_workflow.k
- docs/WORKFLOW_PATTERNS.md
- docs/SECURITY_GUIDELINES.md
"""
# Required metadata for documentation
environment: "dev" | "staging" | "prod"
security_level: "low" | "medium" | "high"
base_workflow: main.BatchWorkflow
# Auto-generated documentation fields
documentation_generated_at: str = "{{now.date}}"
schema_version: str = "1.0"
check:
environment == "prod" and security_level == "high",
"Production workflows must use high security level"
```
This comprehensive best practices guide provides the foundation for creating maintainable, secure, and performant KCL configurations for the provisioning system.