Troubleshooting Guide
This comprehensive troubleshooting guide helps you diagnose and resolve common issues with Infrastructure Automation.
What You’ll Learn
- Common issues and their solutions
- Diagnostic commands and techniques
- Error message interpretation
- Performance optimization
- Recovery procedures
- Prevention strategies
General Troubleshooting Approach
1. Identify the Problem
# Check overall system status
provisioning env
provisioning validate config
# Check specific component status
provisioning show servers --infra my-infra
provisioning taskserv list --infra my-infra --installed
2. Gather Information
# Enable debug mode for detailed output
provisioning --debug <command>
# Check logs and errors
provisioning show logs --infra my-infra
3. Use Diagnostic Commands
# Validate configuration
provisioning validate config --detailed
# Test connectivity
provisioning provider test aws
provisioning network test --infra my-infra
Installation and Setup Issues
Issue: Installation Fails
Symptoms:
- Installation script errors
- Missing dependencies
- Permission denied errors
Diagnosis:
# Check system requirements
uname -a
df -h
whoami
# Check permissions
ls -la /usr/local/
sudo -l
Solutions:
Permission Issues
# Run installer with sudo
sudo ./install-provisioning
# Or install to user directory
./install-provisioning --prefix=$HOME/provisioning
export PATH="$HOME/provisioning/bin:$PATH"
Missing Dependencies
# Ubuntu/Debian
sudo apt update
sudo apt install -y curl wget tar build-essential
# RHEL/CentOS
sudo dnf install -y curl wget tar gcc make
Architecture Issues
# Check architecture
uname -m
# Download correct architecture package
# x86_64: Intel/AMD 64-bit
# arm64: ARM 64-bit (Apple Silicon)
wget https://releases.example.com/provisioning-linux-x86_64.tar.gz
Issue: Command Not Found
Symptoms:
bash: provisioning: command not found
Diagnosis:
# Check if provisioning is installed
which provisioning
ls -la /usr/local/bin/provisioning
# Check PATH
echo $PATH
Solutions:
# Add to PATH
export PATH="/usr/local/bin:$PATH"
# Make permanent (add to shell profile)
echo 'export PATH="/usr/local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
# Create symlink if missing
sudo ln -sf /usr/local/provisioning/core/nulib/provisioning /usr/local/bin/provisioning
Issue: Nushell Plugin Errors
Symptoms:
Plugin not found: nu_plugin_kcl
Plugin registration failed
Diagnosis:
# Check Nushell version
nu --version
# Check KCL installation (required for nu_plugin_kcl)
kcl version
# Check plugin registration
nu -c "version | get installed_plugins"
Solutions:
# Install KCL CLI (required for nu_plugin_kcl)
# Download from: https://github.com/kcl-lang/cli/releases
# Re-register plugins
nu -c "plugin add /usr/local/provisioning/plugins/nu_plugin_kcl"
nu -c "plugin add /usr/local/provisioning/plugins/nu_plugin_tera"
# Restart Nushell after plugin registration
Configuration Issues
Issue: Configuration Not Found
Symptoms:
Configuration file not found
Failed to load configuration
Diagnosis:
# Check configuration file locations
provisioning env | grep config
# Check if files exist
ls -la ~/.config/provisioning/
ls -la /usr/local/provisioning/config.defaults.toml
Solutions:
# Initialize user configuration
provisioning init config
# Create missing directories
mkdir -p ~/.config/provisioning
# Copy template
cp /usr/local/provisioning/config-examples/config.user.toml ~/.config/provisioning/config.toml
# Verify configuration
provisioning validate config
Issue: Configuration Validation Errors
Symptoms:
Configuration validation failed
Invalid configuration value
Missing required field
Diagnosis:
# Detailed validation
provisioning validate config --detailed
# Check specific sections
provisioning config show --section paths
provisioning config show --section providers
Solutions:
Path Configuration Issues
# Check base path exists
ls -la /path/to/provisioning
# Update configuration
nano ~/.config/provisioning/config.toml
# Fix paths section
[paths]
base = "/correct/path/to/provisioning"
Provider Configuration Issues
# Test provider connectivity
provisioning provider test aws
# Check credentials
aws configure list # For AWS
upcloud-cli config # For UpCloud
# Update provider configuration
[providers.aws]
interface = "CLI" # or "API"
Issue: Interpolation Failures
Symptoms:
Interpolation pattern not resolved: {{env.VARIABLE}}
Template rendering failed
Diagnosis:
# Test interpolation
provisioning validate interpolation test
# Check environment variables
env | grep VARIABLE
# Debug interpolation
provisioning --debug validate interpolation validate
Solutions:
# Set missing environment variables
export MISSING_VARIABLE="value"
# Use fallback values in configuration
config_value = "{{env.VARIABLE || 'default_value'}}"
# Check interpolation syntax
# Correct: {{env.HOME}}
# Incorrect: ${HOME} or $HOME
Server Management Issues
Issue: Server Creation Fails
Symptoms:
Failed to create server
Provider API error
Insufficient quota
Diagnosis:
# Check provider status
provisioning provider status aws
# Test connectivity
ping api.provider.com
curl -I https://api.provider.com
# Check quota
provisioning provider quota --infra my-infra
# Debug server creation
provisioning --debug server create web-01 --infra my-infra --check
Solutions:
API Authentication Issues
# AWS
aws configure list
aws sts get-caller-identity
# UpCloud
upcloud-cli account show
# Update credentials
aws configure # For AWS
export UPCLOUD_USERNAME="your-username"
export UPCLOUD_PASSWORD="your-password"
Quota/Limit Issues
# Check current usage
provisioning show costs --infra my-infra
# Request quota increase from provider
# Or reduce resource requirements
# Use smaller instance types
# Reduce number of servers
Network/Connectivity Issues
# Test network connectivity
curl -v https://api.aws.amazon.com
curl -v https://api.upcloud.com
# Check DNS resolution
nslookup api.aws.amazon.com
# Check firewall rules
# Ensure outbound HTTPS (port 443) is allowed
Issue: SSH Access Fails
Symptoms:
Connection refused
Permission denied
Host key verification failed
Diagnosis:
# Check server status
provisioning server list --infra my-infra
# Test SSH manually
ssh -v user@server-ip
# Check SSH configuration
provisioning show servers web-01 --infra my-infra
Solutions:
Connection Issues
# Wait for server to be fully ready
provisioning server list --infra my-infra --status
# Check security groups/firewall
# Ensure SSH (port 22) is allowed
# Use correct IP address
provisioning show servers web-01 --infra my-infra | grep ip
Authentication Issues
# Check SSH key
ls -la ~/.ssh/
ssh-add -l
# Generate new key if needed
ssh-keygen -t ed25519 -f ~/.ssh/provisioning_key
# Use specific key
provisioning server ssh web-01 --key ~/.ssh/provisioning_key --infra my-infra
Host Key Issues
# Remove old host key
ssh-keygen -R server-ip
# Accept new host key
ssh -o StrictHostKeyChecking=accept-new user@server-ip
Task Service Issues
Issue: Service Installation Fails
Symptoms:
Service installation failed
Package not found
Dependency conflicts
Diagnosis:
# Check service prerequisites
provisioning taskserv check kubernetes --infra my-infra
# Debug installation
provisioning --debug taskserv create kubernetes --infra my-infra --check
# Check server resources
provisioning server ssh web-01 --command "free -h && df -h" --infra my-infra
Solutions:
Resource Issues
# Check available resources
provisioning server ssh web-01 --command "
echo 'Memory:' && free -h
echo 'Disk:' && df -h
echo 'CPU:' && nproc
" --infra my-infra
# Upgrade server if needed
provisioning server resize web-01 --plan larger-plan --infra my-infra
Package Repository Issues
# Update package lists
provisioning server ssh web-01 --command "
sudo apt update && sudo apt upgrade -y
" --infra my-infra
# Check repository connectivity
provisioning server ssh web-01 --command "
curl -I https://download.docker.com/linux/ubuntu/
" --infra my-infra
Dependency Issues
# Install missing dependencies
provisioning taskserv create containerd --infra my-infra
# Then install dependent service
provisioning taskserv create kubernetes --infra my-infra
Issue: Service Not Running
Symptoms:
Service status: failed
Service not responding
Health check failures
Diagnosis:
# Check service status
provisioning taskserv status kubernetes --infra my-infra
# Check service logs
provisioning taskserv logs kubernetes --infra my-infra
# SSH and check manually
provisioning server ssh web-01 --command "
sudo systemctl status kubernetes
sudo journalctl -u kubernetes --no-pager -n 50
" --infra my-infra
Solutions:
Configuration Issues
# Reconfigure service
provisioning taskserv configure kubernetes --infra my-infra
# Reset to defaults
provisioning taskserv reset kubernetes --infra my-infra
Port Conflicts
# Check port usage
provisioning server ssh web-01 --command "
sudo netstat -tulpn | grep :6443
sudo ss -tulpn | grep :6443
" --infra my-infra
# Change port configuration or stop conflicting service
Permission Issues
# Fix permissions
provisioning server ssh web-01 --command "
sudo chown -R kubernetes:kubernetes /var/lib/kubernetes
sudo chmod 600 /etc/kubernetes/admin.conf
" --infra my-infra
Cluster Management Issues
Issue: Cluster Deployment Fails
Symptoms:
Cluster deployment failed
Pod creation errors
Service unavailable
Diagnosis:
# Check cluster status
provisioning cluster status web-cluster --infra my-infra
# Check Kubernetes cluster
provisioning server ssh master-01 --command "
kubectl get nodes
kubectl get pods --all-namespaces
" --infra my-infra
# Check cluster logs
provisioning cluster logs web-cluster --infra my-infra
Solutions:
Node Issues
# Check node status
provisioning server ssh master-01 --command "
kubectl describe nodes
" --infra my-infra
# Drain and rejoin problematic nodes
provisioning server ssh master-01 --command "
kubectl drain worker-01 --ignore-daemonsets
kubectl delete node worker-01
" --infra my-infra
# Rejoin node
provisioning taskserv configure kubernetes --infra my-infra --servers worker-01
Resource Constraints
# Check resource usage
provisioning server ssh master-01 --command "
kubectl top nodes
kubectl top pods --all-namespaces
" --infra my-infra
# Scale down or add more nodes
provisioning cluster scale web-cluster --replicas 3 --infra my-infra
provisioning server create worker-04 --infra my-infra
Network Issues
# Check network plugin
provisioning server ssh master-01 --command "
kubectl get pods -n kube-system | grep cilium
" --infra my-infra
# Restart network plugin
provisioning taskserv restart cilium --infra my-infra
Performance Issues
Issue: Slow Operations
Symptoms:
- Commands take very long to complete
- Timeouts during operations
- High CPU/memory usage
Diagnosis:
# Check system resources
top
htop
free -h
df -h
# Check network latency
ping api.aws.amazon.com
traceroute api.aws.amazon.com
# Profile command execution
time provisioning server list --infra my-infra
Solutions:
Local System Issues
# Close unnecessary applications
# Upgrade system resources
# Use SSD storage if available
# Increase timeout values
export PROVISIONING_TIMEOUT=600 # 10 minutes
Network Issues
# Use region closer to your location
[providers.aws]
region = "us-west-1" # Closer region
# Enable connection pooling/caching
[cache]
enabled = true
Large Infrastructure Issues
# Use parallel operations
provisioning server create --infra my-infra --parallel 4
# Filter results
provisioning server list --infra my-infra --filter "status == 'running'"
Issue: High Memory Usage
Symptoms:
- System becomes unresponsive
- Out of memory errors
- Swap usage high
Diagnosis:
# Check memory usage
free -h
ps aux --sort=-%mem | head
# Check for memory leaks
valgrind provisioning server list --infra my-infra
Solutions:
# Increase system memory
# Close other applications
# Use streaming operations for large datasets
# Enable garbage collection
export PROVISIONING_GC_ENABLED=true
# Reduce concurrent operations
export PROVISIONING_MAX_PARALLEL=2
Network and Connectivity Issues
Issue: API Connectivity Problems
Symptoms:
Connection timeout
DNS resolution failed
SSL certificate errors
Diagnosis:
# Test basic connectivity
ping 8.8.8.8
curl -I https://api.aws.amazon.com
nslookup api.upcloud.com
# Check SSL certificates
openssl s_client -connect api.aws.amazon.com:443 -servername api.aws.amazon.com
Solutions:
DNS Issues
# Use alternative DNS
echo 'nameserver 8.8.8.8' | sudo tee /etc/resolv.conf
# Clear DNS cache
sudo systemctl restart systemd-resolved # Ubuntu
sudo dscacheutil -flushcache # macOS
Proxy/Firewall Issues
# Configure proxy if needed
export HTTP_PROXY=http://proxy.company.com:9090
export HTTPS_PROXY=http://proxy.company.com:9090
# Check firewall rules
sudo ufw status # Ubuntu
sudo firewall-cmd --list-all # RHEL/CentOS
Certificate Issues
# Update CA certificates
sudo apt update && sudo apt install ca-certificates # Ubuntu
brew install ca-certificates # macOS
# Skip SSL verification (temporary)
export PROVISIONING_SKIP_SSL_VERIFY=true
Security and Encryption Issues
Issue: SOPS Decryption Fails
Symptoms:
SOPS decryption failed
Age key not found
Invalid key format
Diagnosis:
# Check SOPS configuration
provisioning sops config
# Test SOPS manually
sops -d encrypted-file.k
# Check Age keys
ls -la ~/.config/sops/age/keys.txt
age-keygen -y ~/.config/sops/age/keys.txt
Solutions:
Missing Keys
# Generate new Age key
age-keygen -o ~/.config/sops/age/keys.txt
# Update SOPS configuration
provisioning sops config --key-file ~/.config/sops/age/keys.txt
Key Permissions
# Fix key file permissions
chmod 600 ~/.config/sops/age/keys.txt
chown $(whoami) ~/.config/sops/age/keys.txt
Configuration Issues
# Update SOPS configuration in ~/.config/provisioning/config.toml
[sops]
use_sops = true
key_search_paths = [
"~/.config/sops/age/keys.txt",
"/path/to/your/key.txt"
]
Issue: Access Denied Errors
Symptoms:
Permission denied
Access denied
Insufficient privileges
Diagnosis:
# Check user permissions
id
groups
# Check file permissions
ls -la ~/.config/provisioning/
ls -la /usr/local/provisioning/
# Test with sudo
sudo provisioning env
Solutions:
# Fix file ownership
sudo chown -R $(whoami):$(whoami) ~/.config/provisioning/
# Fix permissions
chmod -R 755 ~/.config/provisioning/
chmod 600 ~/.config/provisioning/config.toml
# Add user to required groups
sudo usermod -a -G docker $(whoami) # For Docker access
Data and Storage Issues
Issue: Disk Space Problems
Symptoms:
No space left on device
Write failed
Disk full
Diagnosis:
# Check disk usage
df -h
du -sh ~/.config/provisioning/
du -sh /usr/local/provisioning/
# Find large files
find /usr/local/provisioning -type f -size +100M
Solutions:
# Clean up cache files
rm -rf ~/.config/provisioning/cache/*
rm -rf /usr/local/provisioning/.cache/*
# Clean up logs
find /usr/local/provisioning -name "*.log" -mtime +30 -delete
# Clean up temporary files
rm -rf /tmp/provisioning-*
# Compress old backups
gzip ~/.config/provisioning/backups/*.yaml
Recovery Procedures
Configuration Recovery
# Restore from backup
provisioning config restore --backup latest
# Reset to defaults
provisioning config reset
# Recreate configuration
provisioning init config --force
Infrastructure Recovery
# Check infrastructure status
provisioning show servers --infra my-infra
# Recover failed servers
provisioning server create failed-server --infra my-infra
# Restore from backup
provisioning restore --backup latest --infra my-infra
Service Recovery
# Restart failed services
provisioning taskserv restart kubernetes --infra my-infra
# Reinstall corrupted services
provisioning taskserv delete kubernetes --infra my-infra
provisioning taskserv create kubernetes --infra my-infra
Prevention Strategies
Regular Maintenance
# Weekly maintenance script
#!/bin/bash
# Update system
provisioning update --check
# Validate configuration
provisioning validate config
# Check for service updates
provisioning taskserv check-updates
# Clean up old files
provisioning cleanup --older-than 30d
# Create backup
provisioning backup create --name "weekly-$(date +%Y%m%d)"
Monitoring Setup
# Set up health monitoring
#!/bin/bash
# Check system health every hour
0 * * * * /usr/local/bin/provisioning health check || echo "Health check failed" | mail -s "Provisioning Alert" admin@company.com
# Weekly cost reports
0 9 * * 1 /usr/local/bin/provisioning show costs --all | mail -s "Weekly Cost Report" finance@company.com
Best Practices
-
Configuration Management
- Version control all configuration files
- Use check mode before applying changes
- Regular validation and testing
-
Security
- Regular key rotation
- Principle of least privilege
- Audit logs review
-
Backup Strategy
- Automated daily backups
- Test restore procedures
- Off-site backup storage
-
Documentation
- Document custom configurations
- Keep troubleshooting logs
- Share knowledge with team
Getting Additional Help
Debug Information Collection
#!/bin/bash
# Collect debug information
echo "Collecting provisioning debug information..."
mkdir -p /tmp/provisioning-debug
cd /tmp/provisioning-debug
# System information
uname -a > system-info.txt
free -h >> system-info.txt
df -h >> system-info.txt
# Provisioning information
provisioning --version > provisioning-info.txt
provisioning env >> provisioning-info.txt
provisioning validate config --detailed > config-validation.txt 2>&1
# Configuration files
cp ~/.config/provisioning/config.toml user-config.toml 2>/dev/null || echo "No user config" > user-config.toml
# Logs
provisioning show logs > system-logs.txt 2>&1
# Create archive
cd /tmp
tar czf provisioning-debug-$(date +%Y%m%d_%H%M%S).tar.gz provisioning-debug/
echo "Debug information collected in: provisioning-debug-*.tar.gz"
Support Channels
-
Built-in Help
provisioning help provisioning help <command> -
Documentation
- User guides in
docs/user/ - CLI reference:
docs/user/cli-reference.md - Configuration guide:
docs/user/configuration.md
- User guides in
-
Community Resources
- Project repository issues
- Community forums
- Documentation wiki
-
Enterprise Support
- Professional services
- Priority support
- Custom development
Remember: When reporting issues, always include the debug information collected above and specific error messages.