provisioning/templates/docs/troubleshooting.md.j2
Jesús Pérez 44648e3206
chore: complete nickel migration and consolidate legacy configs
- Remove KCL ecosystem (~220 files deleted)
- Migrate all infrastructure to Nickel schema system
- Consolidate documentation: legacy docs → provisioning/docs/src/
- Add CI/CD workflows (.github/) and Rust build config (.cargo/)
- Update core system for Nickel schema parsing
- Update README.md and CHANGES.md for v5.0.0 release
- Fix pre-commit hooks: end-of-file, trailing-whitespace
- Breaking changes: KCL workspaces require migration
- Migration bridge available in docs/src/development/
2026-01-08 09:55:37 +00:00

333 lines
7.7 KiB
Django/Jinja

# Workspace {{ workspace_name | title }} - Troubleshooting Guide
**Purpose**: Common issues and solutions for {{ workspace_name }} deployment
---
## Authentication & Credentials
### Issue: "{{ primary_provider | title }} authentication failed"
**Symptoms**:
```
Error: Authentication failed
HTTP 401: Unauthorized
Invalid credentials
```
**Solutions**:
1. **Verify credentials are set**:
```bash
# Check {{ primary_provider | title }} credentials
echo "Provider credentials configured"
```
2. **Test with provider CLI**:
```bash
{{ primary_provider | lower }} auth test
```
3. **Re-set credentials**:
```bash
{% for var, hint in provider_env_vars %}
export {{ var }}="your-{{ hint | lower }}"
{% endfor %}
```
4. **Check provider account**: Log in to {{ provider_url }}
---
### Issue: "SSH key not found"
**Symptoms**:
```
Error: SSH key not found
Unable to load SSH public key
Permission denied (publickey)
```
**Solutions**:
1. **Check SSH key**:
```bash
ls -la ~/.ssh/id_deployment.pub
```
2. **Generate SSH key**:
```bash
ssh-keygen -t ed25519 -f ~/.ssh/id_deployment -C "provisioning@{{ workspace_name }}"
```
3. **Update configuration** if path is different:
```nickel
ssh_key_path = "~/.ssh/your-custom-key.pub"
```
---
## Server Deployment
### Issue: "Server creation failed"
**Symptoms**:
```
Error: Failed to create server
Server creation timeout
```
**Causes**:
- Provider quota exceeded
- Insufficient account balance
- Zone unavailable
- Invalid configuration
**Solutions**:
1. **Check provider quota**: Log in to provider console
2. **Verify zone availability**: `provisioning zone list`
3. **Check configuration**: `nickel export infra/{{ default_infra }}/main.ncl | jq .servers`
4. **Try dry-run first**: `provisioning -c server create --infra {{ default_infra }}`
5. **Check provider status**: {{ provider_status_url }}
---
### Issue: "SSH connection refused"
**Symptoms**:
```
ssh: connect to host xxx.xxx.xxx.xxx port 22: Connection refused
```
**Causes**:
- Server hasn't finished booting (takes 2-3 minutes)
- SSH service not running
- Firewall blocking access
- Wrong IP address
**Solutions**:
1. **Wait for server boot**: Wait 2-3 minutes
2. **Check server status**: `provisioning server list --infra {{ default_infra }}`
3. **Verify IP address**: `provisioning server list --infra {{ default_infra }} --format json | jq '.[] | {hostname, public_ip}'`
4. **Test SSH**: `provisioning server ssh {{ servers[0].name }} --command "uptime"`
5. **Check firewall rules**: In provider console, allow SSH (port 22)
---
## Configuration & Validation
### Issue: "Nickel type-check failed"
**Symptoms**:
```
Error: Type checking failed
Type mismatch
```
**Solutions**:
1. **Run type-check**:
```bash
cd {{ workspace_path }}
nickel typecheck config/config.ncl
nickel typecheck infra/{{ default_infra }}/main.ncl
```
2. **Check error message**: Fix the line mentioned in error
3. **Validate JSON export**: `nickel export config/config.ncl | jq .`
4. **Revert changes**: `git checkout config/config.ncl`
---
### Issue: "Workspace not found"
**Symptoms**:
```
Error: Workspace {{ workspace_name }} not found
```
**Solutions**:
1. **Check if registered**: `provisioning workspace list`
2. **Register workspace**: `provisioning workspace register {{ workspace_name }} {{ workspace_path }}`
3. **Activate workspace**: `provisioning workspace activate {{ workspace_name }}:{{ default_infra }}`
4. **Verify**: `provisioning workspace active`
---
## Service Installation
### Issue: "Kubernetes installation failed"
**Symptoms**:
```
Error: Failed to install Kubernetes
kubectl: command not found
```
**Solutions**:
1. **Verify server resources**:
```bash
ssh root@{{ servers[0].name }} "free -h" # Check memory
ssh root@{{ servers[0].name }} "df -h" # Check disk
```
2. **Check container runtime**: `ssh root@{{ servers[0].name }} "ctr version"`
3. **Increase server size if needed**: Edit `infra/{{ default_infra }}/servers.ncl` and change plan
4. **Retry installation**: `provisioning taskserv create kubernetes`
---
## Network Connectivity
### Issue: "Servers cannot communicate"
**Symptoms**:
```
ping: ICMP timeout
No route to host
```
**Solutions**:
1. **Verify private IP**: `provisioning server list --infra {{ default_infra }} --format json | jq '.[] | {hostname, private_ip}'`
2. **Test ping**: `ssh root@{{ servers[0].name }} "ping -c 3 <private-ip>"`
3. **Check firewall**: In provider console, verify firewall allows internal traffic
4. **Verify servers are in same zone**: `provisioning server list --infra {{ default_infra }} --format json | jq '.[] | {hostname, zone}'`
---
## Performance & Timeouts
### Issue: "Pricing command times out"
**Symptoms**: `provisioning price --infra {{ default_infra }}` takes 30+ seconds
**Solutions**:
1. **First run is slow**: Caches provider data
2. **Use cache for subsequent runs**: Cache at `data/{{ primary_provider | lower }}_prices.yaml`
3. **Refresh cache**: `rm -f data/*_cache.yaml && provisioning price --infra {{ default_infra }}`
4. **Check network**: `ping -c 5 {{ provider_api_host }}`
---
## Orchestrator Issues
### Issue: "Orchestrator health check fails"
**Symptoms**:
```
curl http://localhost:9090/health
curl: (7) Failed to connect to localhost port 9090
```
**Solutions**:
1. **Check if running**: `ps aux | grep orchestrator`
2. **Check port 9090**: `lsof -i :9090`
3. **Check logs**: `tail -f provisioning/platform/orchestrator/data/orchestrator.log`
4. **Start manually**: `cd provisioning/platform/orchestrator && ./scripts/start-orchestrator.nu --background`
5. **Verify**: `curl http://localhost:9090/health`
---
## Cleanup Issues
### Issue: "Server deletion failed"
**Symptoms**:
```
Error: Failed to delete server
Server is locked or still in use
```
**Solutions**:
1. **Delete through provider console**: Log in and manually delete
2. **Check server state**: `provisioning server list --infra {{ default_infra }} --format json | jq '.[] | {hostname, status}'`
3. **Retry deletion**: `provisioning server delete --infra {{ default_infra }}`
4. **Clean cache**: `provisioning cache clean && rm -rf {{ workspace_path }}/.providers/*/state/*`
5. **Verify**: `provisioning server list --infra {{ default_infra }}`
---
## Debug Procedures
### Check Logs
1. **Orchestrator logs**:
```bash
tail -f provisioning/platform/orchestrator/data/orchestrator.log
```
2. **Enable debug mode**:
```bash
export {{ primary_provider | upper }}_DEBUG=true
provisioning --debug server create --infra {{ default_infra }}
```
3. **Server logs**:
```bash
ssh root@{{ servers[0].name }} "tail -f /var/log/syslog"
```
---
## Diagnosis Checklist
```
[ ] Verify {{ primary_provider | title }} credentials
[ ] Check SSH key exists
[ ] Validate Nickel config
[ ] Workspace registered
[ ] Workspace activated
[ ] Workspace validated
[ ] Orchestrator running
[ ] Dry-run passes
[ ] Check provider status
[ ] Review logs
```
---
## Quick Reference
**Common Commands**:
```bash
# Validate
provisioning validate config
nu workspace.nu validate
nickel typecheck infra/{{ default_infra }}/main.ncl
# Deploy
provisioning -c server create --infra {{ default_infra }}
provisioning server create --infra {{ default_infra }}
# Verify
provisioning server list --infra {{ default_infra }}
provisioning price --infra {{ default_infra }}
# Debug
provisioning --debug server create --infra {{ default_infra }}
tail -f provisioning/platform/orchestrator/data/orchestrator.log
```
---
## Next Steps
After troubleshooting:
1. Validate configuration again
2. Perform dry-run deployment
3. Deploy infrastructure
4. Verify deployment health
For more information:
- Deployment Guide: `deployment-guide.md`
- Configuration Guide: `configuration-guide.md`