390 lines
11 KiB
Markdown
390 lines
11 KiB
Markdown
# Pre-Deployment Checklist
|
|
|
|
Critical verification steps before any VAPORA deployment to production or staging.
|
|
|
|
---
|
|
|
|
## 24 Hours Before Deployment
|
|
|
|
### Communication & Scheduling
|
|
|
|
- [ ] Schedule deployment with team (record in calendar/ticket)
|
|
- [ ] Post in #deployments channel: "Deployment scheduled for [DATE TIME UTC]"
|
|
- [ ] Identify on-call engineer for deployment period
|
|
- [ ] Brief on-call on deployment plan and rollback procedure
|
|
- [ ] Ensure affected teams (support, product, etc.) are notified
|
|
- [ ] Verify no other critical infrastructure changes scheduled same time window
|
|
|
|
### Change Documentation
|
|
|
|
- [ ] Create GitHub issue or ticket tracking the deployment
|
|
- [ ] Document: what's changing (configs, manifests, versions)
|
|
- [ ] Document: why (bug fix, feature, performance, security)
|
|
- [ ] Document: rollback plan (revision number or previous config)
|
|
- [ ] Document: success criteria (what indicates successful deployment)
|
|
- [ ] Document: estimated duration (usually 5-15 minutes)
|
|
|
|
### Code Review & Validation
|
|
|
|
- [ ] All provisioning changes merged and code reviewed
|
|
- [ ] Confirm `main` branch has latest changes
|
|
- [ ] Run validation locally: `nu scripts/validate-config.nu --mode enterprise`
|
|
- [ ] Verify all 3 modes validate without errors or critical warnings
|
|
- [ ] Check git log for unexpected commits
|
|
- [ ] Review artifact generation: ensure configs are correct
|
|
|
|
---
|
|
|
|
## 4 Hours Before Deployment
|
|
|
|
### Environment Verification
|
|
|
|
#### Staging Environment
|
|
|
|
- [ ] Access staging Kubernetes cluster: `kubectl cluster-info`
|
|
- [ ] Verify cluster is healthy: `kubectl get nodes` (all Ready)
|
|
- [ ] Check namespace exists: `kubectl get namespace vapora`
|
|
- [ ] Verify current deployments: `kubectl get deployments -n vapora`
|
|
- [ ] Check ConfigMap is up to date: `kubectl get configmap -n vapora -o yaml | head -20`
|
|
|
|
#### Production Environment (if applicable)
|
|
|
|
- [ ] Access production Kubernetes cluster: `kubectl cluster-info`
|
|
- [ ] Verify all nodes healthy: `kubectl get nodes` (all Ready)
|
|
- [ ] Check current resource usage: `kubectl top nodes` (not near capacity)
|
|
- [ ] Verify current deployments: `kubectl get deployments -n vapora`
|
|
- [ ] Check pod status: `kubectl get pods -n vapora` (all Running)
|
|
- [ ] Verify recent events: `kubectl get events -n vapora --sort-by='.lastTimestamp' | tail -10`
|
|
|
|
### Health Baseline
|
|
|
|
- [ ] Record current metrics before deployment
|
|
- CPU usage per deployment
|
|
- Memory usage per deployment
|
|
- Request latency (p50, p95, p99)
|
|
- Error rate (4xx, 5xx)
|
|
- Queue depth (if applicable)
|
|
|
|
- [ ] Verify services are responsive:
|
|
```bash
|
|
curl http://localhost:8001/health -H "Authorization: Bearer $TOKEN"
|
|
curl http://localhost:8001/api/projects
|
|
```
|
|
|
|
- [ ] Check logs for recent errors:
|
|
```bash
|
|
kubectl logs deployment/vapora-backend -n vapora --tail=50
|
|
kubectl logs deployment/vapora-agents -n vapora --tail=50
|
|
```
|
|
|
|
### Infrastructure Check
|
|
|
|
- [ ] Verify storage is not near capacity: `df -h /var/lib/vapora`
|
|
- [ ] Check database health: `kubectl exec -n vapora <pod> -- surreal info`
|
|
- [ ] Verify backups are recent (within 24 hours)
|
|
- [ ] Check SSL certificate expiration: `openssl s_client -connect api.vapora.com:443 -showcerts | grep "Validity"`
|
|
|
|
---
|
|
|
|
## 2 Hours Before Deployment
|
|
|
|
### Artifact Preparation
|
|
|
|
- [ ] Trigger validation in CI/CD pipeline
|
|
- [ ] Wait for artifact generation to complete
|
|
- [ ] Download artifacts from pipeline:
|
|
```bash
|
|
# From GitHub Actions or Woodpecker UI
|
|
# Download: deployment-artifacts.zip
|
|
```
|
|
|
|
- [ ] Verify artifact contents:
|
|
```bash
|
|
unzip deployment-artifacts.zip
|
|
ls -la
|
|
# Should contain:
|
|
# - configmap.yaml
|
|
# - deployment.yaml
|
|
# - docker-compose.yml
|
|
# - vapora-{solo,multiuser,enterprise}.{toml,yaml,json}
|
|
```
|
|
|
|
- [ ] Validate manifest syntax:
|
|
```bash
|
|
yq eval '.' configmap.yaml > /dev/null && echo "✓ ConfigMap valid"
|
|
yq eval '.' deployment.yaml > /dev/null && echo "✓ Deployment valid"
|
|
```
|
|
|
|
### Test in Staging
|
|
|
|
- [ ] Perform dry-run deployment to staging cluster:
|
|
```bash
|
|
kubectl apply -f configmap.yaml --dry-run=server -n vapora
|
|
kubectl apply -f deployment.yaml --dry-run=server -n vapora
|
|
```
|
|
|
|
- [ ] Review dry-run output for any warnings or errors
|
|
- [ ] If test deployment available, do actual staging deployment and verify:
|
|
```bash
|
|
kubectl get deployments -n vapora
|
|
kubectl get pods -n vapora
|
|
kubectl logs deployment/vapora-backend -n vapora --tail=5
|
|
```
|
|
|
|
- [ ] Test health endpoints on staging
|
|
- [ ] Run smoke tests against staging (if available)
|
|
|
|
### Rollback Plan Verification
|
|
|
|
- [ ] Document current deployment revisions:
|
|
```bash
|
|
kubectl rollout history deployment/vapora-backend -n vapora
|
|
# Record the highest revision number
|
|
```
|
|
|
|
- [ ] Create backup of current ConfigMap:
|
|
```bash
|
|
kubectl get configmap -n vapora vapora-config -o yaml > configmap-backup.yaml
|
|
```
|
|
|
|
- [ ] Test rollback procedure on staging (if safe):
|
|
```bash
|
|
# Record current revision
|
|
CURRENT_REV=$(kubectl rollout history deployment/vapora-backend -n vapora | tail -1 | awk '{print $1}')
|
|
|
|
# Test undo
|
|
kubectl rollout undo deployment/vapora-backend -n vapora
|
|
|
|
# Verify rollback
|
|
kubectl get deployment vapora-backend -n vapora -o yaml | grep image
|
|
|
|
# Restore to current
|
|
kubectl rollout undo deployment/vapora-backend -n vapora --to-revision=$CURRENT_REV
|
|
```
|
|
|
|
- [ ] Confirm rollback command is documented in ticket/issue
|
|
|
|
---
|
|
|
|
## 1 Hour Before Deployment
|
|
|
|
### Final Checks
|
|
|
|
- [ ] Confirm all prerequisites met:
|
|
- [ ] Code merged to main
|
|
- [ ] Artifacts generated and validated
|
|
- [ ] Staging deployment tested
|
|
- [ ] Rollback plan documented
|
|
- [ ] Team notified
|
|
|
|
### Communication Setup
|
|
|
|
- [ ] Set status page to "Maintenance Mode" (if public)
|
|
```
|
|
"VAPORA maintenance deployment starting at HH:MM UTC.
|
|
Expected duration: 10 minutes. Services may be briefly unavailable."
|
|
```
|
|
|
|
- [ ] Join #deployments Slack channel
|
|
- [ ] Prepare message: "🚀 Deployment starting now. Will update every 2 minutes."
|
|
- [ ] Have on-call engineer monitoring
|
|
- [ ] Verify monitoring/alerting dashboards are accessible
|
|
|
|
### Access Verification
|
|
|
|
- [ ] Verify kubeconfig is valid and up to date:
|
|
```bash
|
|
kubectl cluster-info
|
|
kubectl get nodes
|
|
```
|
|
|
|
- [ ] Verify kubectl version compatibility:
|
|
```bash
|
|
kubectl version
|
|
# Should match server version reasonably (within 1 minor version)
|
|
```
|
|
|
|
- [ ] Test write access to cluster:
|
|
```bash
|
|
kubectl auth can-i create deployments --namespace=vapora
|
|
# Should return "yes"
|
|
```
|
|
|
|
- [ ] Verify docker/docker-compose access (if Docker deployment)
|
|
- [ ] Verify Slack webhook is working (test send message)
|
|
|
|
---
|
|
|
|
## 15 Minutes Before Deployment
|
|
|
|
### Final Go/No-Go Decision
|
|
|
|
**STOP HERE** and make final decision to proceed or reschedule:
|
|
|
|
**Proceed IF:**
|
|
- ✅ All checklist items above completed
|
|
- ✅ No critical issues found during testing
|
|
- ✅ Staging deployment successful
|
|
- ✅ Team ready and monitoring
|
|
- ✅ Rollback plan clear and tested
|
|
- ✅ Within designated maintenance window
|
|
|
|
**RESCHEDULE IF:**
|
|
- ❌ Any critical issues discovered
|
|
- ❌ Staging tests failed
|
|
- ❌ Team member unavailable
|
|
- ❌ Production issues detected
|
|
- ❌ Unexpected changes in code/configs
|
|
|
|
### Final Notifications
|
|
|
|
If proceeding:
|
|
- [ ] Post to #deployments: "🚀 Deployment starting in 5 minutes"
|
|
- [ ] Alert on-call engineer: "Ready to start - confirm you're monitoring"
|
|
- [ ] Have rollback plan visible and accessible
|
|
- [ ] Open monitoring dashboard showing current metrics
|
|
|
|
### Terminal Setup
|
|
|
|
- [ ] Open terminal with kubeconfig configured:
|
|
```bash
|
|
export KUBECONFIG=/path/to/production/kubeconfig
|
|
kubectl cluster-info # Verify connected to production
|
|
```
|
|
|
|
- [ ] Open second terminal for tailing logs:
|
|
```bash
|
|
kubectl logs -f deployment/vapora-backend -n vapora
|
|
```
|
|
|
|
- [ ] Have rollback commands ready:
|
|
```bash
|
|
# For quick rollback if needed
|
|
kubectl rollout undo deployment/vapora-backend -n vapora
|
|
kubectl rollout undo deployment/vapora-agents -n vapora
|
|
kubectl rollout undo deployment/vapora-llm-router -n vapora
|
|
```
|
|
|
|
- [ ] Prepare metrics check script:
|
|
```bash
|
|
watch kubectl top pods -n vapora
|
|
watch kubectl get pods -n vapora
|
|
```
|
|
|
|
---
|
|
|
|
## Success Criteria Verification
|
|
|
|
Document what "success" looks like for this deployment:
|
|
|
|
- [ ] All three deployments have updated image IDs
|
|
- [ ] All pods reach "Ready" state within 5 minutes
|
|
- [ ] No pod restarts: `kubectl get pods -n vapora --watch` (no restarts column increasing)
|
|
- [ ] No error logs in first 2 minutes
|
|
- [ ] Health endpoints respond (200 OK)
|
|
- [ ] API endpoints respond to test requests
|
|
- [ ] Metrics show normal resource usage
|
|
- [ ] No alerts triggered
|
|
- [ ] Support team reports no user impact
|
|
|
|
---
|
|
|
|
## Team Roles During Deployment
|
|
|
|
### Deployment Lead
|
|
- Executes deployment commands
|
|
- Monitors progress
|
|
- Communicates status updates
|
|
- Decides to proceed/rollback
|
|
|
|
### On-Call Engineer
|
|
- Monitors dashboards and alerts
|
|
- Watches for anomalies
|
|
- Prepares for rollback if needed
|
|
- Available for emergency decisions
|
|
|
|
### Communications Lead (optional)
|
|
- Updates #deployments channel
|
|
- Notifies support/product teams
|
|
- Updates status page if public
|
|
- Handles external communication
|
|
|
|
### Backup Person
|
|
- Monitors for issues
|
|
- Ready to assist with troubleshooting
|
|
- Prepares rollback procedures
|
|
- Escalates if needed
|
|
|
|
---
|
|
|
|
## Common Issues to Watch For
|
|
|
|
⚠️ **Pod CrashLoopBackOff**
|
|
- Indicates config or image issue
|
|
- Check pod logs: `kubectl logs <pod>`
|
|
- Check events: `kubectl describe pod <pod>`
|
|
- **Action**: Rollback immediately
|
|
|
|
⚠️ **Pending Pods (not starting)**
|
|
- Check resource availability: `kubectl describe pod <pod>`
|
|
- Check node capacity
|
|
- **Action**: Investigate or rollback if resource exhausted
|
|
|
|
⚠️ **High Error Rate**
|
|
- Check application logs
|
|
- Compare with baseline errors
|
|
- **Action**: If >10% error increase, rollback
|
|
|
|
⚠️ **Database Connection Errors**
|
|
- Check ConfigMap has correct database URL
|
|
- Verify network connectivity to database
|
|
- **Action**: Check ConfigMap, fix and reapply if needed
|
|
|
|
⚠️ **Memory or CPU Spike**
|
|
- Monitor trends (sudden spike vs gradual)
|
|
- Check if within expected range for new code
|
|
- **Action**: Rollback if resource limits exceeded
|
|
|
|
---
|
|
|
|
## Post-Deployment Documentation
|
|
|
|
After deployment completes, record:
|
|
|
|
- [ ] Deployment start time (UTC)
|
|
- [ ] Deployment end time (UTC)
|
|
- [ ] Total duration
|
|
- [ ] Any issues encountered and resolution
|
|
- [ ] Rollback performed (Y/N)
|
|
- [ ] Metrics before/after (CPU, memory, latency, errors)
|
|
- [ ] Team members involved
|
|
- [ ] Blockers or lessons learned
|
|
|
|
---
|
|
|
|
## Sign-Off
|
|
|
|
Use this template for deployment issue/ticket:
|
|
|
|
```
|
|
DEPLOYMENT COMPLETED
|
|
|
|
✓ All checks passed
|
|
✓ Deployment successful
|
|
✓ All pods running
|
|
✓ Health checks passing
|
|
✓ No user impact
|
|
|
|
Deployed by: [Name]
|
|
Start time: [UTC]
|
|
Duration: [X minutes]
|
|
Rollback needed: No
|
|
|
|
Metrics:
|
|
- Latency (p99): [X]ms
|
|
- Error rate: [X]%
|
|
- Pod restarts: 0
|
|
|
|
Next deployment: [Date/Time]
|
|
```
|