Vapora/docs/operations/pre-deployment-checklist.md

390 lines
11 KiB
Markdown
Raw Normal View History

# Pre-Deployment Checklist
Critical verification steps before any VAPORA deployment to production or staging.
---
## 24 Hours Before Deployment
### Communication & Scheduling
- [ ] Schedule deployment with team (record in calendar/ticket)
- [ ] Post in #deployments channel: "Deployment scheduled for [DATE TIME UTC]"
- [ ] Identify on-call engineer for deployment period
- [ ] Brief on-call on deployment plan and rollback procedure
- [ ] Ensure affected teams (support, product, etc.) are notified
- [ ] Verify no other critical infrastructure changes scheduled same time window
### Change Documentation
- [ ] Create GitHub issue or ticket tracking the deployment
- [ ] Document: what's changing (configs, manifests, versions)
- [ ] Document: why (bug fix, feature, performance, security)
- [ ] Document: rollback plan (revision number or previous config)
- [ ] Document: success criteria (what indicates successful deployment)
- [ ] Document: estimated duration (usually 5-15 minutes)
### Code Review & Validation
- [ ] All provisioning changes merged and code reviewed
- [ ] Confirm `main` branch has latest changes
- [ ] Run validation locally: `nu scripts/validate-config.nu --mode enterprise`
- [ ] Verify all 3 modes validate without errors or critical warnings
- [ ] Check git log for unexpected commits
- [ ] Review artifact generation: ensure configs are correct
---
## 4 Hours Before Deployment
### Environment Verification
#### Staging Environment
- [ ] Access staging Kubernetes cluster: `kubectl cluster-info`
- [ ] Verify cluster is healthy: `kubectl get nodes` (all Ready)
- [ ] Check namespace exists: `kubectl get namespace vapora`
- [ ] Verify current deployments: `kubectl get deployments -n vapora`
- [ ] Check ConfigMap is up to date: `kubectl get configmap -n vapora -o yaml | head -20`
#### Production Environment (if applicable)
- [ ] Access production Kubernetes cluster: `kubectl cluster-info`
- [ ] Verify all nodes healthy: `kubectl get nodes` (all Ready)
- [ ] Check current resource usage: `kubectl top nodes` (not near capacity)
- [ ] Verify current deployments: `kubectl get deployments -n vapora`
- [ ] Check pod status: `kubectl get pods -n vapora` (all Running)
- [ ] Verify recent events: `kubectl get events -n vapora --sort-by='.lastTimestamp' | tail -10`
### Health Baseline
- [ ] Record current metrics before deployment
- CPU usage per deployment
- Memory usage per deployment
- Request latency (p50, p95, p99)
- Error rate (4xx, 5xx)
- Queue depth (if applicable)
- [ ] Verify services are responsive:
```bash
curl http://localhost:8001/health -H "Authorization: Bearer $TOKEN"
curl http://localhost:8001/api/projects
```
- [ ] Check logs for recent errors:
```bash
kubectl logs deployment/vapora-backend -n vapora --tail=50
kubectl logs deployment/vapora-agents -n vapora --tail=50
```
### Infrastructure Check
- [ ] Verify storage is not near capacity: `df -h /var/lib/vapora`
- [ ] Check database health: `kubectl exec -n vapora <pod> -- surreal info`
- [ ] Verify backups are recent (within 24 hours)
- [ ] Check SSL certificate expiration: `openssl s_client -connect api.vapora.com:443 -showcerts | grep "Validity"`
---
## 2 Hours Before Deployment
### Artifact Preparation
- [ ] Trigger validation in CI/CD pipeline
- [ ] Wait for artifact generation to complete
- [ ] Download artifacts from pipeline:
```bash
# From GitHub Actions or Woodpecker UI
# Download: deployment-artifacts.zip
```
- [ ] Verify artifact contents:
```bash
unzip deployment-artifacts.zip
ls -la
# Should contain:
# - configmap.yaml
# - deployment.yaml
# - docker-compose.yml
# - vapora-{solo,multiuser,enterprise}.{toml,yaml,json}
```
- [ ] Validate manifest syntax:
```bash
yq eval '.' configmap.yaml > /dev/null && echo "✓ ConfigMap valid"
yq eval '.' deployment.yaml > /dev/null && echo "✓ Deployment valid"
```
### Test in Staging
- [ ] Perform dry-run deployment to staging cluster:
```bash
kubectl apply -f configmap.yaml --dry-run=server -n vapora
kubectl apply -f deployment.yaml --dry-run=server -n vapora
```
- [ ] Review dry-run output for any warnings or errors
- [ ] If test deployment available, do actual staging deployment and verify:
```bash
kubectl get deployments -n vapora
kubectl get pods -n vapora
kubectl logs deployment/vapora-backend -n vapora --tail=5
```
- [ ] Test health endpoints on staging
- [ ] Run smoke tests against staging (if available)
### Rollback Plan Verification
- [ ] Document current deployment revisions:
```bash
kubectl rollout history deployment/vapora-backend -n vapora
# Record the highest revision number
```
- [ ] Create backup of current ConfigMap:
```bash
kubectl get configmap -n vapora vapora-config -o yaml > configmap-backup.yaml
```
- [ ] Test rollback procedure on staging (if safe):
```bash
# Record current revision
CURRENT_REV=$(kubectl rollout history deployment/vapora-backend -n vapora | tail -1 | awk '{print $1}')
# Test undo
kubectl rollout undo deployment/vapora-backend -n vapora
# Verify rollback
kubectl get deployment vapora-backend -n vapora -o yaml | grep image
# Restore to current
kubectl rollout undo deployment/vapora-backend -n vapora --to-revision=$CURRENT_REV
```
- [ ] Confirm rollback command is documented in ticket/issue
---
## 1 Hour Before Deployment
### Final Checks
- [ ] Confirm all prerequisites met:
- [ ] Code merged to main
- [ ] Artifacts generated and validated
- [ ] Staging deployment tested
- [ ] Rollback plan documented
- [ ] Team notified
### Communication Setup
- [ ] Set status page to "Maintenance Mode" (if public)
```
"VAPORA maintenance deployment starting at HH:MM UTC.
Expected duration: 10 minutes. Services may be briefly unavailable."
```
- [ ] Join #deployments Slack channel
- [ ] Prepare message: "🚀 Deployment starting now. Will update every 2 minutes."
- [ ] Have on-call engineer monitoring
- [ ] Verify monitoring/alerting dashboards are accessible
### Access Verification
- [ ] Verify kubeconfig is valid and up to date:
```bash
kubectl cluster-info
kubectl get nodes
```
- [ ] Verify kubectl version compatibility:
```bash
kubectl version
# Should match server version reasonably (within 1 minor version)
```
- [ ] Test write access to cluster:
```bash
kubectl auth can-i create deployments --namespace=vapora
# Should return "yes"
```
- [ ] Verify docker/docker-compose access (if Docker deployment)
- [ ] Verify Slack webhook is working (test send message)
---
## 15 Minutes Before Deployment
### Final Go/No-Go Decision
**STOP HERE** and make final decision to proceed or reschedule:
**Proceed IF:**
- ✅ All checklist items above completed
- ✅ No critical issues found during testing
- ✅ Staging deployment successful
- ✅ Team ready and monitoring
- ✅ Rollback plan clear and tested
- ✅ Within designated maintenance window
**RESCHEDULE IF:**
- ❌ Any critical issues discovered
- ❌ Staging tests failed
- ❌ Team member unavailable
- ❌ Production issues detected
- ❌ Unexpected changes in code/configs
### Final Notifications
If proceeding:
- [ ] Post to #deployments: "🚀 Deployment starting in 5 minutes"
- [ ] Alert on-call engineer: "Ready to start - confirm you're monitoring"
- [ ] Have rollback plan visible and accessible
- [ ] Open monitoring dashboard showing current metrics
### Terminal Setup
- [ ] Open terminal with kubeconfig configured:
```bash
export KUBECONFIG=/path/to/production/kubeconfig
kubectl cluster-info # Verify connected to production
```
- [ ] Open second terminal for tailing logs:
```bash
kubectl logs -f deployment/vapora-backend -n vapora
```
- [ ] Have rollback commands ready:
```bash
# For quick rollback if needed
kubectl rollout undo deployment/vapora-backend -n vapora
kubectl rollout undo deployment/vapora-agents -n vapora
kubectl rollout undo deployment/vapora-llm-router -n vapora
```
- [ ] Prepare metrics check script:
```bash
watch kubectl top pods -n vapora
watch kubectl get pods -n vapora
```
---
## Success Criteria Verification
Document what "success" looks like for this deployment:
- [ ] All three deployments have updated image IDs
- [ ] All pods reach "Ready" state within 5 minutes
- [ ] No pod restarts: `kubectl get pods -n vapora --watch` (no restarts column increasing)
- [ ] No error logs in first 2 minutes
- [ ] Health endpoints respond (200 OK)
- [ ] API endpoints respond to test requests
- [ ] Metrics show normal resource usage
- [ ] No alerts triggered
- [ ] Support team reports no user impact
---
## Team Roles During Deployment
### Deployment Lead
- Executes deployment commands
- Monitors progress
- Communicates status updates
- Decides to proceed/rollback
### On-Call Engineer
- Monitors dashboards and alerts
- Watches for anomalies
- Prepares for rollback if needed
- Available for emergency decisions
### Communications Lead (optional)
- Updates #deployments channel
- Notifies support/product teams
- Updates status page if public
- Handles external communication
### Backup Person
- Monitors for issues
- Ready to assist with troubleshooting
- Prepares rollback procedures
- Escalates if needed
---
## Common Issues to Watch For
⚠️ **Pod CrashLoopBackOff**
- Indicates config or image issue
- Check pod logs: `kubectl logs <pod>`
- Check events: `kubectl describe pod <pod>`
- **Action**: Rollback immediately
⚠️ **Pending Pods (not starting)**
- Check resource availability: `kubectl describe pod <pod>`
- Check node capacity
- **Action**: Investigate or rollback if resource exhausted
⚠️ **High Error Rate**
- Check application logs
- Compare with baseline errors
- **Action**: If >10% error increase, rollback
⚠️ **Database Connection Errors**
- Check ConfigMap has correct database URL
- Verify network connectivity to database
- **Action**: Check ConfigMap, fix and reapply if needed
⚠️ **Memory or CPU Spike**
- Monitor trends (sudden spike vs gradual)
- Check if within expected range for new code
- **Action**: Rollback if resource limits exceeded
---
## Post-Deployment Documentation
After deployment completes, record:
- [ ] Deployment start time (UTC)
- [ ] Deployment end time (UTC)
- [ ] Total duration
- [ ] Any issues encountered and resolution
- [ ] Rollback performed (Y/N)
- [ ] Metrics before/after (CPU, memory, latency, errors)
- [ ] Team members involved
- [ ] Blockers or lessons learned
---
## Sign-Off
Use this template for deployment issue/ticket:
```
DEPLOYMENT COMPLETED
✓ All checks passed
✓ Deployment successful
✓ All pods running
✓ Health checks passing
✓ No user impact
Deployed by: [Name]
Start time: [UTC]
Duration: [X minutes]
Rollback needed: No
Metrics:
- Latency (p99): [X]ms
- Error rate: [X]%
- Pod restarts: 0
Next deployment: [Date/Time]
```