Vapora/docs/operations/pre-deployment-checklist.md

# Pre-Deployment Checklist

Critical verification steps before any VAPORA deployment to production or staging.

---

## 24 Hours Before Deployment

### Communication & Scheduling

- [ ] Schedule deployment with team (record in calendar/ticket)
- [ ] Post in #deployments channel: "Deployment scheduled for [DATE TIME UTC]"
- [ ] Identify on-call engineer for deployment period
- [ ] Brief on-call on deployment plan and rollback procedure
- [ ] Ensure affected teams (support, product, etc.) are notified
- [ ] Verify no other critical infrastructure changes scheduled same time window

### Change Documentation

- [ ] Create GitHub issue or ticket tracking the deployment
- [ ] Document: what's changing (configs, manifests, versions)
- [ ] Document: why (bug fix, feature, performance, security)
- [ ] Document: rollback plan (revision number or previous config)
- [ ] Document: success criteria (what indicates successful deployment)
- [ ] Document: estimated duration (usually 5-15 minutes)

### Code Review & Validation

- [ ] All provisioning changes merged and code reviewed
- [ ] Confirm `main` branch has latest changes
- [ ] Run validation locally: `nu scripts/validate-config.nu --mode enterprise`
- [ ] Verify all 3 modes validate without errors or critical warnings
- [ ] Check git log for unexpected commits
- [ ] Review artifact generation: ensure configs are correct

---

## 4 Hours Before Deployment

### Environment Verification

#### Staging Environment

- [ ] Access staging Kubernetes cluster: `kubectl cluster-info`
- [ ] Verify cluster is healthy: `kubectl get nodes` (all Ready)
- [ ] Check namespace exists: `kubectl get namespace vapora`
- [ ] Verify current deployments: `kubectl get deployments -n vapora`
- [ ] Check ConfigMap is up to date: `kubectl get configmap -n vapora -o yaml | head -20`

#### Production Environment (if applicable)

- [ ] Access production Kubernetes cluster: `kubectl cluster-info`
- [ ] Verify all nodes healthy: `kubectl get nodes` (all Ready)
- [ ] Check current resource usage: `kubectl top nodes` (not near capacity)
- [ ] Verify current deployments: `kubectl get deployments -n vapora`
- [ ] Check pod status: `kubectl get pods -n vapora` (all Running)
- [ ] Verify recent events: `kubectl get events -n vapora --sort-by='.lastTimestamp' | tail -10`

### Health Baseline

- [ ] Record current metrics before deployment
  - CPU usage per deployment
  - Memory usage per deployment
  - Request latency (p50, p95, p99)
  - Error rate (4xx, 5xx)
  - Queue depth (if applicable)

- [ ] Verify services are responsive:
  ```bash
  curl http://localhost:8001/health -H "Authorization: Bearer $TOKEN"
  curl http://localhost:8001/api/projects
  ```

- [ ] Check logs for recent errors:
  ```bash
  kubectl logs deployment/vapora-backend -n vapora --tail=50
  kubectl logs deployment/vapora-agents -n vapora --tail=50
  ```

### Infrastructure Check

- [ ] Verify storage is not near capacity: `df -h /var/lib/vapora`
- [ ] Check database health: `kubectl exec -n vapora <pod> -- surreal info`
- [ ] Verify backups are recent (within 24 hours)
- [ ] Check SSL certificate expiration: `openssl s_client -connect api.vapora.com:443 -showcerts | grep "Validity"`

---

## 2 Hours Before Deployment

### Artifact Preparation

- [ ] Trigger validation in CI/CD pipeline
- [ ] Wait for artifact generation to complete
- [ ] Download artifacts from pipeline:
  ```bash
  # From GitHub Actions or Woodpecker UI
  # Download: deployment-artifacts.zip
  ```

- [ ] Verify artifact contents:
  ```bash
  unzip deployment-artifacts.zip
  ls -la
  # Should contain:
  # - configmap.yaml
  # - deployment.yaml
  # - docker-compose.yml
  # - vapora-{solo,multiuser,enterprise}.{toml,yaml,json}
  ```

- [ ] Validate manifest syntax:
  ```bash
  yq eval '.' configmap.yaml > /dev/null && echo "✓ ConfigMap valid"
  yq eval '.' deployment.yaml > /dev/null && echo "✓ Deployment valid"
  ```

### Test in Staging

- [ ] Perform dry-run deployment to staging cluster:
  ```bash
  kubectl apply -f configmap.yaml --dry-run=server -n vapora
  kubectl apply -f deployment.yaml --dry-run=server -n vapora
  ```

- [ ] Review dry-run output for any warnings or errors
- [ ] If test deployment available, do actual staging deployment and verify:
  ```bash
  kubectl get deployments -n vapora
  kubectl get pods -n vapora
  kubectl logs deployment/vapora-backend -n vapora --tail=5
  ```

- [ ] Test health endpoints on staging
- [ ] Run smoke tests against staging (if available)

### Rollback Plan Verification

- [ ] Document current deployment revisions:
  ```bash
  kubectl rollout history deployment/vapora-backend -n vapora
  # Record the highest revision number
  ```

- [ ] Create backup of current ConfigMap:
  ```bash
  kubectl get configmap -n vapora vapora-config -o yaml > configmap-backup.yaml
  ```

- [ ] Test rollback procedure on staging (if safe):
  ```bash
  # Record current revision
  CURRENT_REV=$(kubectl rollout history deployment/vapora-backend -n vapora | tail -1 | awk '{print $1}')

  # Test undo
  kubectl rollout undo deployment/vapora-backend -n vapora

  # Verify rollback
  kubectl get deployment vapora-backend -n vapora -o yaml | grep image

  # Restore to current
  kubectl rollout undo deployment/vapora-backend -n vapora --to-revision=$CURRENT_REV
  ```

- [ ] Confirm rollback command is documented in ticket/issue

---

## 1 Hour Before Deployment

### Final Checks

- [ ] Confirm all prerequisites met:
  - [ ] Code merged to main
  - [ ] Artifacts generated and validated
  - [ ] Staging deployment tested
  - [ ] Rollback plan documented
  - [ ] Team notified

### Communication Setup

- [ ] Set status page to "Maintenance Mode" (if public)
  ```
  "VAPORA maintenance deployment starting at HH:MM UTC.
   Expected duration: 10 minutes. Services may be briefly unavailable."
  ```

- [ ] Join #deployments Slack channel
- [ ] Prepare message: "🚀 Deployment starting now. Will update every 2 minutes."
- [ ] Have on-call engineer monitoring
- [ ] Verify monitoring/alerting dashboards are accessible

### Access Verification

- [ ] Verify kubeconfig is valid and up to date:
  ```bash
  kubectl cluster-info
  kubectl get nodes
  ```

- [ ] Verify kubectl version compatibility:
  ```bash
  kubectl version
  # Should match server version reasonably (within 1 minor version)
  ```

- [ ] Test write access to cluster:
  ```bash
  kubectl auth can-i create deployments --namespace=vapora
  # Should return "yes"
  ```

- [ ] Verify docker/docker-compose access (if Docker deployment)
- [ ] Verify Slack webhook is working (test send message)

---

## 15 Minutes Before Deployment

### Final Go/No-Go Decision

**STOP HERE** and make final decision to proceed or reschedule:

**Proceed IF:**
- ✅ All checklist items above completed
- ✅ No critical issues found during testing
- ✅ Staging deployment successful
- ✅ Team ready and monitoring
- ✅ Rollback plan clear and tested
- ✅ Within designated maintenance window

**RESCHEDULE IF:**
- ❌ Any critical issues discovered
- ❌ Staging tests failed
- ❌ Team member unavailable
- ❌ Production issues detected
- ❌ Unexpected changes in code/configs

### Final Notifications

If proceeding:
- [ ] Post to #deployments: "🚀 Deployment starting in 5 minutes"
- [ ] Alert on-call engineer: "Ready to start - confirm you're monitoring"
- [ ] Have rollback plan visible and accessible
- [ ] Open monitoring dashboard showing current metrics

### Terminal Setup

- [ ] Open terminal with kubeconfig configured:
  ```bash
  export KUBECONFIG=/path/to/production/kubeconfig
  kubectl cluster-info  # Verify connected to production
  ```

- [ ] Open second terminal for tailing logs:
  ```bash
  kubectl logs -f deployment/vapora-backend -n vapora
  ```

- [ ] Have rollback commands ready:
  ```bash
  # For quick rollback if needed
  kubectl rollout undo deployment/vapora-backend -n vapora
  kubectl rollout undo deployment/vapora-agents -n vapora
  kubectl rollout undo deployment/vapora-llm-router -n vapora
  ```

- [ ] Prepare metrics check script:
  ```bash
  watch kubectl top pods -n vapora
  watch kubectl get pods -n vapora
  ```

---

## Success Criteria Verification

Document what "success" looks like for this deployment:

- [ ] All three deployments have updated image IDs
- [ ] All pods reach "Ready" state within 5 minutes
- [ ] No pod restarts: `kubectl get pods -n vapora --watch` (no restarts column increasing)
- [ ] No error logs in first 2 minutes
- [ ] Health endpoints respond (200 OK)
- [ ] API endpoints respond to test requests
- [ ] Metrics show normal resource usage
- [ ] No alerts triggered
- [ ] Support team reports no user impact

---

## Team Roles During Deployment

### Deployment Lead
- Executes deployment commands
- Monitors progress
- Communicates status updates
- Decides to proceed/rollback

### On-Call Engineer
- Monitors dashboards and alerts
- Watches for anomalies
- Prepares for rollback if needed
- Available for emergency decisions

### Communications Lead (optional)
- Updates #deployments channel
- Notifies support/product teams
- Updates status page if public
- Handles external communication

### Backup Person
- Monitors for issues
- Ready to assist with troubleshooting
- Prepares rollback procedures
- Escalates if needed

---

## Common Issues to Watch For

⚠️ **Pod CrashLoopBackOff**
- Indicates config or image issue
- Check pod logs: `kubectl logs <pod>`
- Check events: `kubectl describe pod <pod>`
- **Action**: Rollback immediately

⚠️ **Pending Pods (not starting)**
- Check resource availability: `kubectl describe pod <pod>`
- Check node capacity
- **Action**: Investigate or rollback if resource exhausted

⚠️ **High Error Rate**
- Check application logs
- Compare with baseline errors
- **Action**: If >10% error increase, rollback

⚠️ **Database Connection Errors**
- Check ConfigMap has correct database URL
- Verify network connectivity to database
- **Action**: Check ConfigMap, fix and reapply if needed

⚠️ **Memory or CPU Spike**
- Monitor trends (sudden spike vs gradual)
- Check if within expected range for new code
- **Action**: Rollback if resource limits exceeded

---

## Post-Deployment Documentation

After deployment completes, record:

- [ ] Deployment start time (UTC)
- [ ] Deployment end time (UTC)
- [ ] Total duration
- [ ] Any issues encountered and resolution
- [ ] Rollback performed (Y/N)
- [ ] Metrics before/after (CPU, memory, latency, errors)
- [ ] Team members involved
- [ ] Blockers or lessons learned

---

## Sign-Off

Use this template for deployment issue/ticket:

```
DEPLOYMENT COMPLETED

✓ All checks passed
✓ Deployment successful
✓ All pods running
✓ Health checks passing
✓ No user impact

Deployed by: [Name]
Start time: [UTC]
Duration: [X minutes]
Rollback needed: No

Metrics:
- Latency (p99): [X]ms
- Error rate: [X]%
- Pod restarts: 0

Next deployment: [Date/Time]
```