Vapora/docs/operations/pre-deployment-checklist.md

# Pre-Deployment Checklist

Critical verification steps before any VAPORA deployment to production or staging.

---

## 24 Hours Before Deployment

### Communication & Scheduling

- [ ] Schedule deployment with team (record in calendar/ticket)
- [ ] Post in #deployments channel: "Deployment scheduled for [DATE TIME UTC]"
- [ ] Identify on-call engineer for deployment period
- [ ] Brief on-call on deployment plan and rollback procedure
- [ ] Ensure affected teams (support, product, etc.) are notified
- [ ] Verify no other critical infrastructure changes scheduled same time window

### Change Documentation

- [ ] Create GitHub issue or ticket tracking the deployment
- [ ] Document: what's changing (configs, manifests, versions)
- [ ] Document: why (bug fix, feature, performance, security)
- [ ] Document: rollback plan (revision number or previous config)
- [ ] Document: success criteria (what indicates successful deployment)
- [ ] Document: estimated duration (usually 5-15 minutes)

### Code Review & Validation

- [ ] All provisioning changes merged and code reviewed
- [ ] Confirm `main` branch has latest changes
- [ ] Run validation locally: `nu scripts/validate-config.nu --mode enterprise`
- [ ] Verify all 3 modes validate without errors or critical warnings
- [ ] Check git log for unexpected commits
- [ ] Review artifact generation: ensure configs are correct

---

## 4 Hours Before Deployment

### Environment Verification

#### Staging Environment

- [ ] Access staging Kubernetes cluster: `kubectl cluster-info`
- [ ] Verify cluster is healthy: `kubectl get nodes` (all Ready)
- [ ] Check namespace exists: `kubectl get namespace vapora`
- [ ] Verify current deployments: `kubectl get deployments -n vapora`
- [ ] Check ConfigMap is up to date: `kubectl get configmap -n vapora -o yaml | head -20`

#### Production Environment (if applicable)

- [ ] Access production Kubernetes cluster: `kubectl cluster-info`
- [ ] Verify all nodes healthy: `kubectl get nodes` (all Ready)
- [ ] Check current resource usage: `kubectl top nodes` (not near capacity)
- [ ] Verify current deployments: `kubectl get deployments -n vapora`
- [ ] Check pod status: `kubectl get pods -n vapora` (all Running)
- [ ] Verify recent events: `kubectl get events -n vapora --sort-by='.lastTimestamp' | tail -10`

### Health Baseline

- [ ] Record current metrics before deployment
  - CPU usage per deployment
  - Memory usage per deployment
  - Request latency (p50, p95, p99)
  - Error rate (4xx, 5xx)
  - Queue depth (if applicable)

- [ ] Verify services are responsive:
  ```bash
  curl http://localhost:8001/health -H "Authorization: Bearer $TOKEN"
  curl http://localhost:8001/api/projects
  ```

- [ ] Check logs for recent errors:
  ```bash
  kubectl logs deployment/vapora-backend -n vapora --tail=50
  kubectl logs deployment/vapora-agents -n vapora --tail=50
  ```

### Infrastructure Check

- [ ] Verify storage is not near capacity: `df -h /var/lib/vapora`
- [ ] Check database health: `kubectl exec -n vapora <pod> -- surreal info`
- [ ] Verify backups are recent (within 24 hours)
- [ ] Check SSL certificate expiration: `openssl s_client -connect api.vapora.com:443 -showcerts | grep "Validity"`

---

## 2 Hours Before Deployment

### Artifact Preparation

- [ ] Trigger validation in CI/CD pipeline
- [ ] Wait for artifact generation to complete
- [ ] Download artifacts from pipeline:
  ```bash
  # From GitHub Actions or Woodpecker UI
  # Download: deployment-artifacts.zip
  ```

- [ ] Verify artifact contents:
  ```bash
  unzip deployment-artifacts.zip
  ls -la
  # Should contain:
  # - configmap.yaml
  # - deployment.yaml
  # - docker-compose.yml
  # - vapora-{solo,multiuser,enterprise}.{toml,yaml,json}
  ```

- [ ] Validate manifest syntax:
  ```bash
  yq eval '.' configmap.yaml > /dev/null && echo "✓ ConfigMap valid"
  yq eval '.' deployment.yaml > /dev/null && echo "✓ Deployment valid"
  ```

### Test in Staging

- [ ] Perform dry-run deployment to staging cluster:
  ```bash
  kubectl apply -f configmap.yaml --dry-run=server -n vapora
  kubectl apply -f deployment.yaml --dry-run=server -n vapora
  ```

- [ ] Review dry-run output for any warnings or errors
- [ ] If test deployment available, do actual staging deployment and verify:
  ```bash
  kubectl get deployments -n vapora
  kubectl get pods -n vapora
  kubectl logs deployment/vapora-backend -n vapora --tail=5
  ```

- [ ] Test health endpoints on staging
- [ ] Run smoke tests against staging (if available)

### Rollback Plan Verification

- [ ] Document current deployment revisions:
  ```bash
  kubectl rollout history deployment/vapora-backend -n vapora
  # Record the highest revision number
  ```

- [ ] Create backup of current ConfigMap:
  ```bash
  kubectl get configmap -n vapora vapora-config -o yaml > configmap-backup.yaml
  ```

- [ ] Test rollback procedure on staging (if safe):
  ```bash
  # Record current revision
  CURRENT_REV=$(kubectl rollout history deployment/vapora-backend -n vapora | tail -1 | awk '{print $1}')

  # Test undo
  kubectl rollout undo deployment/vapora-backend -n vapora

  # Verify rollback
  kubectl get deployment vapora-backend -n vapora -o yaml | grep image

  # Restore to current
  kubectl rollout undo deployment/vapora-backend -n vapora --to-revision=$CURRENT_REV
  ```

- [ ] Confirm rollback command is documented in ticket/issue

---

## 1 Hour Before Deployment

### Final Checks

- [ ] Confirm all prerequisites met:
  - [ ] Code merged to main
  - [ ] Artifacts generated and validated
  - [ ] Staging deployment tested
  - [ ] Rollback plan documented
  - [ ] Team notified

### Communication Setup

- [ ] Set status page to "Maintenance Mode" (if public)
  ```
  "VAPORA maintenance deployment starting at HH:MM UTC.
   Expected duration: 10 minutes. Services may be briefly unavailable."
  ```

- [ ] Join #deployments Slack channel
- [ ] Prepare message: "🚀 Deployment starting now. Will update every 2 minutes."
- [ ] Have on-call engineer monitoring
- [ ] Verify monitoring/alerting dashboards are accessible

### Access Verification

- [ ] Verify kubeconfig is valid and up to date:
  ```bash
  kubectl cluster-info
  kubectl get nodes
  ```

- [ ] Verify kubectl version compatibility:
  ```bash
  kubectl version
  # Should match server version reasonably (within 1 minor version)
  ```

- [ ] Test write access to cluster:
  ```bash
  kubectl auth can-i create deployments --namespace=vapora
  # Should return "yes"
  ```

- [ ] Verify docker/docker-compose access (if Docker deployment)
- [ ] Verify Slack webhook is working (test send message)

---

## 15 Minutes Before Deployment

### Final Go/No-Go Decision

**STOP HERE** and make final decision to proceed or reschedule:

**Proceed IF:**
- ✅ All checklist items above completed
- ✅ No critical issues found during testing
- ✅ Staging deployment successful
- ✅ Team ready and monitoring
- ✅ Rollback plan clear and tested
- ✅ Within designated maintenance window

**RESCHEDULE IF:**
- ❌ Any critical issues discovered
- ❌ Staging tests failed
- ❌ Team member unavailable
- ❌ Production issues detected
- ❌ Unexpected changes in code/configs

### Final Notifications

If proceeding:
- [ ] Post to #deployments: "🚀 Deployment starting in 5 minutes"
- [ ] Alert on-call engineer: "Ready to start - confirm you're monitoring"
- [ ] Have rollback plan visible and accessible
- [ ] Open monitoring dashboard showing current metrics

### Terminal Setup

- [ ] Open terminal with kubeconfig configured:
  ```bash
  export KUBECONFIG=/path/to/production/kubeconfig
  kubectl cluster-info  # Verify connected to production
  ```

- [ ] Open second terminal for tailing logs:
  ```bash
  kubectl logs -f deployment/vapora-backend -n vapora
  ```

- [ ] Have rollback commands ready:
  ```bash
  # For quick rollback if needed
  kubectl rollout undo deployment/vapora-backend -n vapora
  kubectl rollout undo deployment/vapora-agents -n vapora
  kubectl rollout undo deployment/vapora-llm-router -n vapora
  ```

- [ ] Prepare metrics check script:
  ```bash
  watch kubectl top pods -n vapora
  watch kubectl get pods -n vapora
  ```

---

## Success Criteria Verification

Document what "success" looks like for this deployment:

- [ ] All three deployments have updated image IDs
- [ ] All pods reach "Ready" state within 5 minutes
- [ ] No pod restarts: `kubectl get pods -n vapora --watch` (no restarts column increasing)
- [ ] No error logs in first 2 minutes
- [ ] Health endpoints respond (200 OK)
- [ ] API endpoints respond to test requests
- [ ] Metrics show normal resource usage
- [ ] No alerts triggered
- [ ] Support team reports no user impact

---

## Team Roles During Deployment

### Deployment Lead
- Executes deployment commands
- Monitors progress
- Communicates status updates
- Decides to proceed/rollback

### On-Call Engineer
- Monitors dashboards and alerts
- Watches for anomalies
- Prepares for rollback if needed
- Available for emergency decisions

### Communications Lead (optional)
- Updates #deployments channel
- Notifies support/product teams
- Updates status page if public
- Handles external communication

### Backup Person
- Monitors for issues
- Ready to assist with troubleshooting
- Prepares rollback procedures
- Escalates if needed

---

## Common Issues to Watch For

⚠️ **Pod CrashLoopBackOff**
- Indicates config or image issue
- Check pod logs: `kubectl logs <pod>`
- Check events: `kubectl describe pod <pod>`
- **Action**: Rollback immediately

⚠️ **Pending Pods (not starting)**
- Check resource availability: `kubectl describe pod <pod>`
- Check node capacity
- **Action**: Investigate or rollback if resource exhausted

⚠️ **High Error Rate**
- Check application logs
- Compare with baseline errors
- **Action**: If >10% error increase, rollback

⚠️ **Database Connection Errors**
- Check ConfigMap has correct database URL
- Verify network connectivity to database
- **Action**: Check ConfigMap, fix and reapply if needed

⚠️ **Memory or CPU Spike**
- Monitor trends (sudden spike vs gradual)
- Check if within expected range for new code
- **Action**: Rollback if resource limits exceeded

---

## Post-Deployment Documentation

After deployment completes, record:

- [ ] Deployment start time (UTC)
- [ ] Deployment end time (UTC)
- [ ] Total duration
- [ ] Any issues encountered and resolution
- [ ] Rollback performed (Y/N)
- [ ] Metrics before/after (CPU, memory, latency, errors)
- [ ] Team members involved
- [ ] Blockers or lessons learned

---

## Sign-Off

Use this template for deployment issue/ticket:

```
DEPLOYMENT COMPLETED

✓ All checks passed
✓ Deployment successful
✓ All pods running
✓ Health checks passing
✓ No user impact

Deployed by: [Name]
Start time: [UTC]
Duration: [X minutes]
Rollback needed: No

Metrics:
- Latency (p99): [X]ms
- Error rate: [X]%
- Pod restarts: 0

Next deployment: [Date/Time]
```
chore: extend doc: adr, tutorials, operations, etc 2026-01-12 03:32:47 +00:00			`# Pre-Deployment Checklist`

			`Critical verification steps before any VAPORA deployment to production or staging.`

			`---`

			`## 24 Hours Before Deployment`

			`### Communication & Scheduling`

			`- [ ] Schedule deployment with team (record in calendar/ticket)`
			`- [ ] Post in #deployments channel: "Deployment scheduled for [DATE TIME UTC]"`
			`- [ ] Identify on-call engineer for deployment period`
			`- [ ] Brief on-call on deployment plan and rollback procedure`
			`- [ ] Ensure affected teams (support, product, etc.) are notified`
			`- [ ] Verify no other critical infrastructure changes scheduled same time window`

			`### Change Documentation`

			`- [ ] Create GitHub issue or ticket tracking the deployment`
			`- [ ] Document: what's changing (configs, manifests, versions)`
			`- [ ] Document: why (bug fix, feature, performance, security)`
			`- [ ] Document: rollback plan (revision number or previous config)`
			`- [ ] Document: success criteria (what indicates successful deployment)`
			`- [ ] Document: estimated duration (usually 5-15 minutes)`

			`### Code Review & Validation`

			`- [ ] All provisioning changes merged and code reviewed`
			- [ ] Confirm `main` branch has latest changes
			- [ ] Run validation locally: `nu scripts/validate-config.nu --mode enterprise`
			`- [ ] Verify all 3 modes validate without errors or critical warnings`
			`- [ ] Check git log for unexpected commits`
			`- [ ] Review artifact generation: ensure configs are correct`

			`---`

			`## 4 Hours Before Deployment`

			`### Environment Verification`

			`#### Staging Environment`

			- [ ] Access staging Kubernetes cluster: `kubectl cluster-info`
			- [ ] Verify cluster is healthy: `kubectl get nodes` (all Ready)
			- [ ] Check namespace exists: `kubectl get namespace vapora`
			- [ ] Verify current deployments: `kubectl get deployments -n vapora`
			- [ ] Check ConfigMap is up to date: `kubectl get configmap -n vapora -o yaml \| head -20`

			`#### Production Environment (if applicable)`

			- [ ] Access production Kubernetes cluster: `kubectl cluster-info`
			- [ ] Verify all nodes healthy: `kubectl get nodes` (all Ready)
			- [ ] Check current resource usage: `kubectl top nodes` (not near capacity)
			- [ ] Verify current deployments: `kubectl get deployments -n vapora`
			- [ ] Check pod status: `kubectl get pods -n vapora` (all Running)
			- [ ] Verify recent events: `kubectl get events -n vapora --sort-by='.lastTimestamp' \| tail -10`

			`### Health Baseline`

			`- [ ] Record current metrics before deployment`
			`- CPU usage per deployment`
			`- Memory usage per deployment`
			`- Request latency (p50, p95, p99)`
			`- Error rate (4xx, 5xx)`
			`- Queue depth (if applicable)`

			`- [ ] Verify services are responsive:`
			```bash
			`curl http://localhost:8001/health -H "Authorization: Bearer $TOKEN"`
			`curl http://localhost:8001/api/projects`
			```

			`- [ ] Check logs for recent errors:`
			```bash
			`kubectl logs deployment/vapora-backend -n vapora --tail=50`
			`kubectl logs deployment/vapora-agents -n vapora --tail=50`
			```

			`### Infrastructure Check`

			- [ ] Verify storage is not near capacity: `df -h /var/lib/vapora`
			- [ ] Check database health: `kubectl exec -n vapora <pod> -- surreal info`
			`- [ ] Verify backups are recent (within 24 hours)`
			- [ ] Check SSL certificate expiration: `openssl s_client -connect api.vapora.com:443 -showcerts \| grep "Validity"`

			`---`

			`## 2 Hours Before Deployment`

			`### Artifact Preparation`

			`- [ ] Trigger validation in CI/CD pipeline`
			`- [ ] Wait for artifact generation to complete`
			`- [ ] Download artifacts from pipeline:`
			```bash
			`# From GitHub Actions or Woodpecker UI`
			`# Download: deployment-artifacts.zip`
			```

			`- [ ] Verify artifact contents:`
			```bash
			`unzip deployment-artifacts.zip`
			`ls -la`
			`# Should contain:`
			`# - configmap.yaml`
			`# - deployment.yaml`
			`# - docker-compose.yml`
			`# - vapora-{solo,multiuser,enterprise}.{toml,yaml,json}`
			```

			`- [ ] Validate manifest syntax:`
			```bash
			`yq eval '.' configmap.yaml > /dev/null && echo "✓ ConfigMap valid"`
			`yq eval '.' deployment.yaml > /dev/null && echo "✓ Deployment valid"`
			```

			`### Test in Staging`

			`- [ ] Perform dry-run deployment to staging cluster:`
			```bash
			`kubectl apply -f configmap.yaml --dry-run=server -n vapora`
			`kubectl apply -f deployment.yaml --dry-run=server -n vapora`
			```

			`- [ ] Review dry-run output for any warnings or errors`
			`- [ ] If test deployment available, do actual staging deployment and verify:`
			```bash
			`kubectl get deployments -n vapora`
			`kubectl get pods -n vapora`
			`kubectl logs deployment/vapora-backend -n vapora --tail=5`
			```

			`- [ ] Test health endpoints on staging`
			`- [ ] Run smoke tests against staging (if available)`

			`### Rollback Plan Verification`

			`- [ ] Document current deployment revisions:`
			```bash
			`kubectl rollout history deployment/vapora-backend -n vapora`
			`# Record the highest revision number`
			```

			`- [ ] Create backup of current ConfigMap:`
			```bash
			`kubectl get configmap -n vapora vapora-config -o yaml > configmap-backup.yaml`
			```

			`- [ ] Test rollback procedure on staging (if safe):`
			```bash
			`# Record current revision`
			`CURRENT_REV=$(kubectl rollout history deployment/vapora-backend -n vapora \| tail -1 \| awk '{print $1}')`

			`# Test undo`
			`kubectl rollout undo deployment/vapora-backend -n vapora`

			`# Verify rollback`
			`kubectl get deployment vapora-backend -n vapora -o yaml \| grep image`

			`# Restore to current`
			`kubectl rollout undo deployment/vapora-backend -n vapora --to-revision=$CURRENT_REV`
			```

			`- [ ] Confirm rollback command is documented in ticket/issue`

			`---`

			`## 1 Hour Before Deployment`

			`### Final Checks`

			`- [ ] Confirm all prerequisites met:`
			`- [ ] Code merged to main`
			`- [ ] Artifacts generated and validated`
			`- [ ] Staging deployment tested`
			`- [ ] Rollback plan documented`
			`- [ ] Team notified`

			`### Communication Setup`

			`- [ ] Set status page to "Maintenance Mode" (if public)`
			```
			`"VAPORA maintenance deployment starting at HH:MM UTC.`
			`Expected duration: 10 minutes. Services may be briefly unavailable."`
			```

			`- [ ] Join #deployments Slack channel`
			`- [ ] Prepare message: "🚀 Deployment starting now. Will update every 2 minutes."`
			`- [ ] Have on-call engineer monitoring`
			`- [ ] Verify monitoring/alerting dashboards are accessible`

			`### Access Verification`

			`- [ ] Verify kubeconfig is valid and up to date:`
			```bash
			`kubectl cluster-info`
			`kubectl get nodes`
			```

			`- [ ] Verify kubectl version compatibility:`
			```bash
			`kubectl version`
			`# Should match server version reasonably (within 1 minor version)`
			```

			`- [ ] Test write access to cluster:`
			```bash
			`kubectl auth can-i create deployments --namespace=vapora`
			`# Should return "yes"`
			```

			`- [ ] Verify docker/docker-compose access (if Docker deployment)`
			`- [ ] Verify Slack webhook is working (test send message)`

			`---`

			`## 15 Minutes Before Deployment`

			`### Final Go/No-Go Decision`

			`STOP HERE and make final decision to proceed or reschedule:`

			`Proceed IF:`
			`- ✅ All checklist items above completed`
			`- ✅ No critical issues found during testing`
			`- ✅ Staging deployment successful`
			`- ✅ Team ready and monitoring`
			`- ✅ Rollback plan clear and tested`
			`- ✅ Within designated maintenance window`

			`RESCHEDULE IF:`
			`- ❌ Any critical issues discovered`
			`- ❌ Staging tests failed`
			`- ❌ Team member unavailable`
			`- ❌ Production issues detected`
			`- ❌ Unexpected changes in code/configs`

			`### Final Notifications`

			`If proceeding:`
			`- [ ] Post to #deployments: "🚀 Deployment starting in 5 minutes"`
			`- [ ] Alert on-call engineer: "Ready to start - confirm you're monitoring"`
			`- [ ] Have rollback plan visible and accessible`
			`- [ ] Open monitoring dashboard showing current metrics`

			`### Terminal Setup`

			`- [ ] Open terminal with kubeconfig configured:`
			```bash
			`export KUBECONFIG=/path/to/production/kubeconfig`
			`kubectl cluster-info # Verify connected to production`
			```

			`- [ ] Open second terminal for tailing logs:`
			```bash
			`kubectl logs -f deployment/vapora-backend -n vapora`
			```

			`- [ ] Have rollback commands ready:`
			```bash
			`# For quick rollback if needed`
			`kubectl rollout undo deployment/vapora-backend -n vapora`
			`kubectl rollout undo deployment/vapora-agents -n vapora`
			`kubectl rollout undo deployment/vapora-llm-router -n vapora`
			```

			`- [ ] Prepare metrics check script:`
			```bash
			`watch kubectl top pods -n vapora`
			`watch kubectl get pods -n vapora`
			```

			`---`

			`## Success Criteria Verification`

			`Document what "success" looks like for this deployment:`

			`- [ ] All three deployments have updated image IDs`
			`- [ ] All pods reach "Ready" state within 5 minutes`
			- [ ] No pod restarts: `kubectl get pods -n vapora --watch` (no restarts column increasing)
			`- [ ] No error logs in first 2 minutes`
			`- [ ] Health endpoints respond (200 OK)`
			`- [ ] API endpoints respond to test requests`
			`- [ ] Metrics show normal resource usage`
			`- [ ] No alerts triggered`
			`- [ ] Support team reports no user impact`

			`---`

			`## Team Roles During Deployment`

			`### Deployment Lead`
			`- Executes deployment commands`
			`- Monitors progress`
			`- Communicates status updates`
			`- Decides to proceed/rollback`

			`### On-Call Engineer`
			`- Monitors dashboards and alerts`
			`- Watches for anomalies`
			`- Prepares for rollback if needed`
			`- Available for emergency decisions`

			`### Communications Lead (optional)`
			`- Updates #deployments channel`
			`- Notifies support/product teams`
			`- Updates status page if public`
			`- Handles external communication`

			`### Backup Person`
			`- Monitors for issues`
			`- Ready to assist with troubleshooting`
			`- Prepares rollback procedures`
			`- Escalates if needed`

			`---`

			`## Common Issues to Watch For`

			`⚠️ Pod CrashLoopBackOff`
			`- Indicates config or image issue`
			- Check pod logs: `kubectl logs <pod>`
			- Check events: `kubectl describe pod <pod>`
			`- Action: Rollback immediately`

			`⚠️ Pending Pods (not starting)`
			- Check resource availability: `kubectl describe pod <pod>`
			`- Check node capacity`
			`- Action: Investigate or rollback if resource exhausted`

			`⚠️ High Error Rate`
			`- Check application logs`
			`- Compare with baseline errors`
			`- Action: If >10% error increase, rollback`

			`⚠️ Database Connection Errors`
			`- Check ConfigMap has correct database URL`
			`- Verify network connectivity to database`
			`- Action: Check ConfigMap, fix and reapply if needed`

			`⚠️ Memory or CPU Spike`
			`- Monitor trends (sudden spike vs gradual)`
			`- Check if within expected range for new code`
			`- Action: Rollback if resource limits exceeded`

			`---`

			`## Post-Deployment Documentation`

			`After deployment completes, record:`

			`- [ ] Deployment start time (UTC)`
			`- [ ] Deployment end time (UTC)`
			`- [ ] Total duration`
			`- [ ] Any issues encountered and resolution`
			`- [ ] Rollback performed (Y/N)`
			`- [ ] Metrics before/after (CPU, memory, latency, errors)`
			`- [ ] Team members involved`
			`- [ ] Blockers or lessons learned`

			`---`

			`## Sign-Off`

			`Use this template for deployment issue/ticket:`

			```
			`DEPLOYMENT COMPLETED`

			`✓ All checks passed`
			`✓ Deployment successful`
			`✓ All pods running`
			`✓ Health checks passing`
			`✓ No user impact`

			`Deployed by: [Name]`
			`Start time: [UTC]`
			`Duration: [X minutes]`
			`Rollback needed: No`

			`Metrics:`
			`- Latency (p99): [X]ms`
			`- Error rate: [X]%`
			`- Pod restarts: 0`

			`Next deployment: [Date/Time]`
			```