jesus/Vapora

Fork 0

Jesús Pérez 7110ffeea2

Rust CI / Security Audit (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (stable) (push) Has been cancelled

Details

chore: extend doc: adr, tutorials, operations, etc

2026-01-12 03:32:47 +00:00

11 KiB

Raw Blame History

Pre-Deployment Checklist

Critical verification steps before any VAPORA deployment to production or staging.

24 Hours Before Deployment

Communication & Scheduling

Schedule deployment with team (record in calendar/ticket)
Post in #deployments channel: "Deployment scheduled for [DATE TIME UTC]"
Identify on-call engineer for deployment period
Brief on-call on deployment plan and rollback procedure
Ensure affected teams (support, product, etc.) are notified
Verify no other critical infrastructure changes scheduled same time window

Change Documentation

Create GitHub issue or ticket tracking the deployment
Document: what's changing (configs, manifests, versions)
Document: why (bug fix, feature, performance, security)
Document: rollback plan (revision number or previous config)
Document: success criteria (what indicates successful deployment)
Document: estimated duration (usually 5-15 minutes)

Code Review & Validation

All provisioning changes merged and code reviewed
Confirm main branch has latest changes
Run validation locally: nu scripts/validate-config.nu --mode enterprise
Verify all 3 modes validate without errors or critical warnings
Check git log for unexpected commits
Review artifact generation: ensure configs are correct

4 Hours Before Deployment

Environment Verification

Staging Environment

Access staging Kubernetes cluster: kubectl cluster-info
Verify cluster is healthy: kubectl get nodes (all Ready)
Check namespace exists: kubectl get namespace vapora
Verify current deployments: kubectl get deployments -n vapora
Check ConfigMap is up to date: kubectl get configmap -n vapora -o yaml | head -20

Production Environment (if applicable)

Access production Kubernetes cluster: kubectl cluster-info
Verify all nodes healthy: kubectl get nodes (all Ready)
Check current resource usage: kubectl top nodes (not near capacity)
Verify current deployments: kubectl get deployments -n vapora
Check pod status: kubectl get pods -n vapora (all Running)
Verify recent events: kubectl get events -n vapora --sort-by='.lastTimestamp' | tail -10

Health Baseline

Record current metrics before deployment
- CPU usage per deployment
- Memory usage per deployment
- Request latency (p50, p95, p99)
- Error rate (4xx, 5xx)
- Queue depth (if applicable)

Verify services are responsive:

curl http://localhost:8001/health -H "Authorization: Bearer $TOKEN"
curl http://localhost:8001/api/projects

Check logs for recent errors:

kubectl logs deployment/vapora-backend -n vapora --tail=50
kubectl logs deployment/vapora-agents -n vapora --tail=50

Infrastructure Check

Verify storage is not near capacity: df -h /var/lib/vapora
Check database health: kubectl exec -n vapora <pod> -- surreal info
Verify backups are recent (within 24 hours)
Check SSL certificate expiration: openssl s_client -connect api.vapora.com:443 -showcerts | grep "Validity"

2 Hours Before Deployment

Artifact Preparation

Trigger validation in CI/CD pipeline
Wait for artifact generation to complete

Download artifacts from pipeline:

# From GitHub Actions or Woodpecker UI
# Download: deployment-artifacts.zip

Verify artifact contents:

unzip deployment-artifacts.zip
ls -la
# Should contain:
# - configmap.yaml
# - deployment.yaml
# - docker-compose.yml
# - vapora-{solo,multiuser,enterprise}.{toml,yaml,json}

Validate manifest syntax:

yq eval '.' configmap.yaml > /dev/null && echo "✓ ConfigMap valid"
yq eval '.' deployment.yaml > /dev/null && echo "✓ Deployment valid"

Test in Staging

Perform dry-run deployment to staging cluster:

kubectl apply -f configmap.yaml --dry-run=server -n vapora
kubectl apply -f deployment.yaml --dry-run=server -n vapora

Review dry-run output for any warnings or errors

If test deployment available, do actual staging deployment and verify:

kubectl get deployments -n vapora
kubectl get pods -n vapora
kubectl logs deployment/vapora-backend -n vapora --tail=5

Test health endpoints on staging
Run smoke tests against staging (if available)

Rollback Plan Verification

Document current deployment revisions:

kubectl rollout history deployment/vapora-backend -n vapora
# Record the highest revision number

Create backup of current ConfigMap:

kubectl get configmap -n vapora vapora-config -o yaml > configmap-backup.yaml

Test rollback procedure on staging (if safe):

# Record current revision
CURRENT_REV=$(kubectl rollout history deployment/vapora-backend -n vapora | tail -1 | awk '{print $1}')

# Test undo
kubectl rollout undo deployment/vapora-backend -n vapora

# Verify rollback
kubectl get deployment vapora-backend -n vapora -o yaml | grep image

# Restore to current
kubectl rollout undo deployment/vapora-backend -n vapora --to-revision=$CURRENT_REV

Confirm rollback command is documented in ticket/issue

1 Hour Before Deployment

Final Checks

Confirm all prerequisites met:
- Code merged to main
- Artifacts generated and validated
- Staging deployment tested
- Rollback plan documented
- Team notified

Communication Setup

Set status page to "Maintenance Mode" (if public)

"VAPORA maintenance deployment starting at HH:MM UTC.
 Expected duration: 10 minutes. Services may be briefly unavailable."

Join #deployments Slack channel
Prepare message: "🚀 Deployment starting now. Will update every 2 minutes."
Have on-call engineer monitoring
Verify monitoring/alerting dashboards are accessible

Access Verification

Verify kubeconfig is valid and up to date:
```
kubectl cluster-info
kubectl get nodes
```

Verify kubectl version compatibility:

kubectl version
# Should match server version reasonably (within 1 minor version)

Test write access to cluster:

kubectl auth can-i create deployments --namespace=vapora
# Should return "yes"

Verify docker/docker-compose access (if Docker deployment)
Verify Slack webhook is working (test send message)

15 Minutes Before Deployment

Final Go/No-Go Decision

STOP HERE and make final decision to proceed or reschedule:

Proceed IF:

✅ All checklist items above completed
✅ No critical issues found during testing
✅ Staging deployment successful
✅ Team ready and monitoring
✅ Rollback plan clear and tested
✅ Within designated maintenance window

RESCHEDULE IF:

❌ Any critical issues discovered
❌ Staging tests failed
❌ Team member unavailable
❌ Production issues detected
❌ Unexpected changes in code/configs

Final Notifications

If proceeding:

Post to #deployments: "🚀 Deployment starting in 5 minutes"
Alert on-call engineer: "Ready to start - confirm you're monitoring"
Have rollback plan visible and accessible
Open monitoring dashboard showing current metrics

Terminal Setup

Open terminal with kubeconfig configured:

export KUBECONFIG=/path/to/production/kubeconfig
kubectl cluster-info  # Verify connected to production

Open second terminal for tailing logs:

kubectl logs -f deployment/vapora-backend -n vapora

Have rollback commands ready:

# For quick rollback if needed
kubectl rollout undo deployment/vapora-backend -n vapora
kubectl rollout undo deployment/vapora-agents -n vapora
kubectl rollout undo deployment/vapora-llm-router -n vapora

Prepare metrics check script:

watch kubectl top pods -n vapora
watch kubectl get pods -n vapora

Success Criteria Verification

Document what "success" looks like for this deployment:

All three deployments have updated image IDs
All pods reach "Ready" state within 5 minutes
No pod restarts: kubectl get pods -n vapora --watch (no restarts column increasing)
No error logs in first 2 minutes
Health endpoints respond (200 OK)
API endpoints respond to test requests
Metrics show normal resource usage
No alerts triggered
Support team reports no user impact

Team Roles During Deployment

Deployment Lead

Executes deployment commands
Monitors progress
Communicates status updates
Decides to proceed/rollback

On-Call Engineer

Monitors dashboards and alerts
Watches for anomalies
Prepares for rollback if needed
Available for emergency decisions

Communications Lead (optional)

Updates #deployments channel
Notifies support/product teams
Updates status page if public
Handles external communication

Backup Person

Monitors for issues
Ready to assist with troubleshooting
Prepares rollback procedures
Escalates if needed

Common Issues to Watch For

⚠️ Pod CrashLoopBackOff

Indicates config or image issue
Check pod logs: kubectl logs <pod>
Check events: kubectl describe pod <pod>
Action: Rollback immediately

⚠️ Pending Pods (not starting)

Check resource availability: kubectl describe pod <pod>
Check node capacity
Action: Investigate or rollback if resource exhausted

⚠️ High Error Rate

Check application logs
Compare with baseline errors
Action: If >10% error increase, rollback

⚠️ Database Connection Errors

Check ConfigMap has correct database URL
Verify network connectivity to database
Action: Check ConfigMap, fix and reapply if needed

⚠️ Memory or CPU Spike

Monitor trends (sudden spike vs gradual)
Check if within expected range for new code
Action: Rollback if resource limits exceeded

Post-Deployment Documentation

After deployment completes, record:

Deployment start time (UTC)
Deployment end time (UTC)
Total duration
Any issues encountered and resolution
Rollback performed (Y/N)
Metrics before/after (CPU, memory, latency, errors)
Team members involved
Blockers or lessons learned

Sign-Off

Use this template for deployment issue/ticket:

DEPLOYMENT COMPLETED

✓ All checks passed
✓ Deployment successful
✓ All pods running
✓ Health checks passing
✓ No user impact

Deployed by: [Name]
Start time: [UTC]
Duration: [X minutes]
Rollback needed: No

Metrics:
- Latency (p99): [X]ms
- Error rate: [X]%
- Pod restarts: 0

Next deployment: [Date/Time]

11 KiB Raw Blame History

Pre-Deployment Checklist

24 Hours Before Deployment

Communication & Scheduling

Change Documentation

Code Review & Validation

4 Hours Before Deployment

Environment Verification

Staging Environment

Production Environment (if applicable)

Health Baseline

Infrastructure Check

2 Hours Before Deployment

Artifact Preparation

Test in Staging

Rollback Plan Verification

1 Hour Before Deployment

Final Checks

Communication Setup

Access Verification

15 Minutes Before Deployment

Final Go/No-Go Decision

Final Notifications

Terminal Setup

Success Criteria Verification

Team Roles During Deployment

Deployment Lead

On-Call Engineer

Communications Lead (optional)

Backup Person

Common Issues to Watch For

Post-Deployment Documentation

Sign-Off

11 KiB

Raw Blame History