VAPORA Business Continuity Plan
Strategic plan for maintaining business operations during and after disaster events.
Purpose & Scope
Purpose: Minimize business impact during service disruptions
Scope:
- Service availability targets
- Incident response procedures
- Communication protocols
- Recovery priorities
- Business impact assessment
Owner: Operations Team Review Frequency: Quarterly Last Updated: 2026-01-12
Business Impact Analysis
Service Criticality
Tier 1 - Critical:
- Backend API (projects, tasks, agents)
- SurrealDB (all user data)
- Authentication system
- Health monitoring
Tier 2 - Important:
- Frontend UI
- Agent orchestration
- LLM routing
Tier 3 - Optional:
- Analytics
- Logging aggregation
- Monitoring dashboards
Recovery Priorities
Phase 1 (First 30 minutes):
- Backend API availability
- Database connectivity
- User authentication
Phase 2 (Next 30 minutes): 4. Frontend UI access 5. Agent services 6. Core functionality
Phase 3 (Next 2 hours): 7. All features 8. Monitoring/alerting 9. Analytics/logging
Service Level Targets
Availability Targets
Monthly Uptime Target: 99.9%
- Allowed downtime: ~43 minutes/month
- Current status: 99.95% (last quarter)
Weekly Uptime Target: 99.9%
- Allowed downtime: ~6 minutes/week
Daily Uptime Target: 99.8%
- Allowed downtime: ~17 seconds/day
Performance Targets
API Response Time: p99 < 500ms
- Current: p99 = 250ms
- Acceptable: < 500ms
- Red alert: > 2000ms
Error Rate: < 0.1%
- Current: 0.05%
- Acceptable: < 0.1%
- Red alert: > 1%
Database Query Time: p99 < 100ms
- Current: p99 = 75ms
- Acceptable: < 100ms
- Red alert: > 500ms
Recovery Objectives
RPO (Recovery Point Objective): 1 hour
- Maximum data loss acceptable: 1 hour
- Backup frequency: Hourly
RTO (Recovery Time Objective): 4 hours
- Time to restore full service: 4 hours
- Critical services (Tier 1): 30 minutes
Incident Response Workflow
Severity Classification
Level 1 - Critical 🔴
- Service completely unavailable
- All users affected
- RPO: 1 hour, RTO: 30 minutes
- Response: Immediate activation of DR procedures
Level 2 - Major 🟠
- Service significantly degraded
-
50% users affected or critical path broken
- RPO: 2 hours, RTO: 1 hour
- Response: Activate incident response team
Level 3 - Minor 🟡
- Service partially unavailable
- <50% users affected
- RPO: 4 hours, RTO: 2 hours
- Response: Alert on-call engineer
Level 4 - Informational 🟢
- Service available but with issues
- No user impact
- Response: Document in ticket
Response Team Activation
Level 1 Response (Disaster Declaration):
Immediately notify:
- CTO (@cto)
- VP Operations (@ops-vp)
- Incident Commander (assign)
- Database Team (@dba)
- Infrastructure Team (@infra)
Activate:
- 24/7 incident command center
- Continuous communication (every 2 min)
- Status page updates (every 5 min)
- Executive briefings (every 30 min)
Resources:
- All on-call staff activated
- Contractors/consultants if needed
- Executive decision makers available
Communication Plan
Stakeholders & Audiences
| Audience | Notification | Frequency |
|---|---|---|
| Internal Team | Slack #incident-critical | Every 2 minutes |
| Customers | Status page + email | Every 5 minutes |
| Executives | Direct call/email | Every 30 minutes |
| Support Team | Slack + email | Initial + every 10 min |
| Partners | Email + phone | Initial + every 1 hour |
Communication Templates
Initial Notification (to be sent within 5 minutes of incident):
INCIDENT ALERT - VAPORA SERVICE DISRUPTION
Status: [Active/Investigating]
Severity: Level [1-4]
Affected Services: [List]
Time Detected: [UTC]
Impact: [X] customers, [Y]% of functionality
Current Actions:
- [Action 1]
- [Action 2]
- [Action 3]
Expected Update: [Time + 5 min]
Support Contact: [Email/Phone]
Ongoing Status Updates (every 5-10 minutes for Level 1):
INCIDENT UPDATE
Severity: Level [1-4]
Duration: [X] minutes
Impact: [Latest status]
What We've Learned:
- [Finding 1]
- [Finding 2]
What We're Doing:
- [Action 1]
- [Action 2]
Estimated Recovery: [Time/ETA]
Next Update: [+5 minutes]
Resolution Notification:
INCIDENT RESOLVED
Service: VAPORA [All systems restored]
Duration: [X hours] [Y minutes]
Root Cause: [Brief description]
Data Loss: [None/X transactions]
Impact Summary:
- Users affected: [X]
- Revenue impact: $[X]
Next Steps:
- Root cause analysis (scheduled for [date])
- Preventive measures (to be implemented by [date])
- Post-incident review ([date])
We apologize for the disruption and appreciate your patience.
Alternative Operating Procedures
Degraded Mode Operations
If Tier 1 services are available but Tier 2-3 degraded:
DEGRADED MODE PROCEDURES
Available:
✓ Create/update projects
✓ Create/update tasks
✓ View dashboard (read-only)
✓ Basic API access
Unavailable:
✗ Advanced search
✗ Analytics
✗ Agent orchestration (can queue, won't execute)
✗ Real-time updates
User Communication:
- Notify via status page
- Email affected users
- Provide timeline for restoration
- Suggest workarounds
Manual Operations
If automation fails:
MANUAL BACKUP PROCEDURES
If automated backups unavailable:
1. Database Backup:
kubectl exec pod/surrealdb -- surreal export ... > backup.sql
aws s3 cp backup.sql s3://manual-backups/
2. Configuration Backup:
kubectl get configmap -n vapora -o yaml > config.yaml
aws s3 cp config.yaml s3://manual-backups/
3. Manual Deployment (if automation down):
kubectl apply -f manifests/
kubectl rollout status deployment/vapora-backend
Performed by: [Name]
Time: [UTC]
Verified by: [Name]
Resource Requirements
Personnel
Required Team (Level 1 Incident):
- Incident Commander (1): Directs response
- Database Specialist (1): Database recovery
- Infrastructure Specialist (1): Infrastructure/K8s
- Operations Engineer (1): Monitoring/verification
- Communications Lead (1): Stakeholder updates
- Executive Sponsor (1): Decision making
Total: 6 people minimum
Available 24/7:
- On-call rotations cover all time zones
- Escalation to backup personnel if needed
Infrastructure
Required Infrastructure (Minimum):
- Primary data center: 99.5% uptime SLA
- Backup data center: Available within 2 hours
- Network: Redundant connectivity, 99.9% SLA
- Storage: Geo-redundant, 99.99% durability
- Communication: Slack, email, phone all operational
Failover Targets:
- Alternate cloud region: Pre-configured
- On-prem backup: Tested quarterly
- Third-party hosting: As last resort
Technology Stack
Essential Systems:
✓ kubectl (Kubernetes CLI)
✓ AWS CLI (S3, EC2 management)
✓ Git (code access)
✓ Email/Slack (communication)
✓ VPN (access to infrastructure)
✓ Backup storage (accessible from anywhere)
Testing Requirements:
- Test failover: Quarterly
- Test restore: Monthly
- Update tools: Annually
Escalation Paths
Escalation Decision Tree
Initial Alert
↓
Can on-call resolve within 15 minutes?
YES → Proceed with resolution
NO → Escalate to Level 2
↓
Can Level 2 team resolve within 30 minutes?
YES → Proceed with resolution
NO → Escalate to Level 3
↓
Can Level 3 team resolve within 1 hour?
YES → Proceed with resolution
NO → Activate full DR procedures
↓
Incident Commander takes full control
All personnel mobilized
Executive decision making engaged
Contact Escalation
Level 1 (On-Call):
- Primary: [Name] [Phone]
- Backup: [Name] [Phone]
- Response SLA: 5 minutes
Level 2 (Senior Engineer):
- Primary: [Name] [Phone]
- Backup: [Name] [Phone]
- Response SLA: 15 minutes
Level 3 (Management):
- Engineering Manager: [Name] [Phone]
- Operations Manager: [Name] [Phone]
- Response SLA: 30 minutes
Executive (CTO/VP):
- CTO: [Name] [Phone]
- VP Operations: [Name] [Phone]
- Response SLA: 15 minutes
Business Continuity Testing
Test Schedule
Monthly:
- Backup restore test (data only)
- Alert notification test
- Contact list verification
Quarterly:
- Full disaster recovery drill
- Failover to alternate region
- Complete service recovery simulation
Annually:
- Full comprehensive BCP review
- Stakeholder review and sign-off
- Update based on lessons learned
Monthly Test Procedure
def monthly_bc_test [] {
print "=== Monthly Business Continuity Test ==="
# 1. Backup test
print "Testing backup restore..."
# (See backup strategy procedures)
# 2. Notification test
print "Testing incident notifications..."
send_test_alert() # All team members get alert
# 3. Verify contacts
print "Verifying contact information..."
# Call/text one contact per team
# 4. Document results
print "Test complete"
# Record: All tests passed / Issues found
}
Quarterly Disaster Drill
def quarterly_dr_drill [] {
print "=== Quarterly Disaster Recovery Drill ==="
# 1. Declare simulated disaster
declare_simulated_disaster("database-corruption")
# 2. Activate team
notify_team()
activate_incident_command()
# 3. Execute recovery procedures
# Restore from backup, redeploy services
# 4. Measure timings
record_rto() # Recovery Time Objective
record_rpa() # Recovery Point Objective
# 5. Debrief
print "Comparing results to targets:"
print "RTO Target: 4 hours"
print "RTO Actual: [X] hours"
print "RPA Target: 1 hour"
print "RPA Actual: [X] minutes"
# 6. Identify improvements
record_improvements()
}
Key Contacts & Resources
24/7 Contact Directory
TIER 1 - IMMEDIATE RESPONSE
Position: On-Call Engineer
Name: [Rotating roster]
Primary Phone: [Number]
Backup Phone: [Number]
Slack: @on-call
TIER 2 - SENIOR SUPPORT
Position: Senior Database Engineer
Name: [Name]
Phone: [Number]
Slack: @[name]
TIER 3 - MANAGEMENT
Position: Operations Manager
Name: [Name]
Phone: [Number]
Slack: @[name]
EXECUTIVE ESCALATION
Position: CTO
Name: [Name]
Phone: [Number]
Slack: @[name]
Critical Resources
Documentation:
- Disaster Recovery Runbook: /docs/disaster-recovery/
- Backup Procedures: /docs/disaster-recovery/backup-strategy.md
- Database Recovery: /docs/disaster-recovery/database-recovery-procedures.md
- This BCP: /docs/disaster-recovery/business-continuity-plan.md
Access:
- Backup S3 bucket: s3://vapora-backups/
- Secondary infrastructure: [Details]
- GitHub repository access: [Details]
Tools:
- kubectl config: ~/.kube/config
- AWS credentials: Stored in secure vault
- Slack access: [Workspace]
- Email access: [Details]
Review & Approval
BCP Sign-Off
By signing below, stakeholders acknowledge they have reviewed
and understand this Business Continuity Plan.
CTO: _________________ Date: _________
VP Operations: _________________ Date: _________
Engineering Manager: _________________ Date: _________
Database Team Lead: _________________ Date: _________
Next Review Date: [Quarterly from date above]
BCP Maintenance
Quarterly Review Process
-
Schedule Review (3 weeks before expiration)
- Calendar reminder sent
- Team members notified
-
Assess Changes
- Any new services deployed?
- Any team changes?
- Any incidents learned from?
- Any process improvements?
-
Update Document
- Add new procedures if needed
- Update contact information
- Revise recovery objectives if needed
-
Conduct Drill
- Test updated procedures
- Measure against objectives
- Document results
-
Stakeholder Review
- Present updates to team
- Get approval signatures
- Communicate to organization
Annual Comprehensive Review
-
Full Strategic Review
- Are recovery objectives still valid?
- Has business changed?
- Are we meeting RTO/RPA consistently?
-
Process Improvements
- What worked well in past year?
- What could be improved?
- Any new technologies available?
-
Team Feedback
- Gather feedback from recent incidents
- Get input from operations team
- Consider lessons learned
-
Update and Reapprove
- Revise critical sections
- Update all contact information
- Get new stakeholder approvals
Summary
Business Continuity at a Glance:
| Metric | Target | Status |
|---|---|---|
| RTO | 4 hours | On track |
| RPA | 1 hour | On track |
| Monthly uptime | 99.9% | 99.95% |
| Backup frequency | Hourly | Hourly |
| Restore test | Monthly | Monthly |
| DR drill | Quarterly | Quarterly |
Key Success Factors:
- ✅ Regular testing (monthly backups, quarterly drills)
- ✅ Clear roles & responsibilities
- ✅ Updated contact information
- ✅ Well-documented procedures
- ✅ Stakeholder engagement
- ✅ Continuous improvement
Next Review: [Date + 3 months]