VAPORA Business Continuity Plan

Strategic plan for maintaining business operations during and after disaster events.

Purpose & Scope

Purpose: Minimize business impact during service disruptions

Scope:

Service availability targets
Incident response procedures
Communication protocols
Recovery priorities
Business impact assessment

Owner: Operations Team Review Frequency: Quarterly Last Updated: 2026-01-12

Business Impact Analysis

Service Criticality

Tier 1 - Critical:

Backend API (projects, tasks, agents)
SurrealDB (all user data)
Authentication system
Health monitoring

Tier 2 - Important:

Frontend UI
Agent orchestration
LLM routing

Tier 3 - Optional:

Analytics
Logging aggregation
Monitoring dashboards

Recovery Priorities

Phase 1 (First 30 minutes):

Backend API availability
Database connectivity
User authentication

Phase 2 (Next 30 minutes): 4. Frontend UI access 5. Agent services 6. Core functionality

Phase 3 (Next 2 hours): 7. All features 8. Monitoring/alerting 9. Analytics/logging

Service Level Targets

Availability Targets

Monthly Uptime Target: 99.9%
- Allowed downtime: ~43 minutes/month
- Current status: 99.95% (last quarter)

Weekly Uptime Target: 99.9%
- Allowed downtime: ~6 minutes/week

Daily Uptime Target: 99.8%
- Allowed downtime: ~17 seconds/day

Performance Targets

API Response Time: p99 < 500ms
- Current: p99 = 250ms
- Acceptable: < 500ms
- Red alert: > 2000ms

Error Rate: < 0.1%
- Current: 0.05%
- Acceptable: < 0.1%
- Red alert: > 1%

Database Query Time: p99 < 100ms
- Current: p99 = 75ms
- Acceptable: < 100ms
- Red alert: > 500ms

Recovery Objectives

RPO (Recovery Point Objective): 1 hour
- Maximum data loss acceptable: 1 hour
- Backup frequency: Hourly

RTO (Recovery Time Objective): 4 hours
- Time to restore full service: 4 hours
- Critical services (Tier 1): 30 minutes

Incident Response Workflow

Severity Classification

Level 1 - Critical 🔴

Service completely unavailable
All users affected
RPO: 1 hour, RTO: 30 minutes
Response: Immediate activation of DR procedures

Level 2 - Major 🟠

Service significantly degraded
50% users affected or critical path broken
RPO: 2 hours, RTO: 1 hour
Response: Activate incident response team

Level 3 - Minor 🟡

Service partially unavailable
<50% users affected
RPO: 4 hours, RTO: 2 hours
Response: Alert on-call engineer

Level 4 - Informational 🟢

Service available but with issues
No user impact
Response: Document in ticket

Response Team Activation

Level 1 Response (Disaster Declaration):

Immediately notify:
  - CTO (@cto)
  - VP Operations (@ops-vp)
  - Incident Commander (assign)
  - Database Team (@dba)
  - Infrastructure Team (@infra)

Activate:
  - 24/7 incident command center
  - Continuous communication (every 2 min)
  - Status page updates (every 5 min)
  - Executive briefings (every 30 min)

Resources:
  - All on-call staff activated
  - Contractors/consultants if needed
  - Executive decision makers available

Communication Plan

Stakeholders & Audiences

Audience	Notification	Frequency
Internal Team	Slack #incident-critical	Every 2 minutes
Customers	Status page + email	Every 5 minutes
Executives	Direct call/email	Every 30 minutes
Support Team	Slack + email	Initial + every 10 min
Partners	Email + phone	Initial + every 1 hour

Communication Templates

Initial Notification (to be sent within 5 minutes of incident):

INCIDENT ALERT - VAPORA SERVICE DISRUPTION

Status: [Active/Investigating]
Severity: Level [1-4]
Affected Services: [List]
Time Detected: [UTC]
Impact: [X] customers, [Y]% of functionality

Current Actions:
- [Action 1]
- [Action 2]
- [Action 3]

Expected Update: [Time + 5 min]

Support Contact: [Email/Phone]

Ongoing Status Updates (every 5-10 minutes for Level 1):

INCIDENT UPDATE

Severity: Level [1-4]
Duration: [X] minutes
Impact: [Latest status]

What We've Learned:
- [Finding 1]
- [Finding 2]

What We're Doing:
- [Action 1]
- [Action 2]

Estimated Recovery: [Time/ETA]

Next Update: [+5 minutes]

Resolution Notification:

INCIDENT RESOLVED

Service: VAPORA [All systems restored]
Duration: [X hours] [Y minutes]
Root Cause: [Brief description]
Data Loss: [None/X transactions]

Impact Summary:
- Users affected: [X]
- Revenue impact: $[X]

Next Steps:
- Root cause analysis (scheduled for [date])
- Preventive measures (to be implemented by [date])
- Post-incident review ([date])

We apologize for the disruption and appreciate your patience.

Alternative Operating Procedures

Degraded Mode Operations

If Tier 1 services are available but Tier 2-3 degraded:

DEGRADED MODE PROCEDURES

Available:
✓ Create/update projects
✓ Create/update tasks
✓ View dashboard (read-only)
✓ Basic API access

Unavailable:
✗ Advanced search
✗ Analytics
✗ Agent orchestration (can queue, won't execute)
✗ Real-time updates

User Communication:
- Notify via status page
- Email affected users
- Provide timeline for restoration
- Suggest workarounds

Manual Operations

If automation fails:

MANUAL BACKUP PROCEDURES

If automated backups unavailable:

1. Database Backup:
   kubectl exec pod/surrealdb -- surreal export ... > backup.sql
   aws s3 cp backup.sql s3://manual-backups/

2. Configuration Backup:
   kubectl get configmap -n vapora -o yaml > config.yaml
   aws s3 cp config.yaml s3://manual-backups/

3. Manual Deployment (if automation down):
   kubectl apply -f manifests/
   kubectl rollout status deployment/vapora-backend

Performed by: [Name]
Time: [UTC]
Verified by: [Name]

Resource Requirements

Personnel

Required Team (Level 1 Incident):
- Incident Commander (1): Directs response
- Database Specialist (1): Database recovery
- Infrastructure Specialist (1): Infrastructure/K8s
- Operations Engineer (1): Monitoring/verification
- Communications Lead (1): Stakeholder updates
- Executive Sponsor (1): Decision making

Total: 6 people minimum

Available 24/7:
- On-call rotations cover all time zones
- Escalation to backup personnel if needed

Infrastructure

Required Infrastructure (Minimum):
- Primary data center: 99.5% uptime SLA
- Backup data center: Available within 2 hours
- Network: Redundant connectivity, 99.9% SLA
- Storage: Geo-redundant, 99.99% durability
- Communication: Slack, email, phone all operational

Failover Targets:
- Alternate cloud region: Pre-configured
- On-prem backup: Tested quarterly
- Third-party hosting: As last resort

Technology Stack

Essential Systems:
✓ kubectl (Kubernetes CLI)
✓ AWS CLI (S3, EC2 management)
✓ Git (code access)
✓ Email/Slack (communication)
✓ VPN (access to infrastructure)
✓ Backup storage (accessible from anywhere)

Testing Requirements:
- Test failover: Quarterly
- Test restore: Monthly
- Update tools: Annually

Escalation Paths

Escalation Decision Tree

Initial Alert
    ↓
Can on-call resolve within 15 minutes?
  YES → Proceed with resolution
  NO → Escalate to Level 2
    ↓
Can Level 2 team resolve within 30 minutes?
  YES → Proceed with resolution
  NO → Escalate to Level 3
    ↓
Can Level 3 team resolve within 1 hour?
  YES → Proceed with resolution
  NO → Activate full DR procedures
    ↓
Incident Commander takes full control
All personnel mobilized
Executive decision making engaged

Contact Escalation

Level 1 (On-Call):
- Primary: [Name] [Phone]
- Backup: [Name] [Phone]
- Response SLA: 5 minutes

Level 2 (Senior Engineer):
- Primary: [Name] [Phone]
- Backup: [Name] [Phone]
- Response SLA: 15 minutes

Level 3 (Management):
- Engineering Manager: [Name] [Phone]
- Operations Manager: [Name] [Phone]
- Response SLA: 30 minutes

Executive (CTO/VP):
- CTO: [Name] [Phone]
- VP Operations: [Name] [Phone]
- Response SLA: 15 minutes

Business Continuity Testing

Test Schedule

Monthly:
- Backup restore test (data only)
- Alert notification test
- Contact list verification

Quarterly:
- Full disaster recovery drill
- Failover to alternate region
- Complete service recovery simulation

Annually:
- Full comprehensive BCP review
- Stakeholder review and sign-off
- Update based on lessons learned

Monthly Test Procedure

def monthly_bc_test [] {
  print "=== Monthly Business Continuity Test ==="

  # 1. Backup test
  print "Testing backup restore..."
  # (See backup strategy procedures)

  # 2. Notification test
  print "Testing incident notifications..."
  send_test_alert()  # All team members get alert

  # 3. Verify contacts
  print "Verifying contact information..."
  # Call/text one contact per team

  # 4. Document results
  print "Test complete"
  # Record: All tests passed / Issues found
}

Quarterly Disaster Drill

def quarterly_dr_drill [] {
  print "=== Quarterly Disaster Recovery Drill ==="

  # 1. Declare simulated disaster
  declare_simulated_disaster("database-corruption")

  # 2. Activate team
  notify_team()
  activate_incident_command()

  # 3. Execute recovery procedures
  # Restore from backup, redeploy services

  # 4. Measure timings
  record_rto()  # Recovery Time Objective
  record_rpa()  # Recovery Point Objective

  # 5. Debrief
  print "Comparing results to targets:"
  print "RTO Target: 4 hours"
  print "RTO Actual: [X] hours"
  print "RPA Target: 1 hour"
  print "RPA Actual: [X] minutes"

  # 6. Identify improvements
  record_improvements()
}

Key Contacts & Resources

24/7 Contact Directory

TIER 1 - IMMEDIATE RESPONSE
Position: On-Call Engineer
Name: [Rotating roster]
Primary Phone: [Number]
Backup Phone: [Number]
Slack: @on-call

TIER 2 - SENIOR SUPPORT
Position: Senior Database Engineer
Name: [Name]
Phone: [Number]
Slack: @[name]

TIER 3 - MANAGEMENT
Position: Operations Manager
Name: [Name]
Phone: [Number]
Slack: @[name]

EXECUTIVE ESCALATION
Position: CTO
Name: [Name]
Phone: [Number]
Slack: @[name]

Critical Resources

Documentation:
- Disaster Recovery Runbook: /docs/disaster-recovery/
- Backup Procedures: /docs/disaster-recovery/backup-strategy.md
- Database Recovery: /docs/disaster-recovery/database-recovery-procedures.md
- This BCP: /docs/disaster-recovery/business-continuity-plan.md

Access:
- Backup S3 bucket: s3://vapora-backups/
- Secondary infrastructure: [Details]
- GitHub repository access: [Details]

Tools:
- kubectl config: ~/.kube/config
- AWS credentials: Stored in secure vault
- Slack access: [Workspace]
- Email access: [Details]

Review & Approval

BCP Sign-Off

By signing below, stakeholders acknowledge they have reviewed
and understand this Business Continuity Plan.

CTO: _________________ Date: _________
VP Operations: _________________ Date: _________
Engineering Manager: _________________ Date: _________
Database Team Lead: _________________ Date: _________

Next Review Date: [Quarterly from date above]

BCP Maintenance

Quarterly Review Process

Schedule Review (3 weeks before expiration)
- Calendar reminder sent
- Team members notified
Assess Changes
- Any new services deployed?
- Any team changes?
- Any incidents learned from?
- Any process improvements?
Update Document
- Add new procedures if needed
- Update contact information
- Revise recovery objectives if needed
Conduct Drill
- Test updated procedures
- Measure against objectives
- Document results
Stakeholder Review
- Present updates to team
- Get approval signatures
- Communicate to organization

Annual Comprehensive Review

Full Strategic Review
- Are recovery objectives still valid?
- Has business changed?
- Are we meeting RTO/RPA consistently?
Process Improvements
- What worked well in past year?
- What could be improved?
- Any new technologies available?
Team Feedback
- Gather feedback from recent incidents
- Get input from operations team
- Consider lessons learned
Update and Reapprove
- Revise critical sections
- Update all contact information
- Get new stakeholder approvals

Summary

Business Continuity at a Glance:

Metric	Target	Status
RTO	4 hours	On track
RPA	1 hour	On track
Monthly uptime	99.9%	99.95%
Backup frequency	Hourly	Hourly
Restore test	Monthly	Monthly
DR drill	Quarterly	Quarterly

Key Success Factors:

✅ Regular testing (monthly backups, quarterly drills)
✅ Clear roles & responsibilities
✅ Updated contact information
✅ Well-documented procedures
✅ Stakeholder engagement
✅ Continuous improvement

Next Review: [Date + 3 months]

Keyboard shortcuts

VAPORA Platform Documentation