Vapora/docs/disaster-recovery/business-continuity-plan.md
Jesús Pérez 7110ffeea2
Some checks failed
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
chore: extend doc: adr, tutorials, operations, etc
2026-01-12 03:32:47 +00:00

13 KiB

VAPORA Business Continuity Plan

Strategic plan for maintaining business operations during and after disaster events.


Purpose & Scope

Purpose: Minimize business impact during service disruptions

Scope:

  • Service availability targets
  • Incident response procedures
  • Communication protocols
  • Recovery priorities
  • Business impact assessment

Owner: Operations Team Review Frequency: Quarterly Last Updated: 2026-01-12


Business Impact Analysis

Service Criticality

Tier 1 - Critical:

  • Backend API (projects, tasks, agents)
  • SurrealDB (all user data)
  • Authentication system
  • Health monitoring

Tier 2 - Important:

  • Frontend UI
  • Agent orchestration
  • LLM routing

Tier 3 - Optional:

  • Analytics
  • Logging aggregation
  • Monitoring dashboards

Recovery Priorities

Phase 1 (First 30 minutes):

  1. Backend API availability
  2. Database connectivity
  3. User authentication

Phase 2 (Next 30 minutes): 4. Frontend UI access 5. Agent services 6. Core functionality

Phase 3 (Next 2 hours): 7. All features 8. Monitoring/alerting 9. Analytics/logging


Service Level Targets

Availability Targets

Monthly Uptime Target: 99.9%
- Allowed downtime: ~43 minutes/month
- Current status: 99.95% (last quarter)

Weekly Uptime Target: 99.9%
- Allowed downtime: ~6 minutes/week

Daily Uptime Target: 99.8%
- Allowed downtime: ~17 seconds/day

Performance Targets

API Response Time: p99 < 500ms
- Current: p99 = 250ms
- Acceptable: < 500ms
- Red alert: > 2000ms

Error Rate: < 0.1%
- Current: 0.05%
- Acceptable: < 0.1%
- Red alert: > 1%

Database Query Time: p99 < 100ms
- Current: p99 = 75ms
- Acceptable: < 100ms
- Red alert: > 500ms

Recovery Objectives

RPO (Recovery Point Objective): 1 hour
- Maximum data loss acceptable: 1 hour
- Backup frequency: Hourly

RTO (Recovery Time Objective): 4 hours
- Time to restore full service: 4 hours
- Critical services (Tier 1): 30 minutes

Incident Response Workflow

Severity Classification

Level 1 - Critical 🔴

  • Service completely unavailable
  • All users affected
  • RPO: 1 hour, RTO: 30 minutes
  • Response: Immediate activation of DR procedures

Level 2 - Major 🟠

  • Service significantly degraded
  • 50% users affected or critical path broken

  • RPO: 2 hours, RTO: 1 hour
  • Response: Activate incident response team

Level 3 - Minor 🟡

  • Service partially unavailable
  • <50% users affected
  • RPO: 4 hours, RTO: 2 hours
  • Response: Alert on-call engineer

Level 4 - Informational 🟢

  • Service available but with issues
  • No user impact
  • Response: Document in ticket

Response Team Activation

Level 1 Response (Disaster Declaration):

Immediately notify:
  - CTO (@cto)
  - VP Operations (@ops-vp)
  - Incident Commander (assign)
  - Database Team (@dba)
  - Infrastructure Team (@infra)

Activate:
  - 24/7 incident command center
  - Continuous communication (every 2 min)
  - Status page updates (every 5 min)
  - Executive briefings (every 30 min)

Resources:
  - All on-call staff activated
  - Contractors/consultants if needed
  - Executive decision makers available

Communication Plan

Stakeholders & Audiences

Audience Notification Frequency
Internal Team Slack #incident-critical Every 2 minutes
Customers Status page + email Every 5 minutes
Executives Direct call/email Every 30 minutes
Support Team Slack + email Initial + every 10 min
Partners Email + phone Initial + every 1 hour

Communication Templates

Initial Notification (to be sent within 5 minutes of incident):

INCIDENT ALERT - VAPORA SERVICE DISRUPTION

Status: [Active/Investigating]
Severity: Level [1-4]
Affected Services: [List]
Time Detected: [UTC]
Impact: [X] customers, [Y]% of functionality

Current Actions:
- [Action 1]
- [Action 2]
- [Action 3]

Expected Update: [Time + 5 min]

Support Contact: [Email/Phone]

Ongoing Status Updates (every 5-10 minutes for Level 1):

INCIDENT UPDATE

Severity: Level [1-4]
Duration: [X] minutes
Impact: [Latest status]

What We've Learned:
- [Finding 1]
- [Finding 2]

What We're Doing:
- [Action 1]
- [Action 2]

Estimated Recovery: [Time/ETA]

Next Update: [+5 minutes]

Resolution Notification:

INCIDENT RESOLVED

Service: VAPORA [All systems restored]
Duration: [X hours] [Y minutes]
Root Cause: [Brief description]
Data Loss: [None/X transactions]

Impact Summary:
- Users affected: [X]
- Revenue impact: $[X]

Next Steps:
- Root cause analysis (scheduled for [date])
- Preventive measures (to be implemented by [date])
- Post-incident review ([date])

We apologize for the disruption and appreciate your patience.

Alternative Operating Procedures

Degraded Mode Operations

If Tier 1 services are available but Tier 2-3 degraded:

DEGRADED MODE PROCEDURES

Available:
✓ Create/update projects
✓ Create/update tasks
✓ View dashboard (read-only)
✓ Basic API access

Unavailable:
✗ Advanced search
✗ Analytics
✗ Agent orchestration (can queue, won't execute)
✗ Real-time updates

User Communication:
- Notify via status page
- Email affected users
- Provide timeline for restoration
- Suggest workarounds

Manual Operations

If automation fails:

MANUAL BACKUP PROCEDURES

If automated backups unavailable:

1. Database Backup:
   kubectl exec pod/surrealdb -- surreal export ... > backup.sql
   aws s3 cp backup.sql s3://manual-backups/

2. Configuration Backup:
   kubectl get configmap -n vapora -o yaml > config.yaml
   aws s3 cp config.yaml s3://manual-backups/

3. Manual Deployment (if automation down):
   kubectl apply -f manifests/
   kubectl rollout status deployment/vapora-backend

Performed by: [Name]
Time: [UTC]
Verified by: [Name]

Resource Requirements

Personnel

Required Team (Level 1 Incident):
- Incident Commander (1): Directs response
- Database Specialist (1): Database recovery
- Infrastructure Specialist (1): Infrastructure/K8s
- Operations Engineer (1): Monitoring/verification
- Communications Lead (1): Stakeholder updates
- Executive Sponsor (1): Decision making

Total: 6 people minimum

Available 24/7:
- On-call rotations cover all time zones
- Escalation to backup personnel if needed

Infrastructure

Required Infrastructure (Minimum):
- Primary data center: 99.5% uptime SLA
- Backup data center: Available within 2 hours
- Network: Redundant connectivity, 99.9% SLA
- Storage: Geo-redundant, 99.99% durability
- Communication: Slack, email, phone all operational

Failover Targets:
- Alternate cloud region: Pre-configured
- On-prem backup: Tested quarterly
- Third-party hosting: As last resort

Technology Stack

Essential Systems:
✓ kubectl (Kubernetes CLI)
✓ AWS CLI (S3, EC2 management)
✓ Git (code access)
✓ Email/Slack (communication)
✓ VPN (access to infrastructure)
✓ Backup storage (accessible from anywhere)

Testing Requirements:
- Test failover: Quarterly
- Test restore: Monthly
- Update tools: Annually

Escalation Paths

Escalation Decision Tree

Initial Alert
    ↓
Can on-call resolve within 15 minutes?
  YES → Proceed with resolution
  NO → Escalate to Level 2
    ↓
Can Level 2 team resolve within 30 minutes?
  YES → Proceed with resolution
  NO → Escalate to Level 3
    ↓
Can Level 3 team resolve within 1 hour?
  YES → Proceed with resolution
  NO → Activate full DR procedures
    ↓
Incident Commander takes full control
All personnel mobilized
Executive decision making engaged

Contact Escalation

Level 1 (On-Call):
- Primary: [Name] [Phone]
- Backup: [Name] [Phone]
- Response SLA: 5 minutes

Level 2 (Senior Engineer):
- Primary: [Name] [Phone]
- Backup: [Name] [Phone]
- Response SLA: 15 minutes

Level 3 (Management):
- Engineering Manager: [Name] [Phone]
- Operations Manager: [Name] [Phone]
- Response SLA: 30 minutes

Executive (CTO/VP):
- CTO: [Name] [Phone]
- VP Operations: [Name] [Phone]
- Response SLA: 15 minutes

Business Continuity Testing

Test Schedule

Monthly:
- Backup restore test (data only)
- Alert notification test
- Contact list verification

Quarterly:
- Full disaster recovery drill
- Failover to alternate region
- Complete service recovery simulation

Annually:
- Full comprehensive BCP review
- Stakeholder review and sign-off
- Update based on lessons learned

Monthly Test Procedure

def monthly_bc_test [] {
  print "=== Monthly Business Continuity Test ==="

  # 1. Backup test
  print "Testing backup restore..."
  # (See backup strategy procedures)

  # 2. Notification test
  print "Testing incident notifications..."
  send_test_alert()  # All team members get alert

  # 3. Verify contacts
  print "Verifying contact information..."
  # Call/text one contact per team

  # 4. Document results
  print "Test complete"
  # Record: All tests passed / Issues found
}

Quarterly Disaster Drill

def quarterly_dr_drill [] {
  print "=== Quarterly Disaster Recovery Drill ==="

  # 1. Declare simulated disaster
  declare_simulated_disaster("database-corruption")

  # 2. Activate team
  notify_team()
  activate_incident_command()

  # 3. Execute recovery procedures
  # Restore from backup, redeploy services

  # 4. Measure timings
  record_rto()  # Recovery Time Objective
  record_rpa()  # Recovery Point Objective

  # 5. Debrief
  print "Comparing results to targets:"
  print "RTO Target: 4 hours"
  print "RTO Actual: [X] hours"
  print "RPA Target: 1 hour"
  print "RPA Actual: [X] minutes"

  # 6. Identify improvements
  record_improvements()
}

Key Contacts & Resources

24/7 Contact Directory

TIER 1 - IMMEDIATE RESPONSE
Position: On-Call Engineer
Name: [Rotating roster]
Primary Phone: [Number]
Backup Phone: [Number]
Slack: @on-call

TIER 2 - SENIOR SUPPORT
Position: Senior Database Engineer
Name: [Name]
Phone: [Number]
Slack: @[name]

TIER 3 - MANAGEMENT
Position: Operations Manager
Name: [Name]
Phone: [Number]
Slack: @[name]

EXECUTIVE ESCALATION
Position: CTO
Name: [Name]
Phone: [Number]
Slack: @[name]

Critical Resources

Documentation:
- Disaster Recovery Runbook: /docs/disaster-recovery/
- Backup Procedures: /docs/disaster-recovery/backup-strategy.md
- Database Recovery: /docs/disaster-recovery/database-recovery-procedures.md
- This BCP: /docs/disaster-recovery/business-continuity-plan.md

Access:
- Backup S3 bucket: s3://vapora-backups/
- Secondary infrastructure: [Details]
- GitHub repository access: [Details]

Tools:
- kubectl config: ~/.kube/config
- AWS credentials: Stored in secure vault
- Slack access: [Workspace]
- Email access: [Details]

Review & Approval

BCP Sign-Off

By signing below, stakeholders acknowledge they have reviewed
and understand this Business Continuity Plan.

CTO: _________________ Date: _________
VP Operations: _________________ Date: _________
Engineering Manager: _________________ Date: _________
Database Team Lead: _________________ Date: _________

Next Review Date: [Quarterly from date above]

BCP Maintenance

Quarterly Review Process

  1. Schedule Review (3 weeks before expiration)

    • Calendar reminder sent
    • Team members notified
  2. Assess Changes

    • Any new services deployed?
    • Any team changes?
    • Any incidents learned from?
    • Any process improvements?
  3. Update Document

    • Add new procedures if needed
    • Update contact information
    • Revise recovery objectives if needed
  4. Conduct Drill

    • Test updated procedures
    • Measure against objectives
    • Document results
  5. Stakeholder Review

    • Present updates to team
    • Get approval signatures
    • Communicate to organization

Annual Comprehensive Review

  1. Full Strategic Review

    • Are recovery objectives still valid?
    • Has business changed?
    • Are we meeting RTO/RPA consistently?
  2. Process Improvements

    • What worked well in past year?
    • What could be improved?
    • Any new technologies available?
  3. Team Feedback

    • Gather feedback from recent incidents
    • Get input from operations team
    • Consider lessons learned
  4. Update and Reapprove

    • Revise critical sections
    • Update all contact information
    • Get new stakeholder approvals

Summary

Business Continuity at a Glance:

Metric Target Status
RTO 4 hours On track
RPA 1 hour On track
Monthly uptime 99.9% 99.95%
Backup frequency Hourly Hourly
Restore test Monthly Monthly
DR drill Quarterly Quarterly

Key Success Factors:

  1. Regular testing (monthly backups, quarterly drills)
  2. Clear roles & responsibilities
  3. Updated contact information
  4. Well-documented procedures
  5. Stakeholder engagement
  6. Continuous improvement

Next Review: [Date + 3 months]