jesus/Vapora

Fork 0

Jesús Pérez 7110ffeea2

Rust CI / Security Audit (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (stable) (push) Has been cancelled

Details

chore: extend doc: adr, tutorials, operations, etc

2026-01-12 03:32:47 +00:00

13 KiB

Raw Blame History

VAPORA Business Continuity Plan

Strategic plan for maintaining business operations during and after disaster events.

Purpose & Scope

Purpose: Minimize business impact during service disruptions

Scope:

Service availability targets
Incident response procedures
Communication protocols
Recovery priorities
Business impact assessment

Owner: Operations Team Review Frequency: Quarterly Last Updated: 2026-01-12

Business Impact Analysis

Service Criticality

Tier 1 - Critical:

Backend API (projects, tasks, agents)
SurrealDB (all user data)
Authentication system
Health monitoring

Tier 2 - Important:

Frontend UI
Agent orchestration
LLM routing

Tier 3 - Optional:

Analytics
Logging aggregation
Monitoring dashboards

Recovery Priorities

Phase 1 (First 30 minutes):

Backend API availability
Database connectivity
User authentication

Phase 2 (Next 30 minutes): 4. Frontend UI access 5. Agent services 6. Core functionality

Phase 3 (Next 2 hours): 7. All features 8. Monitoring/alerting 9. Analytics/logging

Service Level Targets

Availability Targets

Monthly Uptime Target: 99.9%
- Allowed downtime: ~43 minutes/month
- Current status: 99.95% (last quarter)

Weekly Uptime Target: 99.9%
- Allowed downtime: ~6 minutes/week

Daily Uptime Target: 99.8%
- Allowed downtime: ~17 seconds/day

Performance Targets

API Response Time: p99 < 500ms
- Current: p99 = 250ms
- Acceptable: < 500ms
- Red alert: > 2000ms

Error Rate: < 0.1%
- Current: 0.05%
- Acceptable: < 0.1%
- Red alert: > 1%

Database Query Time: p99 < 100ms
- Current: p99 = 75ms
- Acceptable: < 100ms
- Red alert: > 500ms

Recovery Objectives

RPO (Recovery Point Objective): 1 hour
- Maximum data loss acceptable: 1 hour
- Backup frequency: Hourly

RTO (Recovery Time Objective): 4 hours
- Time to restore full service: 4 hours
- Critical services (Tier 1): 30 minutes

Incident Response Workflow

Severity Classification

Level 1 - Critical 🔴

Service completely unavailable
All users affected
RPO: 1 hour, RTO: 30 minutes
Response: Immediate activation of DR procedures

Level 2 - Major 🟠

Service significantly degraded
50% users affected or critical path broken
RPO: 2 hours, RTO: 1 hour
Response: Activate incident response team

Level 3 - Minor 🟡

Service partially unavailable
<50% users affected
RPO: 4 hours, RTO: 2 hours
Response: Alert on-call engineer

Level 4 - Informational 🟢

Service available but with issues
No user impact
Response: Document in ticket

Response Team Activation

Level 1 Response (Disaster Declaration):

Immediately notify:
  - CTO (@cto)
  - VP Operations (@ops-vp)
  - Incident Commander (assign)
  - Database Team (@dba)
  - Infrastructure Team (@infra)

Activate:
  - 24/7 incident command center
  - Continuous communication (every 2 min)
  - Status page updates (every 5 min)
  - Executive briefings (every 30 min)

Resources:
  - All on-call staff activated
  - Contractors/consultants if needed
  - Executive decision makers available

Communication Plan

Stakeholders & Audiences

Audience	Notification	Frequency
Internal Team	Slack #incident-critical	Every 2 minutes
Customers	Status page + email	Every 5 minutes
Executives	Direct call/email	Every 30 minutes
Support Team	Slack + email	Initial + every 10 min
Partners	Email + phone	Initial + every 1 hour

Communication Templates

Initial Notification (to be sent within 5 minutes of incident):

INCIDENT ALERT - VAPORA SERVICE DISRUPTION

Status: [Active/Investigating]
Severity: Level [1-4]
Affected Services: [List]
Time Detected: [UTC]
Impact: [X] customers, [Y]% of functionality

Current Actions:
- [Action 1]
- [Action 2]
- [Action 3]

Expected Update: [Time + 5 min]

Support Contact: [Email/Phone]

Ongoing Status Updates (every 5-10 minutes for Level 1):

INCIDENT UPDATE

Severity: Level [1-4]
Duration: [X] minutes
Impact: [Latest status]

What We've Learned:
- [Finding 1]
- [Finding 2]

What We're Doing:
- [Action 1]
- [Action 2]

Estimated Recovery: [Time/ETA]

Next Update: [+5 minutes]

Resolution Notification:

INCIDENT RESOLVED

Service: VAPORA [All systems restored]
Duration: [X hours] [Y minutes]
Root Cause: [Brief description]
Data Loss: [None/X transactions]

Impact Summary:
- Users affected: [X]
- Revenue impact: $[X]

Next Steps:
- Root cause analysis (scheduled for [date])
- Preventive measures (to be implemented by [date])
- Post-incident review ([date])

We apologize for the disruption and appreciate your patience.

Alternative Operating Procedures

Degraded Mode Operations

If Tier 1 services are available but Tier 2-3 degraded:

DEGRADED MODE PROCEDURES

Available:
✓ Create/update projects
✓ Create/update tasks
✓ View dashboard (read-only)
✓ Basic API access

Unavailable:
✗ Advanced search
✗ Analytics
✗ Agent orchestration (can queue, won't execute)
✗ Real-time updates

User Communication:
- Notify via status page
- Email affected users
- Provide timeline for restoration
- Suggest workarounds

Manual Operations

If automation fails:

MANUAL BACKUP PROCEDURES

If automated backups unavailable:

1. Database Backup:
   kubectl exec pod/surrealdb -- surreal export ... > backup.sql
   aws s3 cp backup.sql s3://manual-backups/

2. Configuration Backup:
   kubectl get configmap -n vapora -o yaml > config.yaml
   aws s3 cp config.yaml s3://manual-backups/

3. Manual Deployment (if automation down):
   kubectl apply -f manifests/
   kubectl rollout status deployment/vapora-backend

Performed by: [Name]
Time: [UTC]
Verified by: [Name]

Resource Requirements

Personnel

Required Team (Level 1 Incident):
- Incident Commander (1): Directs response
- Database Specialist (1): Database recovery
- Infrastructure Specialist (1): Infrastructure/K8s
- Operations Engineer (1): Monitoring/verification
- Communications Lead (1): Stakeholder updates
- Executive Sponsor (1): Decision making

Total: 6 people minimum

Available 24/7:
- On-call rotations cover all time zones
- Escalation to backup personnel if needed

Infrastructure

Required Infrastructure (Minimum):
- Primary data center: 99.5% uptime SLA
- Backup data center: Available within 2 hours
- Network: Redundant connectivity, 99.9% SLA
- Storage: Geo-redundant, 99.99% durability
- Communication: Slack, email, phone all operational

Failover Targets:
- Alternate cloud region: Pre-configured
- On-prem backup: Tested quarterly
- Third-party hosting: As last resort

Technology Stack

Essential Systems:
✓ kubectl (Kubernetes CLI)
✓ AWS CLI (S3, EC2 management)
✓ Git (code access)
✓ Email/Slack (communication)
✓ VPN (access to infrastructure)
✓ Backup storage (accessible from anywhere)

Testing Requirements:
- Test failover: Quarterly
- Test restore: Monthly
- Update tools: Annually

Escalation Paths

Escalation Decision Tree

Initial Alert
    ↓
Can on-call resolve within 15 minutes?
  YES → Proceed with resolution
  NO → Escalate to Level 2
    ↓
Can Level 2 team resolve within 30 minutes?
  YES → Proceed with resolution
  NO → Escalate to Level 3
    ↓
Can Level 3 team resolve within 1 hour?
  YES → Proceed with resolution
  NO → Activate full DR procedures
    ↓
Incident Commander takes full control
All personnel mobilized
Executive decision making engaged

Contact Escalation

Level 1 (On-Call):
- Primary: [Name] [Phone]
- Backup: [Name] [Phone]
- Response SLA: 5 minutes

Level 2 (Senior Engineer):
- Primary: [Name] [Phone]
- Backup: [Name] [Phone]
- Response SLA: 15 minutes

Level 3 (Management):
- Engineering Manager: [Name] [Phone]
- Operations Manager: [Name] [Phone]
- Response SLA: 30 minutes

Executive (CTO/VP):
- CTO: [Name] [Phone]
- VP Operations: [Name] [Phone]
- Response SLA: 15 minutes

Business Continuity Testing

Test Schedule

Monthly:
- Backup restore test (data only)
- Alert notification test
- Contact list verification

Quarterly:
- Full disaster recovery drill
- Failover to alternate region
- Complete service recovery simulation

Annually:
- Full comprehensive BCP review
- Stakeholder review and sign-off
- Update based on lessons learned

Monthly Test Procedure

def monthly_bc_test [] {
  print "=== Monthly Business Continuity Test ==="

  # 1. Backup test
  print "Testing backup restore..."
  # (See backup strategy procedures)

  # 2. Notification test
  print "Testing incident notifications..."
  send_test_alert()  # All team members get alert

  # 3. Verify contacts
  print "Verifying contact information..."
  # Call/text one contact per team

  # 4. Document results
  print "Test complete"
  # Record: All tests passed / Issues found
}

Quarterly Disaster Drill

def quarterly_dr_drill [] {
  print "=== Quarterly Disaster Recovery Drill ==="

  # 1. Declare simulated disaster
  declare_simulated_disaster("database-corruption")

  # 2. Activate team
  notify_team()
  activate_incident_command()

  # 3. Execute recovery procedures
  # Restore from backup, redeploy services

  # 4. Measure timings
  record_rto()  # Recovery Time Objective
  record_rpa()  # Recovery Point Objective

  # 5. Debrief
  print "Comparing results to targets:"
  print "RTO Target: 4 hours"
  print "RTO Actual: [X] hours"
  print "RPA Target: 1 hour"
  print "RPA Actual: [X] minutes"

  # 6. Identify improvements
  record_improvements()
}

Key Contacts & Resources

24/7 Contact Directory

TIER 1 - IMMEDIATE RESPONSE
Position: On-Call Engineer
Name: [Rotating roster]
Primary Phone: [Number]
Backup Phone: [Number]
Slack: @on-call

TIER 2 - SENIOR SUPPORT
Position: Senior Database Engineer
Name: [Name]
Phone: [Number]
Slack: @[name]

TIER 3 - MANAGEMENT
Position: Operations Manager
Name: [Name]
Phone: [Number]
Slack: @[name]

EXECUTIVE ESCALATION
Position: CTO
Name: [Name]
Phone: [Number]
Slack: @[name]

Critical Resources

Documentation:
- Disaster Recovery Runbook: /docs/disaster-recovery/
- Backup Procedures: /docs/disaster-recovery/backup-strategy.md
- Database Recovery: /docs/disaster-recovery/database-recovery-procedures.md
- This BCP: /docs/disaster-recovery/business-continuity-plan.md

Access:
- Backup S3 bucket: s3://vapora-backups/
- Secondary infrastructure: [Details]
- GitHub repository access: [Details]

Tools:
- kubectl config: ~/.kube/config
- AWS credentials: Stored in secure vault
- Slack access: [Workspace]
- Email access: [Details]

Review & Approval

BCP Sign-Off

By signing below, stakeholders acknowledge they have reviewed
and understand this Business Continuity Plan.

CTO: _________________ Date: _________
VP Operations: _________________ Date: _________
Engineering Manager: _________________ Date: _________
Database Team Lead: _________________ Date: _________

Next Review Date: [Quarterly from date above]

BCP Maintenance

Quarterly Review Process

Schedule Review (3 weeks before expiration)
- Calendar reminder sent
- Team members notified
Assess Changes
- Any new services deployed?
- Any team changes?
- Any incidents learned from?
- Any process improvements?
Update Document
- Add new procedures if needed
- Update contact information
- Revise recovery objectives if needed
Conduct Drill
- Test updated procedures
- Measure against objectives
- Document results
Stakeholder Review
- Present updates to team
- Get approval signatures
- Communicate to organization

Annual Comprehensive Review

Full Strategic Review
- Are recovery objectives still valid?
- Has business changed?
- Are we meeting RTO/RPA consistently?
Process Improvements
- What worked well in past year?
- What could be improved?
- Any new technologies available?
Team Feedback
- Gather feedback from recent incidents
- Get input from operations team
- Consider lessons learned
Update and Reapprove
- Revise critical sections
- Update all contact information
- Get new stakeholder approvals

Summary

Business Continuity at a Glance:

Metric	Target	Status
RTO	4 hours	On track
RPA	1 hour	On track
Monthly uptime	99.9%	99.95%
Backup frequency	Hourly	Hourly
Restore test	Monthly	Monthly
DR drill	Quarterly	Quarterly

Key Success Factors:

✅ Regular testing (monthly backups, quarterly drills)
✅ Clear roles & responsibilities
✅ Updated contact information
✅ Well-documented procedures
✅ Stakeholder engagement
✅ Continuous improvement

Next Review: [Date + 3 months]

13 KiB Raw Blame History

VAPORA Business Continuity Plan

Purpose & Scope

Business Impact Analysis

Service Criticality

Recovery Priorities

Service Level Targets

Availability Targets

Performance Targets

Recovery Objectives

Incident Response Workflow

Severity Classification

Response Team Activation

Communication Plan

Stakeholders & Audiences

Communication Templates

Alternative Operating Procedures

Degraded Mode Operations

Manual Operations

Resource Requirements

Personnel

Infrastructure

Technology Stack

Escalation Paths

Escalation Decision Tree

Contact Escalation

Business Continuity Testing

Test Schedule

Monthly Test Procedure

Quarterly Disaster Drill

Key Contacts & Resources

24/7 Contact Directory

Critical Resources

Review & Approval

BCP Sign-Off

BCP Maintenance

Quarterly Review Process

Annual Comprehensive Review

Summary

13 KiB

Raw Blame History