provisioning/docs/src/architecture/adr/adr-009-slo-error-budgets.md
2026-01-17 03:58:28 +00:00

6.4 KiB
Raw Blame History

ADR-009: SLO and Error Budget Management

Status: Accepted | Date: 2025-01-16 | Supersedes: None

Context

Provisioning provides infrastructure automation for production systems. Failures cascade to customer infrastructure. SLOs balance reliability investment with development velocity.

Decision

Define service level objectives (SLOs) for each critical service with monitored error budgets. Availability targets guide operational decisions.

SLOs Defined

Tier 1: Critical Infrastructure Services

Availability Target: 99.99% (52.6 minutes downtime/year)

Service Metric Target Measurement
Orchestrator Workflow success rate 99.99% Failed / Total workflows (5m window)
Vault-Service Secret retrieval 99.99% Failed requests / Total requests (5m)
Control-Center API availability 99.99% HTTP 5xx / Total requests (5m)

Tier 2: Supporting Services

Availability Target: 99.9% (8.76 hours downtime/year)

Service Metric Target Measurement
Extension-Registry API availability 99.9% HTTP 5xx / Total requests (5m)
AI-Service Response time 99.9% Queries > 10s / Total queries (5m)
Detector Analysis completion 99.9% Failed analyses / Total analyses (5m)

Tier 3: Enhancement Services

Availability Target: 99.5% (3.65 days downtime/year)

Service Metric Target Measurement
RAG Index freshness 99.5% Stale results / Total queries (5m)
MCP-Server Tool availability 99.5% Unavailable tools / Total tools (5m)

Error Budget Management

Error Budget Calculation

SLO Target: 99.99% (Tier 1)
Available Errors: 100% - 99.99% = 0.01%
Error Budget: 0.01% × Total Requests

Example:
- 1 million requests/day
- Error budget = 10,000 allowed errors/day
- If 5,000 errors already occurred
- Remaining budget = 5,000 errors (50% of budget consumed)

Error Budget Policies

Burn Rate (error consumption speed):

Slow Burn (< 1x rate): Safe, continue normal operations
Fast Burn (1-2x rate): Monitor, may trigger incident response
Critical Burn (> 2x rate): Stop all deployments, emergency incident

Example:
- Daily error budget: 10,000 errors
- 1x burn rate: 10,000 errors/day
- 2x burn rate: 20,000 errors/day (double consumption)

Action Triggers:

Burn Rate Budget Remaining Action
< 1x > 50% Deploy freely, run experiments
1x 25-50% Code freeze for non-critical features
2x 10-25% No deployments except hotfixes
> 2x < 10% Emergency incident, all hands on deck

Prometheus Rules for Error Budget

# provisioning/monitoring/slo-rules.yaml
groups:
- name: slo_monitoring
  rules:
  - record: slo:success_rate:5m
    expr: (1 - (increase(http_requests_errors_total[5m]) / increase(http_requests_total[5m]))) * 100

  - record: slo:error_budget:remaining
    expr: (99.99 - slo:success_rate:5m)

  - alert: ErrorBudgetBurnWarning
    expr: slo:error_budget:remaining < 50
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Error budget burn rate is 1x, {{ $value }}% remaining"

  - alert: ErrorBudgetBurnCritical
    expr: slo:error_budget:remaining < 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Error budget critical! {{ $value }}% remaining"
      runbook: "https://provisioning.internal/runbooks/error-budget-critical"

Measuring SLOs

Service-Level Indicators (SLIs)

SLI = Good Requests / Total Requests

Good Request Definition:
- HTTP status 2xx-3xx
- Response time < 1000ms (latency SLI)
- No errors in workflow execution
- Database transaction committed

SLO Calculation

# Daily SLO report
def slo-report [] {
    let total = (prometheus query "increase(http_requests_total[1d])")
    let errors = (prometheus query "increase(http_requests_errors_total[1d])")
    let success = $total - $errors
    let sli = ($success / $total) * 100

    let target = 99.99
    let remaining_budget = $target - $sli

    print $"SLI: ($sli)%"
    print $"Target: ($target)%"
    print $"Budget Remaining: ($remaining_budget)%"

    if $remaining_budget < 10 {
        print "⚠️  CRITICAL: Error budget exhausted, halt deployments"
    } else if $remaining_budget < 25 {
        print "⚠️  WARNING: Error budget low, restrict changes"
    } else {
        print "✓ Healthy: Error budget available"
    }
}

slo-report

Deployment Policies Based on Error Budget

Green Light Conditions (Error Budget Available)

if remaining_error_budget > 50% {
    allow: normal deployments
    allow: experimental features
    allow: canary at 50%
    frequency: multiple deploys/day
}

Yellow Light Conditions (Error Budget Tight)

if 10% < remaining_error_budget <= 50% {
    allow: critical bug fixes only
    allow: security patches
    disallow: feature releases
    disallow: large refactors
    disallow: canary > 25%
    frequency: 1 deploy/day maximum
}

Red Light Conditions (Error Budget Exhausted)

if remaining_error_budget <= 10% {
    allow: emergency hotfixes only
    disallow: all non-critical changes
    disallow: any new deployments
    action: incident response required
    escalation: VP Engineering approval needed
}

SLO Review Cycle

Monthly:

  • Review SLI data vs SLO targets
  • Identify services approaching budget limits
  • Plan remediation for low-performing services

Quarterly:

  • Review SLO targets against business requirements
  • Adjust targets based on incident patterns
  • Plan infrastructure improvements

Annually:

  • SLO target review with product/ops leadership
  • Align SLOs with business goals
  • Plan year-long reliability improvements

Consequences

  • Positive:

    • Data-driven deployment decisions
    • Balance between innovation and reliability
    • Early warning system for degradation
    • Alignment between dev and ops
  • Negative:

    • Developers may resist deployment restrictions
    • Overhead of monitoring error budgets
    • Complex to communicate to stakeholders
    • SLO targets may feel arbitrary