provisioning/docs/src/architecture/adr/adr-009-slo-error-budgets.md

# ADR-009: SLO and Error Budget Management

**Status**: Accepted | **Date**: 2025-01-16 | **Supersedes**: None

## Context

Provisioning provides infrastructure automation for production systems. Failures cascade to
customer infrastructure. SLOs balance reliability investment with development velocity.

## Decision

Define service level objectives (SLOs) for each critical service with monitored error budgets. Availability targets guide operational decisions.

## SLOs Defined

### Tier 1: Critical Infrastructure Services

**Availability Target**: 99.99% (52.6 minutes downtime/year)

| Service | Metric | Target | Measurement |
| --------- | -------- | -------- | ------------- |
| Orchestrator | Workflow success rate | 99.99% | Failed / Total workflows (5m window) |
| Vault-Service | Secret retrieval | 99.99% | Failed requests / Total requests (5m) |
| Control-Center | API availability | 99.99% | HTTP 5xx / Total requests (5m) |

### Tier 2: Supporting Services

**Availability Target**: 99.9% (8.76 hours downtime/year)

| Service | Metric | Target | Measurement |
| --------- | -------- | -------- | ------------- |
| Extension-Registry | API availability | 99.9% | HTTP 5xx / Total requests (5m) |
| AI-Service | Response time | 99.9% | Queries > 10s / Total queries (5m) |
| Detector | Analysis completion | 99.9% | Failed analyses / Total analyses (5m) |

### Tier 3: Enhancement Services

**Availability Target**: 99.5% (3.65 days downtime/year)

| Service | Metric | Target | Measurement |
| --------- | -------- | -------- | ------------- |
| RAG | Index freshness | 99.5% | Stale results / Total queries (5m) |
| MCP-Server | Tool availability | 99.5% | Unavailable tools / Total tools (5m) |

## Error Budget Management

### Error Budget Calculation

```text
SLO Target: 99.99% (Tier 1)
Available Errors: 100% - 99.99% = 0.01%
Error Budget: 0.01% × Total Requests

Example:
- 1 million requests/day
- Error budget = 10,000 allowed errors/day
- If 5,000 errors already occurred
- Remaining budget = 5,000 errors (50% of budget consumed)
```

### Error Budget Policies

**Burn Rate** (error consumption speed):

```text
Slow Burn (< 1x rate): Safe, continue normal operations
Fast Burn (1-2x rate): Monitor, may trigger incident response
Critical Burn (> 2x rate): Stop all deployments, emergency incident

Example:
- Daily error budget: 10,000 errors
- 1x burn rate: 10,000 errors/day
- 2x burn rate: 20,000 errors/day (double consumption)
```

**Action Triggers**:

| Burn Rate | Budget Remaining | Action |
| ----------- | ------------------ | -------- |
| < 1x | > 50% | Deploy freely, run experiments |
| 1x | 25-50% | Code freeze for non-critical features |
| 2x | 10-25% | No deployments except hotfixes |
| > 2x | < 10% | Emergency incident, all hands on deck |

### Prometheus Rules for Error Budget

```yaml
# provisioning/monitoring/slo-rules.yaml
groups:
- name: slo_monitoring
  rules:
  - record: slo:success_rate:5m
    expr: (1 - (increase(http_requests_errors_total[5m]) / increase(http_requests_total[5m]))) * 100

  - record: slo:error_budget:remaining
    expr: (99.99 - slo:success_rate:5m)

  - alert: ErrorBudgetBurnWarning
    expr: slo:error_budget:remaining < 50
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Error budget burn rate is 1x, {{ $value }}% remaining"

  - alert: ErrorBudgetBurnCritical
    expr: slo:error_budget:remaining < 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Error budget critical! {{ $value }}% remaining"
      runbook: "https://provisioning.internal/runbooks/error-budget-critical"
```

## Measuring SLOs

### Service-Level Indicators (SLIs)

```text
SLI = Good Requests / Total Requests

Good Request Definition:
- HTTP status 2xx-3xx
- Response time < 1000ms (latency SLI)
- No errors in workflow execution
- Database transaction committed
```

### SLO Calculation

```nushell
# Daily SLO report
def slo-report [] {
    let total = (prometheus query "increase(http_requests_total[1d])")
    let errors = (prometheus query "increase(http_requests_errors_total[1d])")
    let success = $total - $errors
    let sli = ($success / $total) * 100

    let target = 99.99
    let remaining_budget = $target - $sli

    print $"SLI: ($sli)%"
    print $"Target: ($target)%"
    print $"Budget Remaining: ($remaining_budget)%"

    if $remaining_budget < 10 {
        print "⚠️  CRITICAL: Error budget exhausted, halt deployments"
    } else if $remaining_budget < 25 {
        print "⚠️  WARNING: Error budget low, restrict changes"
    } else {
        print "✓ Healthy: Error budget available"
    }
}

slo-report
```

## Deployment Policies Based on Error Budget

### Green Light Conditions (Error Budget Available)

```text
if remaining_error_budget > 50% {
    allow: normal deployments
    allow: experimental features
    allow: canary at 50%
    frequency: multiple deploys/day
}
```

### Yellow Light Conditions (Error Budget Tight)

```text
if 10% < remaining_error_budget <= 50% {
    allow: critical bug fixes only
    allow: security patches
    disallow: feature releases
    disallow: large refactors
    disallow: canary > 25%
    frequency: 1 deploy/day maximum
}
```

### Red Light Conditions (Error Budget Exhausted)

```text
if remaining_error_budget <= 10% {
    allow: emergency hotfixes only
    disallow: all non-critical changes
    disallow: any new deployments
    action: incident response required
    escalation: VP Engineering approval needed
}
```

## SLO Review Cycle

**Monthly**:
- Review SLI data vs SLO targets
- Identify services approaching budget limits
- Plan remediation for low-performing services

**Quarterly**:
- Review SLO targets against business requirements
- Adjust targets based on incident patterns
- Plan infrastructure improvements

**Annually**:
- SLO target review with product/ops leadership
- Align SLOs with business goals
- Plan year-long reliability improvements

## Consequences

- **Positive**:
  - Data-driven deployment decisions
  - Balance between innovation and reliability
  - Early warning system for degradation
  - Alignment between dev and ops

- **Negative**:
  - Developers may resist deployment restrictions
  - Overhead of monitoring error budgets
  - Complex to communicate to stakeholders
  - SLO targets may feel arbitrary

## Related ADRs

- [ADR-008: Unified Observability Stack](./adr-008-observability-and-monitoring.md) - Measure SLOs via metrics
- [ADR-010: Incident Response Procedures](./adr-010-incident-response.md)