6.4 KiB
6.4 KiB
ADR-009: SLO and Error Budget Management
Status: Accepted | Date: 2025-01-16 | Supersedes: None
Context
Provisioning provides infrastructure automation for production systems. Failures cascade to customer infrastructure. SLOs balance reliability investment with development velocity.
Decision
Define service level objectives (SLOs) for each critical service with monitored error budgets. Availability targets guide operational decisions.
SLOs Defined
Tier 1: Critical Infrastructure Services
Availability Target: 99.99% (52.6 minutes downtime/year)
| Service | Metric | Target | Measurement |
|---|---|---|---|
| Orchestrator | Workflow success rate | 99.99% | Failed / Total workflows (5m window) |
| Vault-Service | Secret retrieval | 99.99% | Failed requests / Total requests (5m) |
| Control-Center | API availability | 99.99% | HTTP 5xx / Total requests (5m) |
Tier 2: Supporting Services
Availability Target: 99.9% (8.76 hours downtime/year)
| Service | Metric | Target | Measurement |
|---|---|---|---|
| Extension-Registry | API availability | 99.9% | HTTP 5xx / Total requests (5m) |
| AI-Service | Response time | 99.9% | Queries > 10s / Total queries (5m) |
| Detector | Analysis completion | 99.9% | Failed analyses / Total analyses (5m) |
Tier 3: Enhancement Services
Availability Target: 99.5% (3.65 days downtime/year)
| Service | Metric | Target | Measurement |
|---|---|---|---|
| RAG | Index freshness | 99.5% | Stale results / Total queries (5m) |
| MCP-Server | Tool availability | 99.5% | Unavailable tools / Total tools (5m) |
Error Budget Management
Error Budget Calculation
SLO Target: 99.99% (Tier 1)
Available Errors: 100% - 99.99% = 0.01%
Error Budget: 0.01% × Total Requests
Example:
- 1 million requests/day
- Error budget = 10,000 allowed errors/day
- If 5,000 errors already occurred
- Remaining budget = 5,000 errors (50% of budget consumed)
Error Budget Policies
Burn Rate (error consumption speed):
Slow Burn (< 1x rate): Safe, continue normal operations
Fast Burn (1-2x rate): Monitor, may trigger incident response
Critical Burn (> 2x rate): Stop all deployments, emergency incident
Example:
- Daily error budget: 10,000 errors
- 1x burn rate: 10,000 errors/day
- 2x burn rate: 20,000 errors/day (double consumption)
Action Triggers:
| Burn Rate | Budget Remaining | Action |
|---|---|---|
| < 1x | > 50% | Deploy freely, run experiments |
| 1x | 25-50% | Code freeze for non-critical features |
| 2x | 10-25% | No deployments except hotfixes |
| > 2x | < 10% | Emergency incident, all hands on deck |
Prometheus Rules for Error Budget
# provisioning/monitoring/slo-rules.yaml
groups:
- name: slo_monitoring
rules:
- record: slo:success_rate:5m
expr: (1 - (increase(http_requests_errors_total[5m]) / increase(http_requests_total[5m]))) * 100
- record: slo:error_budget:remaining
expr: (99.99 - slo:success_rate:5m)
- alert: ErrorBudgetBurnWarning
expr: slo:error_budget:remaining < 50
for: 15m
labels:
severity: warning
annotations:
summary: "Error budget burn rate is 1x, {{ $value }}% remaining"
- alert: ErrorBudgetBurnCritical
expr: slo:error_budget:remaining < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Error budget critical! {{ $value }}% remaining"
runbook: "https://provisioning.internal/runbooks/error-budget-critical"
Measuring SLOs
Service-Level Indicators (SLIs)
SLI = Good Requests / Total Requests
Good Request Definition:
- HTTP status 2xx-3xx
- Response time < 1000ms (latency SLI)
- No errors in workflow execution
- Database transaction committed
SLO Calculation
# Daily SLO report
def slo-report [] {
let total = (prometheus query "increase(http_requests_total[1d])")
let errors = (prometheus query "increase(http_requests_errors_total[1d])")
let success = $total - $errors
let sli = ($success / $total) * 100
let target = 99.99
let remaining_budget = $target - $sli
print $"SLI: ($sli)%"
print $"Target: ($target)%"
print $"Budget Remaining: ($remaining_budget)%"
if $remaining_budget < 10 {
print "⚠️ CRITICAL: Error budget exhausted, halt deployments"
} else if $remaining_budget < 25 {
print "⚠️ WARNING: Error budget low, restrict changes"
} else {
print "✓ Healthy: Error budget available"
}
}
slo-report
Deployment Policies Based on Error Budget
Green Light Conditions (Error Budget Available)
if remaining_error_budget > 50% {
allow: normal deployments
allow: experimental features
allow: canary at 50%
frequency: multiple deploys/day
}
Yellow Light Conditions (Error Budget Tight)
if 10% < remaining_error_budget <= 50% {
allow: critical bug fixes only
allow: security patches
disallow: feature releases
disallow: large refactors
disallow: canary > 25%
frequency: 1 deploy/day maximum
}
Red Light Conditions (Error Budget Exhausted)
if remaining_error_budget <= 10% {
allow: emergency hotfixes only
disallow: all non-critical changes
disallow: any new deployments
action: incident response required
escalation: VP Engineering approval needed
}
SLO Review Cycle
Monthly:
- Review SLI data vs SLO targets
- Identify services approaching budget limits
- Plan remediation for low-performing services
Quarterly:
- Review SLO targets against business requirements
- Adjust targets based on incident patterns
- Plan infrastructure improvements
Annually:
- SLO target review with product/ops leadership
- Align SLOs with business goals
- Plan year-long reliability improvements
Consequences
-
Positive:
- Data-driven deployment decisions
- Balance between innovation and reliability
- Early warning system for degradation
- Alignment between dev and ops
-
Negative:
- Developers may resist deployment restrictions
- Overhead of monitoring error budgets
- Complex to communicate to stakeholders
- SLO targets may feel arbitrary
Related ADRs
- ADR-008: Unified Observability Stack - Measure SLOs via metrics
- ADR-010: Incident Response Procedures