232 lines
6.4 KiB
Markdown
232 lines
6.4 KiB
Markdown
# ADR-009: SLO and Error Budget Management
|
||
|
||
**Status**: Accepted | **Date**: 2025-01-16 | **Supersedes**: None
|
||
|
||
## Context
|
||
|
||
Provisioning provides infrastructure automation for production systems. Failures cascade to
|
||
customer infrastructure. SLOs balance reliability investment with development velocity.
|
||
|
||
## Decision
|
||
|
||
Define service level objectives (SLOs) for each critical service with monitored error budgets. Availability targets guide operational decisions.
|
||
|
||
## SLOs Defined
|
||
|
||
### Tier 1: Critical Infrastructure Services
|
||
|
||
**Availability Target**: 99.99% (52.6 minutes downtime/year)
|
||
|
||
| Service | Metric | Target | Measurement |
|
||
| --------- | -------- | -------- | ------------- |
|
||
| Orchestrator | Workflow success rate | 99.99% | Failed / Total workflows (5m window) |
|
||
| Vault-Service | Secret retrieval | 99.99% | Failed requests / Total requests (5m) |
|
||
| Control-Center | API availability | 99.99% | HTTP 5xx / Total requests (5m) |
|
||
|
||
### Tier 2: Supporting Services
|
||
|
||
**Availability Target**: 99.9% (8.76 hours downtime/year)
|
||
|
||
| Service | Metric | Target | Measurement |
|
||
| --------- | -------- | -------- | ------------- |
|
||
| Extension-Registry | API availability | 99.9% | HTTP 5xx / Total requests (5m) |
|
||
| AI-Service | Response time | 99.9% | Queries > 10s / Total queries (5m) |
|
||
| Detector | Analysis completion | 99.9% | Failed analyses / Total analyses (5m) |
|
||
|
||
### Tier 3: Enhancement Services
|
||
|
||
**Availability Target**: 99.5% (3.65 days downtime/year)
|
||
|
||
| Service | Metric | Target | Measurement |
|
||
| --------- | -------- | -------- | ------------- |
|
||
| RAG | Index freshness | 99.5% | Stale results / Total queries (5m) |
|
||
| MCP-Server | Tool availability | 99.5% | Unavailable tools / Total tools (5m) |
|
||
|
||
## Error Budget Management
|
||
|
||
### Error Budget Calculation
|
||
|
||
```text
|
||
SLO Target: 99.99% (Tier 1)
|
||
Available Errors: 100% - 99.99% = 0.01%
|
||
Error Budget: 0.01% × Total Requests
|
||
|
||
Example:
|
||
- 1 million requests/day
|
||
- Error budget = 10,000 allowed errors/day
|
||
- If 5,000 errors already occurred
|
||
- Remaining budget = 5,000 errors (50% of budget consumed)
|
||
```
|
||
|
||
### Error Budget Policies
|
||
|
||
**Burn Rate** (error consumption speed):
|
||
|
||
```text
|
||
Slow Burn (< 1x rate): Safe, continue normal operations
|
||
Fast Burn (1-2x rate): Monitor, may trigger incident response
|
||
Critical Burn (> 2x rate): Stop all deployments, emergency incident
|
||
|
||
Example:
|
||
- Daily error budget: 10,000 errors
|
||
- 1x burn rate: 10,000 errors/day
|
||
- 2x burn rate: 20,000 errors/day (double consumption)
|
||
```
|
||
|
||
**Action Triggers**:
|
||
|
||
| Burn Rate | Budget Remaining | Action |
|
||
| ----------- | ------------------ | -------- |
|
||
| < 1x | > 50% | Deploy freely, run experiments |
|
||
| 1x | 25-50% | Code freeze for non-critical features |
|
||
| 2x | 10-25% | No deployments except hotfixes |
|
||
| > 2x | < 10% | Emergency incident, all hands on deck |
|
||
|
||
### Prometheus Rules for Error Budget
|
||
|
||
```yaml
|
||
# provisioning/monitoring/slo-rules.yaml
|
||
groups:
|
||
- name: slo_monitoring
|
||
rules:
|
||
- record: slo:success_rate:5m
|
||
expr: (1 - (increase(http_requests_errors_total[5m]) / increase(http_requests_total[5m]))) * 100
|
||
|
||
- record: slo:error_budget:remaining
|
||
expr: (99.99 - slo:success_rate:5m)
|
||
|
||
- alert: ErrorBudgetBurnWarning
|
||
expr: slo:error_budget:remaining < 50
|
||
for: 15m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "Error budget burn rate is 1x, {{ $value }}% remaining"
|
||
|
||
- alert: ErrorBudgetBurnCritical
|
||
expr: slo:error_budget:remaining < 10
|
||
for: 5m
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "Error budget critical! {{ $value }}% remaining"
|
||
runbook: "https://provisioning.internal/runbooks/error-budget-critical"
|
||
```
|
||
|
||
## Measuring SLOs
|
||
|
||
### Service-Level Indicators (SLIs)
|
||
|
||
```text
|
||
SLI = Good Requests / Total Requests
|
||
|
||
Good Request Definition:
|
||
- HTTP status 2xx-3xx
|
||
- Response time < 1000ms (latency SLI)
|
||
- No errors in workflow execution
|
||
- Database transaction committed
|
||
```
|
||
|
||
### SLO Calculation
|
||
|
||
```nushell
|
||
# Daily SLO report
|
||
def slo-report [] {
|
||
let total = (prometheus query "increase(http_requests_total[1d])")
|
||
let errors = (prometheus query "increase(http_requests_errors_total[1d])")
|
||
let success = $total - $errors
|
||
let sli = ($success / $total) * 100
|
||
|
||
let target = 99.99
|
||
let remaining_budget = $target - $sli
|
||
|
||
print $"SLI: ($sli)%"
|
||
print $"Target: ($target)%"
|
||
print $"Budget Remaining: ($remaining_budget)%"
|
||
|
||
if $remaining_budget < 10 {
|
||
print "⚠️ CRITICAL: Error budget exhausted, halt deployments"
|
||
} else if $remaining_budget < 25 {
|
||
print "⚠️ WARNING: Error budget low, restrict changes"
|
||
} else {
|
||
print "✓ Healthy: Error budget available"
|
||
}
|
||
}
|
||
|
||
slo-report
|
||
```
|
||
|
||
## Deployment Policies Based on Error Budget
|
||
|
||
### Green Light Conditions (Error Budget Available)
|
||
|
||
```text
|
||
if remaining_error_budget > 50% {
|
||
allow: normal deployments
|
||
allow: experimental features
|
||
allow: canary at 50%
|
||
frequency: multiple deploys/day
|
||
}
|
||
```
|
||
|
||
### Yellow Light Conditions (Error Budget Tight)
|
||
|
||
```text
|
||
if 10% < remaining_error_budget <= 50% {
|
||
allow: critical bug fixes only
|
||
allow: security patches
|
||
disallow: feature releases
|
||
disallow: large refactors
|
||
disallow: canary > 25%
|
||
frequency: 1 deploy/day maximum
|
||
}
|
||
```
|
||
|
||
### Red Light Conditions (Error Budget Exhausted)
|
||
|
||
```text
|
||
if remaining_error_budget <= 10% {
|
||
allow: emergency hotfixes only
|
||
disallow: all non-critical changes
|
||
disallow: any new deployments
|
||
action: incident response required
|
||
escalation: VP Engineering approval needed
|
||
}
|
||
```
|
||
|
||
## SLO Review Cycle
|
||
|
||
**Monthly**:
|
||
- Review SLI data vs SLO targets
|
||
- Identify services approaching budget limits
|
||
- Plan remediation for low-performing services
|
||
|
||
**Quarterly**:
|
||
- Review SLO targets against business requirements
|
||
- Adjust targets based on incident patterns
|
||
- Plan infrastructure improvements
|
||
|
||
**Annually**:
|
||
- SLO target review with product/ops leadership
|
||
- Align SLOs with business goals
|
||
- Plan year-long reliability improvements
|
||
|
||
## Consequences
|
||
|
||
- **Positive**:
|
||
- Data-driven deployment decisions
|
||
- Balance between innovation and reliability
|
||
- Early warning system for degradation
|
||
- Alignment between dev and ops
|
||
|
||
- **Negative**:
|
||
- Developers may resist deployment restrictions
|
||
- Overhead of monitoring error budgets
|
||
- Complex to communicate to stakeholders
|
||
- SLO targets may feel arbitrary
|
||
|
||
## Related ADRs
|
||
|
||
- [ADR-008: Unified Observability Stack](./adr-008-observability-and-monitoring.md) - Measure SLOs via metrics
|
||
- [ADR-010: Incident Response Procedures](./adr-010-incident-response.md)
|