provisioning/docs/src/architecture/adr/adr-009-slo-error-budgets.md
2026-01-17 03:58:28 +00:00

232 lines
6.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-009: SLO and Error Budget Management
**Status**: Accepted | **Date**: 2025-01-16 | **Supersedes**: None
## Context
Provisioning provides infrastructure automation for production systems. Failures cascade to
customer infrastructure. SLOs balance reliability investment with development velocity.
## Decision
Define service level objectives (SLOs) for each critical service with monitored error budgets. Availability targets guide operational decisions.
## SLOs Defined
### Tier 1: Critical Infrastructure Services
**Availability Target**: 99.99% (52.6 minutes downtime/year)
| Service | Metric | Target | Measurement |
| --------- | -------- | -------- | ------------- |
| Orchestrator | Workflow success rate | 99.99% | Failed / Total workflows (5m window) |
| Vault-Service | Secret retrieval | 99.99% | Failed requests / Total requests (5m) |
| Control-Center | API availability | 99.99% | HTTP 5xx / Total requests (5m) |
### Tier 2: Supporting Services
**Availability Target**: 99.9% (8.76 hours downtime/year)
| Service | Metric | Target | Measurement |
| --------- | -------- | -------- | ------------- |
| Extension-Registry | API availability | 99.9% | HTTP 5xx / Total requests (5m) |
| AI-Service | Response time | 99.9% | Queries > 10s / Total queries (5m) |
| Detector | Analysis completion | 99.9% | Failed analyses / Total analyses (5m) |
### Tier 3: Enhancement Services
**Availability Target**: 99.5% (3.65 days downtime/year)
| Service | Metric | Target | Measurement |
| --------- | -------- | -------- | ------------- |
| RAG | Index freshness | 99.5% | Stale results / Total queries (5m) |
| MCP-Server | Tool availability | 99.5% | Unavailable tools / Total tools (5m) |
## Error Budget Management
### Error Budget Calculation
```text
SLO Target: 99.99% (Tier 1)
Available Errors: 100% - 99.99% = 0.01%
Error Budget: 0.01% × Total Requests
Example:
- 1 million requests/day
- Error budget = 10,000 allowed errors/day
- If 5,000 errors already occurred
- Remaining budget = 5,000 errors (50% of budget consumed)
```
### Error Budget Policies
**Burn Rate** (error consumption speed):
```text
Slow Burn (< 1x rate): Safe, continue normal operations
Fast Burn (1-2x rate): Monitor, may trigger incident response
Critical Burn (> 2x rate): Stop all deployments, emergency incident
Example:
- Daily error budget: 10,000 errors
- 1x burn rate: 10,000 errors/day
- 2x burn rate: 20,000 errors/day (double consumption)
```
**Action Triggers**:
| Burn Rate | Budget Remaining | Action |
| ----------- | ------------------ | -------- |
| < 1x | > 50% | Deploy freely, run experiments |
| 1x | 25-50% | Code freeze for non-critical features |
| 2x | 10-25% | No deployments except hotfixes |
| > 2x | < 10% | Emergency incident, all hands on deck |
### Prometheus Rules for Error Budget
```yaml
# provisioning/monitoring/slo-rules.yaml
groups:
- name: slo_monitoring
rules:
- record: slo:success_rate:5m
expr: (1 - (increase(http_requests_errors_total[5m]) / increase(http_requests_total[5m]))) * 100
- record: slo:error_budget:remaining
expr: (99.99 - slo:success_rate:5m)
- alert: ErrorBudgetBurnWarning
expr: slo:error_budget:remaining < 50
for: 15m
labels:
severity: warning
annotations:
summary: "Error budget burn rate is 1x, {{ $value }}% remaining"
- alert: ErrorBudgetBurnCritical
expr: slo:error_budget:remaining < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Error budget critical! {{ $value }}% remaining"
runbook: "https://provisioning.internal/runbooks/error-budget-critical"
```
## Measuring SLOs
### Service-Level Indicators (SLIs)
```text
SLI = Good Requests / Total Requests
Good Request Definition:
- HTTP status 2xx-3xx
- Response time < 1000ms (latency SLI)
- No errors in workflow execution
- Database transaction committed
```
### SLO Calculation
```nushell
# Daily SLO report
def slo-report [] {
let total = (prometheus query "increase(http_requests_total[1d])")
let errors = (prometheus query "increase(http_requests_errors_total[1d])")
let success = $total - $errors
let sli = ($success / $total) * 100
let target = 99.99
let remaining_budget = $target - $sli
print $"SLI: ($sli)%"
print $"Target: ($target)%"
print $"Budget Remaining: ($remaining_budget)%"
if $remaining_budget < 10 {
print "⚠️ CRITICAL: Error budget exhausted, halt deployments"
} else if $remaining_budget < 25 {
print "⚠️ WARNING: Error budget low, restrict changes"
} else {
print "✓ Healthy: Error budget available"
}
}
slo-report
```
## Deployment Policies Based on Error Budget
### Green Light Conditions (Error Budget Available)
```text
if remaining_error_budget > 50% {
allow: normal deployments
allow: experimental features
allow: canary at 50%
frequency: multiple deploys/day
}
```
### Yellow Light Conditions (Error Budget Tight)
```text
if 10% < remaining_error_budget <= 50% {
allow: critical bug fixes only
allow: security patches
disallow: feature releases
disallow: large refactors
disallow: canary > 25%
frequency: 1 deploy/day maximum
}
```
### Red Light Conditions (Error Budget Exhausted)
```text
if remaining_error_budget <= 10% {
allow: emergency hotfixes only
disallow: all non-critical changes
disallow: any new deployments
action: incident response required
escalation: VP Engineering approval needed
}
```
## SLO Review Cycle
**Monthly**:
- Review SLI data vs SLO targets
- Identify services approaching budget limits
- Plan remediation for low-performing services
**Quarterly**:
- Review SLO targets against business requirements
- Adjust targets based on incident patterns
- Plan infrastructure improvements
**Annually**:
- SLO target review with product/ops leadership
- Align SLOs with business goals
- Plan year-long reliability improvements
## Consequences
- **Positive**:
- Data-driven deployment decisions
- Balance between innovation and reliability
- Early warning system for degradation
- Alignment between dev and ops
- **Negative**:
- Developers may resist deployment restrictions
- Overhead of monitoring error budgets
- Complex to communicate to stakeholders
- SLO targets may feel arbitrary
## Related ADRs
- [ADR-008: Unified Observability Stack](./adr-008-observability-and-monitoring.md) - Measure SLOs via metrics
- [ADR-010: Incident Response Procedures](./adr-010-incident-response.md)