# ADR-009: SLO and Error Budget Management **Status**: Accepted | **Date**: 2025-01-16 | **Supersedes**: None ## Context Provisioning provides infrastructure automation for production systems. Failures cascade to customer infrastructure. SLOs balance reliability investment with development velocity. ## Decision Define service level objectives (SLOs) for each critical service with monitored error budgets. Availability targets guide operational decisions. ## SLOs Defined ### Tier 1: Critical Infrastructure Services **Availability Target**: 99.99% (52.6 minutes downtime/year) | Service | Metric | Target | Measurement | | --------- | -------- | -------- | ------------- | | Orchestrator | Workflow success rate | 99.99% | Failed / Total workflows (5m window) | | Vault-Service | Secret retrieval | 99.99% | Failed requests / Total requests (5m) | | Control-Center | API availability | 99.99% | HTTP 5xx / Total requests (5m) | ### Tier 2: Supporting Services **Availability Target**: 99.9% (8.76 hours downtime/year) | Service | Metric | Target | Measurement | | --------- | -------- | -------- | ------------- | | Extension-Registry | API availability | 99.9% | HTTP 5xx / Total requests (5m) | | AI-Service | Response time | 99.9% | Queries > 10s / Total queries (5m) | | Detector | Analysis completion | 99.9% | Failed analyses / Total analyses (5m) | ### Tier 3: Enhancement Services **Availability Target**: 99.5% (3.65 days downtime/year) | Service | Metric | Target | Measurement | | --------- | -------- | -------- | ------------- | | RAG | Index freshness | 99.5% | Stale results / Total queries (5m) | | MCP-Server | Tool availability | 99.5% | Unavailable tools / Total tools (5m) | ## Error Budget Management ### Error Budget Calculation ```text SLO Target: 99.99% (Tier 1) Available Errors: 100% - 99.99% = 0.01% Error Budget: 0.01% × Total Requests Example: - 1 million requests/day - Error budget = 10,000 allowed errors/day - If 5,000 errors already occurred - Remaining budget = 5,000 errors (50% of budget consumed) ``` ### Error Budget Policies **Burn Rate** (error consumption speed): ```text Slow Burn (< 1x rate): Safe, continue normal operations Fast Burn (1-2x rate): Monitor, may trigger incident response Critical Burn (> 2x rate): Stop all deployments, emergency incident Example: - Daily error budget: 10,000 errors - 1x burn rate: 10,000 errors/day - 2x burn rate: 20,000 errors/day (double consumption) ``` **Action Triggers**: | Burn Rate | Budget Remaining | Action | | ----------- | ------------------ | -------- | | < 1x | > 50% | Deploy freely, run experiments | | 1x | 25-50% | Code freeze for non-critical features | | 2x | 10-25% | No deployments except hotfixes | | > 2x | < 10% | Emergency incident, all hands on deck | ### Prometheus Rules for Error Budget ```yaml # provisioning/monitoring/slo-rules.yaml groups: - name: slo_monitoring rules: - record: slo:success_rate:5m expr: (1 - (increase(http_requests_errors_total[5m]) / increase(http_requests_total[5m]))) * 100 - record: slo:error_budget:remaining expr: (99.99 - slo:success_rate:5m) - alert: ErrorBudgetBurnWarning expr: slo:error_budget:remaining < 50 for: 15m labels: severity: warning annotations: summary: "Error budget burn rate is 1x, {{ $value }}% remaining" - alert: ErrorBudgetBurnCritical expr: slo:error_budget:remaining < 10 for: 5m labels: severity: critical annotations: summary: "Error budget critical! {{ $value }}% remaining" runbook: "https://provisioning.internal/runbooks/error-budget-critical" ``` ## Measuring SLOs ### Service-Level Indicators (SLIs) ```text SLI = Good Requests / Total Requests Good Request Definition: - HTTP status 2xx-3xx - Response time < 1000ms (latency SLI) - No errors in workflow execution - Database transaction committed ``` ### SLO Calculation ```nushell # Daily SLO report def slo-report [] { let total = (prometheus query "increase(http_requests_total[1d])") let errors = (prometheus query "increase(http_requests_errors_total[1d])") let success = $total - $errors let sli = ($success / $total) * 100 let target = 99.99 let remaining_budget = $target - $sli print $"SLI: ($sli)%" print $"Target: ($target)%" print $"Budget Remaining: ($remaining_budget)%" if $remaining_budget < 10 { print "⚠️ CRITICAL: Error budget exhausted, halt deployments" } else if $remaining_budget < 25 { print "⚠️ WARNING: Error budget low, restrict changes" } else { print "✓ Healthy: Error budget available" } } slo-report ``` ## Deployment Policies Based on Error Budget ### Green Light Conditions (Error Budget Available) ```text if remaining_error_budget > 50% { allow: normal deployments allow: experimental features allow: canary at 50% frequency: multiple deploys/day } ``` ### Yellow Light Conditions (Error Budget Tight) ```text if 10% < remaining_error_budget <= 50% { allow: critical bug fixes only allow: security patches disallow: feature releases disallow: large refactors disallow: canary > 25% frequency: 1 deploy/day maximum } ``` ### Red Light Conditions (Error Budget Exhausted) ```text if remaining_error_budget <= 10% { allow: emergency hotfixes only disallow: all non-critical changes disallow: any new deployments action: incident response required escalation: VP Engineering approval needed } ``` ## SLO Review Cycle **Monthly**: - Review SLI data vs SLO targets - Identify services approaching budget limits - Plan remediation for low-performing services **Quarterly**: - Review SLO targets against business requirements - Adjust targets based on incident patterns - Plan infrastructure improvements **Annually**: - SLO target review with product/ops leadership - Align SLOs with business goals - Plan year-long reliability improvements ## Consequences - **Positive**: - Data-driven deployment decisions - Balance between innovation and reliability - Early warning system for degradation - Alignment between dev and ops - **Negative**: - Developers may resist deployment restrictions - Overhead of monitoring error budgets - Complex to communicate to stakeholders - SLO targets may feel arbitrary ## Related ADRs - [ADR-008: Unified Observability Stack](./adr-008-observability-and-monitoring.md) - Measure SLOs via metrics - [ADR-010: Incident Response Procedures](./adr-010-incident-response.md)