jesus/provisioning

Fork 0

Jesús Pérez 27dbc5cd08

chore: review docs from scratch

2026-01-17 03:58:28 +00:00

6.4 KiB

Raw Blame History

ADR-009: SLO and Error Budget Management

Status: Accepted | Date: 2025-01-16 | Supersedes: None

Context

Provisioning provides infrastructure automation for production systems. Failures cascade to customer infrastructure. SLOs balance reliability investment with development velocity.

Decision

Define service level objectives (SLOs) for each critical service with monitored error budgets. Availability targets guide operational decisions.

SLOs Defined

Tier 1: Critical Infrastructure Services

Availability Target: 99.99% (52.6 minutes downtime/year)

Service	Metric	Target	Measurement
Orchestrator	Workflow success rate	99.99%	Failed / Total workflows (5m window)
Vault-Service	Secret retrieval	99.99%	Failed requests / Total requests (5m)
Control-Center	API availability	99.99%	HTTP 5xx / Total requests (5m)

Tier 2: Supporting Services

Availability Target: 99.9% (8.76 hours downtime/year)

Service	Metric	Target	Measurement
Extension-Registry	API availability	99.9%	HTTP 5xx / Total requests (5m)
AI-Service	Response time	99.9%	Queries > 10s / Total queries (5m)
Detector	Analysis completion	99.9%	Failed analyses / Total analyses (5m)

Tier 3: Enhancement Services

Availability Target: 99.5% (3.65 days downtime/year)

Service	Metric	Target	Measurement
RAG	Index freshness	99.5%	Stale results / Total queries (5m)
MCP-Server	Tool availability	99.5%	Unavailable tools / Total tools (5m)

Error Budget Management

Error Budget Calculation

SLO Target: 99.99% (Tier 1)
Available Errors: 100% - 99.99% = 0.01%
Error Budget: 0.01% × Total Requests

Example:
- 1 million requests/day
- Error budget = 10,000 allowed errors/day
- If 5,000 errors already occurred
- Remaining budget = 5,000 errors (50% of budget consumed)

Error Budget Policies

Burn Rate (error consumption speed):

Slow Burn (< 1x rate): Safe, continue normal operations
Fast Burn (1-2x rate): Monitor, may trigger incident response
Critical Burn (> 2x rate): Stop all deployments, emergency incident

Example:
- Daily error budget: 10,000 errors
- 1x burn rate: 10,000 errors/day
- 2x burn rate: 20,000 errors/day (double consumption)

Action Triggers:

Burn Rate	Budget Remaining	Action
< 1x	> 50%	Deploy freely, run experiments
1x	25-50%	Code freeze for non-critical features
2x	10-25%	No deployments except hotfixes
> 2x	< 10%	Emergency incident, all hands on deck

Prometheus Rules for Error Budget

# provisioning/monitoring/slo-rules.yaml
groups:
- name: slo_monitoring
  rules:
  - record: slo:success_rate:5m
    expr: (1 - (increase(http_requests_errors_total[5m]) / increase(http_requests_total[5m]))) * 100

  - record: slo:error_budget:remaining
    expr: (99.99 - slo:success_rate:5m)

  - alert: ErrorBudgetBurnWarning
    expr: slo:error_budget:remaining < 50
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Error budget burn rate is 1x, {{ $value }}% remaining"

  - alert: ErrorBudgetBurnCritical
    expr: slo:error_budget:remaining < 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Error budget critical! {{ $value }}% remaining"
      runbook: "https://provisioning.internal/runbooks/error-budget-critical"

Measuring SLOs

Service-Level Indicators (SLIs)

SLI = Good Requests / Total Requests

Good Request Definition:
- HTTP status 2xx-3xx
- Response time < 1000ms (latency SLI)
- No errors in workflow execution
- Database transaction committed

SLO Calculation

# Daily SLO report
def slo-report [] {
    let total = (prometheus query "increase(http_requests_total[1d])")
    let errors = (prometheus query "increase(http_requests_errors_total[1d])")
    let success = $total - $errors
    let sli = ($success / $total) * 100

    let target = 99.99
    let remaining_budget = $target - $sli

    print $"SLI: ($sli)%"
    print $"Target: ($target)%"
    print $"Budget Remaining: ($remaining_budget)%"

    if $remaining_budget < 10 {
        print "⚠️  CRITICAL: Error budget exhausted, halt deployments"
    } else if $remaining_budget < 25 {
        print "⚠️  WARNING: Error budget low, restrict changes"
    } else {
        print "✓ Healthy: Error budget available"
    }
}

slo-report

Deployment Policies Based on Error Budget

Green Light Conditions (Error Budget Available)

if remaining_error_budget > 50% {
    allow: normal deployments
    allow: experimental features
    allow: canary at 50%
    frequency: multiple deploys/day
}

Yellow Light Conditions (Error Budget Tight)

if 10% < remaining_error_budget <= 50% {
    allow: critical bug fixes only
    allow: security patches
    disallow: feature releases
    disallow: large refactors
    disallow: canary > 25%
    frequency: 1 deploy/day maximum
}

Red Light Conditions (Error Budget Exhausted)

if remaining_error_budget <= 10% {
    allow: emergency hotfixes only
    disallow: all non-critical changes
    disallow: any new deployments
    action: incident response required
    escalation: VP Engineering approval needed
}

SLO Review Cycle

Monthly:

Review SLI data vs SLO targets
Identify services approaching budget limits
Plan remediation for low-performing services

Quarterly:

Review SLO targets against business requirements
Adjust targets based on incident patterns
Plan infrastructure improvements

Annually:

SLO target review with product/ops leadership
Align SLOs with business goals
Plan year-long reliability improvements

Consequences

Positive:
- Data-driven deployment decisions
- Balance between innovation and reliability
- Early warning system for degradation
- Alignment between dev and ops
Negative:
- Developers may resist deployment restrictions
- Overhead of monitoring error budgets
- Complex to communicate to stakeholders
- SLO targets may feel arbitrary

ADR-008: Unified Observability Stack - Measure SLOs via metrics
ADR-010: Incident Response Procedures

6.4 KiB Raw Blame History Unescape Escape