Vapora/docs/adrs/0034-autonomous-scheduling.md
Jesús Pérez bb55c80d2b
feat(workflow-engine): autonomous scheduling with timezone and distributed lock
Add cron-based autonomous workflow firing with two hardening layers:

  - Timezone-aware scheduling via chrono-tz: ScheduledWorkflow.timezone
    (IANA identifier), compute_next_fire_at/after_tz, validate_timezone;
    DST-safe, UTC fallback when absent; validated at config load and REST API

  - Distributed fire-lock via SurrealDB conditional UPDATE (locked_by/locked_at
    fields, 120 s TTL); WorkflowScheduler gains instance_id (UUID) as lock owner;
    prevents double-fires across multi-instance deployments without extra infra

  - ScheduleStore: try_acquire_fire_lock, release_fire_lock (own-instance guard),
    full CRUD (load_one/all, full_upsert, patch, delete, load_runs)

  - REST: 7 endpoints (GET/PUT/PATCH/DELETE schedules, runs history, manual fire)
    with timezone field in all request/response types

  - Migrations 010 (schedule tables) + 011 (timezone + lock columns)
  - Tests: 48 passing (was 26); ADR-0034; changelog; feature docs updated
2026-02-26 11:34:44 +00:00

102 lines
5.0 KiB
Markdown

# ADR-0034: Autonomous Cron Scheduling — Timezone Support and Distributed Fire-Lock
**Status**: Implemented
**Date**: 2026-02-26
**Deciders**: VAPORA Team
**Technical Story**: `vapora-workflow-engine` scheduler fired cron jobs only in UTC and had no protection against double-fires in multi-instance deployments.
---
## Decision
Extend the autonomous scheduling subsystem with two independent hardening layers:
1. **Timezone-aware scheduling** (`chrono-tz`) — cron expressions evaluated in any IANA timezone, stored per-schedule, validated at API and config-load boundaries.
2. **Distributed fire-lock** — SurrealDB conditional `UPDATE ... WHERE locked_by IS NONE OR locked_at < $expiry` provides atomic, TTL-backed mutual exclusion across instances without additional infrastructure.
---
## Context
### Gaps Addressed
| Gap | Consequence |
|-----|-------------|
| UTC-only cron evaluation | `"0 9 * * *"` fires at 09:00 UTC regardless of business timezone; scheduled reports or maintenance windows drift by the UTC offset |
| No distributed coordination | Two `vapora-workflow-engine` instances reading the same `scheduled_workflows` table both fire the same schedule at the same tick |
### Why These Approaches
**`chrono-tz`** over manual UTC-offset arithmetic:
- Compile-time exhaustive enum of all IANA timezone names — invalid names are rejected at parse time.
- The `cron` crate's `Schedule::upcoming(tz)` / `Schedule::after(&dt_in_tz)` are generic over any `TimeZone`, so timezone-awareness requires no special-casing in iteration logic: pass `DateTime<chrono_tz::Tz>` instead of `DateTime<Utc>`, convert output with `.with_timezone(&Utc)`.
- DST transitions handled automatically by `chrono-tz` — no application code needed.
**SurrealDB conditional UPDATE** over external distributed lock (Redis, etcd):
- No additional infrastructure dependency.
- SurrealDB applies document-level write locking; `UPDATE record WHERE condition` is atomic — two concurrent instances race on the same document and only one succeeds (non-empty return array = lock acquired).
- 120-second TTL enforced in application code: `locked_at < $expiry` in the WHERE clause auto-expires a lock from a crashed instance within two scheduler ticks.
---
## Implementation
### New Fields
`scheduled_workflows` table gains three columns (migration 011):
| Field | Type | Purpose |
|-------|------|---------|
| `timezone` | `option<string>` | IANA identifier (`"America/New_York"`) or `NONE` for UTC |
| `locked_by` | `option<string>` | UUID of the instance holding the current fire-lock |
| `locked_at` | `option<datetime>` | When the lock was acquired; used for TTL expiry |
### Lock Protocol
```
Tick N fires schedule S:
try_acquire_fire_lock(id, instance_id, now)
→ UPDATE ... WHERE locked_by IS NONE OR locked_at < (now - 120s)
→ returns true (non-empty) or false (empty)
if false: log + inc schedules_skipped, return
fire_with_lock(S, now) ← actual workflow start
release_fire_lock(id, instance_id)
→ UPDATE ... WHERE locked_by = instance_id
→ own-instance guard prevents stale release
```
Lock release is always attempted even on `fire_with_lock` error; a `warn!` is emitted if release fails (TTL provides fallback).
### Timezone-Aware Cron Evaluation
```
compute_fire_times_tz(schedule, last, now, catch_up, tz):
match tz.parse::<chrono_tz::Tz>():
Some(tz) → schedule.after(&last.with_timezone(&tz))
.take_while(|t| t.with_timezone(&Utc) <= now)
.map(|t| t.with_timezone(&Utc))
None → schedule.after(&last) ← UTC
```
Parsing an unknown/invalid timezone string silently falls back to UTC — avoids a hard error at runtime if a previously valid TZ identifier is removed from the `chrono-tz` database in a future upgrade.
### API Surface Changes
`PUT /api/v1/schedules/:id` and `PATCH /api/v1/schedules/:id` accept and return `timezone: Option<String>`. Timezone is validated at the API boundary using `validate_timezone()` (returns `400 InvalidInput` for unknown identifiers). Config-file `[schedule]` blocks also accept `timezone` and are validated at startup (fail-fast, same as `cron`).
---
## Consequences
### Positive
- Schedules expressed in business-local time — no mental UTC arithmetic for operators.
- Multi-instance deployments safe by default; no external lock service required.
- `ScheduledWorkflow.timezone` is nullable/optional — all existing schedules without the field default to UTC with no migration required.
### Negative / Trade-offs
- `chrono-tz` adds ~2 MB of IANA timezone data to the binary (compile-time embedded).
- Distributed lock TTL of 120 s means a worst-case window of one double-fire per 120 s if the winning instance crashes between acquiring the lock and calling `update_after_fire`. Acceptable given the `schedule_runs` audit log makes duplicates visible.
- No multi-PATCH for timezone clearance: passing `timezone: null` in JSON is treated as absent (`#[serde(default)]`). Clearing timezone (revert to UTC) requires a full PUT.