Vapora/docs/adrs/0034-autonomous-scheduling.md
Jesús Pérez bb55c80d2b
feat(workflow-engine): autonomous scheduling with timezone and distributed lock
Add cron-based autonomous workflow firing with two hardening layers:

  - Timezone-aware scheduling via chrono-tz: ScheduledWorkflow.timezone
    (IANA identifier), compute_next_fire_at/after_tz, validate_timezone;
    DST-safe, UTC fallback when absent; validated at config load and REST API

  - Distributed fire-lock via SurrealDB conditional UPDATE (locked_by/locked_at
    fields, 120 s TTL); WorkflowScheduler gains instance_id (UUID) as lock owner;
    prevents double-fires across multi-instance deployments without extra infra

  - ScheduleStore: try_acquire_fire_lock, release_fire_lock (own-instance guard),
    full CRUD (load_one/all, full_upsert, patch, delete, load_runs)

  - REST: 7 endpoints (GET/PUT/PATCH/DELETE schedules, runs history, manual fire)
    with timezone field in all request/response types

  - Migrations 010 (schedule tables) + 011 (timezone + lock columns)
  - Tests: 48 passing (was 26); ADR-0034; changelog; feature docs updated
2026-02-26 11:34:44 +00:00

5.0 KiB

ADR-0034: Autonomous Cron Scheduling — Timezone Support and Distributed Fire-Lock

Status: Implemented Date: 2026-02-26 Deciders: VAPORA Team Technical Story: vapora-workflow-engine scheduler fired cron jobs only in UTC and had no protection against double-fires in multi-instance deployments.


Decision

Extend the autonomous scheduling subsystem with two independent hardening layers:

  1. Timezone-aware scheduling (chrono-tz) — cron expressions evaluated in any IANA timezone, stored per-schedule, validated at API and config-load boundaries.
  2. Distributed fire-lock — SurrealDB conditional UPDATE ... WHERE locked_by IS NONE OR locked_at < $expiry provides atomic, TTL-backed mutual exclusion across instances without additional infrastructure.

Context

Gaps Addressed

Gap Consequence
UTC-only cron evaluation "0 9 * * *" fires at 09:00 UTC regardless of business timezone; scheduled reports or maintenance windows drift by the UTC offset
No distributed coordination Two vapora-workflow-engine instances reading the same scheduled_workflows table both fire the same schedule at the same tick

Why These Approaches

chrono-tz over manual UTC-offset arithmetic:

  • Compile-time exhaustive enum of all IANA timezone names — invalid names are rejected at parse time.
  • The cron crate's Schedule::upcoming(tz) / Schedule::after(&dt_in_tz) are generic over any TimeZone, so timezone-awareness requires no special-casing in iteration logic: pass DateTime<chrono_tz::Tz> instead of DateTime<Utc>, convert output with .with_timezone(&Utc).
  • DST transitions handled automatically by chrono-tz — no application code needed.

SurrealDB conditional UPDATE over external distributed lock (Redis, etcd):

  • No additional infrastructure dependency.
  • SurrealDB applies document-level write locking; UPDATE record WHERE condition is atomic — two concurrent instances race on the same document and only one succeeds (non-empty return array = lock acquired).
  • 120-second TTL enforced in application code: locked_at < $expiry in the WHERE clause auto-expires a lock from a crashed instance within two scheduler ticks.

Implementation

New Fields

scheduled_workflows table gains three columns (migration 011):

Field Type Purpose
timezone option<string> IANA identifier ("America/New_York") or NONE for UTC
locked_by option<string> UUID of the instance holding the current fire-lock
locked_at option<datetime> When the lock was acquired; used for TTL expiry

Lock Protocol

Tick N fires schedule S:
  try_acquire_fire_lock(id, instance_id, now)
    → UPDATE ... WHERE locked_by IS NONE OR locked_at < (now - 120s)
    → returns true (non-empty) or false (empty)
  if false: log + inc schedules_skipped, return
  fire_with_lock(S, now)         ← actual workflow start
  release_fire_lock(id, instance_id)
    → UPDATE ... WHERE locked_by = instance_id
    → own-instance guard prevents stale release

Lock release is always attempted even on fire_with_lock error; a warn! is emitted if release fails (TTL provides fallback).

Timezone-Aware Cron Evaluation

compute_fire_times_tz(schedule, last, now, catch_up, tz):
  match tz.parse::<chrono_tz::Tz>():
    Some(tz) → schedule.after(&last.with_timezone(&tz))
                        .take_while(|t| t.with_timezone(&Utc) <= now)
                        .map(|t| t.with_timezone(&Utc))
    None     → schedule.after(&last)     ← UTC

Parsing an unknown/invalid timezone string silently falls back to UTC — avoids a hard error at runtime if a previously valid TZ identifier is removed from the chrono-tz database in a future upgrade.

API Surface Changes

PUT /api/v1/schedules/:id and PATCH /api/v1/schedules/:id accept and return timezone: Option<String>. Timezone is validated at the API boundary using validate_timezone() (returns 400 InvalidInput for unknown identifiers). Config-file [schedule] blocks also accept timezone and are validated at startup (fail-fast, same as cron).


Consequences

Positive

  • Schedules expressed in business-local time — no mental UTC arithmetic for operators.
  • Multi-instance deployments safe by default; no external lock service required.
  • ScheduledWorkflow.timezone is nullable/optional — all existing schedules without the field default to UTC with no migration required.

Negative / Trade-offs

  • chrono-tz adds ~2 MB of IANA timezone data to the binary (compile-time embedded).
  • Distributed lock TTL of 120 s means a worst-case window of one double-fire per 120 s if the winning instance crashes between acquiring the lock and calling update_after_fire. Acceptable given the schedule_runs audit log makes duplicates visible.
  • No multi-PATCH for timezone clearance: passing timezone: null in JSON is treated as absent (#[serde(default)]). Clearing timezone (revert to UTC) requires a full PUT.