Add cron-based autonomous workflow firing with two hardening layers:
- Timezone-aware scheduling via chrono-tz: ScheduledWorkflow.timezone
(IANA identifier), compute_next_fire_at/after_tz, validate_timezone;
DST-safe, UTC fallback when absent; validated at config load and REST API
- Distributed fire-lock via SurrealDB conditional UPDATE (locked_by/locked_at
fields, 120 s TTL); WorkflowScheduler gains instance_id (UUID) as lock owner;
prevents double-fires across multi-instance deployments without extra infra
- ScheduleStore: try_acquire_fire_lock, release_fire_lock (own-instance guard),
full CRUD (load_one/all, full_upsert, patch, delete, load_runs)
- REST: 7 endpoints (GET/PUT/PATCH/DELETE schedules, runs history, manual fire)
with timezone field in all request/response types
- Migrations 010 (schedule tables) + 011 (timezone + lock columns)
- Tests: 48 passing (was 26); ADR-0034; changelog; feature docs updated
102 lines
5.0 KiB
Markdown
102 lines
5.0 KiB
Markdown
# ADR-0034: Autonomous Cron Scheduling — Timezone Support and Distributed Fire-Lock
|
|
|
|
**Status**: Implemented
|
|
**Date**: 2026-02-26
|
|
**Deciders**: VAPORA Team
|
|
**Technical Story**: `vapora-workflow-engine` scheduler fired cron jobs only in UTC and had no protection against double-fires in multi-instance deployments.
|
|
|
|
---
|
|
|
|
## Decision
|
|
|
|
Extend the autonomous scheduling subsystem with two independent hardening layers:
|
|
|
|
1. **Timezone-aware scheduling** (`chrono-tz`) — cron expressions evaluated in any IANA timezone, stored per-schedule, validated at API and config-load boundaries.
|
|
2. **Distributed fire-lock** — SurrealDB conditional `UPDATE ... WHERE locked_by IS NONE OR locked_at < $expiry` provides atomic, TTL-backed mutual exclusion across instances without additional infrastructure.
|
|
|
|
---
|
|
|
|
## Context
|
|
|
|
### Gaps Addressed
|
|
|
|
| Gap | Consequence |
|
|
|-----|-------------|
|
|
| UTC-only cron evaluation | `"0 9 * * *"` fires at 09:00 UTC regardless of business timezone; scheduled reports or maintenance windows drift by the UTC offset |
|
|
| No distributed coordination | Two `vapora-workflow-engine` instances reading the same `scheduled_workflows` table both fire the same schedule at the same tick |
|
|
|
|
### Why These Approaches
|
|
|
|
**`chrono-tz`** over manual UTC-offset arithmetic:
|
|
- Compile-time exhaustive enum of all IANA timezone names — invalid names are rejected at parse time.
|
|
- The `cron` crate's `Schedule::upcoming(tz)` / `Schedule::after(&dt_in_tz)` are generic over any `TimeZone`, so timezone-awareness requires no special-casing in iteration logic: pass `DateTime<chrono_tz::Tz>` instead of `DateTime<Utc>`, convert output with `.with_timezone(&Utc)`.
|
|
- DST transitions handled automatically by `chrono-tz` — no application code needed.
|
|
|
|
**SurrealDB conditional UPDATE** over external distributed lock (Redis, etcd):
|
|
- No additional infrastructure dependency.
|
|
- SurrealDB applies document-level write locking; `UPDATE record WHERE condition` is atomic — two concurrent instances race on the same document and only one succeeds (non-empty return array = lock acquired).
|
|
- 120-second TTL enforced in application code: `locked_at < $expiry` in the WHERE clause auto-expires a lock from a crashed instance within two scheduler ticks.
|
|
|
|
---
|
|
|
|
## Implementation
|
|
|
|
### New Fields
|
|
|
|
`scheduled_workflows` table gains three columns (migration 011):
|
|
|
|
| Field | Type | Purpose |
|
|
|-------|------|---------|
|
|
| `timezone` | `option<string>` | IANA identifier (`"America/New_York"`) or `NONE` for UTC |
|
|
| `locked_by` | `option<string>` | UUID of the instance holding the current fire-lock |
|
|
| `locked_at` | `option<datetime>` | When the lock was acquired; used for TTL expiry |
|
|
|
|
### Lock Protocol
|
|
|
|
```
|
|
Tick N fires schedule S:
|
|
try_acquire_fire_lock(id, instance_id, now)
|
|
→ UPDATE ... WHERE locked_by IS NONE OR locked_at < (now - 120s)
|
|
→ returns true (non-empty) or false (empty)
|
|
if false: log + inc schedules_skipped, return
|
|
fire_with_lock(S, now) ← actual workflow start
|
|
release_fire_lock(id, instance_id)
|
|
→ UPDATE ... WHERE locked_by = instance_id
|
|
→ own-instance guard prevents stale release
|
|
```
|
|
|
|
Lock release is always attempted even on `fire_with_lock` error; a `warn!` is emitted if release fails (TTL provides fallback).
|
|
|
|
### Timezone-Aware Cron Evaluation
|
|
|
|
```
|
|
compute_fire_times_tz(schedule, last, now, catch_up, tz):
|
|
match tz.parse::<chrono_tz::Tz>():
|
|
Some(tz) → schedule.after(&last.with_timezone(&tz))
|
|
.take_while(|t| t.with_timezone(&Utc) <= now)
|
|
.map(|t| t.with_timezone(&Utc))
|
|
None → schedule.after(&last) ← UTC
|
|
```
|
|
|
|
Parsing an unknown/invalid timezone string silently falls back to UTC — avoids a hard error at runtime if a previously valid TZ identifier is removed from the `chrono-tz` database in a future upgrade.
|
|
|
|
### API Surface Changes
|
|
|
|
`PUT /api/v1/schedules/:id` and `PATCH /api/v1/schedules/:id` accept and return `timezone: Option<String>`. Timezone is validated at the API boundary using `validate_timezone()` (returns `400 InvalidInput` for unknown identifiers). Config-file `[schedule]` blocks also accept `timezone` and are validated at startup (fail-fast, same as `cron`).
|
|
|
|
---
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
|
|
- Schedules expressed in business-local time — no mental UTC arithmetic for operators.
|
|
- Multi-instance deployments safe by default; no external lock service required.
|
|
- `ScheduledWorkflow.timezone` is nullable/optional — all existing schedules without the field default to UTC with no migration required.
|
|
|
|
### Negative / Trade-offs
|
|
|
|
- `chrono-tz` adds ~2 MB of IANA timezone data to the binary (compile-time embedded).
|
|
- Distributed lock TTL of 120 s means a worst-case window of one double-fire per 120 s if the winning instance crashes between acquiring the lock and calling `update_after_fire`. Acceptable given the `schedule_runs` audit log makes duplicates visible.
|
|
- No multi-PATCH for timezone clearance: passing `timezone: null` in JSON is treated as absent (`#[serde(default)]`). Clearing timezone (revert to UTC) requires a full PUT.
|