Vapora/docs/adrs/0035-notification-channels.md
Jesús Pérez 027b8f2836
Some checks failed
Documentation Lint & Validation / Markdown Linting (push) Has been cancelled
Documentation Lint & Validation / Validate mdBook Configuration (push) Has been cancelled
Documentation Lint & Validation / Content & Structure Validation (push) Has been cancelled
Documentation Lint & Validation / Lint & Validation Summary (push) Has been cancelled
mdBook Build & Deploy / Build mdBook (push) Has been cancelled
mdBook Build & Deploy / Documentation Quality Check (push) Has been cancelled
mdBook Build & Deploy / Deploy to GitHub Pages (push) Has been cancelled
mdBook Build & Deploy / Notification (push) Has been cancelled
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
feat(channels): webhook notification channels with built-in secret resolution
Add vapora-channels crate with trait-based Slack/Discord/Telegram webhook
  delivery. ${VAR}/${VAR:-default} interpolation is mandatory inside
  ChannelRegistry::from_config — callers cannot bypass secret resolution.
  Fire-and-forget dispatch via tokio::spawn in both vapora-workflow-engine
  (four lifecycle events) and vapora-backend (task Done, proposal approve/reject).
  New REST endpoints: GET /channels, POST /channels/:name/test.
  dispatch_notifications extracted as pub(crate) fn for inline testability;
  5 handler tests + 6 workflow engine tests + 7 secret resolution unit tests.

  Closes: vapora-channels bootstrap, notification gap in workflow/backend layer
  ADR: docs/adrs/0035-notification-channels.md
2026-02-26 14:49:34 +00:00

8.3 KiB
Raw Blame History

ADR-0035: Webhook-Based Notification Channels — vapora-channels Crate

Status: Implemented Date: 2026-02-26 Deciders: VAPORA Team Technical Story: Workflow events (task completion, proposal approve/reject, schedule fires) had no outbound delivery path; operators had to poll the API to learn about state changes.


Decision

Introduce a dedicated vapora-channels crate implementing a trait-based webhook delivery layer with:

  1. NotificationChannel trait — single send(&Message) -> Result<()> method; consumers implement HTTP webhooks (Slack, Discord, Telegram) without vendor SDK dependencies.
  2. ChannelRegistry — name-keyed routing hub; from_config(HashMap<String, ChannelConfig>) builds the registry from TOML config, resolving secrets at construction time.
  3. ${VAR} / ${VAR:-default} interpolation inside the library — secret resolution is mandatory and cannot be bypassed by callers.
  4. Fire-and-forget delivery at both layers: AppState::notify (backend) and WorkflowOrchestrator::notify_* (workflow engine) spawn background tasks; delivery failures are warn!-logged and never surface to API callers.
  5. Per-event routing config (NotificationConfig) maps event names to channel-name lists, not hardcoded channel identifiers.

Context

Gaps Addressed

Gap Consequence
No outbound event delivery Operators must poll 40+ API endpoints to detect state changes
Secrets in TOML as plain strings If resolution is left to callers, a ${SLACK_WEBHOOK_URL} placeholder reaches the HTTP layer verbatim when the caller forgets to interpolate
Tight vendor coupling Using slack-rs / serenity locks the feature to specific Slack/Discord API versions and transitive dependency trees

Why NotificationChannel Trait Over Vendor SDKs

Slack, Discord, and Telegram all accept a simple POST with a JSON body to a webhook URL — no OAuth, no persistent connection, no stateful session. A trait with one async method covers all three with less than 50 lines per implementation. Vendor SDKs add 200500 kB of transitive dependencies and introduce breaking changes on provider API updates.

Why Secret Resolution in the Library

Placing the responsibility on the caller creates a TOFU gap: the first time any caller forgets to call resolve_secrets() before constructing ChannelRegistry, a raw ${SLACK_WEBHOOK_URL} string is sent to Slack's API as the URL. The request fails silently (Slack returns 404 or 400), the placeholder leaks in logs, and no compile-time or runtime warning is raised until a log is inspected.

Moving interpolation into ChannelRegistry::from_config makes it structurally impossible to construct a registry with unresolved secrets: ChannelError::SecretNotFound(var_name) is returned immediately if an env var is absent and no default is provided. There is no non-error path that bypasses resolution.

Why Fire-and-Forget With tokio::spawn

Notification delivery is a best-effort side-effect, not part of the request/response contract. A Slack outage should not cause a POST /api/v1/proposals/:id/approve to return 500. Spawning an independent task decouples delivery latency from API latency; warn! logging provides observability without blocking the caller.


Implementation

Crate Structure (vapora-channels)

vapora-channels/
├── src/
│   ├── lib.rs         — pub re-exports (ChannelRegistry, Message, NotificationChannel)
│   ├── channel.rs     — NotificationChannel trait
│   ├── config.rs      — ChannelsConfig, ChannelConfig, SlackConfig/DiscordConfig/TelegramConfig
│   │                    resolve_secrets() chain + interpolate() with OnceLock<Regex>
│   ├── error.rs       — ChannelError: NotFound, ApiError, SecretNotFound, SerializationError
│   ├── message.rs     — Message { title, body, level: Info|Success|Warning|Error }
│   ├── registry.rs    — ChannelRegistry: name → Arc<dyn NotificationChannel>
│   └── webhooks/
│       ├── slack.rs   — SlackChannel: POST IncomingWebhook JSON
│       ├── discord.rs — DiscordChannel: POST Webhook embed JSON
│       └── telegram.rs— TelegramChannel: POST bot sendMessage JSON

Secret Resolution

interpolate(s: &str) -> Result<String>:
  regex: \$\{([^}:]+)(?::-(.*?))?\}   (compiled once via OnceLock)
  fast-path: if !s.contains("${") { return Ok(s) }
  for each capture:
    var_name = capture[1]
    default  = capture[2] (optional)
    match env::var(var_name):
      Ok(v)  → replace placeholder with v
      Err(_) → if default.is_some(): replace with default
               else: return Err(SecretNotFound(var_name))

resolve_secrets() is called unconditionally in ChannelRegistry::from_config — single mandatory call site, no consumer bypass.

Integration Points

vapora-workflow-engine

WorkflowConfig.notifications: WorkflowNotifications maps four events to channel-name lists:

[workflows.myflow.notifications]
on_stage_complete = ["team-slack"]
on_stage_failed   = ["team-slack", "ops-discord"]
on_completed      = ["team-slack"]
on_cancelled      = ["ops-discord"]

WorkflowOrchestrator holds Option<Arc<ChannelRegistry>> and calls notify_stage_complete, notify_stage_failed, notify_completed, notify_cancelled — each spawns dispatch_notifications.

vapora-backend

Config.channels: HashMap<String, ChannelConfig> and Config.notifications: NotificationConfig:

[channels.team-slack]
type = "slack"
webhook_url = "${SLACK_WEBHOOK_URL}"

[notifications]
on_task_done         = ["team-slack"]
on_proposal_approved = ["team-slack", "ops-discord"]
on_proposal_rejected = ["ops-discord"]

AppState gains channel_registry: Option<Arc<ChannelRegistry>> and notification_config: Arc<NotificationConfig>. Hooks in three existing handlers:

  • update_task_status — fires Message::success on TaskStatus::Done
  • approve_proposal — fires Message::success
  • reject_proposal — fires Message::warning

New REST Endpoints

Method Path Description
GET /api/v1/channels List registered channel names
POST /api/v1/channels/:name/test Send connectivity test; 200 OK / 404 / 502

Testability

dispatch_notifications is extracted as pub(crate) async fn taking Option<Arc<ChannelRegistry>> directly, making it testable without a DB or a fully-constructed AppState. Five inline tests in state.rs use RecordingChannel (captures messages) and FailingChannel (returns 503 error) to verify:

  1. No-op when registry is None
  2. Single-channel delivery
  3. Multi-channel broadcast
  4. Resilience: delivery continues after one channel fails
  5. Warn-log on unknown channel name, other channels still receive

Consequences

Positive

  • Operators get real-time Slack/Discord/Telegram alerts on task completion, proposal decisions, and workflow lifecycle events.
  • Adding a new channel type requires implementing one trait method and one TOML variant — no changes to routing or dispatch code.
  • Secret resolution failures surface immediately at startup (if ChannelRegistry::from_config is called at boot), not silently at first delivery.
  • Zero additional infrastructure: webhooks are outbound-only HTTP POSTs.

Negative / Trade-offs

  • Delivery is best-effort (fire-and-forget). A channel that is consistently down produces warn! logs but no alert escalation; consumers needing guaranteed delivery must implement their own retry loop or use a message queue.
  • ${VAR} interpolation uses unsafe { std::env::set_var } in tests (required by Rust 1.80 stabilization of the unsafety annotation). Tests set/unset env vars; multi-threaded test parallelism can cause flaky results if not isolated with #[serial_test::serial].
  • No per-channel rate limiting: a workflow that fires 1,000 stage-complete events will produce 1,000 Slack messages. Operators must configure notifications lists deliberately.

Supersedes / Specializes

  • Builds on SecretumVault pattern (ADR-0011) philosophy of never storing secrets as plain strings; specializes it to config-file webhook tokens.
  • Parallel to vapora-a2a-client's retry pattern (ADR-0030) — both handle external HTTP delivery, but channels are fire-and-forget while A2A requires confirmed response.