provisioning/adrs/adr-031-unified-component-cli.ncl

48 lines
6.5 KiB
XML
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

let d = import "adr-defaults.ncl" in
d.make_adr {
id = "adr-031",
title = "Unified Component CLI: prvng component <op> with Polymorphic Mode Dispatch",
status = 'Accepted,
date = "2026-04-19",
context = "Before this decision the CLI exposed two separate command hierarchies for infrastructure lifecycle: `prvng taskserv <op>` for taskserv-mode components and `prvng component list|show|info` for read-only introspection. Write operations (install, delete, update, reinstall, restart, backup, restore) were routed through `taskserv.nu` regardless of the component's actual deploy mode (taskserv/cluster/container). This created three problems: (1) a cluster-mode component like postgresql required `prvng taskserv create postgresql` even though it runs as a Kubernetes deployment — the semantics were wrong and confused operators; (2) the script resolution used only the Tier-1 `install-<name>.sh` convention — Tier-2 per-op scripts (`delete-<name>.sh`, `backup-<name>.sh`) were never invoked; (3) the orchestrator endpoint `/workflows/taskserv/create` accepted a `TaskservWorkflow` body that embedded no mode information, so the worker had no way to route script execution by mode. A prerequisite condition also existed: no precondition gate validated that capability providers (e.g. k0s for cluster-mode, NFS for democratic_csi) were healthy before enqueuing a write.",
decision = "Replace `prvng taskserv` with a unified `prvng component <op>` command that dispatches polymorphically by deploy mode. The implementation has five parts: (1) A new orchestrator endpoint `/api/v1/workflows/component/{op}` accepts `ComponentWorkflow` (workspace + infra + component + server + namespace + ssh_user + ssh_key_path + settings + check_mode + provisioning). The `{op}` path segment is validated against a fixed allowlist (install, delete, update, reinstall, restart, backup, restore, check-updates for write; status, health, list, show for read). (2) A precondition gate (`src/preconditions.rs`) runs before task enqueue for write ops: it fast-fails on `.provisioning-state.ncl` terminal states (failed/error), then runs live SSH probes via the system `ssh` binary with a 15-second global timeout. The gate is skipped for read-only ops and for taskserv-mode components (root provider, no dependencies). (3) Script resolution is upgraded to two tiers: Tier 2 (`<op>-<name>.sh`) is preferred; Tier 1 (`install-<name>.sh` + `CMD_TASK=<op>` env) is the fallback. `cluster_deploy.nu` calls `get-component-script-path` from `extensions/discovery.nu` and merges `CMD_TASK` only for Tier-1 scripts. (4) The NuShell CLI adds eight exported lifecycle commands to `cli/components.nu` (install/delete/update/reinstall/restart/backup/restore/check-updates) and two new functions to `platform/clients/orchestrator.nu` (`orch-submit-component`, `orch-wait-task`). (5) A feature flag `orchestrator.features.enable_component_endpoint` (default: true) allows a one-line rollback to 404 without binary redeployment.",
rationale = [
{
claim = "Single endpoint with path-param op is cleaner than one endpoint per operation",
detail = "The alternative was `/api/v1/workflows/component/install`, `/api/v1/workflows/component/delete`, etc. — eight separate routes with identical handler bodies. A single parameterised route `/api/v1/workflows/component/{op}` validates the op at the top of the handler and reuses one `WorkflowTask` construction path. The operation name is forwarded as the Nu script's `CMD_TASK` env var for Tier-1 scripts, so the routing is structural, not branching.",
},
{
claim = "SSH via tokio::process::Command is safer than the russh crate for the precondition gate",
detail = "The codebase has a `russh` dependency but `pool/executor.rs` is a stub. Adding a live russh integration solely for the precondition gate would require implementing the full connection pool — a significant scope addition. The system `ssh` binary with `BatchMode=yes StrictHostKeyChecking=no ConnectTimeout=N` is a one-file implementation with no state, no connection pool management, and no key-format assumptions. The probe is fire-and-forget (invoked once per precondition check, not per request), so there is no meaningful performance loss from process spawn overhead.",
},
{
claim = "The feature flag is a deployment-level rollback, not an A/B switch",
detail = "The flag `enable_component_endpoint` returns HTTP 404 when false. This is intentional: callers that haven't migrated receive a 404 and fall back to the legacy `/workflows/taskserv/create` route, which remains in place. The flag is not intended for permanent dual-track operation — it exists to give a one-flag rollback path during the migration window, after which the legacy route and taskserv.nu will be deleted.",
},
{
claim = "Deleting prvng taskserv is a breaking change accepted as deliberate",
detail = "All three call sites for the old taskserv API (NuShell CLI, control-center-ui, MCP server) are migrated atomically in this decision. The commands-registry.ncl entry is removed, the `t` single-char alias is removed, and the bash wrapper dispatch case is removed. `taskserv.nu` is retained temporarily to avoid breaking any in-flight sessions. Its permanent deletion is a follow-on commit after confirming no un-migrated callers exist.",
},
],
consequences = {
positive = [
"prvng component install postgresql correctly names the operation for both cluster and taskserv modes",
"Backup and restore operations are now surfaced via prvng component backup/restore — previously inaccessible from the CLI",
"Precondition gate prevents cascading failures when a capability provider is unhealthy before a write operation",
"Tier-2 per-op scripts are now invoked when present — operators can specialise delete/backup logic without patching the generic install script",
],
negative = [
"Breaking change: prvng taskserv is removed. Any external scripts or documentation referencing the old command require update",
"The precondition gate adds 015 seconds to write operations when providers are unhealthy or unreachable — healthy-path overhead is ~2s for the state-file fast-fail",
],
neutral = [
"taskserv.nu is not deleted in this commit — a follow-on cleanup commit removes it once in-flight migration is confirmed",
"The legacy /workflows/taskserv/create endpoint is preserved indefinitely until the feature flag is toggled and the route removed in a future cleanup",
],
},
}