48 lines
6.5 KiB
XML
48 lines
6.5 KiB
XML
let d = import "adr-defaults.ncl" in
|
||
|
||
d.make_adr {
|
||
id = "adr-031",
|
||
title = "Unified Component CLI: prvng component <op> with Polymorphic Mode Dispatch",
|
||
status = 'Accepted,
|
||
date = "2026-04-19",
|
||
|
||
context = "Before this decision the CLI exposed two separate command hierarchies for infrastructure lifecycle: `prvng taskserv <op>` for taskserv-mode components and `prvng component list|show|info` for read-only introspection. Write operations (install, delete, update, reinstall, restart, backup, restore) were routed through `taskserv.nu` regardless of the component's actual deploy mode (taskserv/cluster/container). This created three problems: (1) a cluster-mode component like postgresql required `prvng taskserv create postgresql` even though it runs as a Kubernetes deployment — the semantics were wrong and confused operators; (2) the script resolution used only the Tier-1 `install-<name>.sh` convention — Tier-2 per-op scripts (`delete-<name>.sh`, `backup-<name>.sh`) were never invoked; (3) the orchestrator endpoint `/workflows/taskserv/create` accepted a `TaskservWorkflow` body that embedded no mode information, so the worker had no way to route script execution by mode. A prerequisite condition also existed: no precondition gate validated that capability providers (e.g. k0s for cluster-mode, NFS for democratic_csi) were healthy before enqueuing a write.",
|
||
|
||
decision = "Replace `prvng taskserv` with a unified `prvng component <op>` command that dispatches polymorphically by deploy mode. The implementation has five parts: (1) A new orchestrator endpoint `/api/v1/workflows/component/{op}` accepts `ComponentWorkflow` (workspace + infra + component + server + namespace + ssh_user + ssh_key_path + settings + check_mode + provisioning). The `{op}` path segment is validated against a fixed allowlist (install, delete, update, reinstall, restart, backup, restore, check-updates for write; status, health, list, show for read). (2) A precondition gate (`src/preconditions.rs`) runs before task enqueue for write ops: it fast-fails on `.provisioning-state.ncl` terminal states (failed/error), then runs live SSH probes via the system `ssh` binary with a 15-second global timeout. The gate is skipped for read-only ops and for taskserv-mode components (root provider, no dependencies). (3) Script resolution is upgraded to two tiers: Tier 2 (`<op>-<name>.sh`) is preferred; Tier 1 (`install-<name>.sh` + `CMD_TASK=<op>` env) is the fallback. `cluster_deploy.nu` calls `get-component-script-path` from `extensions/discovery.nu` and merges `CMD_TASK` only for Tier-1 scripts. (4) The NuShell CLI adds eight exported lifecycle commands to `cli/components.nu` (install/delete/update/reinstall/restart/backup/restore/check-updates) and two new functions to `platform/clients/orchestrator.nu` (`orch-submit-component`, `orch-wait-task`). (5) A feature flag `orchestrator.features.enable_component_endpoint` (default: true) allows a one-line rollback to 404 without binary redeployment.",
|
||
|
||
rationale = [
|
||
{
|
||
claim = "Single endpoint with path-param op is cleaner than one endpoint per operation",
|
||
detail = "The alternative was `/api/v1/workflows/component/install`, `/api/v1/workflows/component/delete`, etc. — eight separate routes with identical handler bodies. A single parameterised route `/api/v1/workflows/component/{op}` validates the op at the top of the handler and reuses one `WorkflowTask` construction path. The operation name is forwarded as the Nu script's `CMD_TASK` env var for Tier-1 scripts, so the routing is structural, not branching.",
|
||
},
|
||
{
|
||
claim = "SSH via tokio::process::Command is safer than the russh crate for the precondition gate",
|
||
detail = "The codebase has a `russh` dependency but `pool/executor.rs` is a stub. Adding a live russh integration solely for the precondition gate would require implementing the full connection pool — a significant scope addition. The system `ssh` binary with `BatchMode=yes StrictHostKeyChecking=no ConnectTimeout=N` is a one-file implementation with no state, no connection pool management, and no key-format assumptions. The probe is fire-and-forget (invoked once per precondition check, not per request), so there is no meaningful performance loss from process spawn overhead.",
|
||
},
|
||
{
|
||
claim = "The feature flag is a deployment-level rollback, not an A/B switch",
|
||
detail = "The flag `enable_component_endpoint` returns HTTP 404 when false. This is intentional: callers that haven't migrated receive a 404 and fall back to the legacy `/workflows/taskserv/create` route, which remains in place. The flag is not intended for permanent dual-track operation — it exists to give a one-flag rollback path during the migration window, after which the legacy route and taskserv.nu will be deleted.",
|
||
},
|
||
{
|
||
claim = "Deleting prvng taskserv is a breaking change accepted as deliberate",
|
||
detail = "All three call sites for the old taskserv API (NuShell CLI, control-center-ui, MCP server) are migrated atomically in this decision. The commands-registry.ncl entry is removed, the `t` single-char alias is removed, and the bash wrapper dispatch case is removed. `taskserv.nu` is retained temporarily to avoid breaking any in-flight sessions. Its permanent deletion is a follow-on commit after confirming no un-migrated callers exist.",
|
||
},
|
||
],
|
||
|
||
consequences = {
|
||
positive = [
|
||
"prvng component install postgresql correctly names the operation for both cluster and taskserv modes",
|
||
"Backup and restore operations are now surfaced via prvng component backup/restore — previously inaccessible from the CLI",
|
||
"Precondition gate prevents cascading failures when a capability provider is unhealthy before a write operation",
|
||
"Tier-2 per-op scripts are now invoked when present — operators can specialise delete/backup logic without patching the generic install script",
|
||
],
|
||
negative = [
|
||
"Breaking change: prvng taskserv is removed. Any external scripts or documentation referencing the old command require update",
|
||
"The precondition gate adds 0–15 seconds to write operations when providers are unhealthy or unreachable — healthy-path overhead is ~2s for the state-file fast-fail",
|
||
],
|
||
neutral = [
|
||
"taskserv.nu is not deleted in this commit — a follow-on cleanup commit removes it once in-flight migration is confirmed",
|
||
"The legacy /workflows/taskserv/create endpoint is preserved indefinitely until the feature flag is toggled and the route removed in a future cleanup",
|
||
],
|
||
},
|
||
}
|