diff --git a/explore_post_torrust_ag_response_full_info.md b/explore_post_torrust_ag_response_full_info.md index b20c6ca..344fd79 100644 --- a/explore_post_torrust_ag_response_full_info.md +++ b/explore_post_torrust_ag_response_full_info.md @@ -1,6 +1,6 @@ Comment: Building with AI Agents — A Practitioner's Response - Response to "Building with AI Agents / Building for AI Agents" — Torrust Blog +# Response to "Building with AI Agents / Building for AI Agents" — Torrust Blog --- This piece resonates deeply with patterns we've converged on independently across three interconnected projects: a typed form and agent execution library (TypeDialog), an agent @@ -10,9 +10,9 @@ We want to add concrete experience to several of your points, extend two of them, push back on one, and add two problems the article touches but doesn't resolve to their root. --- - What we found to be true — and harder than expected +### What we found to be true — and harder than expected - The skills system tension is real and persistent. +#### The skills system tension is real and persistent. You describe skills as the solution to context pollution. We agree. Our provisioning project has a .claude/skills/ directory with per-domain Nushell skill scripts — code reviewer, script optimizer, test generator. And yet, six months into the project, AGENTS.md had grown to v2.4 with 500+ lines loaded on every agent invocation. The monolithic approach @@ -20,7 +20,7 @@ skills system requires active, deliberate enforcement. We'd argue this is the most underestimated challenge in agent-friendly design: not building the skills system, but not reverting to AGENTS.md over time. - "Agents are users who can program" cuts deeper than it appears. +> "Agents are users who can program" cuts deeper than it appears. Your formulation is correct and we'd extend it: the fact that agents can write and execute code means the interface boundary shifts. A human using your CLI needs good ergonomics. An agent using your CLI needs structured output and a library. TypeDialog exposes --format json|yaml|toml|text on every command — but the real gain came when we also exposed a @@ -28,45 +28,57 @@ domain-specific error enums (ValidationErrorKind::ContractViolation, FormParseErrorKind::MissingField). The CLI becomes the human interface; the library becomes the agent interface. - Documentation staleness is structural, not a discipline problem. +### Documentation staleness is structural, not a discipline problem. - You note that agent-focused docs risk going stale faster. This is true but the cause is architectural: when documentation lives in a different layer than the code it describes, + You note that agent-focused docs risk going stale faster. This is true but the cause is architectural: + + when documentation lives in a different layer than the code it describes, divergence is inevitable. We've moved to a three-layer system — session files (.coder/), operational configuration (.claude/), product documentation (docs/) — with explicit rules + about what can reference what. The key insight is that skills are code, not documentation. A skill that wraps cargo clippy -- -D warnings and interprets the output for a specific + domain doesn't go stale the way a prose description does. The closer documentation is to executable form, the less it rots. --- - What we found that the article doesn't cover +## What we found that the article doesn't cover - Schema language matters as much as schema availability. +### Schema language matters as much as schema availability. You recommend JSON Schema for configuration files. We went further: Nickel as the primary IaC language gives you lazy evaluation, gradual typing, record merging, and contract propagation that JSON Schema can't express. When an agent generates a configuration, Nickel contracts catch type errors, range violations, and cross-field constraints at evaluation - time — before the config is applied to infrastructure. A three-file pattern (contracts.ncl + defaults.ncl + main.ncl) makes the schema self-composing: agents don't just know + time — before the config is applied to infrastructure. + + A three-file pattern (contracts.ncl + defaults.ncl + main.ncl) makes the schema self-composing: agents don't just know what's valid, they get correct defaults for free through deep merge, not shallow merge which silently drops nested fields. The implication for agent-friendly design: if your configuration language has a type system, expose it as the agent interface. Not a JSON Schema export of it — the actual type system. Agents that can write code can use it directly. - Budget enforcement belongs in the infrastructure, not in prompts. +### Budget enforcement belongs in the infrastructure, not in prompts. Our orchestration platform enforces per-role monthly and weekly budget caps with automatic fallback chains (Claude Opus → GPT-4 → Claude Sonnet) and Prometheus metrics for budget utilization. Without it, a misconfigured agent loop can exhaust your API budget before anyone notices. The principle: cost constraints should be compiler-enforced infrastructure, not guidelines in a prompt. - Three focused primitives beat one comprehensive agent platform. +### Three focused primitives beat one comprehensive agent platform. Tinybird's conclusion — "make your platform work with all agents rather than building your own" — is correct. But there's a corollary: you can build multiple focused primitives - that compose. TypeDialog handles typed input capture and agent execution via .agent.mdx files with @input and @validate declarations. Our provisioning system handles IaC execution + that compose. TypeDialog handles typed input capture and agent execution via .agent.mdx files with @input and @validate declarations. + + Our provisioning system handles IaC execution with dependency graph ordering, checkpoint recovery, and multi-cloud providers. The orchestration platform handles agent coordination — routing, learning profiles, approval gates, - cost tracking. Each is independently useful. Together, they form a complete stack where TypeDialog captures config, the provisioning system executes it, and the orchestrator + cost tracking. Each is independently useful. + + Together, they form a complete stack where TypeDialog captures config, the provisioning system executes it, and the orchestrator coordinates the agents doing both. The failure mode we avoided: trying to make one of them do everything. --- - One pushback: training cycle reliance is context-dependent + +### One pushback: training cycle reliance is context-dependent - You write: "rely on upcoming LLM training cycles incorporating public GitHub repositories" rather than building custom RAG pipelines. For mainstream tooling — React, Postgres, - common Rust patterns — this is reasonable. For specialized tooling — Nickel's lazy evaluation semantics, Nushell's structured pipeline model, project-specific provisioning patterns + You write: **"rely on upcoming LLM training cycles incorporating public GitHub repositories"** rather than building custom RAG pipelines. For mainstream tooling — React, Postgres, common Rust patterns — this is reasonable. + + For specialized tooling — Nickel's lazy evaluation semantics, Nushell's structured pipeline model, project-specific provisioning patterns — it's not. These won't be in training data in useful depth, and when they are, the version will lag. The skills system you recommend is the lightweight alternative to RAG. A skill that explains Nickel's deep merge behavior and shows the three-file pattern gives an agent accurate, @@ -75,9 +87,10 @@ The heuristic we'd suggest: training cycle reliance works for languages and frameworks with millions of repositories using them. For everything else, skills. --- - The two problems the article touches but doesn't resolve to their root - VM failures: the problem isn't virtualization — it's CI topology +## The two problems the article touches but doesn't resolve to their root + +### VM failures: the problem isn't virtualization — it's CI topology You describe nested virtualization failures in shared runners as an obstacle to LXD VM testing. The proposed resolution is implicit: simply don't run VM tests on shared CI. That's pragmatic but incomplete. @@ -88,26 +101,32 @@ What actually works is stratifying execution environments by capability and responsibility: Shared runners (GitHub Actions / Woodpecker free tier) + → fast checks only: fmt, lint, unit tests without external I/O → constraint: <5 minutes, no virtualization, no private network Self-hosted runners (your own VMs, nested virt enabled) + → integration tests, LXD containers, network topology tests → UpCloud and Hetzner support nested virtualization natively → no arbitrary timeout Self-hosted runners with full VM support + → LXD VM tests, systemd-full, kernel-level tests → expensive and slow — run on merge to main, not on every PR The key is not attempting to make shared runners do what they structurally cannot. The design error is a single pipeline trying to run everything in the same environment. - From the provisioning project perspective: the same Nickel playbooks that deploy production infrastructure should deploy CI runners. A workflows/ci-runner-setup.ncl that provisions - a VM on UpCloud with nested_virtualization: true and registers the runner in Woodpecker. Direct dogfooding: CI infrastructure becomes the acceptance test for the provisioning + From the provisioning project perspective: the same Nickel playbooks that deploy production infrastructure should deploy CI runners. + + A workflows/ci-runner-setup.ncl that provisions a VM on UpCloud with nested_virtualization: true and registers the runner in Woodpecker. + + Direct dogfooding: CI infrastructure becomes the acceptance test for the provisioning system. If you can't provision your own runners with your own tool, your tool isn't production-ready. --- - Pre-commit timeouts: the problem isn't timeout duration — it's designing for two different audiences +### Pre-commit timeouts: the problem isn't timeout duration — it's designing for two different audiences Pre-commit hooks are designed for human developers committing once every few minutes. For remote agents running automated correction cycles, the contract is entirely different: @@ -118,27 +137,34 @@ The solution isn't removing hooks or lowering quality. It's stratifying by speed and audience: +```text pre-commit hooks — always, <30s: + cargo fmt --check (~2s) taplo fmt --check (~1s) markdownlint (~3s) yamllint (~1s) pre-push hooks — on push, <3min: + cargo clippy -- -D warnings (~45s) cargo test --lib (~2min) CI only — no time limit: + integration tests cargo test --all-features LXD / VM tests +``` + For agents specifically, the most important improvement isn't speed — it's structured output. cargo clippy --message-format json produces errors an agent can parse and act on directly without interpreting ANSI escape codes. The difference between a hook that blocks an agent and one that guides it is whether the output is machine-readable. A pattern that emerges naturally from working across these projects: the justfile as an indirection layer. Instead of agents executing hooks directly, they invoke just recipes that internally do the right thing per context: - + +```text # For agents — fast, structured, no color check-agent: cargo fmt --check @@ -149,33 +175,39 @@ cargo fmt --check cargo clippy -- -D warnings cargo test +``` The agent learns from its skill which recipe to invoke. The CI pipeline invokes check. The agent invokes check-agent. Same underlying code, separate audiences, explicit contracts. --- - The synthesis: development infrastructure as a first-class citizen of the provisioning system +### The synthesis: development infrastructure as a first-class citizen of the provisioning system Both problems — VM failures and pre-commit timeouts — share a common root: CI infrastructure designed around a single type of user (human developer with spaced commits) applied to a fundamentally different context (automated agent with rapid iteration cycles). They aren't two distinct problems. They're the same problem of broken implicit contracts. The explicit contracts that are missing: - Maximum time per operation. Every hook, test, and build step should have an explicit timeout declared — not as a workaround but as a specification. If a step can't complete within + - Maximum time per operation. Every hook, test, and build step should have an explicit timeout declared — not as a workaround but as a specification. If a step can't complete within its declared time, the step design is wrong, not the timeout. - Guaranteed idempotency. Any operation an agent may retry must be idempotent. Pre-commit hooks generally are (format, lint). Those that aren't (tests with external state, deploys) + - Guaranteed idempotency. Any operation an agent may retry must be idempotent. Pre-commit hooks generally are (format, lint). Those that aren't (tests with external state, deploys) don't belong in pre-commit. - Declared environments, not assumed ones. A test requiring nested virtualization should fail immediately with a clear error if the environment doesn't support it — not silently with + - Declared environments, not assumed ones. A test requiring nested virtualization should fail immediately with a clear error if the environment doesn't support it — not silently with network errors after four minutes. In Nickel, this is expressed as an execution environment contract: the workflow schema declares requires: { nested_virtualization: true } and the orchestrator validates before executing. - The argument in full: if your provisioning system doesn't provision its own development environment and CI infrastructure, you have a system that hasn't been validated against its + The argument in full: + + > if your provisioning system doesn't provision its own development environment and CI infrastructure, + > + > you have a system that hasn't been validated against its nearest use case. - The desirable sequence across the three projects we've described: +#### The desirable sequence across the three projects we've described: provisioning/workflows/dev-environment.ncl + → VM with nested virt on UpCloud or Hetzner → Woodpecker runner registered and configured → Pre-commit hooks installed via reproducible script @@ -190,7 +222,8 @@ in CI is no different from technical debt in production. The same patterns resolve both: typed schemas, explicit dependencies, declared environments. --- - What we deferred and why it matters + +## What we deferred and why it matters Your MCP mention as a post-v1 item matches our experience. Two of our three systems have MCP servers. The third doesn't yet — and that gap means the systems can't federate via a common protocol. A workflow where the orchestrator calls TypeDialog to capture typed config, validates it against Nickel schemas, and submits it to the provisioning system is @@ -200,6 +233,9 @@ just within a single tool. --- - The core thesis — invest in primitives, not custom agents — is correct and underappreciated. The only thing we'd add: the quality of the primitive matters as much as its existence. + The core thesis — invest in primitives, not custom agents — is correct and underappreciated. The only thing we'd add: + + > the quality of the primitive matters as much as its existence. + A typed SDK with domain-specific errors, a configuration language with real contracts, a skills system that stays maintained because the architecture makes the monolith costly, and CI infrastructure provisioned by the same system it validates. That's the actual target.