Agent Operations
Before giving your agent more access, use this checklist
A seven-question checklist for reviewing an AI agent before giving it more permission.
Agent Operations
A seven-question checklist for reviewing an AI agent before giving it more permission.
Issue #17
AI-assisted test repair can reduce maintenance toil, but only if teams define what may heal automatically, what requires review, and what evidence proves the test is still protecting the behavior users depend on.
AI
Most agent evals measure the clean path. Production readiness depends on the messy path: tools, time, retries, handoffs, stale state, trace evidence, and recovery.
Issue #16
If orchestration decides sequence, identity decides legitimacy: what an agent can do, for whom, under what authority, across which tenant boundary, and how operators recover when that authority breaks.
Issue #14
Use an escalation ladder, not a hype ladder: stay in plain code longer than the market wants you to, move to a workflow framework when state and recovery become real, and reach for multi-agent coordination only when the job genuinely needs it.
Issue #13
Why long-running agents turn memory design into an ops problem, and what teams should govern before background workflows become invisible operational risk.
Issue #12
A2A turns agent-to-agent communication into a distributed-systems problem, with identity, task ownership, retries, trust, and failure handling now sitting on the critical path.
Issue #11
MCP servers are becoming production dependencies for agent systems. How to inventory ownership, permissions, observability, and failure modes before they become hidden infrastructure risk.
newsletter
AI teams accumulate models faster than they build controls. How to manage model sprawl with registries, drift monitoring, rollbacks, and consolidation.
Issue #9
AI coding tools have genuinely made teams faster. The Harness 2026 State of DevOps report confirms it: AI coding adoption is up, velocity metrics are up, output is up. The same report notes that security and DevOps maturity haven't kept pace with the acceleration. More code is shipping,
Issue #8
A single agent handling predictable traffic is the easy case. Add a gateway, configure it correctly per Parts 1 and 2, and it works. The failure modes at scale are different in kind. An indirect prompt injection embedded in a document your agent was summarizing. A multi-agent workflow where
newsletter
The gap between 'I added a gateway' and 'my gateway is actually working.' Four configuration decisions that separate coverage from false confidence.