Issue #1

Agentic Coding: What's Actually Production-Ready

The productivity gains are real. They're also narrower, weirder, and more conditional than the marketing wants you to believe.

26.9% of all production code is now AI-authored. There's a version of agentic coding you see in demos: you type a description, the AI spins up, reads your entire codebase, writes a feature across a dozen files, runs the tests, and ships a PR. Looks like magic. Completely unlike what most developers experience when they try it on Monday morning.

Agentic Coding: What's Actually Production-Ready — 60-second summary

That gap is worth examining honestly.

The productivity gains from agentic coding tools are real. They're also narrower, weirder, and more conditional than the marketing wants you to believe.

The Numbers

By February 2026, AI-authored code makes up 26.9% of all production code, up from 22% the prior quarter, per DX research from the Pragmatic Summit surveying 121,000 developers across 450+ companies. About 75% of those developers use AI coding assistants weekly. GitHub's controlled study (95 developers, randomized, all tasked with implementing an HTTP server in JavaScript) showed the Copilot group finishing in 71 minutes versus 161 for the control. 55% faster on a well-defined, greenfield task.

Then there's the METR study. In July 2025, researchers at Model Evaluation & Threat Research put 16 experienced open-source developers on 246 real issues from their own familiar codebases, using Cursor Pro with Claude 3.5/3.7 Sonnet. They took 19% longer than the unassisted group. And they thought they were 20–24% faster. (arxiv.org/abs/2507.09089)

Two data points, whole story. Agents accelerate routine, well-scoped work. They slow you down on everything else. Not because the AI is bad, but because prompting, reviewing, and fixing AI output is real work that has to go somewhere.

Where They Actually Deliver

Give an agent a data model and a target framework and it'll scaffold routes, controllers, validation, and tests faster than you can type. Tight scope, obvious success criteria. Hard to go wrong. That's where the GitHub 55% stat lives: a bounded, greenfield task with clear inputs and outputs.

Test generation is legitimately good when the agent has indexed your codebase. It writes against actual domain logic, not just stubs. The VS Code team used this, along with AI-assisted triage, PR descriptions, and release notes, to go from monthly to weekly releases. They didn't get there by generating product code in bulk.

Refactoring with a precise spec works too. "Rename UserRecord to UserProfile across the codebase, update all imports, fix the tests." Bounded, unambiguous, no judgment calls. It either works or you reject the diff in two minutes.

Cursor's whole-repo indexing earns its keep on unfamiliar codebases. You can ask how auth works and get an accurate answer grounded in the actual code, not a hallucination. Cuts hours off the first week. Useful even if you use nothing else.

Where Things Break

Context rot is the most insidious failure mode. LLMs don't process every token equally. What's in the middle of a long context gets less attention than the beginning and end. In extended sessions, agents start forgetting architectural decisions they made earlier. They re-implement things that already exist. They cycle. The longer the session runs without a reset, the worse it gets.

Open-ended tasks fall apart fast. "Build the payments feature" produces something that compiles and entirely misses the business logic, because that lives in your head, not the repo. The demos that look open-ended are almost always tightly scoped tasks in disguise.

Multi-repo work is a hard constraint, not a known limitation they're working on. GitHub Copilot Workspace can't make changes across multiple repos in a single run. A feature that touches a service, a shared library, and an API client (which is most non-trivial work) requires you to stitch agent invocations together manually.

Legacy codebases are rough. Non-standard frameworks, mixed paradigms, tribal knowledge that lives nowhere in the code. Agents produce something confident and wrong.

The defect data is consistent with all of this. A 2025 CodeRabbit analysis of 470 open-source PRs found AI-generated code has 1.7x more defects: 10.83 issues per PR versus 6.45 for human-written code, with logic errors 75% more common. Veracode's 2025 GenAI Code Security Report tested over 100 LLMs on 80 coding tasks and found models picked the insecure implementation 45% of the time when given the choice. Review AI-generated code harder than you'd review a senior engineer's pull request. Not as a formality, because the error rate actually warrants it.

The Mental Model That Holds Up

Agentic coding tools are powerful interns with perfect memory and no judgment. Fast, tireless, and they will implement exactly what you describe. Sometimes correctly.

The teams getting real productivity out of this have small well-defined tasks, diff review before execution, and explicit approval gates built into the workflow. The teams burning hours handed an agent an ambiguous brief and walked away.

Tool Spotlight: Cline

Cline gets less attention than Cursor or Copilot. For production workflows, that's backwards.

It's open-source, runs inside VS Code, and it's model-agnostic. Bring your own LLM: Claude, GPT, Gemini, a local model on your own hardware. That matters if you have data residency requirements, or if you'd rather not have your codebase indexed on someone else's infrastructure.

What makes it work in production is the plan-and-act loop: it reads your codebase, drafts a structured plan, shows proposed changes as diffs, and stops. It won't run a terminal command or write a file until you approve. Feels like friction until the first time it catches a wrong assumption before it cascades through ten files.

A Netflix engineering lead reportedly switched from JetBrains after 25 years specifically because of this. Not for capability. For controllability.

Quick Hits

OpenAI quietly made Codex more interesting. The app now runs in fully isolated environments with parallel agent threads. It's built for tasks you kick off and check back on hours later, not something you hover over. Different from Copilot Workspace in a meaningful way: longer-horizon autonomous runs, human PR review before merge.

Moonshot AI's Kimi-Dev-72B just hit 60.4% on SWE-bench Verified, which would've been a notable score for a commercial model six months ago. It patches real repos autonomously in Docker, built on Qwen2.5-72B, trained via RL that only rewards passing the full test suite. Weights on Hugging Face; details at moonshotai.github.io/Kimi-Dev.

Worth reading: Microsoft's March 2026 post-mortem on how the VS Code team uses AI internally. The weekly shipping cadence didn't come from AI-generated features. It came from AI handling triage, PR descriptions, test gap analysis, and release notes. More replicable than the headline implies.

Senior engineers keep posting the same postmortem. Agents used on complex features. AI-generated technical debt stacks up. Someone spends weeks cleaning it up. The pattern is consistent enough now that it's not a cautionary tale anymore. It's a workflow failure with a known fix.

Builder Tip

Set an approval gate before any agentic execution. Your tool should require confirmation before writing files or running terminal commands. Cline does this by default; Cursor needs you to set it up. Ten seconds to review a diff. Two hours to revert a bad run that touched a dozen files.

Gradient Push covers AI and automation for builders. If this was useful, pass it on.

Enjoying this? Subscribe to Gradient Push for practical AI and automation breakdowns — gradientpush.com