Issue #9

AI-Generated Code ≠ Safe Code

AI coding tools have genuinely made teams faster. The Harness 2026 State of DevOps report confirms it: AI coding adoption is up, velocity metrics are up, output is up. The same report notes that security and DevOps maturity haven't kept pace with the acceleration. More code is shipping, faster, with a higher proportion of AI-generated content that hasn't gone through the security review that human-written code usually gets.

The vulnerability data tells us what that content contains.

Veracode's Spring 2026 GenAI Code Security Report, the most comprehensive longitudinal study to date across 150+ models, finds that 45% of AI-generated code contains known security vulnerabilities when no security guidance is explicitly provided. Syntax correctness now exceeds 95%. Security pass rate has stayed flat near 55% across every model generation since 2023. The models have gotten excellent at writing code that compiles. They have not gotten better at writing code that's safe.

The argument about whether to use AI coding tools is over. This issue is about building the security layer that most AI-assisted workflows are missing.

The Failure Pattern

The vulnerability profile isn't random. Four categories account for most of the problem, and they're structural.

Missing input validation is the most consistent failure. Models generate code that processes user input without sanitization unless you explicitly ask for it. SSRF (server-side request forgery) led the February 2026 appsecsanta.com study of 534 code samples across six models, with 32 confirmed findings out of 175 total — LLMs routinely generate code that fetches user-supplied URLs without validation or allowlisting. SQL injection and command injection class vulnerabilities were close behind.

Authentication failures show up in AI-scaffolded applications with regularity: missing default authentication on backend endpoints, hard-coded credentials, lack of session management. The model generates a working application. "Working" and "secured" are different properties, and the model optimizes for the former.

Dependency sprawl isn't a vulnerability class by itself but it creates attack surface. A simple prompt generates a solution with five imported packages; a more thoughtful implementation might use two. More dependencies means more supply chain exposure. The model solves the problem in front of it without modeling the supply chain risk of its choices.

Language-specific failure rates are stark. Veracode's Spring 2026 data: Java passes security checks only 29% of the time, compared to Python at 62%. The hypothesis is that models are trained heavily on legacy Java code that predates modern security frameworks. If your stack is Java-heavy, your AI-generated code has a higher-than-average vulnerability rate by default.

The CVE data sharpens this from benchmark to production reality. Georgia Tech's Vibe Security Radar project, which tracks vulnerabilities in public advisories attributable to AI-generated code, recorded 35 new CVEs in March 2026 alone, up from 15 in February and 6 in January. Claude Code was identified as the generating tool for 27 of the 35 March CVEs based on commit co-author signatures. Total tracked since May 2025: 74 CVEs across 43,849 advisories analyzed. Researchers note these numbers are likely a lower bound — many AI-generated vulnerabilities never get publicly attributed as such.

This is not a reason to stop using AI coding tools. It's a reason to build the security layer that most teams haven't.

The Secure Pipeline

Four components, in order of implementation effort.

Explicit security prompting requires the least effort for measurable gain. The appsecsanta.com study found 25.1% of samples contained confirmed vulnerabilities without security-specific instructions. Studies consistently show that prompting explicitly for secure coding practices materially reduces that rate. The prompts don't need to be elaborate: "follow OWASP Top 10 guidelines," "validate all user-supplied inputs," "use currently recommended cryptographic standards rather than examples from documentation." Add this to your team's standard prompt template and the baseline improves before you've changed any tooling.

SAST before merge is the gate that catches what prompting doesn't. Static Application Security Testing on every AI-generated PR, before it hits main. Snyk, Semgrep, and SonarQube all have AI-specific capabilities now: Snyk's AI Security Fabric (February 2026) includes 60-second setup flows for Claude Code and Gemini CLI; Semgrep Multimodal (March 2026) combines rule-based analysis with AI reasoning and reports 8x more true positives and 50% fewer false positives compared to base foundation models; SonarQube's AI Code Assurance applies separate quality gates specifically for AI-generated code. The practical consideration: if your SAST is configured for human-written code and producing high false positive rates on AI output, tune a separate ruleset rather than disabling it. High false positive rates are what cause teams to bypass the gate entirely.

Dependency scanning addresses the supply chain exposure that AI-generated code creates. Generate an SBOM (Software Bill of Materials) at build time and scan it against current CVE databases. Syft for SBOM generation, Grype or Trivy for scanning (both support CycloneDX 1.7 and SPDX 3.0 formats). This doesn't have to be a new process: if you're already generating SBOMs for human-written code, extend the same process to AI-generated PRs. The difference is that AI-generated code tends to introduce dependency sprawl that human reviewers wouldn't accept, so the scan catches things that wouldn't have been in the codebase before.

Hard-coded secrets are the most common authentication failure in AI-generated code, and the fix isn't reviewing for them after generation (though do that too with git-secrets or truffleHog). The architectural fix is making static credentials structurally impossible: OIDC-based identity for CI/CD pipelines, secrets management via Vault, AWS Secrets Manager, or GCP Secret Manager, no static credentials anywhere in the codebase. The AI can't hard-code a secret that doesn't exist as a static value. This is the category where the right answer is removing the possibility, not adding a detection step.

What to Do With Existing Code

The pipeline above is for new code. If you've been shipping AI-generated code without these controls, you have a backlog.

Prioritize by exposure: internet-facing endpoints first, authenticated-but-external-facing second, internal tooling last. For each tier, run SAST against the AI-generated PRs from the last 90 days. Generate SBOMs for services that handle sensitive data or external traffic and scan for known CVEs. Run git-secrets or truffleHog against commit history for any secrets that were added and later removed. They're still in the history.

Don't try to audit everything at once. Triage by blast radius: what would an attacker access if they exploited a vulnerability in this service? Start there. A compromised internal reporting tool is a different risk profile than a compromised authentication endpoint.

Builder Tip

Treat AI-generated code like code from a contractor who's excellent at getting things working and has never been briefed on security. You wouldn't merge a contractor's PR without review. The same standard applies here.

The productivity gain from AI coding tools is real, and it doesn't disappear when you add a security review step. What disappears is the false confidence that "it compiles and tests pass" means "it's safe to ship." Those two things have always been different. AI coding tools just make the gap more visible at scale.

Quick Hits

The appsecsanta.com February 2026 study across six models found a 10.1-point spread between the safest (GPT-5.2 at 19.1% vulnerable) and least safe (Claude Opus 4.6, DeepSeek V3, Llama 4 Maverick, all at 29.2%) when given identical prompts with no security guidance. Model choice has a measurable effect on your baseline vulnerability rate before you do anything else.

CrowdStrike's November 2025 research on DeepSeek R1 found that prompts containing terms the model treats as politically sensitive (Tibet, Uyghurs, Falun Gong) increased the probability of severe security vulnerabilities in generated code by up to 50%. One example: framing a request as "an industrial control system based in Tibet" pushed the severe vulnerability rate from 19% to 27.2%. The mechanism appears to be censorship-trained avoidance triggering fallback to insecure default code paths. If DeepSeek is in your toolchain, this is worth knowing about.

XSS is the failure mode Veracode's data highlights most sharply: only 15% of AI-generated code passes XSS security checks. That's an 85% failure rate on a vulnerability class that's been well-understood for two decades. The models know what XSS is; they're not generating secure output for it by default.

Semgrep Multimodal's March 2026 launch is worth tracking. The combination of rule-based static analysis with AI reasoning for triage and auto-fix is a different architecture than traditional SAST. If you've been burned by false positive rates on current SAST tooling for AI-generated code, this is the approach designed to address that specifically.

Gradient Push covers AI and automation for builders. If this was useful, pass it on.

Enjoying this? Subscribe to Gradient Push for practical AI and automation breakdowns — gradientpush.com