Issue #8

When Your Gateway Has to Hold

A single agent handling predictable traffic is the easy case. Add a gateway, configure it correctly per Parts 1 and 2, and it works. The failure modes at scale are different in kind.

When Your Gateway Has to Hold - 60-second summary

An indirect prompt injection embedded in a document your agent was summarizing. A multi-agent workflow where a spawned sub-agent creates its own sub-agents, each with independent budget caps that sum to ten times what you thought you authorized. A gateway that handles 50 requests per second fine and degrades badly at 500. These are correct configurations that do not hold when the threat model includes a malicious payload or when traffic spikes in ways the single-agent setup never surfaced.

Part 3 is the production-grade installment. Security threat modeling, scaling patterns that hold under load, multi-agent topology decisions, and the architecture choices that separate a gateway that is working from one that would fail under real pressure.

The Security Threat Model

Prompt injection is OWASP top threat for LLM applications. The direct case is manageable: input filtering and system prompt hardening address most of it. The harder case is indirect injection: malicious instructions embedded in data the agent is asked to process. A webpage your research agent scrapes. A document your support agent summarizes. The agent processes the content; the injected instruction executes as if it were part of the original task.

The gateway role in indirect injection defense is output-layer scanning, not input filtering. When a tool call returns data from an external source, that response passes through the gateway before reaching the agent. A gateway that scans tool call outputs for injection patterns can block or sanitize before the agent context is poisoned.

A useful framing from current security research: a prompt injection attack needs three conditions to be dangerous. Access to private data the agent can touch. Exposure to untrusted tokens. An exfiltration vector the agent can invoke. Eliminate any one of the three and the attack chain breaks. Scoped tool access removes the private data access. Output scanning at the gateway blocks exfiltration vectors. Treating all external content as untrusted by default addresses the middle condition.

Multi-agent infection is the newer concern. If Agent A processes an indirect injection and passes its output to Agent B as part of a normal workflow delegation, Agent B is now operating with poisoned context. Identity-based tool scoping between agents contains the spread: Agent B can only invoke the tools it is authorized to invoke, regardless of what Agent A output instructs it to do.

Scaling Patterns

Async architecture matters at high RPS. Synchronous gateways that process requests sequentially hit ceiling quickly. In agentic call chains where one agent task generates 10-20 downstream calls, latency compounds fast. Go-based async gateways maintain sub-millisecond overhead at 5,000 RPS; Python-based gateways including LiteLLM show P99 latency spikes under sustained load past a few hundred RPS. The horizontal scaling path: multiple instances behind a load balancer, PostgreSQL for persistent state, Redis for shared state synchronization at 10+ instances.

Budget cap behavior under recursive agent spawning is a failure mode that only emerges in multi-agent systems. An agent with a $20 per day cap spawns five sub-agents, each with its own $20 per day cap. Effective daily exposure is now $120. The gateway fix is hierarchical budget controls: sub-agent spend rolls up to the parent cap rather than running independently.

Semantic caching returns meaningful cache hit rates in agentic systems; exact-match caching largely does not. Agents paraphrase queries constantly. At volume, semantic caching produces significantly higher hit rates in paraphrase-heavy agentic workloads, reducing both cost and latency.

Multi-Agent Gateway Architecture

A shared gateway with centralized policy is the right default for most deployments. All agent traffic routes through a single instance. Unified audit trail, single policy enforcement point, simpler credential management. Start here.

Federated gateways emerge when you have concrete scaling or compliance reasons to separate. Each agent cluster has its own gateway instance; policies synchronize from a central control plane. The transition from shared to federated should be driven by a real constraint, not anticipated complexity.

Identity propagation across agent hops: the established pattern is JWT-based identity propagation where each token carries the agent identity chain. The result is that a single log entry reconstructs the full agent call graph: Tool X was invoked by Agent C, delegated to by Agent B, delegated to by Agent A, originating from Request ID 12345.

Builder Tip

Add a circuit breaker alongside your budget cap. A budget cap stops a runaway agent after it has spent too much. A circuit breaker stops it while it is still recoverable: 10 consecutive errors triggers a pause, sustained latency above threshold triggers a backoff, error rate above threshold in a five-minute window triggers an alert before the cascade reaches the budget wall. Configure it the same session you configure the budget cap.

Quick Hits

agentgateway.dev shipped v1.0 in March 2026, hosted under the Linux Foundation. It is a Rust-based gateway that handles LLM routing, MCP interception, and A2A communication in a single deployment, with the CEL policy engine for fine-grained tool access control and full OpenTelemetry integration. (agentgateway.dev)

Kong AI Gateway 3.13, released January 2026, added MCP Tool ACLs: granular authorization at the gateway layer, identity-based tool filtering, and default-deny policies for MCP servers.

Cedar is the policy language appearing in MCP gateway implementations that need auditable, deterministic access control. AWS open-source policy language, it supports both RBAC and ABAC, and it is shipping in ToolHive and Amazon Bedrock AgentCore Gateway as of early 2026. (cedarpolicy.com)

The lethal trifecta for prompt injection: private data access plus untrusted token exposure plus exfiltration vector. If all three conditions exist in your architecture, an indirect injection is a practical threat, not a theoretical one.

Gradient Push covers AI and automation for builders. If this was useful, pass it on.

Enjoying this? Subscribe to Gradient Push for practical AI and automation breakdowns — gradientpush.com