newsletter

Your Gateway Is Configured. Is It Working?

The gap between 'I added a gateway' and 'my gateway is actually working.' Four configuration decisions that separate coverage from false confidence.

Adding a gateway to your agent stack is the easy part. You point the agent's API calls through it, the dashboard shows traffic, and it feels like governance. Most of the time, it's not. The gap between "I added a gateway" and "my gateway is actually doing what I think it is" is where production incidents happen: a budget cap that doesn't stop a runaway sub-agent, logs that capture everything but answer nothing, failover that kicks in and produces incoherent output because no one thought about state.

This is Part 2 of the Agent Gateways series. Part 1 covered what gateways are and the landscape. This issue is about the configuration decisions that actually make a gateway useful, and the tests that tell you if yours is.

The Four Critical Configuration Decisions

In order of how costly the mistake is.

Budget cap granularity is first, because the mistake here has the most immediate cost consequence. A project-level cap is not the same as a per-agent per-day cap. Here's the failure mode: five agents share a project budget of $100/day. One sub-agent hits an error condition and starts retrying in a loop. It exhausts the $100 in twenty minutes. The other four agents get throttled. Nothing failed from the gateway's perspective. The policy was enforced correctly. The policy was just wrong.

The fix is agent-level granularity. In LiteLLM, generate a separate virtual key per agent and set max_budget and budget_duration on each key. A research agent gets $5/day. A customer-facing agent gets $20/day. The loop agent exhausts its own budget and gets stopped; the others keep running. In Bifrost, the same pattern is available in the open-source tier through virtual key management with independent budget controls per key.

The second piece is the reset window. Most gateways default to resetting budgets at midnight UTC. If your agent runs out at 11pm, it gets throttled for an hour and you find out the next morning when nothing ran overnight. Set the reset window to when you're actually watching.

Logging schema design is second. Turning logging on is not the same as having useful logs. The default behavior of most gateways is to capture request/response pairs as blobs: the full prompt, the full completion, a timestamp. You can store it, but you can't query it. When you're debugging a production incident, "search the blob" is not a workflow.

Structured logging means each event gets tagged at capture time: agent ID, task ID, tool name, token count, latency in milliseconds, outcome (success, error, blocked by policy). With that schema, the incident query is a 30-second filter. Without it, you're grepping unstructured logs and hoping.

Build the schema before you have 500,000 unstructured entries. Retrofitting structure onto existing logs is expensive and often incomplete.

Failover logic and statefulness is where the most subtle misconfiguration lives. Gateways handle failover at the network layer: if the primary provider returns 5xx errors, the gateway reroutes to a backup. For a single LLM call, this works cleanly. For agentic workflows with accumulated context, it doesn't automatically.

The problem: an agent has been running for two minutes. It's accumulated five tool call results, made three planning decisions, and is mid-execution on step seven. The primary provider goes down. The gateway fails over to a backup. What context does the backup receive? Whatever the application sends it. If the application only sends the current request, the backup gets partial context and produces incoherent output. The agent doesn't fail loudly; it produces a plausible-looking wrong answer.

Two patterns for handling this correctly. First: checkpointing. Before each major step in a long task, serialize the agent's state to a durable store. LangGraph does this natively through its checkpointing system. If a failure occurs mid-task, the workflow resumes from the last checkpoint with full reconstructed context. Second: design tasks to be atomic enough that retrying from scratch is acceptable. Short, self-contained tasks with idempotent operations don't need checkpoints; they just retry.

What the gateway doesn't do is solve state for you. It reroutes the call. State is your problem.

Auth boundary placement is last, and the error here is structural. Where you put the gateway determines what it sees. A gateway sitting between your application and the LLM provider intercepts model calls but does not see direct tool calls your agent makes outside that path. If your agent calls an external API directly without routing through the gateway, those calls are invisible to your policy enforcement and audit logs.

The alternative is placing the gateway at the tool call layer. This is where MCP changes things materially: every tool call is a JSON-RPC request to an MCP server, which means a gateway can sit in front of all of them with a single integration point. If you're building on MCP, you can get complete call graph coverage in a way that was significantly harder to achieve with custom integrations.

Draw the call graph before touching auth config. Map every path between your agent and external systems. Confirm where the gateway actually sits on each path.

The MCP Integration Pattern

The standard MCP wiring puts the gateway between the agent and its tool servers, not between the agent and the LLM.

When an agent session starts, it sends a tools/list JSON-RPC request to discover what's available. A gateway in this position can filter the response based on the agent's identity, returning a scoped-down tool set to a customer-facing agent and the full set to a privileged internal one. The agent only knows about the tools it's been shown. It can't request or invoke tools outside that set. This is least-privilege architecture for agents, enforced at the gateway rather than in the system prompt.

When the agent invokes a tool, it sends a tools/call request with the tool name and arguments. The gateway intercepts, validates against policy, and either passes it to the MCP server or blocks it. Concrete example: a customer support agent has read access to CRM data but no write access. The gateway allows crm_lookup and blocks crm_update. A prompt injection attack that tries to override this restriction can't, because the restriction isn't in the prompt.

This pattern requires your gateway to support MCP interception. Kong has early support; LiteLLM is building toward it; Bifrost's roadmap includes it. Check current documentation before relying on this for production.

Testing Your Gateway Config

Three tests before trusting any gateway config in production.

Set your budget cap to $0.01 and run the agent. Confirm two things: that the right agent gets throttled (not all agents on the account), and that the error is handleable rather than a silent hang. You want a clean error your application can catch and log. If the agent just stops with no observable error, your error handling needs work. Restore the real cap after.

After 10 agent runs, try to answer this question in under 30 seconds: all tool calls made by agent X in the last hour, grouped by tool name. If you can't answer it, your logging schema needs work before you're in production.

Point the gateway at a mock provider returning 500s and confirm what happens. Does failover trigger? Does the agent recover cleanly, or does it produce output with corrupted state, or does it hang silently? The answer tells you which statefulness pattern you need to implement.

Builder Tip

Log the tool call, not just the LLM call.

Most default gateway configurations capture prompt-completion pairs. That gives you a view of what the model said, but the interesting failures are usually in what happened around it: which tool was called, what parameters were passed, what the tool returned, how long it took. A 30-second tool call that always returns an empty result is a different problem than a tool call that returns wrong data. You can't distinguish them from LLM logs alone.

Configure tool call events as first-class log entries with their own schema. Most gateways support this with custom event types or middleware hooks. Do it at setup, not after the first incident that requires it.

Quick Hits

LiteLLM virtual key limits enforce first. Set the budget cap on the key, not just at the team or user level. Budget precedence across team and user levels has edge cases; check current docs before relying on team- or user-level caps to stop key-based calls.

The budget_duration field in LiteLLM and Bifrost is not just "daily" or "monthly." You can set it to hours ("30h") or specific day counts ("7d", "30d"). If your agent does burst work on weekdays and nothing on weekends, a 5-day window is more accurate than a 7-day one.

Most gateway dashboards show aggregate token spend by day. That number is useful for billing; it's not useful for debugging. The query you'll need in an incident is per-agent, per-task, per-hour. Confirm your logging backend supports that granularity before you need it.

The MCP tools/list filtering pattern is an emerging practice, not yet a standard configuration option in most gateways. Kong and some dedicated MCP gateways support it today. For most teams, the practical path for now is enforcing tool access at the MCP server level rather than the gateway level.

[Part 1: The Control Plane Your Agent Is Missing](/agent-gateways-part-1/) | [Part 3: When Your Gateway Has to Hold](/when-your-gateway-has-to-hold/)

Enjoying this? Subscribe to Gradient Push for practical AI and automation breakdowns — gradientpush.com