Your Agent Evals Are Lying to You

Most agent evals measure the clean path. Production readiness depends on the messy path: tools, time, retries, handoffs, stale state, trace evidence, and recovery.

Share
Your Agent Evals Are Lying to You
Your Agent Evals Are Lying to You — 60-second summary

Your agent can pass the eval suite and still be unsafe to trust.

That does not mean evals are fake. It means many teams are evaluating the part of the system that is easiest to replay: a prompt, a model response, a reference answer, maybe a preference score. That is useful. It is also too narrow to tell you whether a long-running, tool-using agent will survive production.

Production failures usually do not arrive as neat wrong answers. They arrive as expired auth in the middle of a CRM update. They arrive as a tool returning a subtly different schema than yesterday. They arrive as stale state after a retry, a downstream timeout that leaves partial work behind, or a handoff that technically escalates but gives no human enough context to own the next step.

If your evals only ask, “Did the agent produce the right final answer on the happy path?” they can create a dangerous kind of confidence. They validate the demo path while leaving the operating path mostly untested.

The better question is not whether the agent passed a canned test. It is whether the eval proves the system can detect, bound, and recover from the failures that matter.

Capability evals are not operating evals

A capability eval asks whether the model or agent can perform a task under known conditions. Can it classify this ticket? Can it extract the right fields? Can it produce a useful answer from supplied context? Can it choose the right tool in a straightforward scenario?

Those checks matter. OpenAI’s eval guidance frames evals as structured tests for accuracy, performance, and reliability in variable AI systems, and its agent-eval guidance points teams toward reproducible evaluation, datasets, and trace grading for workflow-level errors. That is the right direction: evals belong in the build loop.

But a production agent is more than a model output. It is a running system with tools, permissions, state, latency, retries, escalation paths, and dependencies that change over time.

An operating eval asks a different set of questions:

  • What happens when a tool call succeeds with malformed or stale data?
  • What happens when auth expires halfway through a workflow?
  • What happens when the agent retries after a timeout?
  • What happens when only part of the job completes?
  • What happens when escalation is required but ownership is ambiguous?
  • What evidence does an operator get after the failure?

That distinction matters because many agent failures happen between steps, not inside one isolated model response.

A support triage agent might classify individual tickets correctly in isolation. In production, it can still fail if the escalation queue is misconfigured, the handoff note omits the decision trail, or the assigned human cannot see which tool result the agent trusted.

A research agent might score well on answer quality. In production, it can still cite stale, malformed, or partial tool output if the eval never checks tool freshness, retrieval provenance, or whether the agent degraded safely when evidence was weak.

A background workflow might complete the happy path. In production, it can still leave a customer record half-updated when a downstream dependency times out after step four.

Those are not benchmark failures. They are operating failures.

The missing surface area

Most weak agent eval programs share the same blind spot: they compress a system into an answer.

That hides the actual surface operators have to trust.

1. Tool drift

Tool behavior changes. APIs add fields, remove fields, rename errors, alter rate limits, or return edge-case payloads that were not in the test set. A prompt-response eval may never notice because it only sees the final text.

A stronger eval checks whether the agent handles changed tool output, missing fields, empty results, duplicate records, malformed payloads, and conflicting sources. It should also test whether the agent knows when not to proceed.

2. Time and session failure

Agents that work across time inherit time-based failure modes. Tokens expire. Sessions close. Approvals age out. Context becomes stale. A workflow that looks reliable in a single replay can break when the same job spans hours or days.

A CRM update eval that starts with fresh credentials is not enough. The eval should include the case where authorization expires after the agent has already gathered context but before it writes the update. The expected behavior is not “try harder.” It is pause safely, preserve state, ask for renewed authority, and avoid writing ambiguous partial data.

3. Retry and partial-completion behavior

Retries are useful until they duplicate work, overwrite state, or hide the real fault. If an agent retries a failed tool call, the eval should inspect whether the retry is idempotent, whether partial state is marked, and whether the operator can see what happened.

A downstream timeout should not leave the system guessing whether the agent completed the job. The eval should force timeout scenarios and verify that the agent records the uncertainty.

4. Handoff failure

Human escalation is often treated as an escape hatch. It is only an escape hatch if the handoff is actionable.

A trustworthy eval checks whether the human receives the user goal, current state, decisions already made, evidence used, failed steps, and the next safe action. If the handoff only says “needs review,” the system has not escalated. It has abandoned the work.

5. Silent degradation

The most dangerous agent failure is not always a crash. It is a system that keeps operating after its evidence quality has degraded.

If retrieval returns old data, if a tool silently falls back to cached results, or if the agent loses access to one dependency but keeps producing confident answers, the eval should catch that. Success should require evidence quality, not just fluent output.

Traces should become part of the eval artifact

For agents, the final answer is not enough evidence.

OpenAI’s trace grading guidance describes trace grading as assigning structured scores or labels to an end-to-end record of decisions, tool calls, and reasoning steps so teams can identify workflow-level mistakes. OpenTelemetry’s GenAI semantic conventions also reflect the same operational need: GenAI instrumentation is moving toward spans, events, metrics, tool execution records, error types, model metadata, and agent/framework spans rather than answer text alone.

The practical point is simple: if you cannot inspect the execution path, you cannot reliably evaluate the agent.

A useful agent eval should preserve enough trace evidence to answer the questions an operator will ask after something goes wrong:

  • Which tools were called?
  • What did each tool return?
  • What state changed?
  • What failed, timed out, retried, or degraded?
  • Which decision depended on which evidence?
  • What did the agent tell the user or human operator?
  • What would an operator need to reproduce or repair the run?

Without that record, teams grade the visible surface while the system fails underneath it.

What a trustworthy agent eval stack includes

A better eval stack does not throw away prompt-response tests. It surrounds them with operating tests.

At minimum, production-facing agent evals should cover five layers.

1. Capability checks

Keep the basic tests. Measure whether the agent can complete the intended task under clean conditions. Use representative examples, reference outputs where appropriate, structured graders where they help, and regression tests for known failures.

Just do not stop there.

2. Tool and dependency checks

Test the agent against realistic tool behavior: empty results, stale results, malformed payloads, permission errors, rate limits, slow responses, duplicate records, and conflicting sources.

The expected behavior should include when to continue, when to ask for clarification, when to escalate, and when to stop.

3. State and time checks

Replay workflows across session boundaries. Expire credentials mid-run. Resume from old state. Inject stale context. Interrupt the workflow and restart it.

The eval should verify that the agent does not treat old authority, old context, or partial state as clean truth.

4. Recovery checks

Force failures that production will eventually produce: timeouts, tool errors, downstream outages, partial writes, repeated retries, and ambiguous completion states.

Then grade the recovery path. Did the agent preserve state? Did it avoid duplicate side effects? Did it expose uncertainty? Did it leave an operator with enough trace evidence to intervene?

5. Escalation checks

Evaluate human handoff as a first-class outcome. The eval should check whether the handoff includes the job goal, current state, attempted actions, evidence trail, failure mode, risk, and recommended next step.

If the human cannot safely continue from the handoff, the agent did not pass.

Operator checklist: before you trust the eval

Before shipping an agent based on eval results, ask these questions:

  1. Does the eval include tool failures, not just task prompts?
  2. Include malformed output, missing fields, stale data, rate limits, permission errors, and dependency timeouts.
  3. Does it test time?
  4. Include expired credentials, stale sessions, old context, delayed resumes, and long-running workflows.
  5. Does it inspect traces, not just final answers?
  6. Capture tool calls, tool outputs, state changes, retries, errors, and escalation decisions.
  7. Does it grade recovery?
  8. Verify safe pause, retry limits, idempotency, partial-state marking, and operator-visible uncertainty.
  9. Does it test handoff quality?
  10. A human should receive enough context to own the next step without reconstructing the entire run.
  11. Does it check evidence freshness and provenance?
  12. The agent should know when its retrieved or tool-provided evidence is outdated, partial, or conflicting.
  13. Does it include known production incidents?
  14. Every real failure should become a regression case, including the trace and state conditions that caused it.
  15. Does it define unsafe success?
  16. A fluent answer should fail if it relied on bad evidence, skipped required authority, hid a timeout, or left partial work unmarked.
  17. Does it measure operator visibility?
  18. The team should be able to explain what happened after a failed run without guessing from logs scattered across systems.
  19. Does it map each eval to an operating risk?
  20. If a test does not protect a real failure mode, decide whether it belongs in the production-readiness suite or only in a model-quality suite.

The close

Agent evals are not lying because evaluation is useless. They are lying when they make a system look ready by measuring only the cleanest part of it.

The production question is broader than answer quality. Can the agent handle tools that drift, sessions that expire, state that goes stale, dependencies that fail, retries that create ambiguity, and humans who need to take over?

If the eval suite cannot see those failures, it cannot certify the system. It can only certify the clean run.

The teams that build reliable agents will not be the ones with the most reassuring leaderboard number. They will be the ones whose eval suites look like production: messy, stateful, tool-dependent, traceable, and recoverable.

Source notes


Primary operator action: Run the production-failure eval checklist before using clean-path eval results to expand agent autonomy.