A Systematic Study of Evaluating Agents

Evaluating AI agents is one of the hardest unsolved problems in applied AI, and most teams are doing it wrong.

If you've shipped an agent to production, you've likely run into this: the agent performs well in your test harness, then behaves in ways nobody predicted when real users start using it. The instinct is to blame the model. The real culprit, almost always, is the absence of a systematic evaluation framework.

Anthropic, OpenAI, Google DeepMind, and academic labs across the world have published extensively on this problem. This post pulls the most useful threads together and explains how to build an evaluation approach that actually predicts what an agent will do in production, not just what it does in a controlled setting.

What makes agent evaluation different from LLM evaluation

The standard playbook for evaluating a large language model is well-established: assemble a benchmark dataset, pass in prompts, compare outputs to reference answers, compute a score. MMLU tests knowledge breadth. HumanEval tests code generation. HellaSwag tests commonsense completion. These are single-turn, bounded tasks with clear right answers.

Agents break every assumption underlying that playbook.

An agent operates across multiple steps, calls external tools, maintains state between turns, and makes sequential decisions where an error at step 2 corrupts everything downstream. The relevant question is not "did the agent produce the right final output?" It is whether the agent got there reliably, through sound reasoning, using the right tools in the right order, across the full range of tasks it will actually encounter.

OpenAI's work on multi-step agent evaluation, including in the SimpleQA benchmark and internal alignment research, makes the distinction explicit: correct outputs reached via broken trajectories are not evidence of a reliable agent. Anthropic's model cards and alignment research make the same point from a different angle. An agent that performs well on a narrow eval can exhibit entirely unexpected behavior when given longer task horizons, more autonomy, or novel environment configurations.

Evaluation frameworks designed for static LLMs are structurally insufficient for agents. You need a different approach.

What is an agent eval, exactly

Before getting into layers and methods, it helps to be concrete about what you're actually measuring when you evaluate an agent.

An agent eval is a structured test in which the agent is given a task, a set of available tools, and an environment to operate in. The eval records what the agent does (every tool call, every sub-goal, every reasoning step) and then scores that record against some criteria. The criteria can be outcome-based (did it complete the task?), trajectory-based (did it take a reasonable path?), or safety-based (did it avoid prohibited actions?). A complete evaluation framework uses all three.

The difference from an LLM eval is the environment. The agent has to do something, not just say something. And the doing produces a trace that is itself a rich source of signal.

Layer 1: Capability benchmarks

The foundation of any evaluation approach is capability assessment. Can the agent perform the tasks it was designed for, at all, under controlled conditions?

Several benchmarks have become reference points for this layer. SWE-bench, developed at Princeton, tests whether agents can resolve real GitHub issues across popular open-source repositories, a task that requires reading code, understanding a bug report, writing a fix, and verifying it against existing tests. WebArena, from Carnegie Mellon, evaluates agents on realistic web-based tasks: booking, searching, and navigating multi-page workflows. AgentBench, from Tsinghua University, covers operating systems, databases, and knowledge graphs. GAIA, from Meta AI and Hugging Face, tests general-purpose real-world task-solving that requires multi-step tool use and grounded reasoning.

These benchmarks are useful starting points and not finish lines. The limitation is that they measure potential, not reliability. An agent scoring 70% on WebArena still fails 30% of the time on tasks within that distribution, and you don't know which 30%. You know even less about the tasks that matter to your actual users. Benchmark performance tells you whether the agent is capable of a task class. It doesn't tell you whether it will be reliable on the specific distribution of tasks you're shipping it for.

Layer 2: Trajectory evaluation

This is where most teams underinvest, and where the highest-signal information lives.

Trajectory evaluation means evaluating the path the agent took to reach an outcome, not just whether the outcome was correct. Anthropic's research on multi-step reasoning and tool use, in their alignment work and in engineering posts about Claude Code and their multi-agent research approach, consistently points to trajectory quality as a stronger predictor of production reliability than outcome quality. A correct answer produced via confused or lucky reasoning will not hold across the full task distribution.

Google DeepMind's agent evaluations, including those published alongside Gemini, show the same pattern: trajectory-level scoring correlates substantially more with downstream reliability than outcome-only scoring.

What you're evaluating at this layer:

Did the agent call the right tool with the right arguments at the right point in the task? A tool call that produces a correct output because the tool was forgiving about bad inputs is not evidence of a reliable agent.
Does each reasoning step follow logically from the prior state of the world? Are the agent's sub-goals appropriate given what it has observed? This requires reading the chain-of-thought or scratchpad, not just the final response.
When the agent hits a dead end (a tool returns an error, a retrieved document doesn't answer the question), does it recover and revise its plan, or does it repeat failed attempts? Recovery behavior under failure is one of the strongest signals of robustness.
Does the agent take extraneous steps that add latency, token cost, or risk without contributing to the task? Over-calling tools and generating unnecessary intermediate outputs both indicate a poorly calibrated agent.

Instrumenting for this layer means logging every decision point: tool calls, arguments, results, sub-goal generation, and reasoning traces. You need scorers that evaluate path quality, not just final state. Building that infrastructure costs more than running a benchmark. It is also what tells you whether your agent is improving.

Layer 3: Adversarial and safety evaluation

The final layer is the one most teams defer until something goes wrong in production.

Adversarial evaluation asks what happens when the environment doesn't cooperate. What happens when inputs are ambiguous, tools return unexpected responses, or a user provides inputs the agent wasn't designed to handle?

Anthropic's Responsible Scaling Policy ties capability evaluations to safety thresholds before models (and agents built on top of them) are deployed at greater autonomy levels. The policy defines evaluation criteria across four risk areas: CBRN (chemical, biological, radiological, nuclear), cybersecurity, deception, and model autonomy. OpenAI's Preparedness Framework covers similar ground with a defined scoring rubric for pre- and post-mitigation risk across the same categories. Both treat these evaluations as prerequisites for deployment at scale, not optional safety audits.

For most production agent teams, adversarial evaluation covers three things:

Prompt injection testing: if your agent reads content from external sources (documents, search results, email bodies, database records), can a malicious string in that content override the agent's instructions? This is not hypothetical. Researchers at ETH Zurich and elsewhere have demonstrated prompt injection attacks against deployed agents in realistic task environments.

Out-of-distribution input handling: what does the agent do when it encounters a request, document type, or tool response it wasn't designed for? A well-evaluated agent should degrade gracefully, expressing uncertainty or taking a conservative fallback, rather than producing a confident wrong answer.

Failure mode cataloging: have you systematically documented the ways your agent fails? Not just the cases where it produces a wrong final output, but the reasoning errors, tool misuse patterns, and edge cases where it behaves unexpectedly? This is the agentic equivalent of a regression suite, and it requires intentional investment to build.

Building the infrastructure to do this

The frameworks above translate into specific engineering decisions.

Define your task distribution before you write a single eval. An eval set that doesn't reflect the actual distribution of tasks your agent will encounter in production measures the wrong thing. Sample real user sessions or realistic synthetic tasks, cluster them by type and complexity, and build your eval set to match.

Use trajectory logging from day one. Every agent action (tool call, sub-goal, reasoning step) should be logged in a structured format that can be replayed. When something goes wrong in production, the first question is what the agent actually did. If you can't answer that exactly, you can't debug it.

Calibrate your automated scorers against human judgment. LLM-as-judge evaluation, using a frontier model to score agent trajectories, has become a practical approach for scaling evaluation coverage beyond what human annotation allows. Anthropic, OpenAI, and academic groups have published on both the promise and the pitfalls. LLM judges have known biases: they tend to favor verbose responses, may share biases with the agent being evaluated, and are sensitive to surface-level features. Build a small high-quality human annotation set for your hardest task types, then calibrate your automated judge against it before trusting it at scale.

Run regression evals on every agent change. Every prompt update, tool addition, model upgrade, or retrieval change should trigger a regression eval run. This is standard practice in software engineering. Track eval metrics over time. A drift in trajectory quality or task completion rate is a leading indicator of production degradation. Catching it in eval is always cheaper than catching it in production.

What frontier labs are still getting wrong

Even well-resourced teams have open problems in agent evaluation.

Anthropic's research team has published extensively on the difficulty of out-of-distribution generalization, the gap between eval performance and real-world performance. Their work on emergent behaviors in capable models and on alignment properties under increased autonomy is a sustained argument that evaluation is not a solved problem even at the frontier. Their Constitutional AI paper and the follow-on RLHF work show that what an agent learns to do in evaluation can diverge from what it does when deployed more broadly.

OpenAI's research on calibration raises a related issue. An agent that is confidently wrong is more dangerous than one that correctly expresses uncertainty. Most current evaluation frameworks, benchmarks and trajectory scorers alike, don't adequately measure calibration. A well-calibrated agent should say it isn't sure when it isn't sure. Building evals that reward calibrated uncertainty is a mostly unsolved problem.

Google DeepMind's work on reward hacking in agentic settings, where agents learn to satisfy the literal specification of a reward function while violating its intent, is a reminder that evaluation metrics can themselves be gamed. Any sufficiently capable agent optimized against a narrow eval will eventually find the gaps in that eval. The answer is richer, more diverse evals combined with human oversight at the edges.

The appropriate response is treating evaluation as ongoing work, not a one-time certification that an agent is ready to ship.

Why this matters in high-stakes domains

A wrong answer in a consumer chatbot is annoying. A misconfigured access policy, a missed compliance check, a hallucinated authorization decision, or a leaked credential in an identity and access management workflow has consequences of a different order. The asymmetry of failure costs means the bar for evaluation rigor is not the same across all deployment contexts.

At Idam AI, we build agents that operate in exactly these environments, where every action has a downstream access, security, or compliance implication. Systematic agent evaluation here is a safety requirement, not a best practice.

The picture from Anthropic, OpenAI, DeepMind, and practitioners across the field is consistent: evaluation must be layered, continuous, and grounded in the actual task distribution your agent will face. Start with capability benchmarks. Go deeper with trajectory evaluation. Stress-test with adversarial scenarios. Build the infrastructure to do it again after every change, because the agent, the environment, and the task distribution will all keep changing.

What does your current agent evaluation stack look like? Are you measuring trajectories, or just final outputs, and what's the hardest failure mode you've had to design an eval around?

Sources

Your AI Copilot for all things Product Management

Idam AI helps you streamline product tasks, enhance decision-making, and accelerate your product development lifecycle with intelligent automation.

Get Started with Idam AI

Table of Contents