Agent Evals in Weave
It is easy to build an impressive agent demo. It is much harder to know whether the agent is actually improving.
That is the role of an agent evaluation: taking a behavior you care about and turning it into something you can measure repeatedly.
As more teams build agent systems, this matters more every month. A single workflow may involve routing, tool use, memory, planning, handoffs, and execution rules. When any of those pieces change, you need a way to tell whether the system improved, regressed, or simply changed behavior.
This post starts with the basics: what an agent eval is, what benefits evals provide, and what kinds of behavior they can measure. From there, it explains how we structure evals in Weave across Loom and Tapestry.
What is an agent evaluation?
At the simplest level, an agent evaluation is a repeatable test for agent behavior.
Instead of asking, “Did the agent look good in this one demo?”, an eval asks a more disciplined question:
Given a specific input and a clear success condition, did the agent behave the way we wanted?
The behavior being measured might be small or large.
- Did it classify the user’s request correctly?
- Did it choose the right next action?
- Did it call the right tool?
- Did it stay on track over several turns?
- Did it respect the rules the system is supposed to operate under?
One useful analogy is this:
If unit tests protect code behavior, agent evals protect agent behavior.
That comparison is not perfect. Agent systems are often less deterministic than ordinary functions, and some evals look more like scenario tests than unit tests. But the goal is similar: create a stable way to measure whether the system is doing the right thing.
In practice, a complete evaluation setup usually has a few parts:
- a task, or test case
- a success condition, which defines what good behavior looks like
- a grader, which checks whether the behavior met that condition
- a suite, which groups related tasks together
Not every eval needs sophisticated grading. Some can be checked with exact rules. Others need more contextual judgment. What matters is that the system is being measured against a clear expectation rather than a vague impression.
Why agent evals matter
Without evals, teams usually fall back on anecdotes.
They remember a few strong outputs, a few bad failures, and whichever demo happened most recently. That is not a reliable basis for improving a system.
Good evals help in a few ways immediately:
- They catch regressions. A prompt update, tool change, or model swap can quietly break behavior that used to work.
- They make comparisons fairer. You can compare models, prompts, or system changes against the same tasks instead of against different examples.
- They make quality visible over time. A single successful run is interesting, but a repeatable success pattern is much more useful.
- They focus discussion. Instead of arguing abstractly about whether the system feels better, teams can ask which behaviors improved and which did not.
Evals do not replace human judgment. They make it more grounded.
What good agent evals actually measure
A useful eval suite should do more than produce a single score.
It should help answer questions like:
- What behavior are we testing?
- What counts as success?
- Is there one correct answer, or several acceptable ones?
- Which change introduced the improvement or regression?
- Are we comparing like with like?
- Can we trust the results historically?
That last point matters more than it first appears. If storage is inconsistent, if model identity drifts over time, or if unlike tests are merged into one summary, the dashboard may look tidy while the conclusions become unreliable.
The main types of agent evals
Not every eval is trying to answer the same question. In practice, the most useful categories for us are these.
1. Identity evals
Identity evals check whether the system picked the right actor.
In Weave, this often means verifying that Loom selected the correct specialist, or correctly chose to handle a simple task itself.
These are usually relatively strict, because there is often a clearly correct answer.
2. Intent evals
Intent evals check whether the system understood what kind of help the user actually needed.
These can be looser than identity evals. Sometimes there are several acceptable ways to represent the same underlying intent, but the core interpretation still needs to be right.
3. Trajectory evals
Trajectory evals check how the interaction unfolds across multiple steps.
An agent can make the right first move and still fail the task overall. Trajectory evals help test whether the system stays aligned, hands work off correctly, and maintains the right order of operations through a longer interaction.
4. Execution-contract evals
Execution-contract evals check whether the system follows its operational rules.
For Weave, this matters especially for Tapestry. We want to know whether it continues correctly, stops when blocked, respects review boundaries, and behaves safely under the rules it is supposed to follow.
These are less about style and more about correctness under the system's defined constraints.
Why this matters for Weave
This became especially important for us once Weave separated two very different responsibilities:
- Loom decides what should happen next
- Tapestry executes reviewed work inside clear boundaries
Both are part of the same product, but they are solving different problems.
If Loom routes to the wrong specialist, that is a routing failure. If Tapestry ignores an execution rule, that is a contract failure.
Those should not be mixed into one benchmark and treated as if they were the same kind of correctness.
That is the central design decision behind our evals: we do not treat all agent behavior as one score.
How we structure evals in Weave
In Weave, evals are organized into families so that each family represents a meaningful behavior surface.
Today, the public registry reflects two active families:
- Loom Routing
agent-routing-identityagent-routing-intentagent-trajectory
- Tapestry Execution
tapestry-execution-contracts
There is also a reserved Tapestry Review family in the registry, but it is not surfaced on the public landing page yet.
This split is intentional.
The Loom family groups together different views of routing quality: who should act, what the user meant, and whether the interaction follows the right path.
Tapestry Execution is kept separate because it answers a different question: whether the execution engine behaves correctly under its contract.
That separation makes the results easier to trust. A single shared score would be simpler to present, but it would also imply a level of comparability that is not actually there.
Where we store evals and why
The structure of the data matters almost as much as the tests themselves.
In Weave, we keep a canonical JSONL file per suite. Each run is recorded as its own line, along with metadata about the run.
That gives us a few practical benefits:
- each suite keeps a durable history
- reporting can group runs by normalized model identity
- summaries stay derived, instead of becoming the source of truth
- historical comparisons remain easier to audit
This sounds like a small implementation detail, but it has a large effect on trust. If the storage layer is sloppy, the reporting layer becomes misleading no matter how polished the charts look.
Why model identity matters
If you want evaluation results to stay useful over time, model identity has to be normalized consistently.
Otherwise, small naming differences can fragment the history and make trends harder to interpret. Weave keeps model-aware reporting tied to run metadata so comparisons remain meaningful across time instead of turning into separate, accidental buckets.
The principle here is simple: preserve the raw history carefully, then build smarter comparison logic on top of it.
Our philosophy: honest evals over flattering dashboards
The goal of the /evals/ section is not to flatten everything into one leaderboard.
It is to make different kinds of behavior visible without pretending they are the same test.
For us, that means:
- keeping eval families separate when their coverage differs
- storing suite histories canonically
- attaching enough metadata to make comparisons trustworthy
- publishing results in a way that avoids misleading rollups
In other words, we want evals to be honest before they are impressive.
Closing thoughts
Agent evaluations are valuable because they help teams answer a practical question with more confidence: is the system getting better at the behavior we actually care about?
For Weave, that means evaluating routing, execution, and eventually review as distinct surfaces with different expectations.
That structure gives us cleaner history, more credible comparisons, and a better way to explain progress in public.
