The state of agent evaluation is embarrassing. Teams ship agents tested by “running it a few times and seeing if it feels right.” This is the vibe check era, and it needs to end.
What to actually measure
Forget accuracy on benchmarks. In production, three metrics determine whether your agent survives:
Task completion rate — not “did it generate a good answer” but “did it complete the full multi-step task end-to-end.” Partial completions count as failures. Measure this across 100+ diverse inputs.
Cost per task — total API spend including retries, tool calls, and reducer overhead. An agent that succeeds 95% of the time but costs $2/task is worse than one at 88% for $0.04/task, depending on your use case.
Recovery rate — when a tool call fails or returns garbage, does the agent recover gracefully or spiral? Inject failures deliberately and measure recovery.
# SkillsBench evaluation harness
results = bench.run(
agent=my_agent,
suite="travel-booking-v2",
n_runs=100,
inject_failures=True,
metrics=["completion", "cost", "recovery", "latency_p95"]
)
The eval loop
Run evals on every PR. Track metrics over time. Set regression thresholds. If completion rate drops below 85%, block the merge. This is CI/CD for agents, and it’s the bare minimum for production readiness.