Agent evals that actually matter: beyond vibe checks

evals agents testing
TL;DR
  • Vibe-check testing is the norm and it needs to end
  • Three metrics that matter: task completion rate, cost per task, recovery rate
  • Inject tool failures deliberately to measure agent resilience
  • Run evals on every PR — this is CI/CD for agents

The state of agent evaluation is embarrassing. Teams ship agents tested by “running it a few times and seeing if it feels right.” This is the vibe check era, and it needs to end.

What to actually measure

Forget accuracy on benchmarks. In production, three metrics determine whether your agent survives:

Task completion rate — not “did it generate a good answer” but “did it complete the full multi-step task end-to-end.” Partial completions count as failures. Measure this across 100+ diverse inputs.

Cost per task — total API spend including retries, tool calls, and reducer overhead. An agent that succeeds 95% of the time but costs $2/task is worse than one at 88% for $0.04/task, depending on your use case.

Recovery rate — when a tool call fails or returns garbage, does the agent recover gracefully or spiral? Inject failures deliberately and measure recovery.

# SkillsBench evaluation harness
results = bench.run(
    agent=my_agent,
    suite="travel-booking-v2",
    n_runs=100,
    inject_failures=True,
    metrics=["completion", "cost", "recovery", "latency_p95"]
)

The eval loop

Run evals on every PR. Track metrics over time. Set regression thresholds. If completion rate drops below 85%, block the merge. This is CI/CD for agents, and it’s the bare minimum for production readiness.

Agent Trace

agent trace · post #4 5 steps · 34.2s
THINK Running eval suite. 100 test cases, 4 metrics. 1ms
TOOL bench.run(suite='travel-booking-v2', n=100) 34.2s
OBS completion=87%, cost_avg=$0.034, recovery=72%, p95=3.2s 5ms
ERR recovery rate below 80% threshold. Flagging regression. 1ms
ACT Report generated. 3 failure patterns identified for review. 12ms
Open in terminal ↗