Agent evals that actually matter: beyond vibe checks

The state of agent evaluation is embarrassing. Teams ship agents tested by “running it a few times and seeing if it feels right.” This is the vibe check era, and it needs to end.

What to actually measure

Forget accuracy on benchmarks. In production, three metrics determine whether your agent survives:

Task completion rate — not “did it generate a good answer” but “did it complete the full multi-step task end-to-end.” Partial completions count as failures. Measure this across 100+ diverse inputs.

Cost per task — total API spend including retries, tool calls, and reducer overhead. An agent that succeeds 95% of the time but costs $2/task is worse than one at 88% for $0.04/task, depending on your use case.

Recovery rate — when a tool call fails or returns garbage, does the agent recover gracefully or spiral? Inject failures deliberately and measure recovery.

# SkillsBench evaluation harness
results = bench.run(
    agent=my_agent,
    suite="travel-booking-v2",
    n_runs=100,
    inject_failures=True,
    metrics=["completion", "cost", "recovery", "latency_p95"]
)

The eval loop

Run evals on every PR. Track metrics over time. Set regression thresholds. If completion rate drops below 85%, block the merge. This is CI/CD for agents, and it’s the bare minimum for production readiness.

Agent Trace

agent trace · post #4 5 steps · 34.2s

THINK Running eval suite. 100 test cases, 4 metrics. 1ms

TOOL bench.run(suite='travel-booking-v2', n=100) 34.2s

OBS completion=87%, cost_avg=$0.034, recovery=72%, p95=3.2s 5ms

ERR recovery rate below 80% threshold. Flagging regression. 1ms

ACT Report generated. 3 failure patterns identified for review. 12ms