- How do you evaluate an AI agent after it fails?
- What is failure learning?
- What should a benchmark log?
- How does this relate to WisdomBench?
What matters after failure?
Recovery, boundary update, repeated error rate, and evidence of changed behavior.
Benchmarks / Failure Evaluation
Capability benchmarks often ask what the system can do once. Reliability evaluation asks what changes after failure, ambiguity, and correction.
Search Intent
Metric
If an agent repeats the same error after feedback, high first-pass fluency is not enough. The evaluation should track recovery, boundary update, repeated failure, and whether the system learned the right lesson.
A failure can improve the system only if it enters the record.
Drift
Evaluation needs stable criteria. If the grader changes standards between attempts, improvement may be an illusion.
A useful benchmark records prompt, input, output, failure class, correction, next attempt, and scoring rule.
Boundary
The public claim should stay bounded: what failure class, what horizon, what evidence, and what unresolved cases remain.
Evidence Route
This page is an entry point, not a standalone proof. It routes the reader to evidence, DOI records, registries, public challenge paths, and explicit non-claims.
Boundary
Recovery, boundary update, repeated error rate, and evidence of changed behavior.
No. The score should be tied to failure classes and evidence records.
Because inconsistent grading can fake improvement.