Benchmarks / Failure Evaluation

The benchmark starts after the agent fails.

Capability benchmarks often ask what the system can do once. Reliability evaluation asks what changes after failure, ambiguity, and correction.

AI agent evaluation after failure LLM evaluation AI benchmark failure learning benchmark agent reliability evaluation WisdomBench

Search Intent

The reader wants an evaluation method for failure learning, not a single score.

  • How do you evaluate an AI agent after it fails?
  • What is failure learning?
  • What should a benchmark log?
  • How does this relate to WisdomBench?

Metric

A useful score must record what changed.

If an agent repeats the same error after feedback, high first-pass fluency is not enough. The evaluation should track recovery, boundary update, repeated failure, and whether the system learned the right lesson.

A failure can improve the system only if it enters the record.

Drift

The grader can drift too.

Evaluation needs stable criteria. If the grader changes standards between attempts, improvement may be an illusion.

A useful benchmark records prompt, input, output, failure class, correction, next attempt, and scoring rule.

Boundary

Failure learning is not the same as unlimited self-improvement.

The public claim should stay bounded: what failure class, what horizon, what evidence, and what unresolved cases remain.

Evidence Route

Where the claim can be checked.

This page is an entry point, not a standalone proof. It routes the reader to evidence, DOI records, registries, public challenge paths, and explicit non-claims.

KindAnchorURLRole
Evidence MapPublic evidence maphttps://mianzhang.org/evidence/Start from supported claims and known boundaries.
Paper IndexDOI and paper status maphttps://mianzhang.org/papers/Use paper-specific DOI records for paper claims.
RegistriesMachine-readable registrieshttps://mianzhang.org/registries/Inspect claim, evidence, counterexample, and action records.
ChallengeCounterexample routehttps://mianzhang.org/counterexamples/Attack overbroad claims through public routes.
ArchiveZenodo portfolio indexhttps://zenodo.org/records/20027295Long-term archive index; cite specific DOI records where available.
BenchmarkWisdomBench failure learninghttps://mianzhang.org/benchmarks/wisdombench-failure-learning/Failure learning benchmark route.

Boundary

What this page does not prove.

  • This page does not claim solved general intelligence.
  • It does not certify all agents as self-improving.
  • It does not replace task-specific evaluation.
FAQ

What matters after failure?

Recovery, boundary update, repeated error rate, and evidence of changed behavior.

FAQ

Is a higher score enough?

No. The score should be tied to failure classes and evidence records.

FAQ

Why track grader drift?

Because inconsistent grading can fake improvement.

Benchmark

WisdomBench Failure Learning

Open route