Public Release - 2026-06-17

Before a benchmark can win, it must answer the same claim.

Artifact: Same-Claim Baseline Intake v1. This page defines public intake fields for baseline challenges so a stronger baseline can sharpen a claim without changing the question being tested.

target_claim candidate_baseline input_boundary failure_standard reproduction_route

Counterexamples Evidence Map Registries Boundaries

Same-claim baseline intake visual for comparing benchmarks only after input, metric, failure standard, and reproduction route align.

Artifact

A stronger baseline is welcome only when it attacks the same claim.

This update defines a public intake object for baseline comparisons. A comparison may use real numbers and still fail as evidence if it changes the input, metric, budget, failure standard, or reproduction route.

The public object is not a social-media ranking. It is a field checklist for deciding whether a candidate baseline challenges the original claim, narrows it, or belongs to a different question.

Target claim

The exact claim being challenged, not a neighboring slogan or broader category.

Candidate baseline

The baseline method, system, setting, or route proposed as the comparison.

Input boundary

The context, data, task route, tool access, and prompt surface available to each side.

Metric

The measured quantity and whether it actually corresponds to the target claim.

Failure standard

The condition under which each side is counted as failed, downgraded, or out of scope.

Claim effect

Whether the baseline strengthens, weakens, narrows, or leaves the claim unchanged.

Comparison Boundary

Different input means different evidence.

A baseline with complete context cannot be directly used to defeat a method that received partial context unless the input boundary is stated. A system with repeated tool calls cannot be directly compared to a one-shot answer without recording the budget and route.

Those comparisons may still be useful. They may reveal a stronger route, a missing control, or a better experimental design. They should not be sold as proof of the original claim until the fields align.

Input boundary inspection visual for checking whether two methods received the same task surface.

Input Gate

The first gate is not score. It is comparability.

Before a score enters the evidence layer, the intake asks whether the baseline used the same target claim, input boundary, metric, failure standard, and reproduction route.

If the answer is no, the result may remain in the registry as a lead. It should not be upgraded into a claim-level proof.

Failure Standard

A changed failure rule can steal the conclusion.

The most common failure is not a fake number. It is a real number attached to the wrong conclusion. A demo that performs better under a friendlier task route does not automatically prove broader superiority.

The intake forces the comparison to say what failed, under which standard, and what the outcome does to the public claim.

Matrix

Same-claim baseline intake fields.

FieldRequired valueFailure modeRepair route

target_claimThe exact public claim under review.Baseline attacks a different claim.Restate claim id and scope.

candidate_baselineThe method, system, or setting proposed as comparison.Baseline is only a label.Specify method and route.

input_boundaryContext, data, prompt, tool access, and budget.Inputs differ silently.Record all input differences.

metricThe measured quantity and unit.Metric measures a neighboring question.Map metric to claim.

failure_standardThe rule for failure, downgrade, or out-of-scope status.One side receives an easier failure rule.Normalize failure criteria.

reproduction_routeThe public-safe route for repeating or auditing the comparison.Result cannot be checked.Add artifact or issue route.

claim_effectstrengthen, weaken, narrow, unchanged, or different_question.Result is over-amplified.Assign bounded effect.

Baseline evaluation ledger visual showing verified match status for same-claim comparisons.

Receipt

A useful baseline leaves a match status.

The intake should end with a receipt: same claim, same input boundary, same metric, same failure standard, same reproduction route, or a specific mismatch.

That receipt is the difference between a serious baseline challenge and a scoreboard that changed the rules mid-game.

Challenge

Challenge one baseline comparison.

Point to any comparison where the input, metric, failure standard, or reproduction route changed while the conclusion was presented as a direct win.

The strongest challenge names the target claim, the candidate baseline, the mismatched field, and the bounded effect on the claim.