Evidence Discipline

Evidence map and claim boundaries.

The project is organized around inspectable claims: what was measured, what was logged, what is still missing, and which claims should not be made.

Evidence and residual scoring visual.

Evidence Layers

What supports the research program.

Different papers use different evidence layers. The public claim is strongest when a paper combines protocol, raw logs, provenance, negative results, and explicit stop rules.

LayerEvidenceWhere usedBoundary
E0Formal definitions, protocols, equations, and claim registriesP04-P21Framework support, not empirical deployment proof
E1Longitudinal API / text-agent panels with repeated rounds and seedsP01-P04Text-agent evidence, not robot evidence
E2RLBench self-trained low-dimensional imitation baseline, 6,300 trialsP05-P06Not a public VLA/SOTA leaderboard
E3Public-factory sidecars and supported-set VLA/LIBERO evaluationP05-P07Supported-set evidence, not full benchmark domination
E4Macro mining, representation compression, and perspectival grounding registriesP08Representation-search evidence, not priority over all compression research
E5Failure-antigen labels, recovery-adapter pilots, and trajectory provenanceP09Offline/pilot recovery evidence, not full trained-policy improvement
E6World-model counterfactual, cybernetic, social, and supra-body ablation artifactsP10-P14Simulation and protocol evidence, not real-world deployment
E7Public simulator bridges and AI-for-science benchmark proposalsP19Benchmark framing, not climate-control authority
E8Adverse perception, detector/tracker logs, DAWN1027/COCO128/public panelsP20Robust evidence gating, not detector SOTA
E9Human-intervention residual schemas and synthetic/public teleoperation ingestionP21No human-subject or wearable-robot validation yet

Claim Boundary

What the public archive does not claim.

The project does not claim universal proof over all AI systems, physical robot deployment, detector SOTA, autonomous enforcement, trading returns, medical validity, or digital identity continuity.

The strongest current public claim is narrower: reliable intelligence should be evaluated after experience, failure, feedback, and perturbation, and the evidence must retain provenance and explicit limitations.

Gate 1

Metric

What exactly is measured: first-attempt success, improvement, recovery, transfer, or action gating?

Gate 2

Provenance

Raw logs, manifests, versioned files, supported-set boundaries, and no duplicate evidence cells.

Gate 3

Negative Results

Failed cells, weak baselines, low success counts, and limitations are retained rather than hidden.

Gate 4

Cost

Evidence should state whether it came from local runs, API panels, cloud GPU work, public simulators, or small pilots.

Supra-body architecture visual.

Why It Matters

Reliable action needs more than a single model response.

The evidence program treats perception, memory, social calibration, failure recovery, workflow context, and action boundaries as coordinated subsystems. This is why the public site separates papers from evidence and why every strong claim has a boundary.

  • Use DOI records for public priority and inspectability.
  • Use anonymous packages for double-blind venues.
  • Use evidence pages to explain what is proven, partial, or pending.
  • Use future external validation for real robots and independent human studies.