Evaluate

Evidence & authority

The 1-5 rank measures verdict independence: how much the check can fail on its own, from a 1 where the fixture hands over the answer to a 5 where the harness derives the verdict across the whole declared contract and can fail with no fixture echo. It does not measure how heavy the machinery is, how much real tooling runs, or how mature the component is.

Every one of the 78 components declares how it is backed: an evidence class and a strength rank from 1 to 5. The rank measures one thing only, verdict independence, which is how much the check can fail on its own rather than echoing an answer the fixture already supplied. A 1 means the fixture hands over the verdict and the component checks its shape. A 5 means the harness derives the verdict across the component's whole declared contract and can fail with nothing fed to it. The rank is not a measure of how much machinery runs or how mature a component is, so read the ordering carefully: a component that runs a real external tool over a deliberately small scope can sit at 4 while a contract validator with no external tool sits at 5. The classes below are ordered by how many components carry them, and each says plainly what it checks and what it does not.

A class and a rank describe how a component's own public contract and fixtures are checked, not whole-system correctness, live freshness, or anything past the component's stated scope. Each component card's “Scope limit” line holds that boundary.

Why these modes

Microcosm is the public release of a larger working system, and the evidence classes are the release lanes components took to get here. A verified source import carries real code across under content-digest checks. Computed projections and bounded replays exercise that code over public fixtures, where private data and live services cannot follow. Contract validators publish the checks themselves, so they can fail in public. External tool runs close the loop with real machinery, such as Lean and finance statistics code, where a real tool fits a bounded public form. A release where every component ran live external tools would need the live system; this one shows the mechanism, the code, and the check in their inspectable forms instead. The class records the lane a component took, not a quality tier.

rank 5 · 39rank 4 · 12rank 3 · 27

If you are wondering why a Contract validator (5) outranks a component that compiles Lean or runs real statistics (4): the rank scores verdict independence, not engineering weight. An External tool run does invoke the real tool, but on this public slice it claims only a bounded witness, the tool's return code plus a few output checks over a small scope, so it is capped at 4. A Contract validator earns 5 because the harness derives the verdict over the component's whole declared contract, with no fixture-supplied answer to lean on. The same ordering holds down the scale: a validator with no external tool can outrank one that runs code when it checks more of its own contract unaided. Components that genuinely run external tools are flagged separately below.

Components that actually run things Runs real tools

A 4 here often means more machinery, not less. These components execute a real tool or runtime: they compile Lean through Lake, run forecast-evaluation statistics over market-shaped fixtures, or step a small NumPy model forward. They are capped at 4 because each claims only a bounded witness over a small scope, never a general proof. Look for the Runs real tools marker on a component to spot them.

Computed projection (27)

A deterministic projection verified by recomputing it from source rather than by a live run; negative cases are policy checks, not real-world validation. Rank 3: the code computes the result, but failure coverage is partial.

Agent Benchmark Integrity Anti Gaming Replay3/5 Agent Memory Temporal Conflict Replay3/5 Agent Monitor Redteam Falsification Replay3/5 Agent Sabotage Scheming Monitor Replay3/5 Agent Sandbox Policy Escape Replay3/5 Agentic Vulnerability Discovery Patch Proof Replay3/5 Audio Level RMS Port3/5 Belief State Process Reward Replay3/5 Compliance Pipeline Bundle3/5 Formal Evidence Cell Anchor Resolver3/5 Formal Math Premise Retrieval3/5 Formal Math Readiness Gate3/5 Formal Math Verifier Trace Repair Loop3/5 Indirect Prompt Injection Information Flow Policy Replay3/5 Lean Std Premise Index3/5 MCP Tool Authority Replay3/5 Materials Chemistry Closed Loop Lab Safety Replay3/5 Mathematical Strategy Atlas Hypothesis Scorer3/5 Prediction Oracle Reconciliation3/5 Proof Diagnostic Evidence Spine3/5 Research Replication Rubric Artifact Replay3/5 Ring2 Premise Retrieval Precision Recall Harness3/5 Self Ignorance Coverage Ledger3/5 Sleeper Memory Poisoning Quarantine Replay3/5 Tactic Portfolio Availability Probe3/5 Target Shape Tactic Routing Gate3/5 Undeclared Library Prior Symbol Classifier3/5

Verified source import (21)

A public source body is copied and validated against its origin byte for byte; the check fails on a missing target, a placeholder digest, an unverified body, or a launch or private-equivalence overclaim. Rank 5: a fully independent provenance verdict.

Contract validator (20)

The harness derives the verdict over the component's whole declared public contract and can fail with no answer supplied by the fixture. Rank 5: the most independent check on this slice.

External tool run (7) Runs real tools

A real external tool, such as Lean or Lake, is run and its return code plus output checks are witnessed over a deliberately small scope. Rank 4: genuine execution, capped because the witness is bounded, not a general proof.

Agent Completion Faithfulness Audit4/5 Bounded Autonomy Campaign Packet4/5 Certificate Kernel Execution Lab4/5 Corpus Readiness Mathlib Absence Gate4/5 Finance Forecast Evaluation Spine4/5 Formal Math Lean Proof Witness4/5 Verifier Lab Execution Spine4/5

Bounded runtime computation (3) Runs real tools

Real in-process computation runs over public inputs with predicted-versus-actual checks and negative cases, scoped to a declared toy runtime. Rank 4: genuine computation, capped at the bounds of that toy scope.

Mission Transaction Work Spine4/5 Provider Context Recipe Budget Policy4/5 Public Reveal Walkthrough4/5

Browse the components → Read the ideas behind them →