Microcosm
This page

Area · 8 components

Research & science

Replays that stand in for scientific and forecasting workflows, run over synthetic fixtures.

Components

Research Replication Rubric Artifact ReplayAudits whether a paper-replication claim carries the full evidence trail.3/5

Does It checks whether a claim that an AI agent "replicated a research paper" comes with the paper trail real replication would leave behind. It re-runs nothing; instead it confirms the bundle names every required piece of evidence: a breakdown of the paper's contributions, a grading rubric, the list of allowed public inputs, a from-scratch repo scaffold, an experiment plan, the metric scripts, a roster of declared file-hashes for the outputs plus hashes that all stay inside that roster, a grader report, a capped compute/runtime budget, an ablation diff, a failure list, and a cold-rerun result record. It also catches eight ways a claim can cheat: reusing the original authors' code, leaking a hidden rubric, calling a run a "success" when only a write-up backs it, asserting a benchmark claims, leaking a private paper or dataset body, using unbounded compute, grading only the final answer, or pointing at a file-hash that was never declared. The work runs on two made-up sample papers (one machine-learning method, one computational-science study), and the generated result record shows which of the eight cheats each test case triggered, rather than taking "it was replicated" on trust.

Scope limit It validates the shape and presence of synthetic replay metadata and result record references only - it does not run any experiment, metric script, or rerun, excludes any claim that a paper was actually replicated, that a benchmark claims was achieved, or that the underlying science is correct, and it never calls providers, exposes private paper/data bodies, or authorizes public sharing or launch.

Run
PYTHONPATH=src python3 -m microcosm_core.organs.research_replication_rubric_artifact_replay run --input fixtures/first_wave/research_replication_rubric_artifact_replay/input --out receipts/first_wave/research_replication_rubric_artifact_replay

Paper module Research Replication Rubric Artifact Replay

Abstract

research_replication_rubric_artifact_replay is a public Microcosm component that turns "an agent replicated a paper" into a replayable evidence contract. It does not rerun a real paper, use external model services, certify benchmark performance, or grant publishing-scope decision. It checks whether a public replay bundle exposes the objects a replication claim must cite before its authority can rise: contribution decomposition refs, rubric-tree refs, allowed input refs, scratch-scaffold refs, experiment-DAG refs, metric-script refs, declared artifact hashes, grader reports, runtime budgets, ablation diffs, failure taxonomies, cold-rerun refs, public execution-trace spans, and source-module digests.

The technical result is an R3 local artifact replay: one public metric script is executed over one allowed public input table, the produced output is compared with a declared output artifact, and the declared hash file is checked against that artifact. A successful run says the replay packet is structurally accountable, digest-bound, redaction-aware, and negative-case tested. It does not say that a real paper was independently replicated.

Purpose

The single question this component answers is narrow: before an agent is allowed to say it replicated a paper, can the claim be forced into a bundle that a cold runtime can check without trusting any prose? The interesting move is that the component refuses to treat "replicated" as one fact. It pulls the claim apart into the objects a real replication would have left behind, a contribution decomposition, a grading rubric tree, the allowed public inputs, an experiment DAG, metric scripts, declared artifact hashes, a grader report, a runtime budget, an ablation diff, a failure taxonomy, and a cold-rerun result record, and it asks for each one by name.

What keeps this from being a checklist linter is the small executable core. The exported bundle does not just assert that an artifact hash exists. The runtime reads one public metric script, runs it over one allowed public input table, produces an output, and then checks that output against both the declared output artifact and the declared hash file. A replay row can name all the right refs and still fail here if the numbers do not reproduce. The negative-case fixtures attack exactly the gap a plausible fake would exploit: report-only success, benchmark-performance language, final-answer-only grading, undeclared hashes, and reuse of the original author's code.

The deliberately modest part is the subject matter. The two paper bundles are public synthetic examples, and the metric is a single sum over a small table. The component's value is the boundary, not the science. It does not run a real paper, call a provider, search compute without bound, or grant any launch or publishing-scope decision. It only makes a replication claim accountable enough that an independent reader can see where the evidence stops.

Telos

Research-agent demos often collapse four objects into one sentence: the paper, the runnable artifact, the grading rubric, and the evidence that an independent rerun happened. This component keeps those objects separate. A replay is admissible only when it names each evidence object and when the local runtime can check the public artifact replay without touching private paper bodies, non-public data bodies, hidden rubrics, model-output data, original-author code bodies, or launch/publishing-scope decision.

The central bet is modest and technical: before any replication claim is made, the system can force the claim into a falsifiable bundle with declared hashes, bounded metric execution, metadata-only result records, and explicit scope boundaries.

Mechanism

The mechanism row is mechanism.research_replication_rubric_artifact_replay.validates_public_research_replication_replay. It runs in src/microcosm_core/organs/research_replication_rubric_artifact_replay.py and is backed by the functions run, run_replication_bundle, validate_source_module_imports, validate_projection_protocol, validate_replication_policy, validate_research_replays, _build_result, _freshness_basis, and the constants EXPECTED_NEGATIVE_CASES, AUTHORITY_CEILING, SOURCE_MODULE_MANIFEST_REF, BUNDLE_RESULT_NAME, and CARD_SCHEMA_VERSION.

The runtime has two modes:

  • Fixture mode reads fixtures/first_wave/research_replication_rubric_artifact_replay/input, includes positive replay rows plus eight negative-case fixtures, and writes first-wave result, board, validation, and sign-off result records.
  • Exported-bundle mode reads examples/research_replication_rubric_artifact_replay/exported_research_replication_bundle, validates the public runtime example, checks the source-module manifest, and writes receipts/runtime_shell/demo_project/organs/research_replication_rubric_artifact_replay/exported_research_replication_bundle_validation_result.json.

The proof object is the tuple:

  1. replication_policy.json, which states required replay fields, rubric axes, and forbidden claims.
  2. research_replays.json, which supplies two synthetic paper bundles that cite public inputs, metrics, artifact hashes, grader reports, budgets, failures, and cold-rerun result records.
  3. execution_artifacts/execution_artifact_manifest.json, which authorizes the replayable artifact relation.
  4. source_module_manifest.json, which names copied source bodies and digest obligations.
  5. Runtime result records, which expose refs, counts, digests, trace spans, and scope boundaries without embedding private bodies.

Metric-Script and Artifact Evidence

The exported bundle includes a small but real artifact-replay loop:

RolePublic artifact
Input bodyexecution_artifacts/inputs/public_synthetic_table.json
Input hashexecution_artifacts/inputs/public_synthetic_table.sha256.json
Metric scriptexecution_artifacts/metrics/public_sum_metric.json
Metric hashexecution_artifacts/metrics/public_sum_metric.sha256.json
Declared outputexecution_artifacts/artifacts/result_table.json
Declared output hashexecution_artifacts/artifacts/result_table.sha256.json

run_replication_bundle reads execution_artifacts/execution_artifact_manifest.json, executes the public_sum_metric over the allowed public input, compares the produced payload with execution_artifacts/artifacts/result_table.json, and verifies the declared hash in execution_artifacts/artifacts/result_table.sha256.json. The focused tests mutate each side of that relation, so the pass is not just a field-presence check.

Pipeline

JSON bundle authorityJSON bundle authorityReplication policyrequired fields + rubric axes+ forbidden claimsReplication policy required fields + rubric axes + forbidden claimsResearch replay rows2 synthetic paper bundlesResearch replay rows 2 synthetic paper bundlesExecution artifactsallowed input + metric spec +declared hashExecution artifacts allowed input + metric spec + declared hashLocal metric replaypublic_sum_metric overallowed inputLocal metric replay public_sum_metric over allowed inputSource-module manifest3 source pattern slices + 1exact-copy component bodySource-module manifest 3 source pattern slices + 1 exact-copy component bodyPublic execution trace2 metadata-only spansPublic execution trace 2 metadata-only spansNegative fixtures8 overclaim casesNegative fixtures 8 overclaim casesmetadata-only result recordscounts, refs, digests, scopeboundariesmetadata-only result records counts, refs, digests, scope boundariesScope limitno replication-success orpublishing-scope decisionScope limit no replication-success or publishing-scope decision

Source refs

JSON bundle authority
paper_module.research_replication_rubric_artifact_replay
Diagram source
flowchart TD bundle["JSON bundle authority paper_module.research_replication_rubric_artifact_replay"] policy["Replication policy required fields + rubric axes + forbidden claims"] replay["Research replay rows 2 synthetic paper bundles"] artifacts["Execution artifacts allowed input + metric spec + declared hash"] metric["Local metric replay public_sum_metric over allowed input"] source_manifest["Source-module manifest 3 source pattern slices + 1 exact-copy component body"] trace["Public execution trace 2 metadata-only spans"] negatives["Negative fixtures 8 overclaim cases"] result records["metadata-only result records counts, refs, digests, scope boundaries"] ceiling["Scope limit no replication-success or publishing-scope decision"] bundle --> policy policy --> replay replay --> artifacts artifacts --> metric source_manifest --> result records metric --> result records trace --> result records negatives --> result records result records --> ceiling

Evidence Contract

The policy file requires fourteen replay fields: paper_id, contribution_decomposition_ref, rubric_tree_ref, allowed_public_input_refs, scratch_repo_scaffold_ref, experiment_dag_ref, metric_script_refs, artifact_hash_refs, declared_artifact_hash_refs, grader_report_ref, cost_runtime_budget_ref, ablation_diff_ref, failure_taxonomy_ref, and cold_rerun_receipt_ref.

The policy also requires eight rubric axes: contribution decomposition, artifact replay, experiment DAG, metric script, grader alignment, budget boundary, failure taxonomy, and cold rerun. A replay row can therefore pass only as a structured evidence packet, not as a final answer or narrative report.

The exported runtime result record currently records the following evidence floor: two synthetic paper bundles, two replay rows, two artifact replay rows, two cold-rerun refs, two public execution-trace spans, four copied source modules, no findings, no error codes, source-module status pass, and input_mode: exported_research_replication_bundle. The fixture result record records all eight negative cases as observed.

Failure Modes and Guardrails

The expected negative cases are:

  • original-author code reuse
  • hidden-rubric leakage
  • report-only success
  • benchmark-performance overclaim
  • private paper or data body leakage
  • unbounded compute search
  • final-answer-only grading
  • undeclared artifact hash refs

The tests also cover source-module digest mismatch, local bundle body tamper, rehashing a swapped source module, wrong execution-artifact hashes, wrong artifact refs with matching hashes, report-only exported replays, metric perturbation, replay metric-script ref tamper, input perturbation, output body tamper, baked output swaps, and self-consistent input/output/hash rewrites. These cases make the component stronger than a field-presence linter: it rejects common ways to produce plausible but unaccountable replication prose.

Test Matrix

The focused regression file tests/test_research_replication_rubric_artifact_replay.py carries the source proof for this module.

ClassExamplesWhat it proves
Real-goodtest_research_replication_replay_observes_negative_cases, test_research_replication_exported_bundle_validates_runtime_shape, test_public_agent_execution_trace_refactor_builds_research_replay_spansThe fixture and exported bundle produce metadata-only result records, observe the required negative cases, execute the local metric replay, and build two public trace spans.
Real-badtest_research_replication_rejects_source_module_digest_mismatch, test_research_replication_rejects_bundle_local_source_module_body_tamper, test_research_replication_rejects_rehashed_source_module_body_swap, test_research_replication_rejects_metadata_only_bundleThe validator rejects broken source-module provenance, local bundle tamper, self-consistent source swaps, and metadata-only replay packets.
Perturbationtest_research_replication_rejects_wrong_execution_artifact_hash, test_research_replication_rejects_wrong_artifact_ref_with_matching_hash, test_research_replication_rejects_metric_perturbation, test_research_replication_rejects_valid_metric_script_body_swap, test_research_replication_rejects_replay_metric_script_ref_tamper, test_research_replication_rejects_replay_allowed_input_ref_tamper, test_research_replication_rejects_input_perturbation, test_research_replication_rejects_output_artifact_body_tamper, test_research_replication_rejects_output_artifact_baked_swap, test_research_replication_rejects_self_consistent_input_output_hash_rewriteMetric, input, output, hash, and replay-row mutations stay blocked even when the tampered bundle tries to preserve self-consistency.
Label forgerytest_research_replication_ignores_forged_negative_case_labels, test_research_replication_negative_case_id_follows_semantics_not_filename, test_research_replication_exported_bundle_ignores_self_declared_pass_labelsVerdicts are derived from semantic replay-row fields, not filenames, declared status labels, or expected error-code labels.
Result record economytest_research_replication_receipts_are_public_relative_and_secret_excluded, test_research_replication_bundle_card_reuses_fresh_receipt, test_research_replication_bundle_card_rejects_stale_receipt_after_input_mutationResult records remain public-relative and secret-excluded; command cards reuse fresh result records and reject stale ones after input mutation.

Realness Rungs

This module's realness is intentionally runged:

  1. Synthetic replay subjects. The two paper bundles are public synthetic examples, one ML-method replay and one computational-science replay.
  2. Real schema pressure. The required fields, rubric axes, declared hash roster, source-module manifest, and non-public-state exclusions are enforced by runtime code and focused tests.
  3. Local artifact replay. The exported bundle executes a local metric over allowed public input and compares produced output against declared artifact hashes.
  4. Source-open provenance. Three public source pattern bodies and one exact Python internal control body are copied into the bundle and digest-checked.
  5. metadata-only public result records. Result records carry counts, refs, digests, verdicts, trace spans, and scope boundaries while excluding private/live/provider material.

The rung contract matters: the component is more than generic documentation polish, but it is still not paper-replication authority.

Relation to Concepts, Principles, and Axioms

The JSON bundle binds the module to concept.research_and_science_replay_evidence_bundle. That concept is instantiated by the mechanism above and abides by AX-1, AX-6, AX-8, and AX-12 at the concept layer. The bundle's direct axiom refs are AX-1, AX-2, AX-5, and AX-7.

The bundle's principle refs are P-1, P-2, P-3, P-6, P-8, and P-15. For this component, the important principle pressure is:

  • Evidence must be structured and replayable before authority rises.
  • Result records and scope boundaries are part of the artifact, not commentary after it.
  • Projections stay below source authority; a readable paper module does not outrank the JSON bundle, mechanism row, runtime code, source-module manifest, or result records.
  • Typed refusal is part of the mechanism: benchmark, provider, public sharing, private-body, original-code, and unbounded-compute claims remain false unless another authority surface actually grants them.

The module depends on paper_module.agent_benchmark_integrity_anti_gaming_replay. Benchmark performance overclaim controls stay routed through that sibling instead of being reinvented here.

Reader Evidence Routing

Open evidence in this order:

  1. core/paper_module_capsules.json#paper_module.research_replication_rubric_artifact_replay for the source-authority bundle, scope limit, doctrine refs, generated projection statuses, and code loci.
  2. core/mechanism_sources.json#mechanism.research_replication_rubric_artifact_replay.validates_public_research_replication_replay for the validator command, exported-bundle validator command, focused regression, guardrails, input refs, result record refs, and upstream mechanisms.
  3. standards/std_microcosm_research_replication_rubric_artifact_replay.json for the first-wave standard, public/private boundary, source-body floor, and hard launch/public sharing/provider/source-file changes flags.
  4. examples/research_replication_rubric_artifact_replay/exported_research_replication_bundle/source_module_manifest.json for source-open body-floor counts and digest obligations.
  5. receipts/runtime_shell/demo_project/organs/research_replication_rubric_artifact_replay/exported_research_replication_bundle_validation_result.json for the current exported-bundle validation result.
  6. tests/test_research_replication_rubric_artifact_replay.py for negative cases, digest tamper tests, metric replay tests, public-relative result record tests, command-card economy, and source-body exclusion.

Prior Art Grounding

This replay scores a research artifact against a replication rubric. It follows artifact-evaluation practice from systems and machine-learning venues (ACM Artifact Review and Badging), which separates 'available' from 'functional' from 'reproduced'. Microcosm borrows the rubric-over-artifact shape; the result is fixture-bound replay evidence, not a reproducibility guarantee or a peer-review verdict.

Validation Result record Path

Focused runtime validation:

./repo-pytest tests/test_research_replication_rubric_artifact_replay.py -q --basetemp=/tmp/microcosm_research_replication_rubric_artifact_replay_pytest

Paper-module corpus validation:

./repo-python scripts/build_doctrine_projection.py --check-paper-module-corpus

The runtime commands behind the result records are:

Scope boundary

Limitations
  • The two replay subjects are synthetic public paper bundles, not real external paper replications.
  • The metric replay is intentionally small: one public metric spec over one public input table with one declared output artifact. Its value is boundary enforcement, not benchmark substance.
  • Source-open proof is limited to three public source pattern body slices and one exact-copy public Python internal control body. It does not expose private source-root bodies, source notes, model-output data, account or browser state, browser UI state, or original-author code bodies.
  • A green run does not establish research truth, paper novelty, formal-result correctness, benchmark performance, external model service, launch-scope decision, or publishing-scope decision.
Authority Boundary

This component validates synthetic public replay metadata, local public artifact replay, source-module digest boundaries, public trace spans, negative-case coverage, and metadata-only result record shape. It does not claim actual paper replication success, benchmark performance, external model service, hidden-rubric access, original-author-code reuse, private paper/data export, unbounded compute search, final-answer-only grading, launch-scope decision, publishing-scope decision, source-file changes, product progress, or whole-system correctness.

Scope limit

This module may claim fixture-bound evidence that the component ran over public synthetic inputs and produced the result records and projections described above, reproduced by the validation result records named on this page.

It may not claim more than its bundle scope limit allows: Copied public source pattern provenance bodies, exact-copy public Python internal control body, metadata-only research-replication replay result records, public agent-execution trace spans, and fixture validation only; no actual paper replication success, benchmark performance claim, private paper/data body export, hidden-rubric export, external model access, unbounded compute search, original-author code reuse, launch-scope decision, publishing-scope decision, source-file changes, or product-progress evidence.

Source and projection details
Source-Open Body Floor

The source-module manifest at examples/research_replication_rubric_artifact_replay/exported_research_replication_bundle/source_module_manifest.json is the source-open body floor. It declares four copied modules:

  • research_replication_extracted_pattern_ledger_row_body_import, a public source pattern body slice.
  • research_replication_high_novelty_growth_receipt_body_import, a public source reconstruction result record slice.
  • research_replication_deterministic_pattern_order_body_import, a public deterministic pattern-order slice.
  • research_replication_replay_control_plane_source_body_import, an exact-copy public Python internal control body for this component.

Each row carries a source ref, target ref, material class, copied-body flag, result record-body exclusion flag, line count or byte count, and sha256 digest. The runtime verifies target digests; for the exact-copy Python row it also checks source currentness and source-target byte equality. Result records expose refs, counts, digests, and verdicts only. They do not embed source bodies.

Spatial World Model Counterfactual Simulation ReplayReplays six what-if robotics scenes to show what a spatial prediction claim is built from.4/5

Does This replay takes six made-up "what if" spatial scenes from robotics and self-driving-style settings (a forklift appears from behind an occlusion, a small pedestrian steps into a crosswalk, a gust pushes a drone off course, a shiny floor fools a robot into seeing free space, a stacked load shifts into a lane, and an oncoming car turns late) and shows each one as inspectable rows: the starting scene, the action taken, the predicted next scene, what changed between them, a sanity check, and honest notes on its limits (it is synthetic, not real-world ground truth). The rows show exactly what a spatial "world model" claim is built from, plus a checklist of dangerous claims it deliberately refuses to make.

Scope limit It validates only the declared public contract of synthetic spatial counterfactual-replay metadata rows. It is evidence for inspectable replay rows and limitation labels, not for real-world spatial accuracy, simulator-product validity, media-only authority, operational deployment, service distribution, or scope decisions.

Run
microcosm spatial-world-model-counterfactual-simulation-replay run-simulation-bundle --input examples/spatial_world_model_counterfactual_simulation_replay/exported_spatial_world_model_simulation_bundle --out receipts/runtime_shell/demo_project/organs/spatial_world_model_counterfactual_simulation_replay

EvidenceContract validatorevidence 4/5Real runtime result

research-workflowsforecasting

Source Design note · Source atlas

Paper module Spatial World Model Counterfactual Simulation Replay

Purpose

Spatial world-model demos are unusually easy to oversell. A plausible-looking video, or a row that simply asserts "the model predicted the next state correctly", can pass for understanding without anything having been checked. This component exists to answer one narrow question: does a declared spatial counterfactual row actually bind a source state, an event, and a predicted outcome that survive an independent recomputation, or is it just a shape that looks right?

The approach is the unusual part. The predicted actor count, transition delta, event label, and spawn cells are derived from the inputs (sensor-packet refs, consistency budget, topology), so a stale or hand-edited prediction no longer matches and the row blocks. The point is not a good simulator. The point is that a spatial-AI claim cannot pass on appearance alone: it has to agree with a recomputation a reader can audit in one screen.

Abstract

spatial_world_model_counterfactual_simulation_replay is a Microcosm component for checking spatial world-model counterfactual claims as metadata transitions, not as generated video, robotics control, AV simulation, geographic truth, or benchmark authority. The component validates six synthetic scene-state rows, six counterfactual replay rows, six predicted transition rows, eight forbidden-claim negative cases, and an exported source-module bundle whose result record stays metadata-only.

The technical claim is deliberately small: for each replay row, the runtime recomputes a deterministic toy gridworld next state from the declared scene state, counterfactual event, sensor-packet refs, consistency budget, topology ref, and limitation labels; it then compares that actual transition against the declared predicted state, transition diff, and oracle check. A green run proves the public replay rows are internally consistent and bounded by their scope limit. It does not establish real-world spatial accuracy, trained simulator quality, generated-video correctness, robot or AV operation, provider behavior, hosting, public sharing, launch-scope decision, or whole-system correctness.

Telos

World-model demos are easy to overstate because visual plausibility can hide whether any state transition was checked. This component makes the proof surface inspectable: a reader can see the scene-state ref, action trace, predicted-state ref, transition-diff ref, oracle-check ref, fidelity limit, limitation labels, negative cases, and source-module digest evidence before accepting any spatial counterfactual claim.

The useful result is not a better simulator. The useful result is an evidence spine that refuses to let a spatial-AI claim advance unless the public row binds input state, counterfactual event, predicted output, actual recomputation, and scope boundary boundary in one result record.

Mechanism

The positive fixture has six scene states and six matching replay rows: warehouse occlusion, crosswalk emergence, drone-corridor gust recovery, mobile robot reflective-floor detour, loading-dock pallet shift, and unprotected-turn late yield. Each row declares a source scene-state ref, action-trace ref, counterfactual event, predicted-state ref, transition-diff ref, oracle-state-check ref, two public sensor-packet refs, a rare-event label, a fidelity-limit label, limitation labels, and explicit false values for private video, raw sensor export, live operation, geography, simulator-product, generated-video-only, benchmark, and launch claims.

Runtime transition checking happens in _state_transition_analysis:

  1. The component resolves each replay to exactly one state-transition row.
  2. It builds an 8 x 8 toy gridworld from the source scene's actor count and topology ref.
  3. It maps the counterfactual event to a deterministic event action such as new_dynamic_actor.
  4. It recomputes the actual next state and transition diff from the input row.
  5. It compares predicted actor count, transition delta, event label, spawn cell or cells, predicted-state ref, diff ref, oracle-check ref, and metadata-only result record status.

The input-driven part matters. Actor-count delta is not copied from the expected fixture. It is recomputed as:

min(
  base_event_actor_count_delta
  + max(0, sensor_packet_count - max_timestep_lag - base_event_actor_count_delta),
  4,
  free_cell_count
)

Spawn cells are also input-derived: the runtime hashes the event, replay id, scene-state ref, topology ref, sensor-packet refs, consistency budget, limitation labels, and source actor count, then walks the bounded grid from the declared event cell. This makes the row sensitive to real input changes while remaining small enough to audit.

Transition Evidence

The current fixture proves a narrow but useful invariant: all six declared predicted states match the runtime's actual toy-gridworld step. The focused test expects:

  • scene_state_count == 6
  • replay_count == 6
  • state_transition_count == 6
  • predicted_state_body_count == 6
  • deterministic_simulation_pass_count == 6
  • gridworld_step_count == 6
  • predicted_actual_match_count == 6
  • transition_diff_count == 6
  • oracle_state_check_count == 6
  • sensor_packet_ref_count == 12

Those counts are technical evidence only because the runtime recomputes the state transition before accepting them. The result record cannot be read as a learned world-model score; it is a public replay consistency check over synthetic metadata and copied source-module digests.

Real-Bad Mutation Contract

The regression suite includes deliberately bad mutations that show the proof is not just shape validation:

  • If a transition row changes actor_count_delta from the recomputed value, run_simulation_bundle blocks with SPATIAL_STATE_TRANSITION_SIMULATION_MISMATCH.
  • If the predicted state misses the gridworld step, the transition row records predicted_state_actor_count_mismatch while the recomputed actual state still shows the expected gridworld execution.
  • If a replay gains an extra sensor-packet ref, the recomputed actor delta moves from 1 to 2. The stale expected transition blocks until the predicted actor count, actor delta, and spawn cells are updated to match the new actual transition.
  • If the source scene actor count and topology ref change, the recomputed source and spawn-cell state moves. The stale predicted state blocks until the transition row is updated.
  • If a source-module manifest tries to place copied body text inside a result record, the source-module summary blocks with SPATIAL_SOURCE_BODY_TEXT_IN_RECEIPT_FORBIDDEN and SPATIAL_SOURCE_MODULE_BODY_TEXT_IN_RECEIPT_FORBIDDEN.

The negative payload cases are similarly typed: private video export, raw sensor export, live robot or AV operation, real-world location claims, simulator-product claims, generated-video-only authority, geographic accuracy claims, and benchmark-score claims without state-diff result records all have explicit forbidden-code coverage.

Shape

yesnonoyesScene-state rowactor count + topologyScene-state row actor count + topologyCounterfactual replay rowevent + sensor refs + budgetCounterfactual replay row event + sensor refs + budgetDeterministic toy gridworldstep8x8 bounded recomputationDeterministic toy gridworld step 8x8 bounded recomputationActual next stateactor delta + spawn cellsActual next state actor delta + spawn cellsDeclared predicted statetransition diff + oraclecheckDeclared predicted state transition diff + oracle checkActual matches declaredtransition?Actual matches declared transition?metadata-only pass resultrecordcounts + refs + digestsmetadata-only pass result record counts + refs + digestsTyped mismatch findingblocked statusTyped mismatch finding blocked statusForbidden payload or claim?Forbidden payload or claim?
Diagram source
flowchart TD Scene["Scene-state row actor count + topology"] --> Replay["Counterfactual replay row event + sensor refs + budget"] Replay --> Step["Deterministic toy gridworld step 8x8 bounded recomputation"] Step --> Actual["Actual next state actor delta + spawn cells"] Replay --> Expected["Declared predicted state transition diff + oracle check"] Actual --> Compare{"Actual matches declared transition?"} Expected --> Compare Compare -->|yes| Result record["metadata-only pass result record counts + refs + digests"] Compare -->|no| Finding["Typed mismatch finding blocked status"] Replay --> Boundary{"Forbidden payload or claim?"} Boundary -->|no| Result record Boundary -->|yes| Finding

This diagram is a reader map for the runtime proof. The generated doctrine lattice Mermaid remains the bundle-derived edge proof.

Reader Evidence Routing

Read this page from source authority outward:

  1. Open core/paper_module_capsules.json::paper_modules[53:paper_module.spatial_world_model_counterfactual_simulation_replay] for the JSON bundle and scope limit.
  2. Open paper_modules/spatial_world_model_counterfactual_simulation_replay.json for generated relationship edges, Mermaid status, Atlas status, and source_authority: json_capsule.
  3. Inspect src/microcosm_core/organs/spatial_world_model_counterfactual_simulation_replay.py, especially _state_transition_analysis, _gridworld_step, _gridworld_actor_count_delta, _gridworld_spawn_cells, _replay_policy_findings, and _source_module_manifest_result.
  4. Inspect fixture inputs under fixtures/first_wave/spatial_world_model_counterfactual_simulation_replay/input and exported-bundle inputs under examples/spatial_world_model_counterfactual_simulation_replay/exported_spatial_world_model_simulation_bundle.
  5. Inspect tests/test_spatial_world_model_counterfactual_simulation_replay.py for the positive replay, public-relative result record, source-module import, body-text rejection, transition-delta mutation, predicted-state mutation, input-perturbation, scene-perturbation, and fresh-card reuse contracts.

Runtime Command

microcosm spatial-world-model-counterfactual-simulation-replay run-simulation-bundle --input examples/spatial_world_model_counterfactual_simulation_replay/exported_spatial_world_model_simulation_bundle --out receipts/runtime_shell/demo_project/organs/spatial_world_model_counterfactual_simulation_replay

The runtime shell also exposes the compressed lens at:

microcosm spatial-simulation

Prior Art Grounding

This replay exercises a spatial world model under counterfactual interventions. It is grounded in the world-models line of work (Ha and Schmidhuber, World Models), where an agent learns a compressed model of its environment it can roll forward under hypothetical actions. Microcosm borrows the counterfactual-rollout shape over synthetic metadata; the result is fixture-bound replay evidence, not robot or AV operation, real-world geography, or a calibrated simulator.

Validation Result record Path

Run from microcosm-substrate:

The expected bundle projection is Mermaid available_from_capsule_edges, Atlas linked_from_capsule_edges, and 20 generated relationship edges. These checks prove the public synthetic replay and source-module import boundary only; they do not validate real geography, robot or AV operation, simulator-product claims, benchmark claims, public sharing, hosting, or launch.

Scope boundary

Public Boundary

The exported bundle may include copied Station geometry source bodies as public source-open material, but result records carry refs, digests, counts, and verdicts only. They must not carry private video bodies, raw sensor payloads, GPS trace bodies, model-output data, account or browser state, account secrets, or live-access material.

The scope limit is therefore:

  • allowed: synthetic scene-state refs, action-trace refs, predicted-state refs, transition-diff refs, oracle-check refs, source-open public sensor-packet refs, rare-event labels, fidelity-limit labels, limitation labels, source-module digests, negative-case result records, and metadata-only validation result records;
  • not allowed: simulator-product authority, private video export, raw sensor export, live robot or AV operation, real-world geography claims, benchmark claims, external model access, hosting, public sharing, launch-scope decision, private-system equivalence, or whole-system correctness.
Limitations

The dynamics are toy dynamics. The 8 x 8 gridworld models actor counts and spawn cells from public metadata; it does not model perception, control, physics, sensor calibration, camera geometry, lidar, maps, vehicle dynamics, human behavior, or material truth. The synthetic events are useful because they force state-diff accounting, not because they approximate the real world.

The fixture is also finite. It covers six public replay rows, six transition rows, two sensor refs per replay, eight negative claim families, and three copied source modules. It does not establish all possible spatial counterfactuals, full secret absence outside the scanner envelope, complete robotics safety, simulator correctness, or future fixture coverage.

The source-open body floor is limited to exact copied Station geometry guardrail bodies named by the source-module manifest and verified by digest. That does not certify private source-root equivalence, private video or raw sensor availability, account or browser state, provider behavior, hidden GPS trace bodies, live-access material, or launch-scope decision.

Scope limit

This module may claim fixture-bound evidence that the component ran over public synthetic inputs and produced the result records and projections described above, reproduced by the validation result records named on this page.

It may not claim more than its bundle scope limit allows: Declared public synthetic spatial counterfactual-replay metadata and source-module import evidence only; no robot or AV operation, real-world geographic accuracy, simulator product validation, generated-video authority, benchmark claims, external model access, hosting, launch-scope decision, publishing-scope decision, or whole-system correctness.

Materials Chemistry Closed Loop Lab Safety ReplayReplays a self-driving lab loop as records, with safety gates and no real chemicals, robot, or lab.3/5

Does Takes the pattern of a "self-driving materials lab" (propose a candidate material, run safety screens, simulate an assay, then decide what to try next) and replays it locally as inspectable records: every step, its safety gate, its simulated result, and the decision that followed, plus the pre-recorded points where such a loop would fail and where it would restart. It makes the workflow's structure visible, all on a simulator-only fixture, so how the loop is wired is traceable without any real lab, real chemicals, or real robot ever being involved.

Scope limit It documents projection and replay mechanics only and excludes wetlab protocols, hazardous synthesis steps, reagent amounts, controlled/bioactive targets, robot commands, live assay data, discovery claims, benchmark claims, external model access, or any judgment of domain/chemical correctness.

Run
microcosm materials-chemistry-closed-loop-lab-safety-replay run-lab-bundle --input examples/materials_chemistry_closed_loop_lab_safety_replay/exported_materials_lab_safety_bundle --out receipts/runtime_shell/demo_project/organs/materials_chemistry_closed_loop_lab_safety_replay

EvidenceComputed projectionevidence 3/5Source-faithful refactor

research-workflowsforecasting

Source Design note · Source atlas

Paper module Materials Chemistry Closed-Loop Lab-Safety Replay

Purpose

"Closed-loop materials lab" is one of the easier phrases to overclaim. A fixture can look like an autonomous discovery loop while carrying nothing that should be spoken aloud: wetlab steps, reagent quantities, a controlled or bioactive target, robot commands, or a flat assertion that some material was discovered. This component exists to sit in front of that language and answer one question: is a closed-loop-lab-shaped fixture safe and grounded enough to be talked about at all, in a simulator-only frame, before any lab claim is allowed?

Its real name inside the runtime is the materials_chemistry_artifact_safety_refusal_validator. The public-promise name "closed-loop replay" was deliberately reframed because nothing here executes a wetlab loop or commands a robot. The unusual part is that the component does not trust the fixture's own conclusion. A normal replay would read a declared "selected candidate" label and report it. This validator instead recomputes the winner from public numbers, weighting an assay proxy, an active-learning score, and a safety gate, then treats a mismatch between that recomputed pick and the declared label as a failure rather than a footnote. A stale or flattering label cannot pass.

The second discipline is refusal as a first-class result. Eight categories of dangerous or overclaiming content each have a named forbidden code, and a fixture that smuggles one in is expected to be refused, not quietly accepted. The verdict is computed from public simulator rows, safety fields, source-module manifests, replay-graph status, negative-case coverage, and a sentinel scan, and it stays inside a simulator-only ceiling. It is a safety and refusal check, not a laboratory.

Abstract

materials_chemistry_closed_loop_lab_safety_replay is a public, simulator-only replay validator for materials-lab language. It does not claim a material discovery, a wetlab protocol, a robot loop, or a benchmark. It checks whether a closed-loop-lab shaped public fixture has enough evidence to be talked about at all: candidate material refs, safety-screen refs, simulator-only assay rows, active-learning decisions, a Lab/Evolve replay graph, source-module manifest digests, negative-case refusals, metadata-only result records, and an explicit scope limit.

The technical claim is a numeric verdict proof boundary. A passing run must recompute the selected candidate from score-backed fixture rows rather than trusting a declared label. The baseline fixture contains four candidates and selects mat_polymer_membrane_001 with score 0.917; perturbation tests prove that stale labels, missing score rows, out-of-range scores, and safety-gate failures block the verdict.

Mechanism

The runtime locus is src/microcosm_core/organs/materials_chemistry_closed_loop_lab_safety_replay.py. The relevant entrypoints are run for first-wave fixture validation and run_lab_bundle for exported-bundle validation. The validator loads a replay policy, candidate rows, experiment DAG rows, simulator assays, active-learning decisions, optional source-module manifests, and eight forbidden negative-case fixtures.

The sign-off rule is deliberately small:

  1. Positive rows must link candidates, experiments, assays, safety screens, active-learning decisions, failure taxonomy refs, and cold replay refs.
  2. Negative cases must be observed and refused.
  3. Numeric replay must recompute the selected candidate from public numbers.
  4. Source-module imports must verify copied bodies without putting bodies into result records.
  5. The safety verdict must remain inside the simulator-only scope limit.
stale label or gate failmatchyesnonumeric policy + expectedlabelnumeric policy + expected label4 candidate refs + safetygates4 candidate refs + safety gates4 public assay proxy values4 public assay proxy values4 active-learning scores4 active-learning scoresnumeric replayweighted recompute of thewinnernumeric replay weighted recompute of the winnerrecomputed pick ==declared label?safety gate >= 0.70?recomputed pick == declared label? safety gate >= 0.70?negative-case fixtures8 forbidden lab classesnegative-case fixtures 8 forbidden lab classesany forbiddenMATERIALS_*_FORBIDDENobserved?any forbidden MATERIALS_*_FORBIDDEN observed?4 copied public body modules4 copied public body modulesLab/Evolve replay graphreplay casesLab/Evolve replay graph replay casessafety verdictsafety verdictAcceptedAcceptedBlockedBlockedmetadata-only result recordscounts, digests, findingsmetadata-only result records counts, digests, findingsscope limitno wetlab / no discovery / nolaunchscope limit no wetlab / no discovery / no launch

Source refs

numeric policy + expected label
replay_policy.json
4 candidate refs + safety gates
candidate_materials.json
4 public assay proxy values
simulator_assays.json
4 active-learning scores
active_learning_decisions.json
4 copied public body modules
source_module_manifest.json
Accepted
public_safe_simulator_replay_accepted
Blocked
blocked_public_safety_boundary
Diagram source
flowchart TD policy["replay_policy.json numeric policy + expected label"] candidates["candidate_materials.json 4 candidate refs + safety gates"] assays["simulator_assays.json 4 public assay proxy values"] decisions["active_learning_decisions.json 4 active-learning scores"] numeric["numeric replay weighted recompute of the winner"] labelcheck{"recomputed pick == declared label? safety gate >= 0.70?"} negatives["negative-case fixtures 8 forbidden lab classes"] refuse{"any forbidden MATERIALS_*_FORBIDDEN observed?"} manifest["source_module_manifest.json 4 copied public body modules"] replay["Lab/Evolve replay graph replay cases"] verdict["safety verdict"] accepted["public_safe_simulator_replay_accepted"] blocked["blocked_public_safety_boundary"] result record["metadata-only result records counts, digests, findings"] ceiling["scope limit no wetlab / no discovery / no launch"] policy --> numeric candidates --> numeric assays --> numeric decisions --> numeric numeric --> labelcheck labelcheck -->|stale label or gate fail| blocked labelcheck -->|match| verdict negatives --> refuse refuse -->|yes| blocked refuse -->|no| verdict manifest --> replay replay --> verdict verdict --> accepted accepted --> result record blocked --> result record result record --> ceiling

Numeric Assay And Verdict Evidence

The replay policy declares:

  • selection rule: max_weighted_public_assay_active_learning_and_safety_gate_score
  • minimum safety gate: 0.70
  • expected selected candidate: mat_polymer_membrane_001
  • weighted score: 0.45 * public_assay_proxy_value + 0.35 * public_active_learning_score + 0.20 * public_safety_gate_score

The source fixture binds four score-backed rows:

CandidateSafety gateAssay proxyActive-learningWeighted scoreDecision / action
mat_polymer_membrane_0010.940.920.900.917decision_membrane_001 / simulate_assay
mat_solid_electrolyte_0020.910.840.810.8445decision_electrolyte_002 / update_surrogate_model
mat_catalyst_support_0030.850.780.740.780decision_support_003 / choose_next_simulation
mat_sorbent_surface_0040.880.700.660.722decision_sorbent_004 / screen_candidate

The focused regression test_materials_chemistry_numeric_replay_recomputes_verdict_from_fixture_numbers proves the pass case: status pass, verified_numeric_row_count == 4, selected candidate mat_polymer_membrane_001, selected decision decision_membrane_001, selected next action simulate_assay, score 0.917, realness rung R3, and verdict basis recomputed_from_public_assay_active_learning_and_safety_gate_fixture_numbers.

The verifier does not use expected labels for selection. Expected labels are checked only after the selected row is recomputed from candidate, assay, and decision content.

Test Matrix

ClassEvidenceExpected verdict
Real-good fixtureBaseline first-wave fixture with four candidate, assay, and decision rowspublic_safe_simulator_replay_accepted; numeric replay pass; selected candidate mat_polymer_membrane_001; score 0.917
Real-good source body floorExported bundle manifest with four copied modules and zero manifest findingssource_module_manifest_status: pass; verified_module_count: 4; result records remain metadata-only; current checked-in bundle still needs refreshed numeric rows before it is a full exported-bundle pass
Real-bad lab safetyControlled/bioactive targets, hazardous synthesis flags, mismatched safety refs, robot command, account secrets, private notebooks, or discovery claimsblocked_public_safety_boundary with the relevant MATERIALS_*_FORBIDDEN or positive-linkage finding
Real-bad numeric missingnessScore-backed rows removed while numeric policy is activeMATERIALS_NUMERIC_REPLAY_POLICY_REQUIRES_SCORE_BACKED_ROWS; verified_numeric_row_count: 0
Real-bad numeric requiredNumeric policy removed and score rows absentMATERIALS_NUMERIC_REPLAY_REQUIRED; realness rung blocked
Real-bad stale labelPolicy declares mat_catalyst_support_003 while recomputation selects mat_polymer_membrane_001MATERIALS_NUMERIC_REPLAY_EXPECTED_LABEL_STALE
Real-bad score rangeSafety, assay, or active-learning score outside [0, 1]MATERIALS_NUMERIC_REPLAY_SCORE_OUT_OF_RANGE
Perturbation, low safety gateMembrane safety gate lowered to 0.52Computed pick moves to mat_solid_electrolyte_002, verdict blocks, and findings include stale label plus MATERIALS_NUMERIC_REPLAY_SAFETY_GATE_FAILED
Perturbation, moved valid pickSorbent raised to safety 0.93, assay 0.98, active learning 0.98, and policy expectation updatedNumeric replay passes, selected candidate mat_sorbent_surface_004, selected action screen_candidate, score 0.970
Perturbation, moved pick without expectation updateExported bundle recomputes sorbent as the winner while policy still expects membraneSource manifest stays pass, but numeric replay blocks with MATERIALS_NUMERIC_REPLAY_EXPECTED_LABEL_STALE

These cases are source/test-backed by tests/test_materials_chemistry_closed_loop_lab_safety_replay.py. Fresh local first-wave result record output is the authority for current numeric replay; older archived first-wave result records and the checked-in exported bundle predate the numeric replay rows and should not be read as the numeric proof. The exported bundle still needs refreshed numeric rows before it is a full exported-bundle pass.

Evidence Routes

  • JSON bundle: core/paper_module_capsules.json::paper_module.materials_chemistry_closed_loop_lab_safety_replay
  • Generated JSON instance: paper_modules/materials_chemistry_closed_loop_lab_safety_replay.json
  • Mechanism source: core/mechanism_sources.json::mechanism.materials_chemistry_closed_loop_lab_safety_replay.validates_public_materials_lab_safety_replay
  • Runtime: src/microcosm_core/organs/materials_chemistry_closed_loop_lab_safety_replay.py
  • Domain standard: standards/std_microcosm_materials_chemistry_closed_loop_lab_safety_replay.json
  • Paper-module standard: standards/std_microcosm_paper_module.json
  • Fixture input: fixtures/first_wave/materials_chemistry_closed_loop_lab_safety_replay/input
  • Exported bundle: examples/materials_chemistry_closed_loop_lab_safety_replay/exported_materials_lab_safety_bundle
  • Focused tests: tests/test_materials_chemistry_closed_loop_lab_safety_replay.py

Prior Art Grounding

This replay exercises a closed-loop materials and chemistry lab controller with a safety gate over synthetic experiments. It is grounded in the self-driving laboratory literature, where a propose-run-measure loop is paired with safety interlocks that can refuse an unsafe experiment. Microcosm borrows the loop-plus-safety-gate shape on a simulator; the result is metadata-only simulator evidence, not a real laboratory controller, chemical-safety authority, or launch.

Validation Result record Path

Run the current runtime proof from the Microcosm root:

Inspect the exported source-body bundle. Until the exported fixture is refreshed with score-backed numeric rows, this command may return a blocked numeric verdict while still proving the manifest/body-floor boundary:

cd microcosm-substrate
PYTHONPATH=src ../repo-python -m microcosm_core.organs.materials_chemistry_closed_loop_lab_safety_replay run-lab-bundle --input examples/materials_chemistry_closed_loop_lab_safety_replay/exported_materials_lab_safety_bundle --out /tmp/microcosm_materials_chemistry_lab_safety_bundle

Run the focused regression suite:

cd microcosm-substrate
PYTHONPATH=src ../repo-pytest tests/test_materials_chemistry_closed_loop_lab_safety_replay.py -q
cd microcosm-substrate
PYTHONPATH=src ../repo-python scripts/build_doctrine_projection.py --check-paper-module-corpus

This lane intentionally does not run scripts/build_doctrine_projection.py --write; generated projections, atlas cards, and shared bundle surfaces belong to their owner lanes.

Scope boundary

Limitations

This module is a replay validator, not a laboratory. It does not synthesize materials, provide wetlab instructions, control robots, rank real compounds, validate live assay data, authorize external model access, or establish a discovery benchmark. Fixture numbers are public replay coordinates for a safety-gated contract; they are not experimental measurements.

The validator can prove local consistency across fixture rows, exported source-module manifests, replay graph records, negative-case checks, sentinel scans, numeric recomputation, and metadata-only result records. It cannot prove chemical safety, regulatory suitability, lab readiness, deployment readiness, public-site freshness, publishing-scope decision, or launch-scope decision.

Scope limit

This module may claim that Microcosm has a public, source-faithful, simulator-only replay contract that checks candidate refs, safety-screen refs, simulator-only assay rows, active-learning decisions, numeric replay, failure-taxonomy refs, cold replay refs, replay cases, source bundle hashes, copied source-module digests, negative-case result records, metadata-only result record policy, and scope limits.

It must not claim wetlab operation, material synthesis, robot control, hazardous synthesis guidance, reagent quantities, controlled or bioactive targeting, live assay data, private lab notebook export, live account secrets, external model service, material discovery, benchmark performance, safety certification, public sharing, hosting, launch-scope decision, source-file changes, or product-progress authority.

Scope limit

This module may claim fixture-bound evidence that the component ran over public synthetic inputs and produced the result records and projections described above, reproduced by the validation result records named on this page.

It may not claim more than its bundle scope limit allows: Copied public Lab/Evolve source/control/result record/standard bodies, metadata-only simulator-only fixture result records, runtime bundle result records, and artifact safety/refusal validation only; no wetlab execution, hazardous synthesis guidance, reagent quantity, controlled or bioactive target, live assay, robot command, private lab notebook, external model access, discovery claim, benchmark claims, launch-scope decision, publishing-scope decision, or product-progress evidence.

Source and projection details
Source-Open Body Floor

The exported bundle at examples/materials_chemistry_closed_loop_lab_safety_replay/exported_materials_lab_safety_bundle contains a source_module_manifest.json with four copied bodies:

Module idMaterial classRole
materials_lab_evolve_failure_replay_specimen_body_importpublic_macro_tool_bodydeterministic replay graph construction, failure classification, restart-point selection, source-bundle hashing, and result record boundaries
materials_lab_evolve_replay_graph_body_importpublic_macro_control_plane_bodyreplay graph body, restart points, source bundles, global teachings, and public claim boundary
materials_lab_evolve_receipt_body_importpublic_macro_receipt_bodyreplay result record body proving the source evidence shape without moving private material into result records
laboratory_standard_body_importpublic_standard_bodypublic laboratory standard floor for the replay

The bundle validator checks module_count: 4, verified_module_count: 4, source_module_manifest_status: pass, metadata-only result record policy, and zero source module findings. The current checked-in exported bundle is still a source-body floor, not the final numeric exported-bundle proof: run_lab_bundle requires refreshed score-backed numeric rows before it can pass as a full exported-bundle verdict. Focused tests inject those rows to prove the exported-bundle numeric path. The remaining bundle and result record refresh is tracked as outstanding work.

The validator also records the blocked source-open boundary for codex/doctrine/paper_modules/lab_oracle_evolve_pipeline.md: that source paper module cannot be imported as an exact body while raw operator-anchor language remains in scope.

Mechanistic Interpretability Circuit Attribution ReplayRecords which model features drove an answer, each tied to checkable evidence.4/5

Does This takes the workflow of "tracing which internal features inside a model drove an answer" and turns it into inspectable local records. Each row links feature ids to a machine-readable graph of connections, records the before/after results of poking those features (the causal-intervention deltas), notes how far the explanation can be trusted (its faithfulness limit), and points to where the underlying evidence lives. The records show that every interpretability claim is backed by checkable evidence, and that they deliberately hold no model weights, no raw activations, no prompts, and no hidden reasoning — they carry only refs, digests, counts, and verdicts.

Scope limit It validates only the declared public circuit-attribution runtime-result record contract. It excludes model-transparency product claims, live model access, export of private weights/raw activations/proprietary prompts/hidden chain-of-thought, external model access, benchmark claims, or public sharing/launch.

Run
microcosm mechanistic-interpretability-circuit-attribution-replay run-attribution-bundle --input examples/mechanistic_interpretability_circuit_attribution_replay/exported_circuit_attribution_bundle --out receipts/runtime_shell/demo_project/organs/mechanistic_interpretability_circuit_attribution_replay

EvidenceContract validatorevidence 4/5Real runtime result

research-workflowsforecastingprovider operations

Source Design note · Source atlas

Paper module Mechanistic Interpretability Circuit Attribution Replay

Purpose

Interpretability writing is unusually easy to overstate. A named feature can read like understanding, a graph picture can read like a discovered circuit, and a small local script can read like access to a real model. This component exists to hold one kind of claim to a smaller, checkable size. It answers a single question: before Microcosm lets a circuit-attribution story stand as public evidence, does the story survive a deterministic replay rather than being taken on trust?

The part worth noticing is how narrow the proof is, and how that narrowness is the point. The component does not attempt to interpret a trained model. It carries a tiny two-layer toy transformer with weights declared in the fixture, recomputes its forward pass, gradient attribution, and per-feature ablation, and then compares the recomputed top feature against the feature the fixture claims. A row passes only when the declared winner still matches after recomputation. Perturb the toy weights and leave the old claim in place and the row is rejected, because the recomputed answer has moved while the prose has not. That is the failure mode the component is built to catch: an interpretability statement that was once true of its inputs but no longer is.

Around that recomputation sit three further gates. Graph evidence must be machine-readable and traversable from declared sparse features to public error nodes, so a screenshot cannot stand in for a circuit. Transparency language needs a causal-intervention reference and faithfulness language needs an explicit limit, so the strongest words carry the strongest evidence requirements. Private weights, raw activations, proprietary prompts, and hidden reasoning are kept out of every result record. What the component produces is an accounting result record for a public fixture, not a transparency tool for any real model.

Abstract

mechanistic_interpretability_circuit_attribution_replay is a public Microcosm component that validates whether circuit-attribution claims are safe to represent as result record evidence. It is not a model-transparency product and does not inspect a live provider model. The component checks a fixture and exported bundle for machine-readable feature graph rows, causal-intervention references, faithfulness limits, source-module digest evidence, negative cases, and a small input-coupled toy-transformer replay.

The technical proof is deliberately modest. A replay passes only when its declared circuit-attribution story agrees with recomputed toy-transformer forward, gradient, and ablation winners; when graph evidence is traversable from public sparse features to public error nodes; when public result records omit private or raw bodies; and when the source-open body floor is backed by copied, source source modules with matching digests. A stale declared top feature is disconfirmed by perturbing the input fixture while leaving the old claim in place.

Problem Statement

Interpretability prose is easy to overclaim: a feature name can sound like transparency, a graph screenshot can sound like a circuit, and a local fixture can sound like model access. This module makes the public claim smaller and more testable. It asks: before Microcosm lets a circuit-attribution story become public evidence, can the story survive a deterministic replay membrane that checks structure, causality refs, source provenance, and explicit scope boundaries?

The answer is local and result record-scoped. Microcosm may claim public circuit-attribution replay accounting for this fixture and exported bundle. It may not claim live model internals, private weights, raw activations, proprietary prompts, hidden reasoning, provider behavior, benchmark claims, publishing-scope decision, hosting, launch-scope decision, or whole-system interpretability correctness.

The technical contribution is therefore an accounting membrane, not a new interpretability algorithm. The membrane turns an interpretability-shaped fixture into a pass/fail public result record by requiring all claim-bearing rows to cross four gates:

GateAcceptsRejects
Replay schemaFeature ids, graph rows, causal refs, sufficiency and faithfulness limits, contradiction refs, cold-replay refs, target refs, and metadata-only result record flags.Missing required fields, unverifiable feature labels, screenshot-only graph evidence, transparency claims without causal-intervention refs, and faithfulness claims without limits.
Graph traversalMachine-readable nodes and edges with a path from declared sparse features to public error nodes.Disconnected edges and decorative constant-delta edge-weight sequences.
Toy recomputationFixture-coupled forward, gradient, ablation, weight digest, and declared-winner comparison.Internal default toy specs, stale declared winners, or uncoupled cached result records.
Source/body boundaryCopied source bodies with digest, class, anchor, and metadata-only result record checks.Private weights, raw activations, proprietary prompt bodies, hidden reasoning, model-output data, body text in result records, and launch-scope decision.

Technical Mechanism

JSON bundleJSON bundleFixture / exported bundlefeature catalog, replay rows,toy-transformer specFixture / exported bundle feature catalog, replay rows, toy-transformer specPolicy gatesrequired fields, forbiddenprivate/raw exports,faithfulness limitsPolicy gates required fields, forbidden private/raw exports, faithfulness limitsGraph analyzerfeature ids -> edges ->public error nodesGraph analyzer feature ids -> edges -> public error nodesToy-transformer replayforward + gradient + ablationrecomputationToy-transformer replay forward + gradient + ablation recomputationSource-open body floorcopied source bodies + digestchecksSource-open body floor copied source bodies + digest checksmetadata-only result recordsrefs, digests, counts,verdictsmetadata-only result records refs, digests, counts, verdictsScope limitpublic replay accounting onlyScope limit public replay accounting only

Source refs

JSON bundle
paper_module.mechanistic_interpretability_circuit_attribution_replay
Diagram source
flowchart TD Bundle["JSON bundle paper_module.mechanistic_interpretability_circuit_attribution_replay"] Fixture["Fixture / exported bundle feature catalog, replay rows, toy-transformer spec"] Policy["Policy gates required fields, forbidden private/raw exports, faithfulness limits"] Graph["Graph analyzer feature ids -> edges -> public error nodes"] Toy["Toy-transformer replay forward + gradient + ablation recomputation"] Source["Source-open body floor copied source bodies + digest checks"] Result records["metadata-only result records refs, digests, counts, verdicts"] Ceiling["Scope limit public replay accounting only"] Bundle --> Fixture Fixture --> Policy Fixture --> Graph Fixture --> Toy Fixture --> Source Policy --> Result records Graph --> Result records Toy --> Result records Source --> Result records Result records --> Ceiling

The component has four coupled checks:

  1. Replay policy validation: each positive row must carry toy prompt refs, sparse feature ids, machine-readable graph nodes and edges, replacement-model approximation scores, causal inhibition and injection refs, causal-intervention result record refs, sufficiency labels, faithfulness limits, contradiction-case refs, cold-replay refs, target refs, and body_in_receipt: false.
  2. Graph analysis: _graph_analysis_for_replay verifies that graph edges resolve to declared nodes and that at least one path exists from the row's sparse feature ids to a public error node. _weight_sequence_analysis rejects simple decorative arithmetic edge-weight sequences across replay rows.
  3. Toy-transformer replay: _toy_transformer_attribution_runtime recomputes a pure-Python two-layer toy transformer from fixture-provided token_ids, embeddings, layer1, layer2, and target_logit_index, then compares the recomputed top attribution and ablation features against declared winners.
  4. Source/body boundary: _source_module_manifest_result, _source_open_body_import_summary, scan_paths, _write_receipts, and result_card verify copied source bodies while keeping result record payloads metadata-only and public-safe.

Implementation Contract

Runtime locusRole in the mechanismEvidence surface
runFirst-wave fixture validator. It loads the public input directory, negative cases, source-module manifest, secret-exclusion policy, and sign-off output.tests/test_mechanistic_interpretability_circuit_attribution_replay.py::test_mechanistic_interpretability_circuit_attribution_replay_observes_negative_cases
run_attribution_bundleExported-bundle validator for the runtime-shell and public demo path. It uses the same replay gates without requiring first-wave negative-case files.test_mechanistic_interpretability_exported_bundle_validates_runtime_shape
_replay_policy_findingsRow-level policy checker for required fields and forbidden interpretability overclaims.Negative fixtures in fixtures/.../input/* and EXPECTED_NEGATIVE_CASES
_graph_analysis_for_replay / _weight_sequence_analysisCircuit-graph shape checks: resolvable nodes/edges, feature-to-error paths, and non-decorative weights.test_mechanistic_interpretability_rejects_disconnected_graph_edges and test_mechanistic_interpretability_rejects_decorative_weight_sequences
_toy_transformer_attribution_runtimePure-Python recomputation harness for target logit, attribution scores, ablation deltas, declared winners, and fixture digest.Toy runtime, stale-claim, perturbation, and cache-reuse tests
_source_module_manifest_result / _source_open_body_import_summarySource-open body floor: copied source body checks with digest, class, anchor, and metadata-only result record constraints.Source-module exact-import and body-text rejection tests
_write_receipts / result_cardPublic output membrane. Result records and cards carry refs, digests, counts, omitted-payload flags, and scope limits rather than source bodies or private state.Result record-boundary and card-reuse tests

Toy-Transformer Attribution Mechanism

The toy-transformer runtime is intentionally small enough to audit. The fixture in fixtures/first_wave/mechanistic_interpretability_circuit_attribution_replay/input/attribution_replays.json declares:

  • token_ids: [0, 1, 2]
  • a three-row embedding table over two dimensions
  • a two-by-three first layer
  • a three-by-two second layer
  • target_logit_index: 1
  • expected top feature by attribution and ablation: toy_hidden_feature_1

The runtime computes token embeddings, averages them into a context vector, applies the first layer, applies a tanh hidden activation, applies the second layer, and reads the target logit. It then computes activation-gradient scores for each hidden feature, using the analytic tanh derivative 1 - h^2 so the attribution score is grounded in the same forward pass rather than a separate estimate. It also ablates each hidden feature in turn, zeroing it and re-reading the target logit, to measure the output delta that feature is responsible for. The fixture currently produces target logit 0.044176; both the gradient attribution and the ablation delta select toy_hidden_feature_1, and the row passes only because those two independent paths agree with each other and with the fixture's declaration.

The important point is not that this is a serious transformer. It is a deterministic proof harness for the public replay claim. The result record can say the declared top feature agrees with recomputation only because the verifier recomputes from input fields and compares the result. The result record also records a weight digest so cached or exported bundle cards can prove which fixture basis they are coupled to.

Discriminating Tests

The proof is strongest where it distinguishes a real coupling from a plausible but stale story. The focused tests exercise those distinctions directly:

TestFixture moveExpected verdictWhy it matters
test_mechanistic_interpretability_toy_transformer_input_perturbation_moves_verdictChanges layer2[0][1] to -0.5 and updates declared winners to toy_hidden_feature_0.Passes with target logit -0.116939; both attribution and ablation move to toy_hidden_feature_0.The result record follows changed input when declaration and recomputation remain coupled.
test_mechanistic_interpretability_input_perturbation_rejects_stale_claimsApplies the same perturbation but leaves declared winners at toy_hidden_feature_1.Blocks with INTERPRETABILITY_TOY_TRANSFORMER_DECLARED_TOP_FEATURE_MISMATCH.The verifier disconfirms stale interpretability claims instead of trusting old fixture prose.
test_mechanistic_interpretability_rejects_internal_default_toy_runtimeRemoves toy_transformer_runtime from the exported bundle.Blocks with INTERPRETABILITY_TOY_TRANSFORMER_FIXTURE_SPEC_REQUIRED.The public proof must be input-coupled, not backed by an internal default.
test_mechanistic_interpretability_bundle_card_rejects_uncoupled_cached_receiptEdits a cached result record so input_coupled_fixture and input_coupled_verdict are false.The command-card path is a freshness optimization, not permission to reuse uncoupled evidence.
test_mechanistic_interpretability_rejects_decorative_weight_sequencesRewrites graph-edge weights into simple arithmetic sequences.Blocks as suspected decorative graph evidence.Machine-readable graph rows still need anti-fabrication checks.
test_mechanistic_interpretability_rejects_disconnected_graph_edgesBreaks an edge path to a declared public error node.Blocks with zero path count for the affected row.A circuit-shaped graph must be traversable, not merely present.
test_mechanistic_interpretability_source_modules_reject_body_text_in_receiptMarks source body text as present in result record material.Blocks the source/body import.Source-open evidence remains metadata-only at result record boundaries.

Evidence Contract

Evidence classLocal authorityWhat it provesWhat it does not establish
Bundle bindingcore/paper_module_capsules.json row 52The paper module, component, mechanism, source locus, and generated projection statuses are linked.Markdown is not promoted to source authority.
Replay rowsfixtures/.../input/attribution_replays.json and exported bundle mirrorSix public replay rows with feature ids, graph edges, causal refs, faithfulness limits, contradiction refs, cold replay refs, and metadata-only target refs.The refs are fixture/accounting evidence, not live model internals.
Feature catalogfixtures/.../input/feature_catalog.jsonSix public sparse-feature summary ids with labels and no private weights or activation dumps.It does not disclose trained-model features or raw activations.
Toy runtime_toy_transformer_attribution_runtime and focused testsForward, gradient, ablation, digest, and stale-declaration checks are recomputed from the input fixture.The toy runtime is not a general interpretability method.
Graph analysis_graph_analysis_for_replay and _weight_sequence_analysisGraph rows are machine-readable, traversable, and not decorative constant-delta weight sequences.It does not validate a real neural circuit.
Source-open body floorsource_module_manifest.json plus source_modules/Eleven copied source bodies have digest/anchor/material-class checks.Bodies are not copied into result records and do not authorize private/live export.
Result record setreceipts/first_wave/..., result records/sign-off/..., runtime-shell lensPublic outputs carry refs, digests, counts, verdicts, omitted-payload flags, and scope limits.Result records do not publish private model data or launch-scope decision.

Reader Evidence Routing

The proof consumer for this reader slice is the focused interpretability replay suite plus the paper-module corpus parity check. The table below is the route a rank/projection reader should follow before trusting any claim in this module:

Reader questionSource surfaceFocused proof consumerScope limit
Is this module bound to a real component and mechanism?core/paper_module_capsules.json::paper_module.mechanistic_interpretability_circuit_attribution_replay and paper_modules/mechanistic_interpretability_circuit_attribution_replay.jsonscripts/build_doctrine_projection.py --check-paper-module-corpus
Does the replay recompute the attribution claim?_toy_transformer_attribution_runtime over fixture-provided token_ids, weights, and target_logit_indextest_mechanistic_interpretability_toy_transformer_runtime_computes_attribution, perturbation, and stale-claim testsProves fixture-local recomputation, not a general interpretability method.
Are graph rows actual circuit evidence rather than screenshots?_graph_analysis_for_replay and _weight_sequence_analysis over declared graph nodes, edges, and public error nodesdisconnected-graph and decorative-weight regression testsProves machine-readable traversability and anti-decoration checks, not a real neural circuit.
Do source-open bodies stay out of result records?source_module_manifest.json, copied source_modules/, _source_module_manifest_result, and _write_receiptssource-module exact-import and body-text-in-result record rejection testsProves copied body floor and metadata-only result records, not private/live export authority.
Where does a reader start when projections disagree?source record, generated JSON instance, runtime source, focused tests, then result recordscorpus check and focused pytest together

Failure Modes And Limitations

  • Missing required replay fields block with INTERPRETABILITY_REPLAY_FIELD_REQUIRED.
  • Feature names without catalog-backed ids block with INTERPRETABILITY_FEATURE_NAME_UNVERIFIABLE.
  • Graph screenshots or disconnected graph rows block because machine-readable edges and traversable paths are required.
  • Transparency language without a causal-intervention result record blocks with INTERPRETABILITY_INTERVENTION_RECEIPT_REQUIRED.
  • Faithfulness language without explicit limits blocks with INTERPRETABILITY_FAITHFULNESS_REQUIRES_LIMITS.
  • Private model weights, raw activation dumps, proprietary prompt exports, hidden chain-of-thought exports, model-output data bodies, and launch-scope decision are forbidden public outputs.
  • Decorative graph-weight sequences block as suspected fabrication.
  • Stale declared toy-transformer winners block when recomputation selects a different top feature.
  • The proof is fixture-local. It verifies a public replay membrane and copied source evidence; it does not certify real-world model faithfulness.

Relation To Interpretability Literature

The module borrows its accounting shape from the transformer-circuits and mechanistic-interpretability tradition: circuits should be graph-structured, features should be identifiable, causal language should be backed by interventions, and faithfulness language should be bounded. Useful prior-art anchors include Anthropic's transformer-circuits framing, causal scrubbing, and SAE/sparse-feature circuit work.

Microcosm does not reproduce those methods. The local contribution is a public replay boundary around an interpretability-shaped claim: machine-readable edges instead of screenshots, causal-intervention refs instead of bare transparency language, fixture recomputation instead of stale row trust, and explicit scope boundaries before a claim becomes public evidence.

Relation To Microcosm Concepts, Mechanisms, And Principles

The bundle binds this module to:

  • concept.research_and_science_replay_evidence_bundle
  • mechanism.mechanistic_interpretability_circuit_attribution_replay.validates_public_mechanistic_interpretability_circuit_attribution_replay
  • principles P-2, P-4, P-8, and P-9
  • axioms AX-3, AX-5, AX-7, and AX-8

The practical reading is:

  • P-2: claim language stays below the strength of the checker.
  • P-4: public proof routes through result records and explicit evidence refs.
  • P-8: failed preconditions are typed refusals, not vague warnings.
  • P-9: provenance crosses from fixture, source source, and result record without upgrading authority.
  • AX-3: dereferenced proof and policy refs matter more than prose labels.
  • AX-5: status fails closed across all required parts.
  • AX-7: partial computation returns a typed refusal.
  • AX-8: public fixture and copied-source labels propagate without becoming private model access.

Named Proof Consumers

Run from microcosm-substrate:

This consumes the first-wave fixture, negative cases, source-module mirror, secret scan, toy-transformer replay, and result record writer.

PYTHONPATH=src ../repo-python -m microcosm_core.organs.mechanistic_interpretability_circuit_attribution_replay run-attribution-bundle \
  --input examples/mechanistic_interpretability_circuit_attribution_replay/exported_circuit_attribution_bundle \
  --out /tmp/microcosm-mechanistic-interpretability-circuit-attribution-replay/bundle \
  --card

This consumes the exported circuit-attribution bundle, copied body floor, digest checks, metadata-only result records, command-card omission contract, and runtime-shell validation shape.

PYTHONPATH=src ../repo-python -m pytest -p no:cacheprovider tests/test_mechanistic_interpretability_circuit_attribution_replay.py -q
PYTHONPATH=src ../repo-python scripts/build_doctrine_projection.py --check-paper-module-corpus

The focused regression pins recomputation, stale-row rejection, graph and source-body gates, card result record reuse, and body-text exclusions.

Reader Route

A cold reader should inspect in this order:

  1. core/paper_module_capsules.json row 52 for authority and projection binding.
  2. paper_modules/mechanistic_interpretability_circuit_attribution_replay.json for generated relationship edges.
  3. src/microcosm_core/organs/mechanistic_interpretability_circuit_attribution_replay.py for runtime logic.
  4. tests/test_mechanistic_interpretability_circuit_attribution_replay.py for the stale-row, perturbation, graph, source-body, and result record-boundary proof.
  5. fixtures/first_wave/mechanistic_interpretability_circuit_attribution_replay/input for the fixture.
  6. examples/mechanistic_interpretability_circuit_attribution_replay/exported_circuit_attribution_bundle for the public bundle.
  7. receipts/first_wave/mechanistic_interpretability_circuit_attribution_replay and receipts/runtime_shell/public_mechanistic_interpretability_circuit_attribution_replay_lens.json for metadata-only public result record evidence.

Prior Art Grounding

This replay exercises a circuit-attribution pass that traces which internal components account for a behaviour. It is grounded in mechanistic interpretability, the study of the internal circuits of neural networks (Anthropic, Transformer Circuits). Microcosm borrows the attribution-replay shape over synthetic fixtures; the result is fixture-bound runtime evidence, not live model access, a transparency product, or a correctness claim about any real model.

Validation Result record Path

Reader-verifiable commands, run from the microcosm-substrate/ public root:

PYTHONPATH=src python3 -m pytest tests/test_mechanistic_interpretability_circuit_attribution_replay.py -q
PYTHONPATH=src python3 scripts/build_doctrine_projection.py --check-paper-module-corpus

These are reader-verifiable evidence only and do not include launch operations, external model access, source-file changes, or whole-system correctness.

Scope boundary

Authority And Evidence Boundary
  • Source authority: core/paper_module_capsules.json::paper_modules[52:paper_module.mechanistic_interpretability_circuit_attribution_replay] with source_authority: json_capsule.
  • Generated instance: paper_modules/mechanistic_interpretability_circuit_attribution_replay.json.
  • Runtime: src/microcosm_core/organs/mechanistic_interpretability_circuit_attribution_replay.py.
  • Focused tests: tests/test_mechanistic_interpretability_circuit_attribution_replay.py.
  • Governing standard: standards/std_microcosm_mechanistic_interpretability_circuit_attribution_replay.json.

This Markdown is a human-readable paper projection. The bundle JSON binds the component, mechanism, source locus, generated Mermaid status available_from_capsule_edges, and Atlas status linked_from_capsule_edges. The runtime, fixtures, tests, result records, and manifests are the technical evidence for the claims below.

Scope limit

This module may claim:

  • public, cold-replayable circuit-attribution accounting for the named fixture and exported bundle;
  • feature ids tied to machine-readable graph edges and traversable public error-node paths;
  • causal-intervention result record refs and faithfulness-limit refs are required before transparency or faithfulness language passes;
  • the toy-transformer declaration is input-coupled to recomputed forward, gradient, and ablation evidence;
  • stale toy-transformer declarations are rejected by focused tests;
  • copied source source bodies are verified by manifest and digest checks while result records remain metadata-only.

It may not claim:

  • live model access or external model access;
  • private weights, raw activation tensors/dumps, proprietary prompts, hidden chain-of-thought, hidden reasoning, or model-output data export;
  • real model-transparency product status;
  • benchmark claims authority;
  • public sharing, hosted-product readiness, launch-scope decision, or recipient-send authority;
  • whole-system interpretability correctness.
Source and projection details
Source-Open Body Floor

The source-open body floor is declared in:

  • examples/mechanistic_interpretability_circuit_attribution_replay/exported_circuit_attribution_bundle/source_module_manifest.json
  • fixtures/first_wave/mechanistic_interpretability_circuit_attribution_replay/input/source_module_manifest.json

The manifest covers copied source bodies: Oracle attribution maps, pattern-ledger rows, high-novelty scout records, component projection IR, projection readiness code, mission transaction preflight code, execution trace code, strict JSON code, and trace/readiness standards. The runtime verifies classification, material class, body-copied status, body-not-in-result record status, target digest, source/target digest agreement, line count when the source is available, and required anchors.

The body floor excludes private model weights, raw activations, proprietary prompts, hidden reasoning, model-output data, account or browser state, browser or HUD state, account secret material, private source-root material, public sharing, hosting, and launch-scope decision.

Prediction Oracle ReconciliationReplays a forecast against the discipline a careful predictor would have to defend.3/5

Does Runs a made-up forecasting case through the discipline a careful predictor would have to defend: which way a fork was called and why the losing side was ruled out, whether each prediction stayed inside the pre-declared list of allowed outcomes, that no "after the fact" evidence got used as if it were known in advance, how the guesses compared to a synthetic "what actually happened" result, and that any edits to the running record were small, allowed changes rather than rewrites. The reasoning is laid out as inspectable records rather than a single handed-down verdict. Everything is invented test data — it makes no real forecast and claims no track record.

Scope limit It exercises projection mechanics on a synthetic, invented packet only. It does not establish forecasting correctness or accuracy, give trading/financial/investment-related actions, call live market data or providers, publish predictions, claim any performance or track record, import non-public data, or include launch operations.

Run
PYTHONPATH=src python3 -m microcosm_core.organs.prediction_oracle_reconciliation run --input fixtures/first_wave/prediction_oracle_reconciliation/input --out receipts/first_wave/prediction_oracle_reconciliation

EvidenceComputed projectionevidence 3/5Source-faithful refactor

research-workflowsforecastingprovider operations

Source Design note · Source atlas

Paper module Prediction Oracle Reconciliation

prediction_oracle_reconciliation is a source-available runtime fixture component for the prediction-engine slice. It compresses the source pattern group around CP1 bifurcation resolution, CP2 valid target universes, oracle grounding firewalls, diff grading, and dossier mutation into a synthetic packet a cold reader can run.

It is deliberately not a market product. The component has no live data, no external model access, no trading authority, no financial or investment-related actions authority, no publishing-scope decision, and no launch-scope decision. Its job is to make the reasoning shape inspectable without making performance or action claims. The result record contract is source-open by default: public fixture packets, exported bundle refs, source refs, and runtime result records carry the evidence, while secret_exclusion_scan blocks only live market feeds, model-output data bodies, account or browser material, private dossiers, and account secret-equivalent access.

Purpose

A forecast that gets the direction right can still be badly wrong about the number, and a forecast can look accurate only because it quietly used evidence that arrived after the outcome it was meant to predict. This component exists to make those two failures visible on a synthetic packet, before any reasoning is dressed up as a track record. The single question it answers is narrow: does this prediction packet keep its evidence honest and its grading recomputable, or does it cut a corner?

The unusual choice is that the component does not trust the numbers the packet reports. For every numeric row it recomputes the absolute error, the percent error, and the direction hit from the snapshot, predicted, and realized prices, then rejects any claimed value that contradicts the recompute. It also surfaces a direction hit that is still a large numeric miss rather than letting the correct arrow hide the size of the error. Evidence is split at the prediction time: a reference that points past the target window is refused, not silently scored.

None of this is forecasting. There is no live market data, no external model access, no trading or investment-related actions, and no performance claim. The packet, its target universe, and its realized values are invented fixtures. A direction hit or a numeric miss inside a result record is a statement about the fixture and the grading mechanics, nothing more.

Public Contract

The input packet names:

  • source_pattern_ids for the source pattern family being projected.
  • valid_prediction_targets and target_universe for the CP2 gate.
  • cp1_branches with selected side, rationale refs, and opposite-side invalidation refs.
  • cp2_predictions with pre-target evidence refs and grounding ids.
  • oracle_diff rows that grade synthetic realized direction against prediction.
  • dossier_mutations constrained to fixture deltas.
  • public_runtime_refs for the public fixture, exported bundle, and paper module system refs.
  • authority_ceiling values that explicitly keep trading, advice, provider, live-market, public sharing, launch, and secret-export authority false.

How it works

validate_reconciliation_packet runs five checks over the packet and folds the findings into one status. Each check guards a specific way a forecast can flatter itself.

CP1 resolution. Every cp1_branches row must name the side it chose, carry rationale refs, and keep an opposite_side_invalidation_ref, the record of why the losing side lost. A branch that asserts a winner without retaining the discarded alternative is rejected as an unresolved bifurcation. Equity or market-lane branches additionally need an explicit confirmation bit before they count.

CP2 universe and pre-target evidence. Predictions must name a target_id inside the declared valid_prediction_targets, so the set of things being predicted is fixed before the outcome rather than chosen afterwards. Evidence refs must be pre-target: a ref is accepted only if it carries the T- time prefix, and a reference that points past the target window raises PREDICTION_ORACLE_POST_T_EVIDENCE_FORBIDDEN. This is the gate that stops a packet from grading itself with hindsight.

Recomputed numeric grading. This is the part that does real arithmetic. For each graded row the component takes the snapshot, predicted, and realized prices and recomputes the absolute delta, the percent delta against the snapshot, and the direction hit. If the row also reports its own abs_error, pred_error_pct, or direction_hit, the claimed value must match the recompute or the row is rejected. Two further rules matter. A row whose direction is correct but whose error clears the floor (ten in absolute terms, or five percent) is surfaced as a large miss, so a right arrow cannot conceal a large numeric error. A row with no realized price is not fabricated into a graded row, a row marked degraded is gated out of grading rather than scored, and the STOCK and ETF asset classes are kept as separate counts rather than blended.

Oracle diff and bounded mutation. The oracle_diff rows grade synthetic realized direction against each prediction, and dossier_mutations may only add a contradiction, revise a confidence band, or retire a claim. A high-severity mutation needs two evidence refs and an explicit public-delta allowlist before it is allowed.

A run passes only when at least two CP1 branches, two CP2 predictions, two graded numeric rows across both asset classes, and one bounded mutation are present, the recompute and evidence gates raise no findings, the source-module digests match, and the secret scan is clean. The result record records counts, verdicts, and authority booleans; the packet body, claimed numbers, and source bodies stay out of it.

Shape

Synthetic prediction packettarget universe, CP1branches,CP2 predictions, oracle diff,numeric rows, dossiermutationsSynthetic prediction packet target universe, CP1 branches, CP2 predictions, oracle diff, numeric rows, dossier mutationsCP1 resolutionchosen side + rationale +why the opposite side lost;equity lane needsconfirmationCP1 resolution chosen side + rationale + why the opposite side lost; equity lane needs confirmationCP2 universe + evidencetarget inside declareduniverse;evidence must be pre-target(T-)CP2 universe + evidence target inside declared universe; evidence must be pre-target (T-)Recomputed numeric gradingabs error, percent error,direction hit recomputed;claimed values must matchRecomputed numeric grading abs error, percent error, direction hit recomputed; claimed values must matchOracle diff + mutationrealized vs predicteddirection;bounded dossier deltasOracle diff + mutation realized vs predicted direction; bounded dossier deltasDirection-right, numeric-misssurfaced, not hiddenDirection-right, numeric-miss surfaced, not hiddenDegraded / missing-truth rowsgated, not fabricatedDegraded / missing-truth rows gated, not fabricatedmetadata-only result recordsresult, board, validation,sign-off; counts and verdictsmetadata-only result records result, board, validation, sign-off; counts and verdictsScope limitsynthetic fixture only;no trading, advice, provider,live market, publish, launchScope limit synthetic fixture only; no trading, advice, provider, live market, publish, launch
Diagram source
flowchart TD Packet["Synthetic prediction packet target universe, CP1 branches, CP2 predictions, oracle diff, numeric rows, dossier mutations"] CP1["CP1 resolution chosen side + rationale + why the opposite side lost; equity lane needs confirmation"] CP2["CP2 universe + evidence target inside declared universe; evidence must be pre-target (T-)"] Numeric["Recomputed numeric grading abs error, percent error, direction hit recomputed; claimed values must match"] Oracle["Oracle diff + mutation realized vs predicted direction; bounded dossier deltas"] LargeMiss["Direction-right, numeric-miss surfaced, not hidden"] Gated["Degraded / missing-truth rows gated, not fabricated"] Result records["metadata-only result records result, board, validation, sign-off; counts and verdicts"] Ceiling["Scope limit synthetic fixture only; no trading, advice, provider, live market, publish, launch"] Packet --> CP1 Packet --> CP2 Packet --> Numeric Packet --> Oracle Numeric --> LargeMiss Numeric --> Gated CP1 --> Result records CP2 --> Result records LargeMiss --> Result records Gated --> Result records Oracle --> Result records Result records --> Ceiling

Evidence/accounting:

  • Bundle authority: core/paper_module_capsules.json::paper_modules[54:paper_module.prediction_oracle_reconciliation] sets source_authority: json_capsule, binds the component, binds mechanism.prediction_oracle_reconciliation.validates_public_prediction_oracle_reconciliation, and resolves src/microcosm_core/organs/prediction_oracle_reconciliation.py.
  • Generated instance: paper_modules/prediction_oracle_reconciliation.json reports paper_module_payload.source_authority: json_capsule, Mermaid available_from_capsule_edges, Atlas linked_from_capsule_edges, 15 relationship edges, and no unpopulated selective relations.
  • Runtime and fixture floor: src/microcosm_core/organs/prediction_oracle_reconciliation.py exposes run, run_prediction_bundle, validate_source_module_imports, validate_reconciliation_packet, _source_open_body_import_summary, write_receipts, EXPECTED_NEGATIVE_CASES, and AUTHORITY_CEILING. fixtures/first_wave/prediction_oracle_reconciliation/input/reconciliation_packet.json carries the synthetic CP1/CP2, oracle-diff, target-universe, and dossier-mutation evidence shape.
  • Exported bundle and result records: examples/prediction_oracle_reconciliation/exported_prediction_oracle_bundle/source_module_manifest.json and the exported source artifacts provide source-open replay evidence. receipts/first_wave/prediction_oracle_reconciliation/prediction_oracle_reconciliation_result.json, prediction_oracle_validation_receipt.json, and result records/sign-off/first_wave/prediction_oracle_reconciliation_fixture_acceptance.json keep the result record metadata-only and fixture-bounded.
  • Test and claim boundary: tests/test_prediction_oracle_reconciliation.py checks invalid target universes, unresolved CP1 branches, post-target evidence, unsafe dossier mutation, live-market/trading/advice overclaims, exported-bundle validation, and source-module digest gates. The structured source record scope limit excludes forecasting correctness, financial decisions, trading authority, live market data, external model access, prediction public sharing, performance track record, non-public data import, launch-scope decision, publishing-scope decision, and whole-system correctness.

Reader Evidence Routing

Open this module as a reader map, not as prediction evidence. Use the runtime fixture input for packet shape, the exported bundle for source-open replay, the structured source record for relationship edges, and the test file for the negative cases that enforce the scope limit.

Route evidence in this order:

  1. Read the structured lattice bindings section to confirm the source record path and subject edges.
  2. Inspect the fixture input for declared target universes, CP1 branches, CP2 prediction evidence, oracle-diff rows, and fixture-bounded dossier mutations.
  3. Run the fixture and exported-bundle commands to produce metadata-only result records.
  4. Check tests/test_prediction_oracle_reconciliation.py for the negative cases that reject target-universe escapes, unresolved CP1 branches, post-target evidence, live-market overclaims, and authority overclaims.
  5. Use paper_modules/prediction_oracle_reconciliation.json as the generated relationship graph for this module.

Negative Cases

The fixture rejects:

  • a CP2 prediction outside the target universe;
  • an unresolved CP1 bifurcation;
  • post-target evidence used as prediction evidence;
  • unconfirmed equity or market-lane claims;
  • unsafe high-severity dossier mutation;
  • trading, advice, live-provider, public sharing, launch, or secret-export authority overclaims.

Prior Art Grounding

This component is grounded in probabilistic forecast evaluation and prediction market infrastructure. The Brier score is an early probability-forecast verification anchor, proper-scoring-rule work such as Gneiting and Raftery motivates incentive-compatible forecast scoring, and Hanson's logarithmic market scoring rule grounds the prediction-market idea that forecasts can be updated and evaluated through explicit scoring mechanisms. Forecasting tournament work around tracking and calibration also motivates separating prediction evidence from post-outcome explanation.

Microcosm borrows the reconciliation pattern: declare the target universe before the outcome, keep pre-target evidence separate from post-target evidence, grade against a synthetic oracle diff, and constrain dossier mutation to declared fixture deltas. It does not trade, advise, publish predictions, or claim forecast performance.

Commands

PYTHONPATH=src python3 -m microcosm_core.organs.prediction_oracle_reconciliation run \
  --input fixtures/first_wave/prediction_oracle_reconciliation/input \
  --out receipts/first_wave/prediction_oracle_reconciliation

PYTHONPATH=src python3 -m microcosm_core.organs.prediction_oracle_reconciliation run-prediction-bundle \
  --input examples/prediction_oracle_reconciliation/exported_prediction_oracle_bundle \
  --out receipts/runtime_shell/demo_project/organs/prediction_oracle_reconciliation

Validation Result record Path

Run from microcosm-substrate:

PYTHONPATH=src ../repo-python -m microcosm_core.organs.prediction_oracle_reconciliation run \
  --input fixtures/first_wave/prediction_oracle_reconciliation/input \
  --out /tmp/microcosm-prediction-oracle-reconciliation/fixture \
  --card
PYTHONPATH=src ../repo-python -m microcosm_core.organs.prediction_oracle_reconciliation run-prediction-bundle \
  --input examples/prediction_oracle_reconciliation/exported_prediction_oracle_bundle \
  --out /tmp/microcosm-prediction-oracle-reconciliation/bundle \
  --card
PYTHONPATH=src ../repo-python -m pytest -p no:cacheprovider tests/test_prediction_oracle_reconciliation.py -q
PYTHONPATH=src ../repo-python scripts/build_doctrine_projection.py --check-paper-module-corpus

A passing run proves only synthetic target-universe reconciliation, CP1/CP2 accounting, oracle-diff grading, and fixture-bounded dossier mutation; it does not establish forecasting performance, financial decisions, trading authority, live market access, public sharing, or launch.

Scope boundary

Scope limit

This module covers only fixture-bounded prediction-oracle reconciliation: synthetic target-universe accounting, CP1/CP2 separation, oracle-diff grading, dossier mutation constraints, copied source-module import evidence, negative cases, and public result records. They do not prove forecasting accuracy, financial decisions, trading authority, live-market access, provider behavior, prediction public sharing, performance track record, private-data import, launch-scope decision, publishing-scope decision, or whole-system correctness.

Limitations

The target universe, CP1 branches, CP2 evidence, realized values, oracle diff, and dossier mutations are fixture artifacts. They exercise the shape of a reconciliation pipeline, but they are not live market data, a validated forecasting track record, an investment strategy, or a prediction public sharing surface. A direction hit or numeric miss inside the result record is evidence about the synthetic packet only.

The exported bundle is source-open in the narrow body-floor sense. It digest checks copied source contracts, node manifests, tool code, pattern rows, and route-decision artifacts while keeping body text out of result records. That does not certify private source-root equivalence, provider behavior, account or session state, hidden market feeds, private dossiers, or launch-scope decision.

The negative cases are scoped regression guards. They reject invalid targets, unresolved bifurcations, post-target evidence, unconfirmed equity-lane claims, unsafe dossier mutation, trading/advice overclaims, degraded feed misuse, missing realized numeric truth, and asset-class mixing. Those refusals do not prove full financial safety, whole-system correctness, runtime correctness outside the named component, or complete secret absence beyond the declared scanner envelope.

Scope limit

Synthetic invented prediction packet and source-module import evidence only; no forecasting correctness or accuracy, no trading, financial, or investment-related actions, no live market data, no external model access, no prediction public sharing, no performance track record, no non-public data import, no launch-scope decision, no publishing-scope decision, and no whole-system correctness.

Scope boundary

This module demonstrates synthetic prediction-reconciliation mechanics only. It does not trade, give financial or investment-related actions, call live market providers, publish predictions, claim forecasting performance, import non-public data, or include launch operations.

Source and projection details
Governing Lattice Relation
  • source record: core/paper_module_capsules.json::paper_modules[54:paper_module.prediction_oracle_reconciliation].
  • Subject edges: explains component prediction_oracle_reconciliation and mechanism mechanism.prediction_oracle_reconciliation.validates_public_prediction_oracle_reconciliation.
  • Doctrine edges: governed by principles P-2, P-6, P-8, and P-9; abides by axioms AX-5, AX-7, AX-8, and AX-10.
  • Dependency edges: depends on paper_module.finance_forecast_evaluation_spine, paper_module.world_model_projection_drift_control_room, and paper_module.research_replication_rubric_artifact_replay.
  • Runtime code locus: src/microcosm_core/organs/prediction_oracle_reconciliation.py, including run, run_prediction_bundle, validate_source_module_imports, validate_reconciliation_packet, _source_open_body_import_summary, _build_result, write_receipts, result_card, EXPECTED_NEGATIVE_CASES, and AUTHORITY_CEILING.
  • Generated row proof: 15 resolved relationship edges, no unpopulated selective relations, Mermaid available_from_capsule_edges, and Atlas linked_from_capsule_edges.

The governing lattice turns the component into a bounded reconciliation checker rather than a forecast authority. P-2 lowers every positive claim to the checker strength: CP1/CP2 accounting, oracle-diff grading, numeric-row gates, source-module digest checks, negative cases, and metadata-only result records. P-6 fails closed when a branch is unresolved, a target escapes the declared universe, a source digest mismatches, or an authority flag tries to rise above the accepted component ceiling. P-8 makes those refusals typed outcomes instead of prose warnings. P-9 carries source refs, public runtime refs, copied-body material status, and result record refs across the fixture and exported bundle.

The axiom layer supplies the same boundary. AX-5 prevents the fixture from upgrading synthetic reconciliation evidence into trading, advice, live-market, provider, public sharing, launch, or performance-track-record authority. AX-7 permits partiality: degraded feed health, missing realized numeric truth, and asset-class split pressure are surfaced as scoped findings rather than hidden successes. AX-8 keeps copied source bodies while excluding live market data, model-output data bodies, private dossiers, and account secret-equivalent material. AX-10 requires the target-universe, CP1/CP2, oracle-diff, and source-module evidence to be tied to the current fixture or bundle result records before the Markdown projection is treated as current.

The structured source record's 15 edges prove route parity only.

Finance Forecast Evaluation SpineReplays synthetic forecast tests through copied finance stats, recording p-values with no advice.4/5Runs real tools

Does Runs public synthetic forecast-evaluation fixtures through copied finance statistics modules and records p-value/refusal behavior without live market data or advice claims.

Scope limit synthetic fixture forecast-evaluation statistics only; no investment-related actions, live market data, track record, or performance claim

Run
microcosm finance-forecast-evaluation-spine run --input fixtures/first_wave/finance_forecast_evaluation_spine/input --out receipts/first_wave/finance_forecast_evaluation_spine

EvidenceExternal tool runevidence 4/5Real runtime result

research-workflowsforecastingfinance

Source Design note · Source atlas

Paper module Finance Forecast Evaluation Spine

finance_forecast_evaluation_spine is a Crown Jewel import component with real runnable system and a strict public scope limit. It consumes synthetic public fixtures, copied source source bodies, and source manifests that verify sha256 digests, line counts, required anchors, secret-exclusion status, and result record body omission.

Purpose

Comparing two forecasting models is harder than it looks. A lower average loss does not establish that one model genuinely predicts better, because losses are autocorrelated, samples are short, and a careless split can let a model peek at the answer. This component exists to carry the statistical machinery that economists use to answer that question carefully, and to do so without ever claiming the machinery has been pointed at a real market.

The single question it answers is narrow: given two paired loss series over a synthetic fixture, can the difference in predictive accuracy be called significant under an admissible test, or must the test refuse? It computes the Diebold-Mariano loss-differential statistic with a Bartlett HAC long-run variance, the Harvey-Leybourne-Newbold small-sample correction, Hansen's test for superior predictive ability with recentering, a model confidence set, and a Politis-Romano stationary bootstrap.

Failure is handled explicitly. The Harvey-Leybourne-Newbold correction returns its computed statistic, but when SciPy is absent it refuses the p-value with a typed reason rather than fabricating one. The same discipline rejects a horizon that reaches the sample length, a sample too small to estimate anything, a time split that lets the evaluation date sit at or after the event window, and any policy flag that smuggles in advice or a track-record claim. A refusal is recorded as a first-class validator outcome, not an error: "we declined to answer" is itself a valid result.

The guards run before the statistics. If a boundary policy or a leakage check fails, the result record is blocked before any statistics subprocess starts, so an inadmissible request never produces a number that could be misread as a result.

What it proves: synthetic fixture forecast-evaluation statistics only; no investment-related actions, live market data, track record, or performance claim.

How to run it:

microcosm finance-forecast-evaluation-spine run --input fixtures/first_wave/finance_forecast_evaluation_spine/input --out receipts/first_wave/finance_forecast_evaluation_spine

Runtime bundle route:

python -m microcosm_core.organs.finance_forecast_evaluation_spine run-finance-forecast-bundle --input examples/finance_forecast_evaluation_spine/exported_finance_eval_bundle --out receipts/runtime_shell/demo_project/organs/finance_forecast_evaluation_spine

Negative cases covered by the fixture manifest: finance_hln_dependency_refusal, finance_leakage_lookahead_split, finance_no_advice_overclaim.

Source provenance is anchored by examples/finance_forecast_evaluation_spine/exported_finance_eval_bundle/source_module_manifest.json and result records carry refs, digests, counts, verdicts, and scope boundaries only.

Shape

"boundary fails""boundary passes""first-wave fixture""exported bundle"Synthetic fixture inputsfamily_loss_matrix,paired_loss_series,finance_boundary_policy,projection_protocolSynthetic fixture inputs family_loss_matrix, paired_loss_series, finance_boundary_policy, projection_protocolCopied finance modulesplus source manifest digestsCopied finance modules plus source manifest digestsRunnerRunnerGuards run firstpolicy no-advice flags,lookahead-split leakage checkGuards run first policy no-advice flags, lookahead-split leakage checkBlocked result recordstatistics subprocess neverstartsBlocked result record statistics subprocess never startsAdmissible andexported bundle?Admissible and exported bundle?Statistics subprocessDM/HAC, Hansen SPA, MCS,stationary bootstrap, HLNrefusalStatistics subprocess DM/HAC, Hansen SPA, MCS, stationary bootstrap, HLN refusalStandalone statisticscontractno live source-rootsubprocessStandalone statistics contract no live source-root subprocessResult recordsrefs, hashes, counts,verdicts,scope boundaries;body_in_receipt falseResult records refs, hashes, counts, verdicts, scope boundaries; body_in_receipt false

Source refs

Runner
finance_forecast_evaluation_spine.run
Diagram source
flowchart TD Fixture["Synthetic fixture inputs family_loss_matrix, paired_loss_series, finance_boundary_policy, projection_protocol"] Source["Copied finance modules plus source manifest digests"] Runner["finance_forecast_evaluation_spine.run"] Guards["Guards run first policy no-advice flags, lookahead-split leakage check"] Blocked["Blocked result record statistics subprocess never starts"] Branch{"Admissible and exported bundle?"} Subprocess["Statistics subprocess DM/HAC, Hansen SPA, MCS, stationary bootstrap, HLN refusal"] Standalone["Standalone statistics contract no live source-root subprocess"] Result record["Result records refs, hashes, counts, verdicts, scope boundaries; body_in_receipt false"] Fixture --> Runner Source --> Runner Runner --> Guards Guards -->|"boundary fails"| Blocked Guards -->|"boundary passes"| Branch Branch -->|"first-wave fixture"| Subprocess Branch -->|"exported bundle"| Standalone Subprocess --> Result record Standalone --> Result record Blocked --> Result record

Technical Mechanism

The module is a deterministic forecast-evaluation harness around CrownJewelSpec, not a finance product. The spec fixes four required fixture inputs (family_loss_matrix.json, paired_loss_series.json, finance_boundary_policy.json, and projection_protocol.json), names the three required negative cases, binds the source manifest, and restricts the source-open import to required anchors in model_selection_stats.py, spa_statistics.py, loss_differentials.py, and family_loss_matrix.py.

At runtime, run delegates to run_crown_jewel_organ with evaluate and evaluate_negative_case. evaluate loads the synthetic loss matrix, paired loss series, and boundary policy, then calls _evaluate_payloads. That function first enforces the policy and lookahead-split guards; if either boundary fails, it returns a blocked result record before any statistics subprocess can run. Only after those guards pass does it run the copied statistics modules or, for the exported bundle path, use _standalone_exported_statistics_contract so the standalone public bundle does not depend on a live source-root subprocess.

The statistical witness is therefore deliberately narrow: Reality Check, Hansen-SPA, MCS, Diebold-Mariano/HAC, stationary bootstrap, and the HLN refusal are result record fields over the synthetic fixture. The same mechanism treats finance_hln_dependency_refusal as a typed negative case when SciPy support is absent, treats policy overclaims as FINANCE_NO_ADVICE_OVERCLAIM, treats temporal leakage as FINANCE_LOOKAHEAD_SPLIT_FORBIDDEN, and keeps copied source bodies out of result records with body_in_receipt: false.

Reader Evidence Routing

Read the positive fixture as a small statistical witness, not as a market result. The current result record has status: pass, sample_size: 40, candidate_count: 3, reality_check.status: computed_bootstrap, spa.status: computed_bootstrap, mcs.implemented: true, paired_loss.diebold_mariano.status: computed_hac_normal_approximation, and a five-replicate stationary-bootstrap witness. Those fields show that the component can exercise the copied forecast evaluation code paths on public synthetic data.

Read the negative floor as equal evidence. The observed negative cases are finance_hln_dependency_refusal, finance_leakage_lookahead_split, and finance_no_advice_overclaim, with stable error codes FINANCE_HLN_TYPED_REFUSAL_REQUIRED, FINANCE_LOOKAHEAD_SPLIT_FORBIDDEN, and FINANCE_NO_ADVICE_OVERCLAIM. The HLN case refuses because SciPy is unavailable for the t-distribution; that is the intended scope limit, not a missing p-value to fill in by hand.

Read source-open evidence through the manifest, not through result records. The source bundle carries 13 copied finance modules; result records carry references, hashes, counts, verdicts, and scope boundaries, and keep body_in_receipt: false. The local claim therefore stays at "synthetic fixture forecast-evaluation statistics and typed refusals." It does not become investment-related actions, live-market data, a track record, performance proof, optimizer authorization, or launch-scope decision.

Forecast-Evaluation Discipline

This component is evidence that the Microcosm can carry professional forecast evaluation logic without pretending to carry market authority. The admissible statistics include Diebold-Mariano loss-differential testing, the Harvey-Leybourne-Newbold small-sample correction, Hansen's SPA test, a Politis-Romano stationary bootstrap, Bartlett HAC long-run variance, and purged/embargoed cross-validation in the Lopez de Prado style.

The important doctrine is refusal discipline. Horizons greater than or equal to sample length, samples too small to estimate a statistic, leakage-prone splits, missing SciPy support, and advice-shaped claims must return typed refusals instead of crashes or meaningless numbers. Hansen-style recentering of poor or irrelevant alternatives is part of the SPA contract because it is the boundary between a useful superior-predictive-ability test and White Reality Check style over-penalization.

Result records should therefore distinguish "computed statistic" from "refused because inadmissible." Both are successful validator outcomes when the fixture asked for that behavior.

Named Proof Consumers

  • Runtime fixture consumer: finance_forecast_evaluation_spine.run over fixtures/first_wave/finance_forecast_evaluation_spine/input must produce status: pass, the three observed semantic negative cases, false advice/live-data/performance authority flags, and metadata-only source-manifest result record material.
  • Exported-bundle consumer: run-finance-forecast-bundle over examples/finance_forecast_evaluation_spine/exported_finance_eval_bundle must validate the 13 copied finance modules by digest and use the standalone statistics contract rather than a live source subprocess.
  • Focused pytest consumer: tests/test_finance_forecast_evaluation_spine.py must keep the positive statistical fixture, no-advice overclaim, live-market overclaim, lookahead split, semantic-negative-case, standalone-bundle, and digest-mismatch tests green.
  • Corpus consumer: scripts/build_doctrine_projection.py --check-paper-module-corpus must keep the 98-module Microcosm paper-module corpus valid without hand-editing the generated JSON instance.
  • Scope limit consumer: any public or dissemination copy must preserve the local ceiling that this is synthetic fixture forecast-evaluation evidence, not investment-related actions, live data, performance proof, optimizer authorization, or launch-scope decision.

Prior Art Grounding

This component is grounded in forecast-evaluation statistics rather than trading systems. The core anchors are the Diebold-Mariano test for comparing predictive accuracy, the Harvey-Leybourne-Newbold small-sample correction for prediction-error tests (DOI reference), Hansen's test for superior predictive ability, and proper-scoring-rule work such as Gneiting and Raftery. The purged/embargoed split discipline also follows the financial ML concern that temporal leakage can make backtests look stronger than they are.

Microcosm borrows the professional evaluation posture: compute admissible statistics when the fixture supports them, return typed refusals when it does not, and keep evaluation separate from advice, live market data, or performance claims.

Validation Result record Path

PYTHONPATH=src ./repo-pytest tests/test_finance_forecast_evaluation_spine.py -q --basetemp=/tmp/microcosm_finance_forecast_evaluation_spine_pytest
./repo-python scripts/build_doctrine_projection.py --check-paper-module-corpus

Scope boundary

Scope limit

Finance forecast evaluation spine proves only synthetic market-shaped forecast-evaluation fixture behavior, copied source manifest integrity, metadata-only result records, admissible statistic computation, and typed refusals for inadmissible finance claims. A diagram view and atlas navigation entry are generated for this module, but those navigation projections do not expand the proof. This module is not investment or trading decisions, uses no live market data, proves no track record or performance claim, mutates no optimizer, certifies no trading strategy, and treats SciPy absence as a typed HLN refusal rather than a hidden statistical success.

Source and projection details
Governing Lattice Relation

The generated JSON instance resolves six bundle-derived edges for this module: it explains component finance_forecast_evaluation_spine, explains mechanism mechanism.finance_forecast_evaluation_spine.validates_public_finance_forecast_evaluation_spine, is governed by concept concept.research_and_science_replay_evidence_bundle, is governed by principle P-8, abides by AX-7, and cites the code locus src/microcosm_core/organs/finance_forecast_evaluation_spine.py. Those edges come from core/paper_module_capsules.json::paper_modules[30:paper_module.finance_forecast_evaluation_spine] and the generated structured source record, not from this Markdown prose.

Mechanically, P-8 and AX-7 show up as refusal discipline: an admissible statistic can pass, but advice-shaped policy flags, live-market authority, leakage-prone time splits, source digest mismatch, and fake HLN p-values must block. The concept edge keeps the module in the research/science replay-evidence family, where proof value is a reproducible fixture and source-manifest witness rather than a claim about markets.

Market Dashboard Read-Model BundleRuns a copied market-dashboard reader to catch broken links, stale feeds, and trading overclaims.5/5

Does This bundle imports the market dashboard read-model source as public runnable system. Running it over synthetic market-dashboard rows shows how structural read-model checks, feed freshness classification, and related-situation grouping catch dangling graph edges, unsafe route refs, auto-apply overclaims, trading-language overclaims, silent omissions, stale or missing readiness, and no-overlap relation cases.

Scope limit This is fixture-bound read-model, freshness, and relation-grouping evidence only; it is not live market-level conclusions, not investment-related actions, not external model access, not launch-scope decision, and not whole-system correctness.

Run
microcosm batch12-market-dashboard-read-model-capsule run-market-dashboard-bundle --input examples/batch12_market_dashboard_read_model_capsule/exported_batch12_market_dashboard_read_model_capsule_bundle --out receipts/runtime_shell/demo_project/organs/batch12_market_dashboard_read_model_capsule

EvidenceVerified source importevidence 5/5Copied source body

research-workflowsforecastingfinance

Source Design note · Source atlas

Paper module Set 12 Market Dashboard Read-Model Bundle

Purpose

The underlying source module compiles a generated market-situation graph into a backend read model: a trust strip, a ranked situation queue, a detail index, a graph slice, facets, drilldowns, and an API contract. The read model is the shape a dashboard consumes. It runs the copied read-model helpers over small synthetic fixtures and asks one question: does the read-model layer hold its own claim boundary, or does it quietly become a market-truth or advice surface?

The interesting part is what the validator refuses rather than what it accepts. A presentation layer is the easy place for an overclaim to leak in: a label like "strong buy", an auto_apply_allowed flag left true, a freshness state that reports green from a stale or missing artifact. The copied validate_market_dashboard_read_model scans for trading and action-claim language, requires oracle_evolve.auto_apply_allowed to be false and review_gated to be true, requires no_advice_mode to be enabled, and requires the silent-omission count to be zero. The bundle drives those checks with fixtures designed to trip each one, then records whether the source actually flagged them.

The other two mechanisms guard the read path itself. A feed-freshness overlay classifies the current run into a small set of honest states so historical green proof cannot stand in for live-feed capability, and a related-situations scorer groups situations by shared entities or matching type without inventing links. Everything is fixture-bound: there is no live market data, no external model access, and no investment-related actions anywhere in scope.

Mechanisms

  • validate_market_dashboard_read_model
  • _runtime_feed_freshness_overlay
  • _related_situations

What the checks do

validate_market_dashboard_read_model is the structural and overclaim gate. It first checks the read model is well formed: the schema version matches, every situation in the queue resolves to a detail entry, every graph-slice edge points at a node that exists, and each drilldown source-ref returns metadata only with no arbitrary file read and no .. traversal in its route. It then enforces the claim boundary. auto_apply_allowed must be false, review_gated must be true, no_advice_mode must be enabled, the silent-omission count must be zero, and any copied source text is scanned for trading or action-claim language (buy, sell, short, price target, stop loss, and similar). The bundle feeds it five negative fixtures, one per failure shape, and confirms the source emits the matching error string for each. A read model that passed these checks but stayed silent on a planted overclaim would be the real failure, so the bundle treats a missing error as a finding.

_runtime_feed_freshness_overlay reads a per-run readiness summary and reports one of three honest states. fresh_green_feed requires the run to be ready, all targets met, no blockers, and same-day generation. stale_green_feed is artifact-backed but no longer same-day. blocked_missing_artifact covers the run that is missing its readiness file, falls short on targets, or carries blockers. The point is that a stale or absent run never reports green: historical proof cannot stand in for live-feed capability, and the state carries a plain truth-statement saying so. The bundle writes synthetic readiness files for each case and checks the classifier returns the expected state.

_related_situations builds the "see also" cohort for a situation. It collects other situations that either share an entity or match the situation type, ranks them, excludes the focus situation itself, and caps the list at six. The bundle checks one boundary case in particular: a situation with no entity overlap and a different type produces an empty cohort rather than a spurious link.

Shape

Synthetic dashboard,freshness, related fixturesSynthetic dashboard, freshness, related fixturesCopied read-model helpers(market_dashboard_read_model.py)Copied read-model helpers (market_dashboard_read_model.py)Validate market dashboardread modelValidate market dashboard read modelStructure: schema,queue-to-detail, graph edges,drilldown route safetyStructure: schema, queue-to-detail, graph edges, drilldown route safetyScope limit: no auto-apply,review-gated, no-advice,no trading language,zero silent omissionsScope limit: no auto-apply, review-gated, no-advice, no trading language, zero silent omissions_runtime_feed_freshness_overlay_runtime_feed_freshness_overlayfresh_green_feedfresh_green_feedstale_green_feedstale_green_feedBlocked missing artifactBlocked missing artifact_related_situations_related_situationsEntity overlap or type match;self-excluded, capped at six;no overlap means emptyEntity overlap or type match; self-excluded, capped at six; no overlap means emptymetadata-only result recordand card(refs, digests, counts,verdicts)metadata-only result record and card (refs, digests, counts, verdicts)

Source refs

Validate market dashboard read model
validate_market_dashboard_read_model
Blocked missing artifact
blocked_missing_artifact
Diagram source
flowchart TD A["Synthetic dashboard, freshness, related fixtures"] --> B["Copied read-model helpers (market_dashboard_read_model.py)"] B --> C["validate_market_dashboard_read_model"] C --> C1["Structure: schema, queue-to-detail, graph edges, drilldown route safety"] C --> C2["Scope limit: no auto-apply, review-gated, no-advice, no trading language, zero silent omissions"] B --> D["_runtime_feed_freshness_overlay"] D --> D1["fresh_green_feed"] D --> D2["stale_green_feed"] D --> D3["blocked_missing_artifact"] B --> E["_related_situations"] E --> E1["Entity overlap or type match; self-excluded, capped at six; no overlap means empty"] C1 --> F["metadata-only result record and card (refs, digests, counts, verdicts)"] C2 --> F D1 --> F D2 --> F D3 --> F E1 --> F

Reader Evidence Routing

Start with paper_modules/batch12_market_dashboard_read_model_capsule.json for bundle-derived source authority, then read this Markdown as the explanatory projection. Use examples/batch12_market_dashboard_read_model_capsule/exported_batch12_market_dashboard_read_model_capsule_bundle/source_module_manifest.json to inspect copied-source digest status before opening copied source modules. Use tests/test_batch12_market_dashboard_read_model_capsule.py to verify the fixture and bundle expectations.

The useful evidence is dashboard read-model accounting over synthetic public fixtures: validation rows, freshness overlays, related-situation joins, negative cases, metadata-only result records, and scope limit fields.

Prior Art Grounding

The component is grounded in CQRS/read-model and dashboard-observability patterns: derive presentation-ready projections from source data, make freshness visible, and keep the read surface separate from mutation authority. Useful anchors include:

  • Microsoft's CQRS pattern, where read models are optimized for queries and presentation rather than command handling.
  • Grafana dashboards, which query and transform data sources into operational panels.

Microcosm borrows the read-model shape for dashboard validation, runtime feed freshness overlays, and related-situation joins. The result is fixture-bound mechanism evidence; it does not become market-level conclusions, external model access, investment-related actions, or launch-scope decision.

Validation Result record Path

Reader-verifiable commands, run from the microcosm-substrate/ public root:

The fixture command writes the dashboard read-model result record and sign-off JSON. The bundle command validates copied source system, manifest digests, freshness overlay rows, related-situation joins, negative cases, and metadata-only result record posture. The focused test checks fixture validation, bundle validation, digest/anchor coverage, and scope limits.

This result record path is reader-verifiable evidence only. It excludes launch, external model access, private-system equivalence, market-level conclusions, investment-related actions, or whole-system correctness.

Scope boundary

Scope limit

This module may claim public fixture evidence that the copied source system produced market-dashboard read-model rows, runtime feed freshness overlays, related-situation joins, negative-case checks, metadata-only result record posture, and validation result records over synthetic inputs.

This module may not claim launch-scope decision, external model access, private-system equivalence, live market-level conclusions, investment-related actions, deployment posture, source-file changes, publishing-scope decision, or whole-system correctness.

Scope limit

This is fixture-bound market-dashboard read-model mechanism evidence. It excludes launch, external model access, private-system equivalence, market-level conclusions, investment-related actions, deployment posture, source-file changes, publishing-scope decision, or whole-system correctness.

Prediction Market Board BundleReplays imported quant market math on test rows, with duplicate retention and seven refusals.5/5

Does This bundle imports the quant presentation mart source as public runnable system. Running it over synthetic prediction-market and feed-diagnostic rows shows event identity joining, duplicate-market retention by volume, orphan identity refusal, provider drift flags, missingness rows, unavailable previous-green deltas, and source lifecycle vintage enrichment.

Scope limit This is deterministic fixture evidence for copied quant helpers only; it is not live prediction-market-level conclusions, not provider truth, not forecast correctness, not investment-related actions, not external model access, and not launch-scope decision.

Run
microcosm batch12-prediction-market-board-capsule run-prediction-market-board-bundle --input examples/batch12_prediction_market_board_capsule/exported_batch12_prediction_market_board_capsule_bundle --out receipts/runtime_shell/demo_project/organs/batch12_prediction_market_board_capsule

EvidenceVerified source importevidence 5/5Copied source body

research-workflowsforecastingfinance

Source Design note · Source atlas

Paper module Set 12 Prediction Market Board Bundle

Purpose

Market and source dashboards have a recurring failure: a row looks like a fact when it is really a guess. A duplicate listing inflates a volume figure, an unmatched market slug grows a fabricated identity, a feed reports zero rows but the board shows it as healthy, and a "change since last time" number appears even when there is no prior baseline to compare against. The single question this component answers is whether the copied presentation-mart logic keeps those distinctions honest when run over public synthetic inputs.

It does that by importing the real quant_presentation_mart helper body and running it against fixtures that are built to expose each trap, then asserting the exact diagnostic the body should produce. The interesting choice is that the board never asserts what a market price means. It computes accounting about the data: which event a market belongs to, whether its identity was actually matched, how providers drifted, where rows went missing, and whether a vintage date is genuinely present. Aggregation is deliberately conservative. A missing value stays missing rather than defaulting to a confident zero, and an unmatched slug is reported as missing_from_feed_artifact instead of being given a synthetic event id.

The result is fixture-bound evidence, not a forecast. The board is a diagnostic surface over public synthetic rows. It does not read live markets, use external model services, or claim that any number is tradeable.

Mechanisms

  • _prediction_market_board
  • _polymarket_identity_by_slug
  • _provider_drift_monitor
  • _missingness_board
  • _delta_since_previous_green
  • _macro_lifecycle_by_slug
  • _macro_regime_board

How it works

The bundle loads three fixtures, runs the copied helpers, and checks eight named invariants. Each check targets a specific way a board can quietly mislead.

The event-join engine (_prediction_market_board with _polymarket_identity_by_slug) groups raw market rows into events using the Polymarket identity snapshot. Identity is matched by market_slug. When two rows share the same slug and outcome, only the higher-volume one is kept, so a duplicate listing cannot double a market count or inflate an aggregate. A slug with no identity match is not dropped and is not given a made-up event id. Its event_identity_status becomes missing_from_feed_artifact and its max_liquidity stays at 0.0. The fixture proves all three: the duplicate fold (top volume 900000 with one surviving market), the orphan with a null event id, and the deduped aggregate.

The provider-drift monitor (_provider_drift_monitor) reads each feed's diagnostics and raises typed flags rather than a single health score. Generic transport problems (provider_fallback_used, html_response_seen, fetch_failures) are kept distinct from FRED-specific ones (fred_invalid_series, fred_network_warning). The fixture checks that the stock feed surfaces the generic set, the news feed stays clean, and the source feed surfaces the FRED set. Keeping the families apart means a source data-source fault is not laundered into a generic warning.

The missingness board (_missingness_board) lists only feeds that are not both non-empty and ok. A feed with zero rows is labelled zero_rows; a populated but low-quality feed is labelled quality_degraded; a healthy feed is omitted entirely. The fixture confirms the healthy feed is absent and the two failing lanes carry the correct reason, so an empty feed cannot read as present.

The prior-green delta (_delta_since_previous_green) only computes a "change since last run" when a previous green run actually exists. With no baseline it returns status: unavailable and an empty row_deltas_by_lane, which the fixture asserts directly. This is the guard against a delta number that has nothing to compare against.

The source lifecycle enrichment (_macro_lifecycle_by_slug feeding _macro_regime_board) buckets source series, then binds each bucket's vintage_status and release_calendar_status to whether the lifecycle structured source record genuinely carries that metadata. The fixture proves a series with a present vintage reads available with the expected observation date, while a series whose lifecycle row is absent reads missing_from_feed_artifact. A vintage date is shown only when it is really there.

Shape

yesno, unmatchedno, matchedSynthetic market rowsSynthetic market rowsEvent join + identity match_prediction_market_boardEvent join + identity match _prediction_market_boardPolymarket identity snapshotPolymarket identity snapshotQuant-mart helper fixturesQuant-mart helper fixturesProvider drift monitorgeneric vs FRED flagsProvider drift monitor generic vs FRED flagsMissingness boardzero_rows vs quality_degradedMissingness board zero_rows vs quality_degradedPrior-green deltaunavailable with no baselinePrior-green delta unavailable with no baselineSource regime boardvintage status bound tostructured source recordSource regime board vintage status bound to structured source recordSlug + outcomeseen before?Slug + outcome seen before?Keep higher-volume marketKeep higher-volume marketno fabricated event idno fabricated event idAppend to event aggregateAppend to event aggregatemetadata-only result recordand carddiagnostic rows, negativecases,scope limitmetadata-only result record and card diagnostic rows, negative cases, scope limit

Source refs

no fabricated event id
missing_from_feed_artifact
Diagram source
flowchart TD Rows["Synthetic market rows"] --> Join["Event join + identity match _prediction_market_board"] Identity["Polymarket identity snapshot"] --> Join Helpers["Quant-mart helper fixtures"] --> Drift["Provider drift monitor generic vs FRED flags"] Helpers --> Miss["Missingness board zero_rows vs quality_degraded"] Helpers --> Delta["Prior-green delta unavailable with no baseline"] Helpers --> Source["Source regime board vintage status bound to structured source record"] Join --> Dedup{"Slug + outcome seen before?"} Dedup -->|yes| Keep["Keep higher-volume market"] Dedup -->|no, unmatched| Orphan["missing_from_feed_artifact no fabricated event id"] Dedup -->|no, matched| Append["Append to event aggregate"] Keep --> Result record["metadata-only result record and card diagnostic rows, negative cases, scope limit"] Orphan --> Result record Append --> Result record Drift --> Result record Miss --> Result record Delta --> Result record Source --> Result record

Reader Evidence Routing

Start with paper_modules/batch12_prediction_market_board_capsule.json for bundle-derived source authority, then read this Markdown as the explanatory projection. Use examples/batch12_prediction_market_board_capsule/exported_batch12_prediction_market_board_capsule_bundle/source_module_manifest.json to inspect copied-source digest status before opening copied source modules. Use tests/test_batch12_prediction_market_board_capsule.py to verify the fixture and bundle expectations.

The useful evidence is diagnostic accounting over synthetic public fixtures: provider identity matching, drift rows, missingness boards, prior-green deltas, lifecycle/vintage rows, source-regime enrichment, negative cases, metadata-only result records, and scope limit fields.

Prior Art Grounding

The component borrows from prediction-market information aggregation and public market-data integration practice: event contracts expose market prices and settlement states, while dashboards must keep provider identity, missingness, and vintage drift visible. Relevant anchors include:

Microcosm borrows the information-aggregation and provider-join shape, then keeps the board explicitly diagnostic: identity matching, provider drift, missingness, prior-green deltas, lifecycle vintage, and source-regime enrichment are tested over public synthetic fixtures. It is not market-level conclusions, provider truth, investment-related actions, or launch-scope decision.

Validation Result record Path

Reader-verifiable commands, run from the microcosm-substrate/ public root:

The fixture command writes the prediction-market board result record and sign-off JSON. The bundle command validates copied source system, manifest digests, provider identity and drift diagnostics, missingness rows, lifecycle rows, negative cases, and metadata-only result record posture. The focused test checks fixture validation, bundle validation, digest/anchor coverage, and scope limits.

This result record path is reader-verifiable evidence only. It excludes launch, external model access, private-system equivalence, market-level conclusions, provider truth, investment-related actions, or whole-system correctness.

Scope boundary

Scope limit

This is fixture-bound mechanism evidence for prediction-market joining, quant-mart diagnostics, and source-lifecycle vintage enrichment. It excludes launch, external model access, private-system equivalence, market-level conclusions, provider truth, investment-related actions, source-file changes, publishing-scope decision, or whole-system correctness.

Scope limit

It does not establish live market-level conclusions, provider truth, external model access, investment-related actions, source-file changes, launch-scope decision, publishing-scope decision, private-system equivalence, or whole-system correctness.

Source refs

Built from public source refs, with each input path recorded for provenance.

Each component has a stable public source path with commands, source links, and its supported scope.