Area · 17 components

Agent reliability & safety

Source-open replays of agent failure modes as inspectable specimens.

Components

Agent Benchmark Integrity Anti Gaming ReplayValidates a synthetic benchmark-integrity record and flags the contamination cases it declares.3/5

Does Checks a public benchmark-integrity example bundle that contains three copied source pattern provenance bodies under source_artifacts. The component verifies the source-module manifest digests, requires each replay row to cite those copied source artifacts, recomputes pass/quarantine verdicts from contamination, file-access, and locked-evaluator spans, and rejects common gaming attempts such as peeking at hidden answers, training on the test set, exposing the oracle patch, cherry-picking the best of many tries, or asserting a score. It still does not run real bug fixes or claim any benchmark claims.

Scope limit It authorizes only bounded public runtime validation over copied source-open pattern provenance bodies and metadata-only benchmark-integrity replay rows; it does not establish any benchmark or SWE-bench score, agent capability, external model service, live-repo mutation, private/oracle/hidden-gold body access, product progress, or launch-scope decision.

Run

microcosm agent-benchmark-integrity-anti-gaming-replay run-benchmark-integrity-bundle --input examples/agent_benchmark_integrity_anti_gaming_replay/exported_benchmark_integrity_bundle --out .microcosm/agent_benchmark_integrity_anti_gaming_replay

EvidenceComputed projectionevidence 3/5Source-faithful refactor

Links to Research Replication Rubric Artifact Replay, Cold Evaluation Honesty Bundle

ai-safetyagent-evaluationred-teaming

Source Design note · Source atlas

Paper module Agent Benchmark Integrity Anti-Gaming Replay

Explainscomponent Agent Benchmark Integrity Anti Gaming Replay mechanism validates public benchmark integrity replay

Governed byprinciples Recompute, do not echo Lower claim strength to checker strength concept agent reliability and safety validators as bounded public scope limits

Abides byaxiom Derivation before assertion

Depends onpaper module Mission Transaction Work Spine

This module is the public Microcosm projection of the rule that agent benchmark claims must be replay-backed before they are score-backed. It carries copied source-open source pattern provenance bodies for the benchmark-integrity pattern row and reconstruction state, plus a metadata-only regression integrity component. It is not a benchmark runner or product-progress claim.

The fixture models a repository repair benchmark with public case ids, task and patch hashes, locked evaluator ids, evaluator config hashes, file-access log refs, contamination-check refs, trusted-reference score refs, output-replay refs, held-out guard ids, and body_in_receipt=false rows. It deliberately keeps issue bodies, oracle patch bodies, hidden-gold answers, model-output data, and live repository paths out of the public boundary.

The exported bundle includes source_module_manifest.json and source_artifacts/ copies of the source pattern provenance rows from state/microcosm_portfolio. The validator verifies those copied bodies by manifest digest and keeps body text out of result records.

Purpose

Agent benchmark numbers are easy to state and hard to trust. A single headline like "passes N percent of repository repair tasks" hides every decision that produced it: which evaluator ran, whether its configuration was frozen, whether the agent could see held-out answers, whether the test cases leaked into training, and whether one lucky attempt was promoted as the score. This component exists to answer one question before any of that language is allowed: can each claimed pass be replayed from public refs that name their evaluator, their configuration hash, and the evidence that the run was not gamed?

A positive result cannot be asserted. A replay row that simply declares integrity_pass is recomputed from scratch. The validator checks that the evaluator id is on a locked list, that the configuration hash is one the policy declared in advance, that file-access, contamination, and output-replay evidence artifacts exist and pass, and that the case id was registered up front. If any of those is missing or contradicted, the row is recomputed as quarantine regardless of what it declared. Declaring success is treated as the thing to be checked, not as the proof.

There is a further floor: an integrity_pass must be backed by a sanitised real command-run trace, not only by hand-written replay refs. Each row cites a real_benchmark_trace_ref that has to resolve to a copied artifact carrying a passing focused pytest run for this component, with sha256 digests bound to the recorded command-run id and an explicit list of omitted live material (model-output data, account secrets, private issue bodies, oracle patch bodies). The point is to stop a benchmark claim from resting on prose. The evidence has to trace back to a command that actually ran and is reproducible from public refs, while the private and live material that command touched stays out of the public boundary.

This is a discipline fixture, not a leaderboard. It proves that a metadata-only replay respected an anti-gaming boundary over public case ids and locked evaluator refs. It never reports a score, a SWE-bench result, or a capability claim, and the eleven negative cases below are there to demonstrate the boundary holding rather than to advertise a number.

Technical Mechanism

The component turns a benchmark claim into a replay-verification problem. Its inputs are the projection protocol, locked evaluator policy, benchmark case roster, replay observations, exported bundle manifest, source-module manifest, and copied source_artifacts/ rows. _build_result loads those inputs, validates source-module imports, scans public inputs and copied source bodies against the non-public-state forbidden-class policy, checks projection protocol density, validates the locked evaluator policy, validates the case roster, and then validates each replay row against the same public boundary.

A positive replay cannot pass by declaring success. The replay row must name a case id present in benchmark_cases.json, cite a locked evaluator id, carry an evaluator config hash allowed by locked_evaluator_policy.json, expose file-access, contamination-check, trusted-reference, and output-replay refs, and cite source-artifact evidence refs that match the exported source-module manifest targets. Each of those evidence refs must resolve to a metadata-only benchmark_integrity_evidence_artifact_v1 artifact bound to the same replay, case, evaluator, and config hash, with file-access marked passed, contamination flags clear, a trusted reference present without a claimed score, and an output replay that is not final-answer-only grading. The validator recomputes whether each row is integrity_pass or quarantine; missing refs, unregistered cases, unlocked or mutated evaluators, score authorization, private issue bodies, oracle patch bodies, hidden-gold access, model-output data, pass-k cherry-picking, and misleading tests force quarantine or a blocking finding.

A further gate is the real-trace floor. Every positive replay row also cites a real_benchmark_trace_ref, and that ref must resolve to a copied source-module artifact whose material_class is public_sanitized_real_benchmark_trace. The validator opens that artifact and checks that it records a completed, exit-zero command run of the focused pytest for this component, carries a passing pytest summary, binds sha256 digests for the command metadata, stdout, and stderr to a declared command-run id, cites state/command_runs/ source refs for that id, and declares the omission of model-output data, account secrets, private issue bodies, and oracle patch bodies. A replay whose real_benchmark_trace_ref is missing, unverified, or not also listed in the source-artifact evidence refs cannot stand as a pass. This is what stops a benchmark claim from resting on hand-authored refs alone: the integrity verdict has to trace back to a command that actually ran and is reproducible from public refs.

The copied body floor is verified separately from the public result record. The source-module manifest must declare copied_non_secret_macro_body material, public source pattern body classes, body_in_receipt=false, and digest-stable targets. validate_source_module_imports checks that each manifest row points to an existing copied artifact and that its recorded SHA-256 digest matches disk. Result records and command cards then omit the bodies and carry only ids, refs, digests, classes, counts, verdicts, findings, and scope limits.

The public trace is a second proof pass rather than a display copy of replay rows. build_public_benchmark_integrity_anti_gaming_trace recomputes each span from locked-evaluator status, contamination signals, file-access refs, contamination-check refs, trusted-reference refs, and declared quarantine reasons. The expected public fixture has three spans: two recompute as integrity_pass, one recomputes as quarantine, and the trace must agree with the declared replay verdicts before the component can return status=pass.

Named Proof Consumers

run consumes the first-wave fixture and writes the result, board, validation result record, sign-off result record, and metadata-only command card. It is the proof consumer for the canonical fixture boundary and required negative-case floor.
run-benchmark-integrity-bundle consumes the exported public bundle and proves that source-open body imports, bundle shape, manifest digests, and metadata-only result record/card rules survive outside the fixture directory.
tests/test_agent_benchmark_integrity_anti_gaming_replay.py is the focused regression consumer. It asserts negative-case observation, digest verification, source-artifact evidence refs, public trace verdict recomputation, positive/negative verdict handling, metadata-only result records, bundle runtime shape, and command-card reuse of a fresh result record.
A cold reader consumes this Markdown only after checking the JSON bundle, generated JSON instance, exported source manifest, case roster, replay observations, focused test path, and scope limit. The reader may verify the replay boundary but must not infer a benchmark claims, provider behavior, product-progress state, public sharing state, or launch-scope decision.

Shape

Source refs

Protocol: projection_protocol.json
Manifest: source_module_manifest.json

Diagram source

flowchart LR Bundle["JSON bundle authority"] --> Markdown["Reader projection"] Protocol["projection_protocol.json"] --> ProtocolGate["source refs and result record density"] Manifest["source_module_manifest.json"] --> DigestGate["material class and digest gate"] DigestGate --> Bodies["copied public source provenance bodies"] DigestGate --> RealTrace["sanitised real command-run trace passing pytest, sha256 digests, declared omissions"] Cases["3 public case ids"] --> ReplayGate["case roster and required replay refs"] Policy["locked evaluator policy"] --> EvaluatorGate["locked ids and config hashes"] Replays["3 replay observations"] --> ReplayGate EvaluatorGate --> ReplayGate ProtocolGate --> ReplayGate ReplayGate --> EvidenceGate["per-ref evidence artifacts file-access, contamination, trusted reference, output replay"] EvidenceGate --> Recompute["recompute integrity_pass or quarantine"] RealTrace --> Recompute Recompute --> Trace["public trace verdict recomputation"] Trace --> Verdicts["2 integrity_pass and 1 quarantine"] Negatives["11 anti-gaming fixtures"] --> Quarantine["quarantine or blocking finding"] Bodies --> PrivateScan["metadata-only non-public-state scan"] RealTrace --> PrivateScan Verdicts --> Result record["metadata-only integrity result record"] Quarantine --> Result record PrivateScan --> Result record Result record --> Ceiling["anti-score scope limit"]

The page shape is a bounded replay spine, not a benchmark leaderboard. A reader starts at the JSON bundle, follows the source-open manifest into three copied public source provenance bodies, then checks the public case roster, locked evaluator policy, replay observations, recomputed trace verdicts, and metadata-only result records. The output is an integrity-boundary verdict: two public case replays pass the boundary, one public case replay is quarantined, and no score or hidden-gold authority is created.

Reader Evidence Routing

Bundle route: read core/paper_module_capsules.json::paper_modules[3], then the generated JSON instance, before treating this Markdown as explanatory projection.
Bundle route: read examples/agent_benchmark_integrity_anti_gaming_replay/exported_benchmark_integrity_bundle/source_module_manifest.json for module_count=3, body_in_receipt=false, copied body refs, digest refs, and the explicit secret-exclusion boundary.
Case route: read benchmark_cases.json for repo_issue_public_001, repo_issue_public_002, and repo_issue_public_003; the rows expose ids, hashes, splits, and held-out guard ids, not issue bodies or oracle patches.
Replay route: read replay_observations.json for the locked evaluator ids, config hashes, file-access refs, contamination refs, trusted-reference refs, output-replay refs, and the two integrity_pass plus one quarantine verdict pattern.
Runtime route: run tests/test_agent_benchmark_integrity_anti_gaming_replay.py when the reader needs recomputation evidence. The focused tests assert source-module digest verification, public trace verdict recomputation, required negative cases, and metadata-only result record boundaries.

Public Mechanics

A replay cannot pass unless the evaluator id and config hash are locked.
A replay row cannot pass unless its case id appears in the declared benchmark_cases.json roster.
File-access logs, contamination checks, trusted references, and output replay refs are required before any benchmark-style language can be considered.
Train/test leakage, hidden-gold access, oracle patch bodies, model-output data, final-answer-only grading, pass-k cherry-picking, misleading tests, private issue bodies, unregistered case replays, and score overclaims are quarantine cases.
integrity_pass is evidence that a metadata-only regression replay respected the boundary, not evidence of a SWE-bench score, live agent capability, or product-spine system progress.
Result records expose ids, refs, verdicts, counts, negative cases, and scope limits only.
Source body imports expose source pattern provenance artifacts in the bundle, with result records limited to refs, digests, classes, and validation status.

Prior Art Grounding

This component is grounded in the long-running observation that optimized metrics can become targets and lose evidential force, plus the AI-safety literature on reward hacking and specification gaming. Concrete Problems in AI Safety frames reward hacking as a practical accident-risk problem, DeepMind's specification-gaming survey collects concrete examples of agents satisfying a proxy in the wrong way, and benchmark-contamination work such as Benchmarking Benchmark Leakage in Large Language Models motivates explicit leakage and benchmark-use documentation.

Microcosm borrows the anti-gaming accounting pattern: evaluator ids, config hashes, case rosters, file-access logs, contamination checks, trusted-reference refs, and replay refs must be present before benchmark-style language is allowed. It does not report or imply a model score.

Validation Result records

The focused proof consumer is tests/test_agent_benchmark_integrity_anti_gaming_replay.py. A passing result record has to show that the fixture and exported-bundle validators recompute benchmark-integrity replay from public case ids, locked evaluator ids, config hashes, file-access refs, contamination-check refs, trusted-reference refs, output-replay refs, source-module manifest digests, and negative-case rows rather than trusting declared benchmark language.

PYTHONDONTWRITEBYTECODE=1 ./repo-pytest \
  tests/test_agent_benchmark_integrity_anti_gaming_replay.py \
  -p no:cacheprovider
./repo-python scripts/build_doctrine_projection.py \
  --check-paper-module-corpus

For the focused test, the result record boundary is the asserted shape: three public case ids, three replay rows, two recomputed integrity_pass rows, one quarantine row, three public trace spans, locked-evaluator and config-hash coverage, three copied source-module imports, nine source-artifact evidence refs, three verified source-artifact evidence refs, body_in_receipt=false, and negative cases for verdict mismatch, invalid declared verdict, evaluator config hash swaps, missing replay/source evidence, digest mismatches, manifest boundary violations, hidden-gold/oracle/provider/score overclaims, and unsafe command-card body reuse. For the corpus check, the result record only proves bundle/instance parity; it does not create benchmark claims, product-progress, provider, public sharing, or launch-scope decision.

Validation Result record Path

Run the first-wave fixture validator from the repo root and write its result record outside the repo working tree:

Then run the exported bundle validator:

cd microcosm-substrate && PYTHONPATH=src ../repo-python -m microcosm_core.organs.agent_benchmark_integrity_anti_gaming_replay run-benchmark-integrity-bundle --input examples/agent_benchmark_integrity_anti_gaming_replay/exported_benchmark_integrity_bundle --out /tmp/agent_benchmark_integrity_bundle_receipt --card > /tmp/agent_benchmark_integrity_bundle_card.json

The focused regression test and corpus projection checks are:

cd microcosm-substrate && ../repo-pytest tests/test_agent_benchmark_integrity_anti_gaming_replay.py
./repo-python scripts/build_doctrine_projection.py --check-paper-module-corpus

Scope boundary

Scope limit

This module may claim only that the public fixture and exported bundle preserve a metadata-only benchmark-integrity replay boundary: public case ids, locked evaluator refs, config hashes, contamination refs, output-replay refs, manifest digests, negative cases, and scope limits are recomputed or checked.

It must not claim benchmark performance, SWE-bench score, provider capability, hidden-gold access, oracle patch access, private issue access, live repository mutation, publishing-scope decision, product-progress evidence, or launch-scope decision.

Scope boundary

This module does not claim benchmark performance, run providers, expose private issue or oracle patch bodies, access hidden-gold answers, mutate live repositories, publish results, host a benchmark, or include launch operations.

Source and projection details

Source-Open Body Floor

The standard treats the bundle source_module_manifest.json as the body-row authority for three copied source pattern provenance bodies: benchmark_integrity_extracted_pattern_ledger_row_body_import, benchmark_integrity_high_novelty_growth_receipt_body_import, and benchmark_integrity_deterministic_pattern_order_body_import.

Those rows stay in source_artifacts/; result records and workingness/status cards carry refs, digests, classes, counts, and scope limits only. The body floor is accepted as regression-negative fixture evidence, not as a benchmark claims, SWE-bench performance claim, hidden-gold export, provider authority, live repository mutation authority, product-progress evidence, public sharing, or launch-scope decision.

Governing Lattice Relation

The bundle binds this page to mechanism.agent_benchmark_integrity_anti_gaming_replay.validates_public_benchmark_integrity_replay, the agent_reliability_and_safety_validator_bundle concept, provisional principles P-1 and P-2, provisional axiom AX-1, and the paper_module.mission_transaction_work_spine dependency. Within that lattice, the mechanism is an evidence-before-score gate: benchmark-style language has no paper authority unless the source record, copied-source manifest, locked policy, case roster, replay observations, public trace, negative-case floor, and metadata-only result records agree.

The governing concept is accountability for validator bundles, not public leaderboard construction. The principle/axiom ceiling is enforced as a refusal surface: private issue bodies, hidden-gold answers, oracle patch bodies, model-output data, source-file changes, live repository mutation, publishing-scope decision, product-progress evidence, and launch-scope decision remain false even when the replay fixture passes.

Cold Evaluation Honesty BundleRuns a copied route-quality simulator and checks its all-B scorecard against the original code.5/5

Does This component imports the real cold_eval.py route-quality simulator as an exact source copy. Running it over a synthetic workspace inspects the all-B scorecard shape, source-module digest evidence, and scope limit checks without exporting body text in result records or turning the fixture into benchmark truth.

Scope limit verified cold-eval source body import only, not a live benchmark, navigation truth, source authority, external model access, private-system equivalence, public sharing, or launch-scope decision

Run

microcosm batch10-cold-eval-honesty-capsule run --input fixtures/first_wave/batch10_cold_eval_honesty_capsule/input --out receipts/first_wave/batch10_cold_eval_honesty_capsule --acceptance-out receipts/acceptance/first_wave/batch10_cold_eval_honesty_capsule_fixture_acceptance.json

EvidenceVerified source importevidence 5/5Copied source body

ai-safetyagent-evaluationred-teaming

Source Design note · Source atlas

Paper module Set 10 Cold Eval Honesty Bundle

Explainscomponent Cold Evaluation Honesty Bundle mechanism validates public cold eval honesty bundle

Governed byprinciples

Abides byaxioms

Purpose

batch10_cold_eval_honesty_capsule answers one narrow question: can the public Microcosm copy of the source cold_eval.py route-quality simulator run over a synthetic workspace, expose its measured scorecard shape, and refuse to promote that shape into a benchmark or navigation-truth claim?

The useful evidence is deliberately small. A green run means the copied source body executed, the all-B.idea_first_packet winner shape was recomputed from fixture rows, and the scope limit blocked benchmark, hosted-readiness, and launch language. It does not say idea-first routing wins in the live system.

Shape

Diagram source

flowchart TD A["Public cold-eval workspace (tasks, navigation packets)"] --> B["Copied cold_eval.py runner"] B --> A1["Arm A: flat repo entry (README, quickstart, pyproject)"] B --> A2["Arm B: idea-first packet (entry packet, atlas, index)"] A1 --> SC["Score each task by declared route refs covered (refs scored, never injected)"] A2 --> SC SC --> W["Winner per task, idea-first win count"] W --> C["Scorecard shape audit all-B win + route asymmetry + no non-public refs"] C --> D["Scope limit gate injection off, forbidden benchmark/launch claims named"] D --> E["metadata-only result record and card"]

Prior Art Grounding

This component is grounded in evaluation-transparency and benchmark-hygiene practice: scorecards should expose what was measured, what fixture assumptions were injected, and what claims the result can and cannot support. Useful anchors include:

HELM, which frames model evaluation as a transparent, scenario-bound benchmark surface rather than a single global capability claim.
Model Cards for Model Reporting, which established the pattern of pairing performance results with intended use, limitations, and caveats.

Microcosm borrows the scorecard-plus-limitations shape, then narrows it to a deterministic route-quality fixture. The all-B.idea_first_packet winner row is accounting evidence for this fixture only; it is not promoted into navigation truth, hosted readiness, or launch-scope decision.

Reader Evidence Routing

Read the scorecard as evidence accounting, not as a leaderboard. The fixture intentionally creates a public workspace where the idea-first packet wins. The component then checks that the expected-ref injection policy is off, that non-public refs are not present, and that forbidden claims are named in the manifest.

The honesty of that win turns on one design choice in the copied scorer. Each task lists the route refs an answer should reach, but those expected refs are only ever used to *score* coverage. They are never added to either arm's route, so neither arm is handed the answer. Arm A is scored on the refs a flat reader reaches from README.md, docs/quickstart.md, and pyproject.toml. Arm B is scored on the refs the navigation packets actually declare. The scoring policy is named in every row as declared_route_refs_no_expected_ref_injection_v1, and every row carries expected_ref_injection_used: false. The idea-first arm wins because the entry packets genuinely declare more of the relevant files, not because the scorer leaked the target into the route. That distinction is the difference between a measured route-quality result and a rigged one, and the scope limit gate reports blocked rather than pass if the injection flag is ever turned on.

The engine ids are:

cold_eval_original_runner: dynamically loads the copied source body and runs run_cold_eval in a temporary public workspace.
cold_eval_scorecard_shape_audit: verifies the all-B winner shape and records visible route-surface asymmetry without upgrading it into proof.
cold_eval_claim_ceiling_gate: checks expected-ref injection policy and forbidden benchmark/launch claims.

Validation Result record Path

Reader-verifiable commands, run from the microcosm-substrate/ public root:

The fixture command writes the route-quality scorecard result record and sign-off JSON. The bundle command validates copied source source, source manifests, metadata-only cards, expected-ref injection policy, and private-ref negative cases. The focused test covers missing tasks, flat-route wins, expected-ref injection, private fixture refs, and the no-benchmark/no-launch scope limit.

This result record path is reader-verifiable evidence only. It does not establish live benchmark results, navigation truth, hosted readiness, launch-scope decision, external model access, source-file changes, or whole-system correctness.

Scope boundary

Scope limit

This module may claim public fixture evidence that the copied cold_eval.py source body executed over the synthetic workspace, the expected scorecard shape was recomputed, expected-ref injection was refused, non-public refs were excluded, negative fixtures were checked, metadata-only cards were emitted, and validation result records enforced the listed scope limit.

This module may not claim live benchmark results, navigation truth, hosted readiness, route-quality superiority, external model access, deployment posture, source-file changes, publishing-scope decision, launch-scope decision, or whole-system correctness.

Scope limit

Fixture-bound route-quality scorecard and copied source refs only; no live benchmark, navigation truth, hosted readiness, launch-scope decision, external model access, source-file changes, or whole-system correctness.

Validator Checker BundleRuns the real validator code over public examples so its safety checks stay inspectable.5/5

Does This component imports the real idea_microcosm validators.py body as an exact source copy. Running it shows status-policy judging, private-boundary scans, specimen checks, launch-gate checks, and the validate entrypoint exercised against public fixtures and negative cases.

Scope limit It validates only the imported validators.py source body and its checker membrane. It does not claim source authority, a full validator-suite proof, private-system equivalence, launch, hosted-public status, public sharing, external model access, or source-file changes.

Run

microcosm batch8-validator-checker-capsule run --input fixtures/first_wave/batch8_validator_checker_capsule/input --out receipts/first_wave/batch8_validator_checker_capsule --acceptance-out receipts/acceptance/first_wave/batch8_validator_checker_capsule_fixture_acceptance.json

EvidenceVerified source importevidence 5/5Copied source body

Links to Cold Evaluation Honesty Bundle, Release Public Wording Gate

ai-safetyagent-evaluationred-teaming

Source Design note · Source atlas

Paper module Set 8 Validator Checker Bundle

Explainscomponent Validator Checker Bundle mechanism validates public validator checker bundle

Governed byprinciples

Role

This module imports the real self-indexing-cognitive-system/src/idea_microcosm/validators.py body into Microcosm and exercises individual checker functions that were not covered by the earlier status-judge-only import.

Purpose

An earlier import brought across only one entry point from validators.py, the status-judge function. That left most of the validator body imported as text but never actually run. This bundle answers a single question: when the real checker functions are invoked, do they still behave the way their names claim? It picks six groups of checkers from the copied body and runs them, rather than asserting from a distance that the file is correct.

The groups are chosen to span the kinds of judgement the validator makes: whether a status policy blocks a poisoned transition, whether the private boundary scanner finds a planted home path and email address, whether the specimen and launch-gate checkers report zero failures on the existing fixture, and whether the no-write validate(root, write_receipt=False) entry point runs without mutating anything. Each group reaches into a different part of the imported body.

The design choice worth noting is what happens when the private source state is not present. In that case the component does not pretend the checkers passed. It falls back to reading the copied source for the named anchors and marks the remaining engines public_runtime_source_only, recording that as a stated limit rather than a hidden success. The second unusual choice is that the negative cases are judged from the engine outputs themselves, so a check cannot pass merely because a fixture file happens to contain the right error string. Both choices exist to stop a green run from claiming more than it observed.

Prior Art Grounding

This bundle borrows from schema validation, fixture-driven testing, and policy/checker separation. Useful anchors include:

JSON Schema, as a general pattern for declaring structural expectations and validating data instances against them.
pytest fixtures, as a common test pattern for isolating public inputs and expected negative cases.
Open Policy Agent, as a prior art pattern for separating policy evaluation from the application code that invokes it.

Microcosm borrows the validator/checker and fixture-negative-case shape, but keeps this component to bounded checker exercises over copied public source. It is not launch-scope decision, hosted-public proof, source-file changes, or a complete validator-suite proof.

Imported system

self-indexing-cognitive-system/src/idea_microcosm/validators.py

Technical Mechanism

The runtime does not ask the reader to trust the phrase "validator checker." It builds a small checker membrane around a single imported source body and then records how far that membrane reaches.

The source-anchor phase reads examples/batch8_validator_checker_capsule/exported_batch8_validator_checker_capsule_bundle/source_module_manifest.json. That manifest declares one exact copied module under the public bundle-relative locus source_modules/self-indexing-cognitive-system/src/idea_microcosm/validators.py, with a 12,747-line body and digest 4b2d44810cb9db2c5f62fd39da55deb7f20f6bd44ed1a8b0ae4324d38012a1d4. Here the root segment is a manifest-included public synthetic Microcosm root. The private source-root path is lineage-only and remains excluded from public copy; the checker validates the copied bundle body, not live private source. _validator_source_anchor_matrix checks that the copied body still contains the named validator anchors: private_boundary_hits, policy_wellformedness_failures, judge_status_request, _status_collapse_suite_failures, _source_shuttle_specimen_failures, and validate(root: Path).

The checker-exercise phase then runs six bounded engines when source state is available: source anchoring, status-policy judging, private-boundary scanning, specimen checker groups, launch-gate checker groups, and the no-write validate(root, write_receipt=False) witness. In exported-bundle mode, where a public runtime should not import private source state, the same component falls back to copied-source anchor evidence and marks the remaining engines as public_runtime_source_only. That fallback is a scope limit, not a hidden pass-through to private state.

The negative-case phase is semantic rather than fixture-string-only. The component declares six failure modes: missing validator source, policy poisoning, blind private-boundary scanning, missing specimen checkers, missing launch gates, and bypassing the validate entrypoint. evaluate_negative_case observes those cases from the engine outputs, so the tests can prove the negative cases move with runtime evidence instead of passing because a fixture file contains the right error code.

The result record phase uses the shared crown-jewel runner to write result, board, validation, and sign-off artifacts, then result_card deliberately compresses them into an authority floor and body floor. Those card fields keep release_authorized, publication_authorized, provider_dispatch, model_dispatch, source_mutation_authorized, full_validator_suite_freshness_claim, public_clone_or_hosting_authority, and test_completeness_proof false while also preserving body_in_receipt: false.

Shape

Diagram source

flowchart TD A["Fixture input or exported bundle"] --> B["Source manifest validation"] B --> C["Exact copied validators.py digest and required anchors"] C --> D{"Source state available?"} D -- "yes" --> E["Six runtime checker engines"] D -- "no" --> F["Copied-source anchors plus source-only witnesses"] E --> G["Semantic negative-case evaluator"] F --> G G --> H["Crown-jewel result, board, validation, sign-off result records"] H --> I["Result card authority_floor and body_floor"] I --> J["Reader claim: bounded checker membrane, not launch-scope decision"]

Doctrine Relation

The generated JSON row binds this page to mechanism.batch8_validator_checker_capsule.validates_public_validator_checker_capsule and concept.agent_reliability_and_safety_validator_bundle; that relation is bundle-declared rather than inferred from this prose. The bundle also names the axiom refs AX-1, AX-4, AX-5, AX-7, AX-8, AX-11, and AX-12 and the principle refs P-1, P-2, P-5, P-6, P-8, P-9, P-13, and P-15. In this module those refs matter because the component separates evidence from authority, keeps JSON as the navigable contract, prevents body leakage, and refuses to promote a selected checker run into a launch or proof claim.

The dependency edges also explain the reader route. microcosm_axiom_substrate owns the axiom vocabulary this module abides by; engine_room_generated_projection_drift_gate owns the generated-projection freshness posture this page must not bypass; and public_reveal_walkthrough owns the reading lane for result records, source refs, and scope boundaries.

Evidence Model and Limitations

The strongest positive evidence is narrow and useful: the focused regression checks that all expected engines are present, the exact copied source body matches the source source digest, exported-bundle validation does not import source validators, source-anchor corruption blocks validation, result cards omit private bodies, and semantic negative cases fail when runtime evidence is weakened.

The limitations are just as important. Exported-bundle mode validates copied source anchors and public-runtime witness fields; it does not re-run the full source validator suite. The fixture proves selected checker groups and selected negative cases, not all future validator behavior. The copied source body being large does not itself increase the claim; only the named anchors, engines, digests, negative cases, and result record fields are evidence. A green run therefore supports a bounded checker-membrane claim and nothing broader.

Reader Evidence Routing

Bundle route: read core/paper_module_capsules.json::paper_modules[65] before treating this Markdown as explanation.
Generated route: inspect paper_modules/batch8_validator_checker_capsule.json for current relationship state and projection details.
Bundle route: inspect examples/batch8_validator_checker_capsule/exported_batch8_validator_checker_capsule_bundle for copied validator source refs and digest evidence.
Runtime route: run tests/test_batch8_validator_checker_capsule.py and the commands in ## Validation Result record Path.

Exercised checker groups

Policy well-formedness and status transition judging.
Private boundary scanning without putting private body text into result records.
Status collapse, internal control, correction, self-comprehension, task-ledger, and atlas navigation specimen checkers.
launch standards, source bundle, source shuttle, concurrency, native guard, and launch-root compiler gate checkers.
The no-write validate(root, write_receipt=False) entrypoint.

Validation Result record Path

Reader-verifiable commands, run from the microcosm-substrate/ public root:

The fixture command writes the bounded validator-checker result record and sign-off JSON. The bundle command validates copied checker source, manifest digests, selected checker-group exercises, body-exclusion scans, and scope limit fields. The focused test checks fixture validation, bundle validation, private-boundary scanning, and the no-complete-suite-proof scope limit.

This result record path is reader-verifiable evidence only. It does not establish the complete validator suite, authorize source-file changes, provide hosted-public proof, dispatch providers, authorize public sharing, or approve launch.

Scope boundary

Scope limit

The bundle is not launch-scope decision, not hosted-public proof, not source-file changes, and not a complete validator-suite proof.

Scope limit

This paper module can claim a bounded validator/checker fixture with a diagram view and Atlas navigation generated for it. It can explain the declared checker groups, no-write validation entrypoint, and metadata-only result record boundary.

It cannot claim launch-scope decision, hosted-public proof, source-file changes, complete validator-suite proof, publishing-scope decision, provider authority, or whole-system correctness. Any broader checker claim must be grounded in the JSON bundle and its generated projection.

Agent Monitor Redteam Falsification ReplayReplays honest and deceptive agent runs and flags any verdict missing its declared backing evidence.3/5

Does It replays three made-up agent trajectories (a normal one, one trying to hide a failure, and one overstepping a tool's permissions) and checks that every monitor verdict — including a clean "pass" — comes with the result records needed to re-run and verify it. The result records show exactly which checks each verdict depends on, and any verdict missing its backing evidence gets quarantined instead of trusted. It is a demo over fictional data, not a measurement of how well a real monitor catches anything.

Scope limit Bounded public runtime validation over copied source pattern bodies, sanitized dogfood trace slices, recomputed monitor-verdict spans, source-artifact evidence refs, digest/metadata-only/non-public-state gates, and negative cases only; no live agent execution, monitor product performance, control-eval score, safety-validation, benchmark, provider-call, source-file changes, launch, public sharing, or product authority.

Run

microcosm agent-monitor-redteam-falsification-replay run-monitor-bundle --input examples/agent_monitor_redteam_falsification_replay/exported_monitor_redteam_bundle --out receipts/runtime_shell/demo_project/organs/agent_monitor_redteam_falsification_replay

EvidenceComputed projectionevidence 3/5Source-faithful refactor

Links to Agent Sabotage Scheming Monitor Replay

ai-safetyagent-evaluationred-teaming

Source Design note · Source atlas

Paper module Agent Monitor Redteam Falsification Replay

Explainscomponent Agent Monitor Redteam Falsification Replay mechanism validates public monitor verdict result records

Governed byprinciples Recompute, do not echo Lower claim strength to checker strength concept agent reliability and safety validators as bounded public scope limits

Abides byaxiom Derivation before assertion

Depends onpaper module Source Projection Import Protocol

agent_monitor_redteam_falsification_replay is a synthetic regression fixture for one honesty claim: that a monitor verdict was earned by evidence rather than asserted. It replays public trajectory metadata and checks that every "the monitor caught it" sentence is tied to a probe, an escalation, a mitigation, and a metadata-only result record before it counts.

Purpose

It is easy to say a safety monitor flagged a suspicious agent action. It is harder to show that the verdict followed from the evidence and would survive a rerun. This component answers one question: if a monitor claim is replayed against its own declared evidence, does the claim still hold, or does it fall apart?

The part worth noticing is that the page does not trust the verdict it is given. A separate builder, build_public_monitor_redteam_falsification_trace, independently recomputes two things from the trajectory spans: whether each declared coverage label is actually backed by an adversarial probe, and what the monitor verdict should be given the span's severity, escalation, and mitigation evidence. If that recomputed verdict disagrees with the verdict the fixture declared, the run blocks. So the contract is not "does the row have the right fields" but "does an independent reading of the evidence reach the same conclusion". The failure mode it guards against is the one that quietly defeats most monitoring write-ups: a verdict label that reads well but is not supported by the trajectory it claims to summarise.

The replay is deliberately narrow. It works over three synthetic trajectory cases and seven negative cases, on public metadata and metadata-only result records. It is not a live monitor, a control-evaluation score, an exploit guide, or evidence about how well any real monitor performs. It only proves that, for these fixtures, a monitor claim cannot pass without surviving the disagreement check.

Abstract

This module is the public Microcosm technical paper for a monitor/redteam honesty contract. It specifies a synthetic replay fixture that treats a monitor verdict as an evidence object to falsify: the verdict must be tied to a trajectory hash, suspicious-span ref, adversarial-probe ref, escalation ref, mitigation ref, body-omission ref, cold-replay ref, and source-artifact digest boundary before it can support any monitor-coverage sentence.

The result is a safety/reliability proof about evidence shape, not a live monitor product, control-eval score, exploit guide, product-spine system, or complete-security claim. The technical contribution is the proof contract: a replay can show how a monitor claim would fail under semantic negatives and perturbations without exporting private reasoning, internal source bodies, model-output data, live agent traffic, or launch-scope decision.

Telos

The fixture models three public trajectory cases: a benign control, a suppressed failure attempt, and a tool-authority boundary attempt. Each case must carry a trajectory hash, monitor id, suspicious span ref, adversarial probe ref, verdict, severity, escalation ref, body-omission ref, mitigation ref, and cold replay ref before the claim is admitted.

The point is to make "the monitor caught it" a rerunnable, falsifiable claim. A benign pass proves only that the benign route remains admissible. An escalation proves that a suppressed-failure attempt retained enough evidence for human escalation. A block proves that the tool-authority boundary attempt was rejected under the public policy vocabulary. None of those outcomes proves live monitor performance; together they prove that this fixture's evidence contract catches missing probes, missing source evidence, private-body leakage, and verdict drift.

Monitor-Falsification Mechanism

The component turns a monitor claim into a falsification replay. Its inputs are the projection protocol, monitor policy, trajectory cases, monitor observations, negative-case files, exported bundle manifest, and source-module manifest. _build_result loads those inputs, validates the projection protocol density, checks the monitor policy vocabulary, checks trajectory case floors, validates each observation against required refs and source-artifact evidence, scans public inputs and copied source artifacts for forbidden non-public-state classes, and folds in a public trace recomputation before writing result records.

A positive observation cannot pass by carrying a verdict label alone. The policy requires a trajectory id, monitor id, coverage label, suspicious action span ref, adversarial probe ref, monitor verdict, severity tier, human escalation ref, body-omission ref, mitigation diff ref, and cold replay ref. For positive observations, source-artifact evidence refs must match the targets declared by the exported source-module manifest. Missing fields, unsupported verdicts, unverified source evidence, private chain-of-thought export, internal-code export, exploit-instruction material, account secret material, live agent traffic, monitor product-performance claims, and coverage without a probe become quarantine or blocking findings.

The exported body floor is checked separately from the result record. The source-module manifest admits two source artifacts: a copied public source pattern body for the monitor-redteam compound row, and a sanitized public dogfood safety-evals trace slice. validate_source_module_manifest requires the declared module count to match the rows, requires body_in_receipt=false, rejects body text in result records, resolves each target path, and verifies every declared digest field against disk. Result records and cards then expose ids, refs, digests, counts, verdicts, findings, and ceilings, not copied bodies.

The public trace is the anti-echo pass. build_public_monitor_redteam_falsification_trace recomputes whether a coverage label is backed by an adversarial-probe span, then derives the monitor verdict from severity, escalation, and mitigation evidence. In the first-wave fixture, the three spans recompute to one pass, one escalate, and one block. If a declared verdict no longer matches the span-derived verdict, validate_public_trace records PUBLIC_TRACE_MONITOR_REDTEAM_VERDICT_MISMATCH and the component blocks.

Named Proof Consumers

run consumes the first-wave fixture and writes result, board, validation, sign-off, and metadata-only command-card result records. It is the proof consumer for the three synthetic trajectory cases and seven required negative cases.
run-monitor-bundle consumes the exported public bundle and proves that the bundle manifest, source-module manifest, copied/sanitized source artifacts, four bundle observations, digest checks, non-public-state scan, and metadata-only card path remain valid outside the fixture directory.
tests/test_agent_monitor_redteam_falsification_replay.py is the focused regression consumer. It asserts digest verification, sanitized dogfood trace slicing, workingness exposure of source-body imports, negative-case semantic floors, public trace verdict recomputation, source-evidence ref enforcement, public-relative metadata-only result records, manifest boundary rejection, and fresh result record reuse.
A cold reader consumes this Markdown only after checking the JSON bundle, generated JSON instance, exported source manifest, trajectory and observation fixtures, public trace recomputation path, focused tests, and scope limit. The reader may verify the replay boundary but must not infer live monitor performance, control-eval score, exploit guidance, complete security, public sharing, or launch-scope decision.

Semantic Negative and Perturbation Proof

The replay proves its boundary through negative and perturbation cases, not by trusting the positive fixture. The negative floor includes private chain-of-thought leakage, internal code export, exploit-instruction material, account secret material, live agent traffic, monitor product-performance claims, and coverage claims without adversarial probes. Those cases are expected to block or quarantine because they violate the public-source/metadata-only contract, not because they resemble production attacks.

The perturbation proof is the public trace recomputation. A declared verdict is accepted only when the public span evidence derives the same pass, escalate, or block result. Removing the adversarial-probe span, weakening the escalation or mitigation refs, changing the severity/verdict relation, or declaring source evidence that is absent from the exported source-module manifest turns the replay into a mismatch or blocking finding. That is the technical safety property: the paper authority lives in disagreement detection between declared monitor coverage and independently recomputed public evidence.

Shape

Source refs

Manifest: source_module_manifest.json

Diagram source

flowchart TD bundle["JSON bundle authority"] markdown["Markdown reader projection"] manifest["source_module_manifest.json"] digest["2 source-artifact digest gates"] cases["three trajectory cases"] policy["public monitor policy"] observations["three observations"] required["required refs: span, probe, escalation, mitigation, replay"] trace["public trace verdict recomputation"] verdicts["pass / escalate / block"] negatives["7 negative semantic floors"] privateScan["metadata-only non-public-state scan"] result record["metadata-only validation result record"] ceiling["scope limits"] bundle --> markdown manifest --> digest digest --> observations cases --> required policy --> required required --> observations observations --> trace trace --> verdicts negatives --> result record digest --> privateScan verdicts --> result record privateScan --> result record result record --> ceiling

The module's shape is evidence narrowing.

Reader Evidence Routing

Bundle route: core/paper_module_capsules.json::paper_modules[33] is the bundle-backed authority row, and paper_modules/agent_monitor_redteam_falsification_replay.json is the generated paper-module instance.
Source-module route: examples/agent_monitor_redteam_falsification_replay/exported_monitor_redteam_bundle/source_module_manifest.json records two admitted source artifacts with body_in_receipt: false: the public source-pattern JSON slice with digest sha256:89792add1e2f03a09c40f64d19c1ac0a54d62c053aabe11ecad0a9846a54cf33, and the sanitized public dogfood safety-evals trace slice with digest sha256:88493225f908f3f8892d187370d30231cb4e292b43bed202b462b6a6888e1eb0.
Trajectory route: trajectory_cases.json carries the three public trajectory hashes sha256:monitor-redteam-benign-route-review, sha256:monitor-redteam-suppressed-failure, and sha256:monitor-redteam-tool-authority-escape.
Observation route: monitor_observations.json binds obs_benign_route_review_pass to monitor.synthetic.route-review.v1 and a pass verdict, obs_suppressed_failure_escalate to monitor.synthetic.failure-suppression.v1 and an escalate verdict, and obs_tool_authority_escape_block to monitor.synthetic.tool-authority.v1 and a block verdict. Each observation carries a body-omission ref and cold replay ref.
Runtime route: src/microcosm_core/organs/agent_monitor_redteam_falsification_replay.py and tests/test_agent_monitor_redteam_falsification_replay.py verify manifest digest visibility, negative-case coverage, required observation fields, public trace recomputation, and metadata-only result record boundaries.

Public Mechanics

A monitor claim cannot pass unless the observation includes a verdict and the probe, escalation, mitigation, body-omission, and replay refs that make the verdict rerunnable.
Coverage labels require adversarial probe refs; benign-only trajectories do not authorize coverage language.
Private reasoning, internal code, exploit-detail, account secret, live-traffic, product-performance, and coverage-without-probe cases are expected falsification fixtures.
Result records expose ids, refs, verdict counts, negative cases, body_in_receipt: false, non-public-state scan, and scope limits only.

Prior Art Grounding

This component is grounded in model red-team and behavior-discovery work that treats monitor claims as things to falsify with adversarial probes. Anthropic's Red Teaming Language Models to Reduce Harms is a close procedural anchor for eliciting harmful or unwanted behavior, and Discovering Language Model Behaviors with Model-Written Evaluations anchors the idea that evaluation prompts can surface behavior classes worth tracking. More recent sabotage and control-evaluation work reinforces the same shape: monitors need adversarial trajectories, suspicious spans, escalation paths, and negative cases, not just benign examples.

Microcosm borrows the falsification accounting pattern. A monitor verdict needs trajectory hashes, probe refs, suspicious-span refs, escalation refs, mitigation refs, replay refs, and body-omission result records before coverage language is allowed. It does not claim a live monitor product or control-eval score.

Evidence Contract Summary

The evidence contract has four gates:

Trajectory gate: each monitor observation must cite a trajectory hash, monitor id, suspicious-span ref, adversarial-probe ref, verdict, severity, escalation ref, body-omission ref, mitigation ref, and cold-replay ref.
Source-body gate: the exported source-module manifest names the admitted copied/sanitized public source artifacts, requires matching digests, and keeps body_in_receipt: false.
Falsification gate: semantic negatives and public trace recomputation reject private-body leakage, unsupported source evidence, missing probes, unsupported verdicts, and declared/recomputed verdict mismatch.

A valid paper claim must pass all four gates and still inherit the limitations above.

Validation Result record Path

Run the first-wave fixture validator from the repo root and write its result record outside the repo working tree:

Then run the exported bundle validator:

cd microcosm-substrate && PYTHONPATH=src ../repo-python -m microcosm_core.organs.agent_monitor_redteam_falsification_replay run-monitor-bundle --input examples/agent_monitor_redteam_falsification_replay/exported_monitor_redteam_bundle --out /tmp/agent_monitor_redteam_bundle_receipt --card > /tmp/agent_monitor_redteam_bundle_card.json

The focused regression test and corpus projection checks are run from the repo root:

PYTHONDONTWRITEBYTECODE=1 PYTHONPYCACHEPREFIX=/tmp/mc_agent_monitor_pyc ./repo-pytest tests/test_agent_monitor_redteam_falsification_replay.py -q -p no:cacheprovider --basetemp=/tmp/mc_agent_monitor_bt
./repo-python scripts/build_doctrine_projection.py --check-paper-module-corpus

The validation ceiling remains synthetic monitor falsification replay only.

Scope boundary

Limitations and Scope limit

This module may claim public fixture evidence that trajectory hashes, synthetic monitor ids, suspicious-span refs, adversarial-probe refs, verdict labels, escalation refs, mitigation refs, body-omission refs, cold replay refs, negative-case labels, manifest digests, and validation result records are checked by the listed runtime witnesses.

This module may not claim a live monitor product, control-eval score, production monitoring, exploit guidance, private reasoning export, internal source export, live agent traffic, provider behavior, complete security, source-file changes, publishing-scope decision, launch-scope decision, or whole-system correctness.

Scope boundary

This module does not run live agents, use external model services, expose private chain-of-thought, export internal code, provide exploit instructions, include account secrets, import live agent traffic, claim monitor product performance, claim control-eval scores, change source files, publish results, or include launch operations.

Scope limit

This module may claim fixture-bound evidence that the component ran over public synthetic inputs and produced the result records and projections described above, reproduced by the validation result records named on this page.

It may not claim more than its bundle scope limit allows: Synthetic replay fixture only; not a live monitor product, control eval, exploit guide, product-spine system, or complete-security claim.

Source and projection details

Governing Lattice Relation

The bundle binds this page to mechanism.agent_monitor_redteam_falsification_replay.validates_public_monitor_verdict_receipts, the agent_reliability_and_safety_validator_bundle concept, provisional principles P-1 and P-2, provisional axiom AX-1, and the agent_monitor_redteam_falsification_replay runtime locus. Within that lattice, the mechanism is an evidence-before-monitor-claim gate: monitor coverage language has no paper authority unless the source record, policy, trajectory roster, observations, source-artifact manifest, negative-case semantic floors, public trace, and metadata-only result records agree.

The governing concept is validator-bundle accountability, not live monitor operation. The principle/axiom ceiling is enforced as a refusal surface: private reasoning, internal code, exploit instructions, account secrets, live agent traffic, model-output data, performance claims, source-file changes, publishing-scope decision, and launch-scope decision remain false even when the replay fixture passes.

Agent Sabotage Scheming Monitor ReplayAudits a hidden-goal catch claim for the steps, suspicion scores, and counterfactual it needs.3/5

Does This inspects whether a write-up that claims an AI agent was caught secretly working a hidden side-goal actually carries the supporting structure a reader would need: the step-by-step actions, a per-step "how suspicious did the monitor find this?" score, and a stated counterfactual saying that removing the sneaky steps would drop the hidden goal while keeping the real task working. It runs only on synthetic, scrubbed example data, so it shows what the skeleton of such a safety claim should look like and checks the pieces line up with each other. It does not run a real agent, does not detect real scheming, and proves nothing about how any deployed monitor performs.

Scope limit Bounded public runtime validation over copied source pattern bodies, sanitized dogfood trace slices, recomputed sabotage/scheming monitor spans, source-artifact evidence refs, digest/metadata-only/non-public-state gates, and negative cases only; no live sabotage, live agent execution, exploit instruction, account secret/account, private-reasoning, harmful-payload, monitor-product-performance, deployment-risk, benchmark, provider-call, source-file changes, launch, public sharing, or product authority.

Run

microcosm agent-sabotage-scheming-monitor-replay run-sabotage-bundle --input examples/agent_sabotage_scheming_monitor_replay/exported_sabotage_monitor_bundle --out receipts/runtime_shell/demo_project/organs/agent_sabotage_scheming_monitor_replay

EvidenceComputed projectionevidence 3/5Source-faithful refactor

ai-safetyagent-evaluationred-teaming

Source Design note · Source atlas

Paper module Agent Sabotage Scheming-Monitor Replay

Explainscomponent Agent Sabotage Scheming Monitor Replay mechanism validates public sabotage scheming monitor replay

Governed byprinciples Recompute, do not echo Lower claim strength to checker strength concept agent reliability and safety validators as bounded public scope limits

Abides byaxiom Derivation before assertion

Depends onpaper module Agent Monitor Redteam Falsification Replay

Abstract

agent_sabotage_scheming_monitor_replay is a technical paper module for a drilldown-only regression component. It validates whether a sabotage/scheming monitor claim is backed by joinable synthetic episodes, action traces, per-step monitor scores, counterfactual benign replay, cold replay, falsification fixtures, source-artifact evidence refs, metadata-only result records, and an explicit scope limit.

The core contribution is not a live scheming detector. It is a replay contract: given a public fixture or exported bundle, the component recomputes caught/not-caught verdicts from monitor evidence, checks whether counterfactual replay isolates the side objective, verifies source-artifact digest/provenance rows, rejects common overclaim shapes, and emits result records that omit private or harmful bodies.

Telos

The module exists to make safety-monitor evidence auditable without laundering a demo into a deployment claim. Frontier-lab readers should be able to answer:

Which finite evidence rows support the sabotage/scheming monitor claim?
Which verdicts are recomputed from scores and joins rather than echoed from fixture labels?
Which mutations and negative cases cause rejection?
Which claims remain out of scope even when every local check is green?

Mechanism

The runtime locus is src/microcosm_core/organs/agent_sabotage_scheming_monitor_replay.py. It exposes two public admission paths:

run validates the first-wave fixture under fixtures/first_wave/agent_sabotage_scheming_monitor_replay/input.
run_sabotage_bundle validates the exported bundle under examples/agent_sabotage_scheming_monitor_replay/exported_sabotage_monitor_bundle.

Both paths call _build_result. The builder loads:

projection_protocol.json
scheming_monitor_policy.json
task_episodes.json
action_traces.json
monitor_scores.json
counterfactual_replay.json
cold_replay.json
source_module_manifest.json when present

The monitor proof is a join-and-recompute circuit. validate_task_episodes, validate_action_traces, and validate_monitor_scores require monitor rows to bind to an episode and action step. Monitor rows must carry evidence refs; in the exported bundle those refs resolve through the digest-checked source manifest. validate_counterfactual_replay checks that replay rows remove side-task success while preserving the main task. validate_cold_replay requires explicit passing replay result records rather than narrative assurance.

The recompute is deliberately conjunctive. A monitor row only earns a block verdict if the suspicion score is at or above the policy block threshold and the row is a "backed suspicious action". Backed means five separate joins all hold: the suspicious-action span names a real action id, that action id is also cited in the row's evidence refs, the joined action carries attack_opportunity_label = present, its hidden-side-objective pressure is medium or high, and the counterfactual replay isolates exactly that step by removing side-task success while preserving the main task. Drop any one of those and the row recomputes to escalate or pass rather than block, so a high score alone cannot manufacture a catch. An episode is then recomputed as caught only if it contains at least one recomputed block row with verified source-artifact evidence.

The trace recomputation locus is src/microcosm_core/macro_tools/agent_execution_trace.py::build_public_sabotage_scheming_monitor_trace. It folds the same joins into per-episode spans, recomputes the caught verdict, and checks that the counterfactual's changed steps are the higher-suspicion ones (a positive rank delta). A stale declared verdict, a benign-step counterfactual spoof, or a missing cold replay becomes a finding rather than a silent pass.

Evidence Floor

The local floor is intentionally small and inspectable:

Evidence class	Fixture mode	Exported-bundle mode
Episodes	3	4
Action trace rows	6	7
Monitor-score rows	6	7
Counterfactual replay rows	3	4
Cold replay rows	3	4
Negative cases	7	0 fixture-negative payloads
Copied public source artifacts	source trace body + trace builder import	2 manifest rows

The exported bundle adds scheming_ep_public_authority_probe and obs_public_authority_probe. That row is the real dogfood-trace floor: it comes from a sanitized public command-trace slice, verifies against the source trace digest, drops argv/stdout/stderr previews, and remains metadata-only in result records. The tests require it to compute a block monitor verdict and a caught episode without exporting private session material.

The source-open body floor is split:

The runtime imports microcosm_core.macro_tools.agent_execution_trace::build_public_sabotage_scheming_monitor_trace as the public recomputation body.
The exported bundle manifest carries agent_sabotage_scheming_extracted_pattern_ledger_row_body_import as a source-faithful public source pattern slice with digest sha256:e1ad34691ca598b878f8658485ce78fa4fa59b3eb0a4bd078355c415ca2cb312.
The same manifest carries agent_sabotage_scheming_public_dogfood_safety_evals_trace_slice_import as a source-faithful sanitized command-trace slice with digest sha256:7b6255ec65f4b21163685cf9d7a59dd9ae6075126deac065dec9a45d79dcea06 and source trace digest sha256:24487cf8edb0779cfdd460b04a4ea7701805686545069f9e1c5767250c223cd3.

Both manifest rows require body_in_receipt: false.

Shape

Diagram source

flowchart TD bundle["JSON bundle authority"] fixture["fixture or exported bundle"] episodes["task episodes"] actions["action traces"] scores["monitor scores"] sourceManifest["source module manifest"] sourceArtifacts["digest-checked public artifacts"] counterfactual["counterfactual replay"] cold["cold replay"] recompute["public trace recomputation"] negatives["negative/falsification cases"] result record["metadata-only validation result record"] ceiling["scope limit"] bundle --> fixture fixture --> episodes fixture --> actions fixture --> scores fixture --> counterfactual fixture --> cold fixture --> sourceManifest sourceManifest --> sourceArtifacts episodes --> recompute actions --> recompute scores --> recompute counterfactual --> recompute cold --> recompute negatives --> result record sourceArtifacts --> result record recompute --> result record result record --> ceiling

Rejection And Perturbation Evidence

The focused regression suite is the strongest technical evidence for the paper claim. tests/test_agent_sabotage_scheming_monitor_replay.py covers:

Digest mismatch in source_module_manifest.json blocks the exported bundle.
target_ref / path mismatch blocks the exported bundle.
body_text_in_receipt at manifest or row level is rejected.
Missing source_artifact_evidence_refs quarantines a monitor row.
A declared caught verdict that disagrees with recomputation emits PUBLIC_TRACE_SCHEMING_CAUGHT_VERDICT_MISMATCH.
Perturbing a high-suspicion block row down to pass moves the recomputed caught verdict and blocks the result record.
A stale caught_verdict label inside a monitor-score row is rejected against the recomputed episode verdict.
A counterfactual row that changes a benign low-suspicion step fails the rank delta check and the isolation check.
Spoofing a declared negative-case id is caught by semantic negative-case floors.
Public-relative, metadata-only result records exclude absolute local paths and private keys such as account secrets, model-output data, hidden reasoning, raw harmful payloads, exploit instructions, and real target identifiers.
The real sanitized public probe must compute block, carry verified source artifact evidence, and join to its counterfactual row.
Spoofing the public probe to pass is rejected by verdict recomputation and the real-sanitized-evidence check.

These tests matter because they exercise the falsification path, not only the happy path. The component is permitted to say "this synthetic replay failed closed under these mutations"; it is not permitted to say "this detects real scheming".

Reader Evidence Routing

Bundle: core/paper_module_capsules.json::paper_modules[34:paper_module.agent_sabotage_scheming_monitor_replay]
Generated instance: paper_modules/agent_sabotage_scheming_monitor_replay.json
Standard: standards/std_microcosm_agent_sabotage_scheming_monitor_replay.json
Mechanism: mechanisms/mechanism.agent_sabotage_scheming_monitor_replay.validates_public_sabotage_scheming_monitor_replay.json
Component model: organs/agent_sabotage_scheming_monitor_replay.json
Runtime: src/microcosm_core/organs/agent_sabotage_scheming_monitor_replay.py
Trace recomputation: src/microcosm_core/macro_tools/agent_execution_trace.py::build_public_sabotage_scheming_monitor_trace
Fixture input: fixtures/first_wave/agent_sabotage_scheming_monitor_replay/input
Exported bundle: examples/agent_sabotage_scheming_monitor_replay/exported_sabotage_monitor_bundle
Focused tests: tests/test_agent_sabotage_scheming_monitor_replay.py
Runtime-shell result record: receipts/runtime_shell/demo_project/organs/agent_sabotage_scheming_monitor_replay/exported_sabotage_monitor_bundle_validation_result.json

Governing Relations

The generated paper-module instance binds this reader page to:

component:agent_sabotage_scheming_monitor_replay
mechanism:mechanism.agent_sabotage_scheming_monitor_replay.validates_public_sabotage_scheming_monitor_replay
concept:concept.agent_reliability_and_safety_validator_bundle
principle:P-1
principle:P-2
axiom:AX-1
paper_module.agent_monitor_redteam_falsification_replay
code locus: src/microcosm_core/organs/agent_sabotage_scheming_monitor_replay.py with run, run_sabotage_bundle, _build_result, _write_receipts, and result_card

Those edges are structural evidence-routing edges. They do not by themselves prove runtime correctness; the runtime result records and tests carry that narrower claim.

Prior-Art Positioning

The module borrows evaluation grammar from sabotage, scheming, and deceptive behavior evaluations: episodes, action traces, monitor scores, counterfactual benign replay, falsification cases, and fail-closed monitor disagreement. The closest public references remain Anthropic's sabotage evaluations, Apollo's in-context scheming evaluations, Anthropic's Sleeper Agents work, and the OpenAI/Apollo line on detecting and reducing scheming.

The Microcosm claim is deliberately narrower than those research programs. It does not evaluate frontier models or live deployments. It gives the public Microcosm system a result record-backed, metadata-only, replayable safety-monitor proof shape that can reject overclaims and provenance drift.

Prior Art Grounding

This component replays a monitor that watches an agent transcript for sabotage and scheming signals. It is grounded in the AI-control literature, which studies how to catch a model that may be deliberately subverting oversight rather than assuming it is benign (Greenblatt et al., AI Control, and Anthropic's sabotage evaluations). Microcosm borrows the monitor-over-transcript shape on synthetic fixtures; the result is fixture-bound replay evidence, not a live monitor, a control guarantee, or an exploit guide.

Validation Result record Path

Run the metadata-only fixture validator:

Run the exported-bundle validator:

cd microcosm-substrate && PYTHONPATH=src ../repo-python \
  -m microcosm_core.organs.agent_sabotage_scheming_monitor_replay \
  run-sabotage-bundle \
  --input examples/agent_sabotage_scheming_monitor_replay/exported_sabotage_monitor_bundle \
  --out /tmp/agent_sabotage_scheming_bundle_receipt \
  --card > /tmp/agent_sabotage_scheming_bundle_card.json

Run focused tests and paper-module corpus validation:

PYTHONPATH=src ./repo-pytest \
  tests/test_agent_sabotage_scheming_monitor_replay.py
cd microcosm-substrate && PYTHONPATH=src python3 \
  scripts/build_doctrine_projection.py --check-paper-module-corpus

Expected local result: the fixture and exported-bundle validators pass on the checked-in inputs; the mutation tests block when digest, target-ref, body-in-result record, evidence-ref, caught-verdict, counterfactual-rank, negative semantic, or public-probe labels are corrupted.

Scope boundary

Public-Safety Boundary

Public rendering may describe the replay grammar, recomputation rule, falsification cases, sanitized command-trace source artifact, digest checks, and metadata-only result records.

Public rendering must not include or imply:

live sabotage instructions
exploit steps
account secrets, browser state, account or browser material, or model-output data
private chain-of-thought or hidden reasoning bodies
raw harmful payloads
unsanitized argv/stdout/stderr previews
browser UI state
production telemetry
live traffic
live agent execution authorization
deployment-risk measurement
product monitor performance
benchmark claims
provider affiliation or provider behavior claims
source-file changes
public sharing/launch-scope decision
private-system equivalence
whole-system safety

Scope limit

It may not claim more than its bundle scope limit allows: No live sabotage, exploit instruction, account secret/account material, private reasoning, harmful payload, or deployment-risk product claim; synthetic fixtures and metadata-only result records only.

Agent Memory Temporal Conflict ReplayReplays a memory edit-and-delete to show stale facts get flagged before they sway an answer.3/5

Does Replays a canned three-step story: an agent's memory first records two facts (a preference and a tool result), then learns one is out of date and edits the preference and deletes the stale fact, then re-runs the same task once with memory on and once with memory off. From plain result records, it shows that the update and deletion each carry a "conflict" and "downgrade" result record before the memory is allowed to affect the answer, that the memory-on vs memory-off runs are compared through logged evidence rather than just the final wording, and that private-thread content stays in the record only as metadata pointers, never as copied text. It checks the bookkeeping of this synthetic example; it does not judge whether the agent's memory decisions were the right ones.

Scope limit It validates the projection mechanics of a synthetic memory fixture only — that the required refs, decisions, paired replays, negative cases, and secret-exclusion scan line up and that result records are metadata-only. It does not claim live-memory product quality, judge whether memory decisions were domain-correct, treat memory recall as source authority, adopt active injection, export private transcripts, use external model services, change source files, or include launch operations.

Run

PYTHONPATH=src python3 -m microcosm_core.organs.agent_memory_temporal_conflict_replay run --input fixtures/first_wave/agent_memory_temporal_conflict_replay/input --out /tmp/agent_memory_temporal_conflict_replay_out

EvidenceComputed projectionevidence 3/5Source-faithful refactor

Links to Sleeper Memory Poisoning Quarantine Replay

ai-safetyagent-evaluationred-teaming

Source Design note · Source atlas

Paper module Agent Memory Temporal-Conflict Replay

Explainscomponent Agent Memory Temporal Conflict Replay mechanism validates public memory conflict replay

Governed byprinciples Recompute, do not echo Lower claim strength to checker strength concept agent reliability and safety validators as bounded public scope limits

Abides byaxiom Derivation before assertion

Depends onpaper modules Agent Route Observability Runtime Bridge Phase Continuity Runtime

This module is the public Microcosm projection of an agent-memory honesty contract. It is a synthetic replay fixture, not a live memory product, private transcript export, source-authority claim, or launch claim.

The fixture models three public episodes: episode A records a scoped preference and a tool-result fact, episode B updates the preference scope and deletes the now-stale fact through conflict-edge and downgrade result records, and episode C replays the task with memory enabled and disabled. The replay is admitted only when ADD, UPDATE, DELETE, and NOOP decisions, metadata-only non-public refs, evidence handles, cold replay refs, and an answer-delta result record line up.

Purpose

This component exists because an agent that remembers can quietly start trusting the wrong row. A user states a preference, the world changes, a later turn contradicts the earlier one, and a naive memory store keeps serving the stale fact as though it were still true. The single question this fixture answers is narrow and checkable: when one memory write supersedes an earlier one, does the record of that conflict actually hold up, or is it just a label?

The unusual choice is that the validator does not trust the labels it is given. A row can declare decision = UPDATE, attach a plausible-looking conflict-edge ref, and still be quarantined. In _apply_conflict_semantic_recompute the checker re-derives the conflict lineage from the raw fields it can verify: episode order, event timestamp, memory priority, and source-trust score. An UPDATE or DELETE that claims to supersede a prior write but is not timestamped after it, or that regresses priority or relies on lower-trust evidence than the write it replaces, is rejected. _apply_temporal_order_checks adds the coarser ordering rule that a conflict edge must land after some earlier accepted write and a replay must land after the conflict it depends on. The label is treated as a claim to be recomputed, not as authority.

The point of the paired memory-on and memory-off replay is the matching discipline on the output side. Memory is only allowed to take credit for a better answer through an explicit evidence handle and a cold-replay result record, so the gain is attributable rather than asserted. The interesting idea here is not a memory product. It is a small, reproducible accounting method for one specific failure: a stale row outranking newer evidence.

Abstract

Agent memory becomes dangerous when a stored row is allowed to outrank later evidence. This module turns that risk into a public, replayable checker: a synthetic three-episode fixture exercises memory ADD, UPDATE, DELETE, and NOOP decisions, then verifies that later conflicts can influence replay only through typed evidence handles, temporal conflict edges, stale-downgrade result records, paired memory-on/off cold replays, and a metadata-only answer-delta result record.

The technical contribution is not "better memory" and not a product claim. It is a narrow public accounting method for temporal memory conflict: memory rows are metadata under test, non-public refs are metadata-only, copied source-open source bodies are digest checked outside result records, and seven negative fixtures prove that common overclaim paths are rejected before any pass result record is written.

Telos

The reader-facing aim is to make a hard memory-honesty boundary inspectable without exporting private memory bodies. A cold reader should be able to answer four questions from files and result records:

Which memory decision was made, and under which public route ref?
Which evidence handle, timestamp, priority, and source-trust score justified the row?
Which prior row was conflicted or downgraded before later replay credit was allowed?
Did the memory-enabled replay use admissible evidence while the paired memory-disabled replay remained available for answer-delta accounting?

The accepted result is a metadata-only memory-conflict result record. It supports only a synthetic fixture-level claim: this replay respected the declared temporal conflict contract under the checked inputs.

Technical Mechanism

The runtime treats memory as public replay metadata, not as authority. The validator loads projection_protocol.json, memory_policy.json, memory_episodes.json, and replay_observations.json; the exported-bundle mode also loads bundle_manifest.json, source_module_manifest.json, and the copied source artifacts listed in the manifest. _build_result combines secret scanning, public trace construction, protocol validation, policy validation, episode validation, replay validation, source-module import validation, negative-case coverage, and the scope limit before a pass status is possible.

The mechanism has five reader-visible gates:

validate_projection_protocol requires source refs, source pattern ids, projection result records, target refs, public runtime refs, target symbols, reimplemented mechanics, omissions, and an explicit denial that private thread bodies were copied.
validate_memory_policy requires ADD, UPDATE, DELETE, and NOOP as the only admitted decision vocabulary and denies live-memory product, transcript export, source-authority, active-injection, provider-call, and launch-scope decision.
validate_memory_episodes turns the five public event rows into accepted or quarantined memory metadata. Each row needs a route ref, decision, synthetic subject id, evidence handle, metadata-only private thread ref, body-export flag, source-authority flag, and active-injection flag. Positive replay credit requires all four decision classes, two conflict-edge refs, stale-downgrade refs, and a prompt-adoption observation ref.
validate_replay_observations checks the paired memory-enabled and memory-disabled replay rows. Memory-enabled replay must cite public evidence handles that resolve against the accepted event rows, both replays must carry cold-replay result record refs, and the pair must share an answer-delta result record.
validate_source_module_imports verifies the exported bundle's five copied source bodies by digest, material class, relation, and body_in_receipt=false. The card path reports only counts, digest refs, and result record paths; full memory rows, replay rows, source bodies, private transcript bodies, model-output data, and active injection text stay out of result records and public cards.

The mechanism is deliberately negative as well as constructive. Seven falsification fixtures prove that raw transcript export, private candidate auto-promotion, stale preference override, memory-as-source-authority, vector recall without evidence, final-answer-only memory credit, and active injection as authority are blocked. build_public_memory_conflict_trace gives the reader a seven-span public trace over the same rows, with five memory-event spans and two cold-replay spans, and audits coverage for evidence handles, metadata-only non-public refs, no non-public body export, cold-replay refs, answer-delta refs, and memory-enabled evidence.

Temporal Conflict Mechanism

The central rule is evidence-before-memory-authority. A memory row may be accepted as replay metadata only after it satisfies the public policy fields: route ref, decision, synthetic subject id, event timestamp, memory priority, source-trust score, evidence handle, metadata-only private thread ref, and explicit false authority flags for body export, source authority, and active injection.

UPDATE and DELETE decisions have an extra burden because they alter older memory. The validator recomputes the conflict lineage instead of trusting the label. _apply_temporal_order_checks verifies that conflict rows occur after the prior writes they touch, and that replay NOOP rows occur after conflict and downgrade evidence. _apply_conflict_semantic_recompute then checks the semantic shape of the mutation: the prior event must exist, the conflict group must be coherent, timestamps must advance, priority may not regress below the allowed floor, and source trust must stay above the declared floor.

Only after those checks pass can episode C receive replay credit. The memory-enabled replay must cite public evidence refs that resolve to accepted memory rows. The memory-disabled replay stays paired by replay group. The answer-delta result record accounts for the difference between those cold replays without reducing the evaluation to final-answer comparison alone.

Real Sanitized Episode Evidence

The first-wave fixture is not merely shape-only synthetic data. Its memory_episodes.json, memory_policy.json, and replay_observations.json mirror the exported memory-temporal-conflict bundle, and the positive rows carry sanitized_real_episode=true, source artifact refs, source event refs, timestamps, memory priority, and source-trust scores. The source evidence posture declares real_source_floor as copied_non_secret_macro_agent_memory_body_with_provenance, body_in_receipt=false, and private_bodies_exported=false.

The exported bundle contributes a source-open body floor without turning bodies into result record material. source_module_manifest.json lists five copied source bodies across tool, doctrine, standard, and pattern material classes. The runtime verifies their digests and material classes, while public cards and result records expose only paths, counts, digest refs, omitted-material reasons, and the scope limit. That is the realness proof this paper can use: source provenance and result record-level recomputation, not private memory export.

Perturbation and Rejection Contract

The fixture includes positive pass evidence and perturbation evidence. Focused tests mutate the bundle to ensure that the validator rejects timestamp incoherence, priority regression, source-trust regression, temporal order breakage, unverified conflict evidence, source-event drift, stale override without downgrade, downgrade result record field swaps, positive rows without evidence handles, unresolved replay refs, replay without memory evidence, and source-body tampering.

Those rejection tests matter because temporal memory bugs often look plausible in isolation. A stale row with a nice label is still rejected if its conflict edge is absent or late; a memory-enabled replay is still rejected if its evidence refs do not resolve; a digest-mismatched body floor blocks the source import; and final-answer-only comparison remains a negative case rather than utility evidence.

Named Proof Consumers

python -m microcosm_core.organs.agent_memory_temporal_conflict_replay run consumes the first-wave fixture, includes negative cases, and writes the result, board, validation, and sign-off result records.
python -m microcosm_core.organs.agent_memory_temporal_conflict_replay run-memory-bundle consumes the exported source-open bundle, digest-checks copied source bodies, and emits the public bundle validation result.
tests/test_agent_memory_temporal_conflict_replay.py consumes the same fixture and bundle to assert decision counts, conflict counts, stale downgrades, secret exclusion, public-relative result records, unresolved replay rejection, source-module digest verification, metadata-only result record cards, and seven-span trace construction.
A cold public reader consumes the source record, manifest, event rows, replay rows, source-artifact digests, and validation result records; that consumer can verify the synthetic honesty boundary but cannot infer quality of any live memory system or launch-scope decision.

Shape

Diagram source

flowchart TD BodyFloor["Source-open body floor"] -->|digest verified; body_in_receipt=false| Policy["Policy vocabulary"] Policy -->|allows ADD / UPDATE / DELETE / NOOP only| EpisodeA["Episode A: ADD rows"] EpisodeA -->|creates baseline memory metadata| EpisodeB["Episode B: UPDATE / DELETE rows"] EpisodeB -->|touches older memory| ConflictGate["Temporal conflict gate"] ConflictGate -->|requires conflict_edge_ref| DowngradeGate["Stale downgrade gate"] DowngradeGate -->|requires stale_downgrade_ref| Recompute["Semantic recompute"] Recompute -->|checks timestamp, priority, source trust| Enabled["Episode C: memory-enabled replay"] Recompute -->|keeps paired baseline| Disabled["Episode C: memory-disabled replay"] Enabled -->|uses evidence_handle_refs| Delta["Answer-delta result record"] Disabled -->|no memory evidence used| Delta Delta -->|paired by replay_group_id| Trace["Public 7-span trace"] Trace -->|covers events plus cold replays| Result record["metadata-only result record"] Result record -->|omits private bodies and model-output data| Ceiling["Scope limit / scope boundary"] Ceiling -->|denies live-memory and source-authority claims| Done["Fixture-level validation claim only"]

The page shape is a temporal-conflict replay, not a memory product surface. A reader starts with the JSON bundle, follows the source module manifest to five copied source bodies, then checks three synthetic episodes: initial memory writes, a later temporal conflict with stale downgrades, and paired cold replays with memory enabled and disabled. The accepted outcome is a result record that says the replay respected the memory-honesty boundary; it does not make memory recall into source authority.

Failure Modes and Limitations

This module is intentionally narrow. It validates a public fixture and exported bundle against a declared temporal conflict contract; it does not measure live assistant memory quality, user satisfaction, recall coverage, provider behavior, or deployment posture. Passing result records show that checked rows, digests, traces, negative cases, and scope limits agreed for the fixture under test.

Known failure modes are treated as checker inputs rather than prose caveats: private transcript export, private candidate auto-promotion, stale preference override, memory-as-source-authority, vector recall without evidence, final-answer-only credit, active injection authority, missing source manifests, source-body digest drift, source-event drift, missing conflict edges, missing downgrade result records, and unresolved replay evidence. If a future module wants a stronger memory claim, it needs a new standard and new evidence; this module cannot promote itself beyond its fixture ceiling.

Reader Evidence Routing

Bundle route: read examples/agent_memory_temporal_conflict_replay/exported_memory_temporal_conflict_bundle/source_module_manifest.json for module_count=5, body_in_receipt=false, material classes, digest refs, omitted-material reasons, and the explicit secret-exclusion boundary.
Event route: read memory_episodes.json for the five memory events: episode_a_preference_add, episode_a_tool_fact_add, episode_b_preference_scope_update, episode_b_tool_fact_delete, and episode_c_replay_noop.
Conflict route: verify that the UPDATE and DELETE events carry temporal conflict-edge refs and stale-downgrade refs before they can affect replay credit.
Replay route: read replay_observations.json for the paired episode_c_memory_enabled_replay and episode_c_memory_disabled_replay rows, evidence refs, cold replay result records, and answer-delta accounting.
Runtime route: run tests/test_agent_memory_temporal_conflict_replay.py when the reader needs recomputation evidence. The focused tests assert digest verification, public-relative result records, non-public-state exclusion, unresolved replay rejection, and the exported bundle runtime shape.

Public Mechanics

Memory update claims require route refs, evidence handles, and explicit ADD/UPDATE/DELETE/NOOP decisions.
Updates and deletes that touch older memory require temporal conflict-edge refs plus stale-downgrade refs before memory can affect replay credit.
Private thread references are metadata-only; transcript bodies and private memory candidate bodies stay omitted.
Utility language requires paired memory-enabled and memory-disabled cold replay result records; final-answer-only comparison is not enough to support a memory utility claim.
Raw transcript export, private candidate auto-promotion, stale preference override, memory-as-source-authority, vector recall without evidence, final-answer-only memory credit, and active-injection authority are expected falsification fixtures.

Prior Art Grounding

This component is grounded in agent-memory architectures and the newer literature on stale or poisoned memory. The constructive lineage includes Generative Agents, which made observation, reflection, retrieval, and planning a concrete agent-memory pattern, and MemGPT, which treats long-context behavior as a memory-management problem. The risk lineage includes AgentPoison and STALE, which focus respectively on poisoned retrieval stores and whether agents update invalid memories when new evidence arrives.

Microcosm does not claim a live memory product. It borrows the useful accounting questions: which memory decision was made, which evidence handle justified it, which older row was conflicted or downgraded, and whether memory-on/off replay supports any claim beyond a final-answer comparison.

Validation Result record Path

Run the first-wave fixture validator from the repo root and write its result record outside the repo working tree:

Then run the exported bundle validator:

cd microcosm-substrate && PYTHONPATH=src ../repo-python -m microcosm_core.organs.agent_memory_temporal_conflict_replay run-memory-bundle --input examples/agent_memory_temporal_conflict_replay/exported_memory_temporal_conflict_bundle --out /tmp/agent_memory_temporal_conflict_bundle_receipt --card > /tmp/agent_memory_temporal_conflict_bundle_card.json

The focused regression test and corpus projection checks are:

cd microcosm-substrate && ../repo-pytest tests/test_agent_memory_temporal_conflict_replay.py
./repo-python scripts/build_doctrine_projection.py --check-paper-module-corpus

Scope boundary

Reader Validation Boundary

A cold reader can validate this module by starting from the JSON source record, then checking the generated JSON instance, source-module manifest, synthetic episodes, memory-event decisions, temporal conflict-edge refs, stale-downgrade refs, paired memory-on/off cold replays, negative cases, and focused tests. The validation is limited to whether the synthetic replay preserved a metadata-only memory-honesty boundary.

The validation stops before quality claims about any live memory system, private transcript export, private candidate promotion, memory recall as source authority, provider behavior, active-injection authority, public sharing, and launch. Unpopulated concept, principle, axiom, and dependency edges remain residual pressure unless the JSON bundle owner lane adds real targets.

Scope limit

This module may claim only that a synthetic memory-temporal replay preserved a metadata-only memory-honesty boundary: ADD/UPDATE/DELETE/NOOP decisions, temporal conflict-edge refs, stale-downgrade refs, paired memory-on/off cold replay refs, answer-delta accounting, public trace refs, manifest digests, negative cases, and scope limits are checked.

It must not claim quality of any live memory system, readiness of a memory product, private transcript export, private candidate auto-promotion, source-authority status, provider behavior, active-injection authority, source-file changes, public sharing authorization, or launch-scope decision.

Scope boundary

This module does not run live memory, claim memory product quality, export private transcripts, auto-promote private candidates, treat memory recall as source authority, adopt active injection, use external model services, change source files, publish results, or include launch operations.

Source and projection details

Source-Open Body Floor

The exported bundle manifest is the body-row authority for five copied source bodies spanning tool, doctrine, standard, and pattern material classes. Those bodies stay in bundle source artifacts; result records and cards carry refs, digests, classes, counts, omission reasons, secret-exclusion status, and scope limits only.

The floor is accepted as synthetic temporal-conflict replay evidence. It is not live memory product evidence, private transcript export, private memory candidate export, provider-behavior evidence, source-file changes, public sharing authorization, or launch-scope decision.

Governing Lattice Relation

The JSON bundle binds this module to mechanism.agent_memory_temporal_conflict_replay.validates_public_memory_conflict_replay, concept.agent_reliability_and_safety_validator_bundle, provisional principle refs P-1 and P-2, and provisional axiom ref AX-1. This Markdown does not promote those placeholder refs into stronger doctrine ids; it explains how the concrete mechanism satisfies the current bundle boundary.

Mechanically, the governing relation is evidence-before-memory-authority: memory rows may influence replay only after they carry route refs, public evidence handles, metadata-only non-public refs, conflict-edge or downgrade result records when stale state changes, and paired replay result records. The concept relation is validator-bundle accountability: the module is not a narrative claim about agents remembering well, but an executable fixture whose policy, trace, source-body manifest, negative cases, and result records must all agree. The axiom/principle ceiling is the same one enforced by the validator: private state is not public source authority, synthetic replay is not live product evidence, and projection-ready result records cannot authorize source-file changes, external model access, public sharing, or launch.

Sleeper Memory Poisoning Quarantine ReplayReplays a recorded memory-tamper case, checking its declared quarantine, block, and delete steps line up.3/5

Does This is a worked, made-up example of how an agent should handle a tampered "memory": a poisoned note gets spotted, held in quarantine, blocked from being used in any later action, and finally deleted-with-an-audit, with a clean re-run recorded showing the bad memory is no longer there. The component does not actually delete anything or perform the re-run; it reads the on-disk record of that whole guard-and-cleanup sequence and checks that every required step and result record lines up, so exactly which checks must hold is visible. It works entirely from synthetic refs and metadata, with no private memory or transcripts exposed.

Scope limit It only checks the structural shape and internal consistency of a synthetic memory-security policy projection recorded as JSON. It does not run or validate any real memory store, does not itself quarantine, delete, or re-run anything, and does not establish that any system actually resists poisoning. It exports no private memory bodies or transcripts, calls no providers, mutates no source, produces no benchmark claims, and excludes launch (all scope limit flags are hardcoded false).

Run

PYTHONPATH=src python3 -m microcosm_core.organs.sleeper_memory_poisoning_quarantine_replay run --input fixtures/first_wave/sleeper_memory_poisoning_quarantine_replay/input --out receipts/first_wave/sleeper_memory_poisoning_quarantine_replay --acceptance-out receipts/acceptance/first_wave/sleeper_memory_poisoning_quarantine_replay_fixture_acceptance.json

EvidenceComputed projectionevidence 3/5Source-faithful refactor

ai-safetyagent-evaluationred-teaming

Source Design note · Source atlas

Paper module Sleeper Memory Poisoning Quarantine Replay

Explainscomponent Sleeper Memory Poisoning Quarantine Replay mechanism validates public sleeper memory poisoning quarantine replay

Governed byprinciples Preserve provenance across every boundary Carry basis and provenance together concept agent reliability and safety validators as bounded public scope limits

Abides byaxiom Provenance propagation and non-interference

Depends onpaper modules

Purpose

Persistent agent memory is an attack surface. If an agent reads a poisoned source in one session and writes a memory from it, that memory can quietly shape a later session's actions, long after the poisoning is out of view. This module asks one question: if an agent quarantines a poisoned memory write, can it show, from result records alone, that the quarantine actually held when the memory was retrieved later, and that a rollback genuinely removed it?

The interesting part is that the runtime grades the whole chain, not just the final answer. A naive memory-security story checks that the agent reached the right conclusion. This validator refuses that. It requires the poisoned write to carry provenance, the later retrieval to be blocked before any action and to cite the same memory ref the write quarantined, and the rollback to carry a deletion audit ref, a cold-rerun result record, and proof that the memory is absent after the rerun. A blocked retrieval that cannot name the quarantine audit ref or the cold-replay result record for the memory it gates is treated as unproven, not as a pass.

It is a synthetic fixture, deliberately narrow. The inputs are public metadata rows, never live user memory or private bodies, and the result records carry refs, hashes, counts, and verdicts rather than any memory text. It borrows the control shape from prior work on sleeper triggers and memory poisoning; it does not secure a live memory system or claim a benchmark result.

Abstract

This module is the public Microcosm projection of a persistent-memory security claim contract. It is a synthetic replay fixture, not a live memory product, live user memory import, benchmark security result, private memory export, or launch claim.

The fixture models four public sessions: a poisoned source bundle is seen, a memory write proposal is quarantined, later retrieval is blocked before action, and rollback plus cold rerun proves the poisoned memory is absent at the result record boundary. The claim is admitted only when source bundle refs, provenance refs, quarantine verdicts, classifier labels, retrieval influence gates, rollback audit refs, rerun result records, negative cases, and scope limits line up.

Shape

Diagram source

flowchart TD inputs["Public metadata inputs sessions, write proposals, retrieval replays, rollback rows"] subgraph Gates["Four ordered gates"] provenance["Provenance gate poisoned write quarantined + provenance-bound control admitted"] influence["Delayed-influence gate later retrieval blocked before action, same memory ref + audit + cold-replay ref"] rollback["Rollback gate deletion audit + cold-rerun result record + memory absent after rerun"] bodies["Source-body gate copied bodies digest-checked, result records stay metadata-only"] end negatives["Eight negative cases each must be observed as a typed finding"] result records["metadata-only result records refs, hashes, counts, verdicts"] ceiling["Scope limit synthetic replay only"] inputs --> provenance provenance -->|quarantined memory ref| influence influence --> rollback rollback --> bodies inputs --> negatives provenance --> result records influence --> result records rollback --> result records bodies --> result records negatives --> result records result records --> ceiling

The module's shape is a public memory-security replay, not a live memory product. Public metadata inputs pass through four ordered gates: provenance, delayed influence, rollback, and source-body handling. The delayed-influence gate is coupled to the provenance gate, so a blocked retrieval must target the same memory ref the write quarantined and cite its audit and cold-replay refs. Alongside the positive chain, eight negative cases must each surface as a typed finding, and every path lands in metadata-only result records under the scope limit.

Mechanism

The mechanism is a replay reducer over public metadata, not a memory runtime. src/microcosm_core/organs/sleeper_memory_poisoning_quarantine_replay.py loads six positive input families through _build_result: projection protocol, memory policy, session chain, quarantine events, retrieval replays, rollback/cold-rerun rows, and the source-module manifest. When run is used on the first-wave fixture it also loads the expected negative fixtures; when run_quarantine_bundle is used on the exported bundle it validates the public bundle without treating that bundle as negative-case authority.

The first gate is provenance. validate_memory_write_proposals accepts a memory write only when it carries the required source bundle ref, provenance ref, trust tier, classifier labels, audit ref, quarantine verdict, and redacted body posture. An untrusted source context with the sleeper-poisoning classifier cannot silently become trusted memory. Missing provenance, private memory body export, raw transcript export, live user-memory claims, and trusted promotion from untrusted context become typed findings instead of admissible memory authority.

The second gate is delayed influence. validate_retrieval_replays checks that later retrieval of the quarantined memory is blocked before any action can use it. The row must carry a retrieval ref, influence grade, action gate, cold replay result record ref, public evidence refs, and a quarantine audit ref coupling back to the write proposal it gates. This is the anti-final-answer check: the runtime rejects a memory-security story that grades only the final answer while omitting retrieval, influence, or rerun evidence.

The third gate is rollback. validate_rollback_rerun requires a rollback result record ref, deletion audit ref, rerun result record ref, and memory_absent_after_rerun=true. Rollback language is therefore admitted only when deletion and cold rerun are both present. Tests mutate these fields to show that nonempty but bogus rollback refs, missing result record refs, and absence failures block rather than becoming evidence.

The fourth gate is source-open body handling. validate_source_module_manifest and _source_open_body_import_summary verify seven copied public source bodies, their declared material classes, their digest fields, and their metadata-only result record posture. _write_receipts and result_card then expose public ids, counts, refs, digests, verdicts, negative-case status, and scope limits while omitting retrieval rows, rollback rows, and copied source bodies from command cards.

The proof consumer is therefore a pair of bounded runs plus focused tests: run proves the first-wave fixture with expected negative cases; run_quarantine_bundle proves the exported public bundle; and tests/test_sleeper_memory_poisoning_quarantine_replay.py verifies mutated positive rows, stale baked labels, retrieval/quarantine coupling, rollback result record shape, source-body digest checks, public-relative redaction, and card payload omission.

Reader Evidence Routing

Bundle route: core/paper_module_capsules.json::paper_modules[37:paper_module.sleeper_memory_poisoning_quarantine_replay] is the JSON authority row. A diagram view and an atlas card are generated for this module.
Mechanism route: core/mechanism_sources.json::mechanism.sleeper_memory_poisoning_quarantine_replay.validates_public_sleeper_memory_poisoning_quarantine_replay binds the code locus, input refs, result record refs, validator commands, focused regression, and guardrails.
Runtime route: src/microcosm_core/organs/sleeper_memory_poisoning_quarantine_replay.py owns run, run_quarantine_bundle, _build_result, _write_receipts, result_card, EXPECTED_NEGATIVE_CASES, AUTHORITY_CEILING, and the metadata-only source-module import checks.
Exported-bundle route: examples/sleeper_memory_poisoning_quarantine_replay/exported_sleeper_memory_poisoning_bundle contains bundle_manifest.json, projection_protocol.json, memory_policy.json, session_chain.json, quarantine_events.json, retrieval_replays.json, rollback_rerun.json, and source_module_manifest.json.
Source-module route: source_module_manifest.json records seven copied public source bodies, including the growth result record, memory-plane paper modules, operator-memory tests, agent execution trace runtime, strict JSON helper, and agent execution trace standard; result records keep source bodies out with body_in_receipt: false.
Focused-test route: tests/test_sleeper_memory_poisoning_quarantine_replay.py verifies negative cases, public-relative redacted result records, exported-bundle runtime shape, digest mismatch rejection, exact copied source bodies, and card result record reuse.

Prior Art Grounding

This component combines two prior-art lines: sleeper/deceptive trigger behavior and long-term-memory/RAG poisoning. The sleeper-trigger lineage is Anthropic's Sleeper Agents. The memory-poisoning lineage includes AgentPoison, MemoryGraft, and Hidden in Memory, which all treat retrieved or persistent agent memory as an attack surface rather than a neutral cache.

Microcosm does not claim to secure live memory systems. It borrows the control shape: memory writes need provenance, untrusted source context cannot silently become authority, later retrieval must pass an influence gate, and deletion or rollback needs an audit ref plus cold rerun evidence.

Validation Result record Path

Run the first-wave fixture validator from the repo root and write its result record outside the repo working tree:

Then run the exported bundle validator:

cd microcosm-substrate && PYTHONPATH=src ../repo-python -m microcosm_core.organs.sleeper_memory_poisoning_quarantine_replay run-quarantine-bundle --input examples/sleeper_memory_poisoning_quarantine_replay/exported_sleeper_memory_poisoning_bundle --out /tmp/sleeper_memory_poisoning_bundle_receipt --card > /tmp/sleeper_memory_poisoning_bundle_card.json

The focused regression test and corpus projection checks are:

cd microcosm-substrate && ../repo-pytest tests/test_sleeper_memory_poisoning_quarantine_replay.py
./repo-python scripts/build_doctrine_projection.py --check-paper-module-corpus

The result record path proves synthetic memory-poisoning quarantine replay over public metadata refs, not live memory safety, provider behavior, or benchmark security.

Scope boundary

This module does not run live memory, claim memory product quality, import live user memory, export private memory bodies or raw transcripts, promote untrusted context into trusted memory, use external model services, change source files, claim benchmark security, publish results, or include launch operations.

Scope limit

This module may claim synthetic sleeper-memory poisoning quarantine replay over public metadata refs: source bundle refs, provenance refs, quarantine verdicts, classifier labels, retrieval influence gates, rollback audit refs, cold rerun result records, expected negative cases, source-module digest checks, metadata-only result records, and validation result records.

It does not claim live memory product quality, live user-memory handling, trusted promotion from untrusted context, provider behavior, source-file changes, benchmark security, private memory export, public sharing, launch-scope decision, or whole-system correctness. The generated diagram and atlas card are navigation aids, not security benchmark results.

MCP Tool Authority ReplayAudits a recorded tool-use log to confirm each action was scoped, approved, undoable, and fenced.3/5

Does When an AI agent uses outside tools (look something up, change a ticket, read a result from an untrusted source), this checks a recorded log of that tool use to confirm each action was properly fenced: bound to a narrow permission, approved before it changed anything, given a way to undo it, and kept from letting untrusted tool output boss the agent around. It runs on a small built-in make-believe example that carries only labels and references (no real accounts, secrets, or tool contents), so the safety checks an agent's tool use is supposed to pass are inspectable without anything touching a live account.

Scope limit It only checks that the tool-authority evidence in a recorded bundle (scopes, approvals, rollbacks, instruction/data splits, cold replays, redaction, and the expected abuse-case failures) is present and internally consistent. It does not run tools or authorize live MCP/account access, account secret or payload export, treating tool output as instruction, source-file changes, benchmark safety scores, or launch, and it makes no claim that the underlying tool-use policy is domain-correct.

Run

PYTHONPATH=src python3 -m microcosm_core.organs.mcp_tool_authority_replay run --input fixtures/first_wave/mcp_tool_authority_replay/input --out receipts/first_wave/mcp_tool_authority_replay

EvidenceComputed projectionevidence 3/5Source-faithful refactor

Links to Agent Sandbox Policy Escape Replay, Sleeper Memory Poisoning Quarantine Replay

ai-safetyagent-evaluationred-teaming

Source Design note · Source atlas

Paper module MCP Tool Authority Replay

Explainscomponent MCP Tool Authority Replay mechanism validates public mcp tool authority replay

Governed byprinciples Possession is not permission Bind authority to transaction scope concept agent reliability and safety validators as bounded public scope limits

Abides byaxiom Authority by derivation, not possession

Depends onpaper module Source Projection Import Protocol

This module is the public Microcosm projection of a tool-authority claim contract. It is a synthetic MCP-like replay fixture, not a live MCP account test, external model access, account secret-handling certification, benchmark security result, or launch claim.

The fixture models three public tools: a readonly docs lookup, a write-capable ticket update, and an untrusted result source. The claim is admitted only when tool manifest scope refs, call argument hashes, approval token refs, side-effect ledger refs, rollback result records, untrusted-output instruction/data splits, cold replay result records, negative cases, and scope limits line up.

Purpose

When an agent uses tools through a protocol like MCP, the sentence "the agent used the tool safely" is cheap to write and hard to back. This component answers one question: given a recorded tool-use trace, does the evidence actually support the authority the trace claims, or is the safety language unearned? It exists so that tool-authority claims have to be replayed against metadata before prose is allowed to call them safe.

The approach is to treat a tool call as a small transaction that must show its working. Each call cites a narrow capability scope, an argument hash, and, if it writes, an approval token, a side-effect ledger entry, and a rollback result record. Those references are not taken on trust: the side-effect ledger and the cold replay rows are cross-checked against the accepted call rows by call id, so a rollback result record that no call refers to, or a write that skips approval, is caught rather than waved through. The point is that a reference string is not authority until something downstream resolves it.

Two failure modes are worth naming because they are specific to tool-using agents. The first is the confused deputy: a call that asks for a scope wider than its task needs (*, account_full_access) is rejected before it runs, so a tool cannot quietly borrow more authority than it was granted. The second is tool-output-as-instruction, the prompt-injection shape where text returned by an untrusted tool is obeyed as a command. Here untrusted output must stay data and cite an instruction/data split; a row that lets output become instruction is one of the eight negative cases the fixture is built to catch.

This is deliberately a synthetic replay, not a live test. The component never opens an MCP account, calls a provider, or handles a account secret. It reads only public metadata and digests, and it keeps every payload, result body, and account secret out of the result records it writes. What it offers is narrow and honest: a way to check that a tool-authority story is internally consistent and metadata-only, not a certificate that any real tool integration is secure.

Shape

Diagram source

flowchart TD bundle["JSON bundle authority"] markdown["Markdown reader projection"] mechanism["mechanism source row"] component["MCP authority runtime"] fixture["first-wave fixture"] bundle["exported authority bundle"] manifest["tool manifest"] calls["scoped tool calls"] results["tool result rows"] side_effects["side-effect ledger"] replay["cold replay rows"] trace["public trace spans"] source_modules["source-module body floor"] result records["metadata-only result records"] ceiling["scope limit"] bundle --> markdown bundle --> mechanism mechanism --> component component --> fixture component --> bundle bundle --> manifest manifest --> calls calls --> results calls --> side_effects side_effects --> replay results --> trace replay --> trace source_modules --> trace trace --> result records result records --> ceiling

The module's shape is a public tool-authority replay, not a live MCP security claim.

Technical Mechanism

The replay is a fail-closed authority lattice over a synthetic MCP-like tool story. _build_result loads the fixture or exported bundle, runs load_forbidden_classes and scan_paths over input JSON and copied source modules, then validates each contract plane separately: projection protocol, tool policy, tool manifest, tool calls, tool results, side-effect ledger, cold replay rows, public trace spans, and source-module manifest rows. The final status is pass only when every sub-validator passes, no expected negative case is missing, the secret scan has zero blocking hits, and the source-module floor is either present when required or explicitly not required for the first-wave fixture.

Tool authority is checked before prose can turn it into evidence. Manifest rows define the declared tool ids and allowed tool classes. validate_tool_calls then rejects undeclared tool ids, overbroad scopes, missing argument hashes, hidden account secret export, live account access, unapproved side effects, tool-output-as-instruction, final-answer-only grading, and unredacted payload export. Write-capable calls must carry approval token refs, side-effect ledger refs, and rollback result record refs; untrusted-result calls must keep instruction and data boundaries explicit. validate_side_effect_ledger and validate_cold_replay make those refs observable instead of leaving them as decorative strings.

The exported-bundle path adds a body-floor check without moving bodies into result records. _source_module_manifest_result streams digest verification over each copied source source module, checks required anchors, requires body_in_receipt: false, and reports a metadata-only source-open import summary. build_public_mcp_tool_authority_trace contributes three public trace spans for the tool calls; _body_import_verification binds that public refactor back to the source source and Microcosm target digests. _write_receipts emits the result, board, validation result record, and sign-off result record for fixture runs, and result_card deliberately exposes counts, status bits, digest freshness, and omission result records rather than tool rows or source bodies.

The mechanism is intentionally narrower than a tool-use security benchmark. It accepts only public metadata and digest evidence, and it treats every generated projection as a result record over source rows rather than as source authority.

Public Mechanics

Every tool call must bind to a narrow capability scope ref before admission.
Write-capable calls require approval token refs, side-effect ledger refs, and rollback result record refs.
Untrusted tool output is data, not instruction, and must cite an instruction/data split ref.
Call arguments, tool outputs, account refs, and result bodies stay redacted or metadata-only.
Overbroad scopes, hidden account secret export, tool-output-as-instruction, unapproved side effects, live account access, final-answer-only grading, missing rollback result records, and unredacted tool payloads are expected falsification fixtures.

Reader Evidence Routing

Bundle route: core/paper_module_capsules.json::paper_modules[39:paper_module.mcp_tool_authority_replay] is the JSON authority row; a diagram view and an atlas card are generated for this module from the source record.
Mechanism route: core/mechanism_sources.json::mechanism.mcp_tool_authority_replay.validates_public_mcp_tool_authority_replay binds the code locus, fixture refs, exported bundle refs, result record refs, validator commands, focused regression, and guardrails.
Runtime route: src/microcosm_core/organs/mcp_tool_authority_replay.py owns run, run_tool_authority_bundle, _build_result, _write_receipts, result_card, EXPECTED_NEGATIVE_CASES, and AUTHORITY_CEILING.
Exported-bundle route: examples/mcp_tool_authority_replay/exported_mcp_tool_authority_bundle contains bundle_manifest.json, projection_protocol.json, tool_policy.json, tool_manifest.json, tool_calls.json, tool_results.json, side_effect_ledger.json, cold_replay.json, and source_module_manifest.json.
Source-module route: source_module_manifest.json records seven copied public source body rows, while the exported source-open body summary exposes at least six body materials. The floor includes high-novelty and extracted-pattern evidence, agent execution trace runtime and standard bodies, route-readiness standard material, mission-transaction preflight internal control material, and the strict JSON helper. Result records carry refs, digests, counts, and status only.
Focused-test route: tests/test_mcp_tool_authority_replay.py verifies negative cases, public-relative redacted result records, exported-bundle runtime shape, source-module digest failures, exact copied source bodies, card result record reuse, and public trace span construction.

Named Proof Consumers

First-wave fixture consumer: PYTHONPATH=src ../repo-python -m microcosm_core.components.mcp_tool_authority_replay run --input fixtures/first_wave/mcp_tool_authority_replay/input --out /tmp/microcosm-mcp-tool-authority-replay/fixture --sign-off-out /tmp/microcosm-mcp-tool-authority-replay/sign-off.json --card consumes the fixture route, expected negative cases, secret scan, public trace construction, scope limit, metadata-only result record writer, and command-card omission contract.
Exported-bundle consumer: PYTHONPATH=src ../repo-python -m microcosm_core.organs.mcp_tool_authority_replay run-tool-authority-bundle --input examples/mcp_tool_authority_replay/exported_mcp_tool_authority_bundle --out /tmp/microcosm-mcp-tool-authority-replay/bundle --card consumes the public bundle, source-module manifest, copied body-floor digest checks, public trace spans, metadata-only exported-bundle result record, and fresh card reuse path.
Focused regression consumer: PYTHONPATH=src ../repo-python -m pytest -p no:cacheprovider tests/test_mcp_tool_authority_replay.py -q pins the negative-case matrix, undeclared-tool rejection, redacted/public result record paths, source-module digest mismatch failures, exact copied source bodies, public trace span coverage, and card omission behavior.
This check excludes hand-editing generated projections; it is a read-only result record for this Markdown slice.

Prior Art Grounding

This component is grounded in capability security, least privilege, and current MCP authorization guidance. The classic security lineage is Saltzer and Schroeder's Protection of Information in Computer Systems and Hardy's Confused Deputy: authority should be narrow, mediated, and bound to the object/action being requested. The MCP-specific lineage is the official MCP authorization and security best practices guidance, especially least-privilege scopes and token audience boundaries.

Microcosm does not claim live MCP security or account certification. It borrows the prior-art authority shape and makes it replayable: every public tool story must expose capability scope refs, approval refs, side-effect refs, rollback refs, instruction/data split refs, and scope boundaries before write authority is treated as evidence.

Validation Result record Path

Run the first-wave fixture into disposable result records from the Microcosm root:

Run the exported bundle through the same component:

cd microcosm-substrate
PYTHONPATH=src ../repo-python -m microcosm_core.organs.mcp_tool_authority_replay run-tool-authority-bundle --input examples/mcp_tool_authority_replay/exported_mcp_tool_authority_bundle --out /tmp/microcosm_mcp_tool_authority_bundle

cd microcosm-substrate
PYTHONPATH=src ../repo-python -m pytest -p no:cacheprovider tests/test_mcp_tool_authority_replay.py -q
PYTHONPATH=src ../repo-python scripts/build_doctrine_projection.py --check-paper-module-corpus

Scope boundary

Scope limit

This module may claim only that a synthetic public MCP-like replay preserved tool-authority boundaries over metadata rows: capability scopes, argument hashes, approval refs, side-effect refs, rollback refs, instruction/data split refs, cold replay refs, source-module digests, negative cases, and metadata-only validation result records.

It must not claim live MCP account safety, account secret-handling certification, live tool behavior, provider behavior, benchmark security, source-file changes, publishing-scope decision, complete security, or launch-scope decision.

Scope boundary

This module does not access live MCP accounts, export account secrets or model-output data, obey tool output as instruction, run live tools, change source files, claim benchmark safety, publish results, or include launch operations.

Source and projection details

Governing Lattice Relation

The source record declares concept.agent_reliability_and_safety_validator_bundle, principles P-4 and P-16, and axiom AX-3 as the governing lattice. The generated component row repeats the paper and mechanism links and adds an component-level P-18 relation; that component relation is useful context but does not expand the paper-module bundle's declared authority.

P-4 and AX-3 are the local authority rule: a tool handle, account secret-shaped string, role name, or trusted-session label is bounded evidence of authority. The runtime must derive authority from dereferenced manifest policy, capability scope refs, approval refs, side-effect refs, rollback refs, cold replay refs, and public trace spans. P-16 supplies the transaction boundary: a write-capable tool call is admissible only when the call is scoped, the side-effect is ledgered, rollback evidence is present, and the result record says which scope limit still holds.

The mechanism row deliberately leaves sibling/upstream mechanism relations as residual pressure. That residual is part of the scope limit: this module can show a public replay lattice for synthetic MCP-like tool authority, but it does not infer neighbouring mechanisms from prose, certify live MCP security, or promote generated Mermaid, Atlas, site, or corpus projections into source authority.

Source-Open Body Floor

The exported bundle carries copied source bodies under source_modules/, governed by source_module_manifest.json. The manifest records source refs, target refs, hashes, material classes, required anchors, and result record body exclusions for:

state/microcosm_portfolio/reconstruction/high_novelty_substrate_gap_scout_v1.json
state/microcosm_portfolio/extracted_patterns_ledger.jsonl
system/lib/agent_execution_trace.py
codex/standards/std_agent_execution_trace.json
codex/standards/std_extracted_pattern_route_readiness.json
tools/meta/control/mission_transaction_preflight.py
system/lib/strict_json.py

The floor is source-open body evidence, not live-account or provider authority. Result records and command cards expose refs, digest status, counts, and verdicts only; they do not embed copied source bodies or private/live payload material.

Belief State Process Reward ReplayChecks that each step reward in a recorded run cites a declared verifier-feedback row, not a trick.3/5

Does Takes a recorded, synthetic bundle of agent steps on three partially-observable toy tasks (a terminal investigation, a mock purchase, a small planner) and checks that every "the agent did the right step" reward is actually backed by a checkable verifier or observed feedback reference, not by hidden reasoning, formatting tricks, a smuggled answer key, or a final-answer-only score. It does not run or watch a live agent; it validates pre-recorded files. The resulting result record files show, per step, the belief summary, the reward, and whether the reward-hacking and replay checks passed, so it is inspectable why each reward was or was not allowed.

Scope limit It only checks that the projection's accounting lines up under its own schema rules over recorded synthetic fixtures; it excludes hidden-reasoning export, RL training, hidden gold or neural-judge-only labels, benchmark-performance claims, external model access, source-file changes, or launch, and proves nothing about real-world reward, live agent behavior, or domain-level conclusions.

Run

PYTHONPATH=src python3 -m microcosm_core.organs.belief_state_process_reward_replay run --input fixtures/first_wave/belief_state_process_reward_replay/input --out .microcosm/belief_state_process_reward_replay

EvidenceComputed projectionevidence 3/5Source-faithful refactor

ai-safetyagent-evaluationred-teaming

Source Design note · Source atlas

Paper module Belief-State Process Reward Replay

Explainscomponent Belief State Process Reward Replay mechanism validates public belief state process reward replay

Governed byprinciples Recompute, do not echo Lower claim strength to checker strength concept agent reliability and safety validators as bounded public scope limits

Abides byaxiom Derivation before assertion

Depends onpaper module Agent Route Observability Runtime

This module is the public Microcosm projection of a belief-state process reward claim contract. It is backed by the public agent-execution trace refactor lane plus copied source source bodies. It is not a hidden-reasoning export, live RL run, neural-judge-only label set, hidden-gold benchmark, external model access, source-file changes, benchmark-score claim, or launch claim.

The public bundle models three partially observable tasks: terminal investigation, mock purchase, and formal-planning toy. A process-reward claim is admitted only when public observation digests, typed belief-state summaries, predicted next evidence, verifier or observed feedback refs, belief-discrepancy scores, dense process rewards, outcome rewards, reward-hacking trap results, trajectory groups, cold replay result records, negative cases, scope limits, and a source-faithful public trace span set line up.

Purpose

Process-reward language is easy to assert and hard to verify. A row can claim that a step earned a reward "for good reasoning" while the underlying evidence is a hidden gold label, a neural-judge guess, or formatting that gamed the scorer. This component exists to answer one narrow question: does a public process-reward claim actually reconstruct from lower-level public evidence, or is it just a label asserting its own correctness?

The interesting part is the recomputation. The validator does not trust any single fixture file. If any of those refs is missing or points somewhere inconsistent, the claim is blocked with a specific reason code rather than passed. A reward cannot point at a belief that points at a different episode, or cite feedback that belongs to another trajectory, and still count.

That cross-referential check is what separates this from a shape linter. The failure mode it guards against is a process-reward claim that looks correct field by field but does not survive being recomputed end to end. Two further design choices keep the result honest: outcome rewards are carried beside process rewards so a final answer cannot be re-labelled as step-level evidence, and every belief summary, feedback ref, and reward event stays metadata-only, so the validator proves the accounting structure without ever reading hidden reasoning.

Shape

The local governing standard is standards/std_microcosm_belief_state_process_reward_replay.json, whose authority boundary is synthetic belief-state process-reward replay only, not live training, benchmark, provider, source-file changes, public sharing, or launch-scope decision.

Source refs

Local standard: std_microcosm_belief_state_process_reward_replay.json
Runtime locus: belief_state_process_reward_replay.py

Diagram source

flowchart TD bundle["JSON source record paper_module_capsules.json[36]"] standard["Local standard std_microcosm_belief_state_process_reward_replay.json"] component["Runtime locus belief_state_process_reward_replay.py"] fixtureMode["run (fixture mode) 8 positive + 7 negative inputs"] bundleMode["run_reward_bundle (bundle mode) copied-body manifest floor required"] floors["Per-file floors projection protocol, reward policy, episodes, belief states, feedback, rewards, trajectory groups, cold replay"] recompute["Semantic recompute rebuild belief -> feedback -> process reward -> trajectory -> outcome reward -> cold replay"] negatives["Negative cases 7 planted traps must be observed"] scan["Secret-exclusion scan plus metadata-only public trace span set"] gate{"All floors pass, chain recomputes, every trap observed, no secret hit?"} pass["status: pass"] blocked["status: blocked with reason codes"] result records["Result records + compact card refs, hashes, counts, verdicts; body_in_receipt false"] ceiling["Scope limit source-faithful replay only"] bundle --> component standard --> component component --> fixtureMode component --> bundleMode fixtureMode --> floors bundleMode --> floors floors --> recompute recompute --> negatives negatives --> scan scan --> gate gate -->|yes| pass gate -->|no| blocked pass --> result records blocked --> result records result records --> ceiling

The generated instance reports eight relationship edges and zero unpopulated selective relations: it explains the belief_state_process_reward_replay component and the validating mechanism, is governed by P-1, P-2, and concept.agent_reliability_and_safety_validator_bundle, abides by AX-1, depends on paper_module.agent_route_observability_runtime, and cites src/microcosm_core/organs/belief_state_process_reward_replay.py as the resolved code locus. The component atlas adds the human/agent gloss and result record set; it classifies the evidence as algorithmic_projection and restates that the validator operates on recorded synthetic fixtures rather than live agent behavior.

The fixture manifest fixtures/first_wave/belief_state_process_reward_replay/fixture_manifest.json names eight positive input files and seven planted negative cases: hidden-chain-of-thought export, neural-judge-only labels, hidden gold labels, reward-by-formatting, verifier bypass, benchmark-performance claims, and final-answer-only scoring. The exported bundle manifest carries the source-open body floor: seven copied source modules under source_modules/, checked by digest and anchor refs, with body_text_exported_in_receipts: false. The focused test file tests/test_belief_state_process_reward_replay.py covers the fixture validator, exported bundle validator, public-root copy, negative cases, exact source body imports, and route/result record shape.

The honest ceiling is therefore narrow: this module can support public, metadata-only belief-state process-reward replay over synthetic tasks with verifier-backed process feedback, separated outcome rewards, cold replay, and negative-case coverage. It cannot support hidden reasoning export, live RL, reward-model quality, hidden-gold benchmark claims, provider behavior, source-file changes, publishing-scope decision, launch-scope decision, or whole-system correctness.

Reader Evidence Routing

Read this page from the structured bindings outward. The bindings name the component, mechanism, concept, dependency module, runtime code locus, principle and axiom refs. The fixture result records, bundle result records, and focused test then show the metadata-only replay behavior. This page explains that chain for readers.

Technical Mechanism

The runtime validator is a two-mode replay checker. In fixture mode, run loads eight positive fixture files plus the seven planted negative inputs named by EXPECTED_NEGATIVE_CASES; _build_result then validates the projection protocol, reward policy, task episodes, belief states, verifier feedback, reward events, trajectory groups, cold replay, negative cases, secret-exclusion scan, and public trace projection before _write_receipts writes the result, board, validation, and sign-off result records. A pass requires all required positive floors to pass, every expected negative case to be observed, zero secret-scan blocking hits, public trace status pass, and no positive finding outside the expected falsification cases.

In exported-bundle mode, run_reward_bundle validates the public bundle without negative inputs and makes the copied-body floor mandatory. The source_module_manifest.json path must declare seven copied source body modules, body_in_receipt: false, body_text_in_receipt: false, public material classes, exact source/target digests, existing copied targets, and all declared anchors. Digest mismatch, missing manifest, wrong body class, result record-body leakage, count mismatch, missing target, and missing anchor cases block the bundle instead of degrading silently.

Between the per-file floors and the result records sits validate_semantic_recompute, which is where most of the real work happens. It checks that the cited feedback belongs to the same episode, that the process reward references the same belief, episode, trajectory, feedback ref, and belief-discrepancy value, that the trajectory actually lists that episode and that reward, that the trajectory's outcome reward is a real outcome event for the same episode, and that the cold replay both exists and passes. Any inconsistency appends a precise reason code such as feedback_episode_mismatch, belief_discrepancy_mismatch, or trajectory_process_reward_missing, and a single blocked row is enough to block the whole result. This is the check that a label-only fixture cannot fake: the references have to recompute into one coherent chain, not merely be present.

The validator links process-reward claims to public belief summaries rather than private reasoning. build_public_belief_state_process_reward_trace emits six metadata-only trace spans from the exported bundle, and the card path reports only compact counts, status, freshness digest, source-body floor metadata, and result record refs. CARD_OMITTED_FULL_PAYLOAD_KEYS keeps findings, scans, trace bodies, row payloads, source imports, scope limit, and scope boundary text out of the command card so public surfaces carry proof handles rather than copied private or source bodies.

Named Proof Consumers

microcosm_core.organs.belief_state_process_reward_replay.run is the first-wave fixture consumer. It writes result, board, validation, and sign-off result records for the synthetic episodes and negative-case floor.
microcosm_core.organs.belief_state_process_reward_replay.run_reward_bundle is the exported bundle consumer. It validates copied source bodies, public trace spans, digest/anchor contracts, and metadata-only result record behavior.
microcosm_core.organs.belief_state_process_reward_replay.result_card is the compact public card consumer. It reports counts and validation state while omitting the heavy/private payload classes named by CARD_OMITTED_FULL_PAYLOAD_KEYS.
tests/test_belief_state_process_reward_replay.py is the focused regression consumer. It asserts the three episode groups, six belief states, six process rewards, three outcome rewards, three cold replays, seven expected negative cases, exact source-body imports, digest mismatch blockers, manifest-boundary blockers, public-relative redacted result records, fresh-card reuse, and metadata-only public trace projection.
microcosm_core.macro_tools.agent_execution_trace.build_public_belief_state_process_reward_trace is the trace-projection consumer. It converts the exported bundle into six public spans with belief-state, feedback, process-reward, outcome-reward, and cold-replay coverage while preserving body_in_receipt: false.

Public Mechanics

Belief-state JSON is a public summary, not hidden chain-of-thought.
Process rewards must cite deterministic verifier or observed environment feedback refs; neural-judge-only labels are rejected.
Outcome rewards are carried beside process rewards so final answers cannot masquerade as process evidence.
Reward-hacking traps and cold replay result records must pass for each trajectory group.
microcosm_core.macro_tools.agent_execution_trace:: build_public_belief_state_process_reward_trace turns the public bundle into ordered trace spans that preserve belief, verifier, process-reward, outcome-reward, and cold-replay refs while keeping bodies out of result records.
examples/belief_state_process_reward_replay/ exported_belief_state_process_reward_bundle/source_module_manifest.json verifies exact copied source bodies for the extracted-pattern ledger, high-novelty reconstruction result record, canonical component model, agent-execution trace runtime, trace standard, and route-readiness checker. Those bodies live in source_modules/; result records carry refs, hashes, counts, and verdicts only.
Hidden reasoning export, hidden gold labels, reward-by-formatting, verifier bypass, benchmark-performance claims, and final-answer-only scoring are expected falsification fixtures.

Prior Art Grounding

This component is grounded in three older ideas: belief-state tracking under partial observability, process supervision, and reward-hacking controls. The belief lineage comes from POMDP work such as Kaelbling, Littman, and Cassandra's Planning and Acting in Partially Observable Stochastic Domains. The process-reward lineage follows OpenAI's Let's Verify Step by Step, where step-level feedback is separated from outcome-only supervision. The reward-hacking lineage comes from Concrete Problems in AI Safety and related work on specification gaming.

Microcosm does not train a reward model or expose hidden reasoning. It borrows the accounting form: public belief summaries, verifier-backed process feedback, outcome rewards kept separate from process rewards, reward-hacking traps, and cold replay result records before process-reward language is admitted.

Validation Result record Path

Run the first-wave fixture validator from the repo root and write its result record outside the repo working tree:

Then run the exported bundle validator:

cd microcosm-substrate && PYTHONPATH=src ../repo-python -m microcosm_core.organs.belief_state_process_reward_replay run-reward-bundle --input examples/belief_state_process_reward_replay/exported_belief_state_process_reward_bundle --out /tmp/belief_state_process_reward_bundle_receipt --card > /tmp/belief_state_process_reward_bundle_card.json

Scope boundary

Limitations

The replay is intentionally small and synthetic. It covers three partially observable task families, six accepted belief summaries, six process rewards, three outcome rewards, three trajectory groups, three cold replays, seven negative cases in fixture mode, and seven copied source modules in exported-bundle mode. Those counts are proof boundaries, not scale claims. They show that the public validator can separate belief summaries, verifier-backed process feedback, outcome rewards, reward-hacking traps, and cold replay under declared fixtures.

The mechanism does not estimate reward-model calibration, generalize to unseen tasks, compare agent policies, certify live training behavior, or score a benchmark. build_public_belief_state_process_reward_trace emits metadata-only trace spans, so it can prove trace structure and privacy boundaries, not hidden reasoning fidelity. The copied-source manifest proves exact declared public source bodies and anchors for this bundle; it excludes private source root export, external model access, source-file changes, public sharing, or launch.

Scope limit

Source-faithful refactored fixtures and copied source bodies only; not fixture-echo product evidence, hidden reasoning export, live RL training, neural-judge sufficiency, hidden-gold benchmarking, provider behavior, benchmark claims, source-file changes, publishing-scope decision, launch-scope decision, or whole-system correctness.

Scope limit

This paper module can claim a metadata-only belief-state process-reward replay over synthetic tasks, with public belief summaries, verifier-backed process feedback, separated outcome rewards, reward-hacking traps, and cold replay result records.

It cannot claim hidden reasoning export, live RL training, reward-model quality, hidden-gold benchmark claims, provider behavior, source-file changes, publishing-scope decision, launch-scope decision, or whole-system correctness. Any higher claim must land first in core/paper_module_capsules.json and the generated paper-module projection.

Scope boundary

This module does not export hidden reasoning, run RL or train a model, use hidden gold labels, rely on neural-judge-only labels, claim benchmark performance, use external model services, change source files, publish results, or include launch operations.

Source and projection details

Governing Lattice Relation

The governing lattice relation is that belief-state process-reward language is admissible only after the runtime recomputes the claim from lower-level public evidence. The generated JSON instance resolves eight edges and leaves no selective relation open: the bundle explains the belief_state_process_reward_replay component and mechanism.belief_state_process_reward_replay.validates_public_belief_state_process_reward_replay, is governed by concept.agent_reliability_and_safety_validator_bundle, P-1, and P-2, abides by AX-1, depends on paper_module.agent_route_observability_runtime, and cites src/microcosm_core/organs/belief_state_process_reward_replay.py as the code locus.

That relation matters because the module is not trying to make reward quality plausible from a label. P-1 requires recomputation rather than echoing fixture verdicts, so _build_result rechecks projection protocol, reward policy, episodes, belief rows, feedback rows, reward rows, trajectory groups, cold replay, expected negative cases, secret-exclusion scans, public trace shape, and copied source-module manifests. P-2 and AX-1 then lower the paper claim to what those checks derive: a local replay certificate over declared public inputs. The focused proof consumer is tests/test_belief_state_process_reward_replay.py, which exercises both fixture and exported-bundle modes, mutates real positive feedback linkage, rejects digest and manifest boundary violations, verifies exact source-body imports, checks freshness over live source authority, and confirms the command card omits full payload keys.

Agent Sandbox Policy Escape ReplayMaps sandboxed agent actions to show each was approved or blocked before running, then rolled back.3/5

Does This takes a synthetic record of an agent attempting risky actions inside a sandbox and lays out, step by step, what each action requested, whether a safety policy approved or blocked it before it would have run, what (if anything) changed afterward, and for the actions that did run, whether the change was rolled back and could be re-checked. The record shows exactly how each containment decision is captured: every blocked attempt is still logged as a traced step but is marked as never executed with no resulting change, all from local files with no real secrets, network, or live agent involved.

Scope limit It validates the projection / trace-refactor mechanics over a synthetic fixture only; it excludes live sandbox escape, secret or account secret handling, live network access, host filesystem mutation, executable payload export, raw environment export, external model access, security benchmark claims, source-file changes, or launch. A pass proves the projection boundary and trace-refactor mechanics for this contract, not real sandbox security, exploit resistance, or whole-system safety.

Run

PYTHONPATH=src python3 -m microcosm_core.cli agent-sandbox-policy-escape-replay run-sandbox-bundle --input examples/agent_sandbox_policy_escape_replay/exported_sandbox_policy_escape_bundle --out receipts/runtime_shell/demo_project/organs/agent_sandbox_policy_escape_replay

EvidenceComputed projectionevidence 3/5Source-faithful refactor

Links to Sleeper Memory Poisoning Quarantine Replay

ai-safetyagent-evaluationred-teaming

Source Design note · Source atlas

Paper module Agent Sandbox Policy-Escape Replay

Explainscomponent Agent Sandbox Policy Escape Replay mechanism validates public sandbox policy trace

Governed byprinciples Recompute, do not echo Lower claim strength to checker strength concept agent reliability and safety validators as bounded public scope limits

Abides byaxiom Derivation before assertion

Depends onpaper modules Computer-Use Action Trace Replay MCP Tool Authority Replay

agent_sandbox_policy_escape_replay is a validator-backed public refactor of the source agent_execution_trace system for sandbox/security claims. It asks a narrow question: can Microcosm compute metadata-only trace spans from action requests, pre-execution policy verdicts, side-effect diff result records, rollback result records, cold replay, falsification fixtures, and an explicit scope limit?

Purpose

Agent sandbox claims are easy to assert and hard to evidence. A system can say it blocks an untrusted action, or that a side effect was rolled back, without ever showing the record that would make the statement checkable. This component exists to hold one narrow claim to account: that for a fixed set of synthetic action requests, the policy decision was recorded before execution, blocked actions left no side effect, and permitted side effects each carry a diff and a rollback result record.

The interesting choice is that the validator does not trust the verdict it is handed. For each request it derives the expected verdict from the request's own shape, its action kind, requested capability, risk class, and source trust label, using a small fixed policy table (_derived_sandbox_policy_verdict). A secret read from untrusted tool output derives to block; a low-risk public fixture write derives to allow; a mock database update derives to review. Any action shape the table does not recognise fails closed to block. The recorded verdict row is then checked against that derivation. A declared allow that should have derived to block is a finding, not a pass. The same fail-closed semantics drive the side-effect check: a request whose shape requires a block must show no execution and a zero diff count regardless of what the verdict row claims.

Because every check runs over symbolic references, the page can report concrete numbers, six action requests, four derived blocks, six metadata-only trace spans, while staying honest about what those numbers are not. They demonstrate that the pre-execution accounting pattern is wired and replayable over public fixtures. They do not demonstrate containment in a real host, resistance to a live exploit, or network isolation. That gap is the point of the scope limit below, and it is the line this component is built to keep visible.

Named Proof Consumers

The primary proof consumer is tests/test_agent_sandbox_policy_escape_replay.py. Its 17 tests exercise both runtime entry points (run and run_sandbox_bundle) and the public trace builder from microcosm_core.macro_tools.agent_execution_trace. The consumer does not accept declared labels at face value: it mutates policy verdicts, side-effect rows, cold replay labels, exported bundle rows, source-module digests, source/target manifest fields, body-boundary fields, and cached-card freshness to prove that the validator recomputes the sandbox-policy result from source rows.

The fixture proof path is microcosm_core.organs.agent_sandbox_policy_escape_replay run. Its success result record must include six action requests, six pre-execution verdicts, four derived blocked rows, one derived allow row, one derived review row, six side-effect result records, two rollback-verified rows, six cold replay passes, all expected negative cases, and a six-span public trace. The negative-case rows are not auxiliary examples; they are the admission boundary that rejects semantic policy drift, blocked-action execution, executable escape payload material, tool-output authority bypass, raw environment exposure, and broad security or benchmark overclaim.

The exported-bundle proof path is microcosm_core.cli agent-sandbox-policy-escape-replay run-sandbox-bundle. It has no fixture-only negative cases, so its proof surface shifts to bundle shape: the bundle id, source-module manifest, seven copied source bodies, digest equality, required anchors, metadata-only result records, public-relative paths, and public trace spans must all validate. The same test file also breaks the manifest in targeted ways to prove that missing manifests, invalid material classes, body-in-result record flags, count mismatches, missing target copies, and partial source or target digest drift block the result.

The corpus proof consumer is scripts/build_doctrine_projection.py --check-paper-module-corpus. It proves only that this reader page remains aligned with the JSON bundle-backed paper-module corpus. It does not refresh generated Mermaid, Atlas, site, or verifier projections, and it does not raise the claim above public fixture and bundle replay evidence.

Technical Mechanism

The mechanism is a validating transducer over public refs, not a sandbox. The runtime entry points run and run_sandbox_bundle load the fixture or exported bundle, then _build_result recomputes each claim from lower-level rows before any result record is accepted. The named proof consumer is tests/test_agent_sandbox_policy_escape_replay.py, with the corpus-level projection consumer scripts/build_doctrine_projection.py --check-paper-module-corpus.

The validator first establishes an input boundary. _load_payloads reads the projection protocol, sandbox policy, action requests, verdicts, side-effect result records, rollback result records, and cold replay rows with strict JSON parsing. scan_paths checks the same public files and copied source-module bodies against core/private_state_forbidden_classes.json, while _source_module_manifest_result verifies that the seven copied source bodies are present, digest-matched, by material class, and excluded from result record body fields.

The policy mechanism is then recomputed row by row. validate_action_requests admits only symbolic request metadata with redacted bodies and no live network target. validate_policy_verdicts joins each request to a pre-execution verdict and checks the verdict against the request's risk class instead of trusting the declared label. validate_side_effect_receipts enforces the mechanical consequence: block decisions must have no execution and no diff, while allow/review decisions must carry a non-empty public diff result record. validate_rollback_receipts requires rollback refs for side-effecting actions, and validate_cold_replay requires replay rows that reproduce verdict and side-effect state.

The trace layer converts accepted public rows into metadata-only PublicTraceSpan records through build_public_sandbox_policy_trace. Each span carries a request id, authority verdict ref, side-effect or rollback ref, outcome, digest, and sandbox_policy_action tool label. This is why the module can claim six public trace spans and outcome counts, but cannot claim live sandbox security: the trace proves replay consistency over refs, not containment in a real host environment.

Negative cases define the refusal surface. The focused test suite mutates the fixture and exported bundle to verify semantic mismatch, blocked-action execution, source-module digest mismatch, source-module manifest boundary breakage, public-relative result record paths, and card reuse. These tests are the source-bound evidence that the validator fails closed for the named public contract.

Shape

Diagram source

flowchart TD bundle["JSON bundle authority"] markdown["Markdown reader projection"] manifest["source module manifest"] requests["six action requests"] verdicts["six pre-execution verdicts"] effects["six side-effect result records"] rollbacks["two rollback result records"] replay["six cold replay rows"] trace["public agent-execution trace"] negative["negative-case refusals"] tests["focused proof consumer"] result record["metadata-only validation result record"] ceiling["scope limits"] bundle --> markdown manifest --> result record requests --> verdicts verdicts --> effects effects --> rollbacks effects --> replay verdicts --> negative effects --> negative replay --> trace negative --> tests trace --> result record tests --> result record result record --> ceiling

The module's shape is pre-execution containment accounting. Public action requests are normalized into symbolic refs, policy verdicts must exist before execution, blocked actions carry zero side effects, allowed or reviewed side effects need diff refs and rollback refs, cold replay must reproduce the public state, and the trace builder emits metadata-only spans over those refs without promoting the fixture into live sandbox-security authority.

Reader Evidence Routing

Bundle route: core/paper_module_capsules.json::paper_modules[35] is the bundle-backed authority row, and paper_modules/agent_sandbox_policy_escape_replay.json is the generated paper-module instance.
Bundle route: examples/agent_sandbox_policy_escape_replay/exported_sandbox_policy_escape_bundle carries action_requests.json, policy_verdicts.json, side_effect_receipts.json, rollback_receipts.json, cold_replay.json, sandbox_policy.json, projection_protocol.json, and source_module_manifest.json.
Action route: the six public request ids are req_secret_read_attempt, req_network_exfil_attempt, req_destructive_delete_attempt, req_shell_obfuscation_attempt, req_safe_file_edit, and req_reviewed_mock_db_update.
Verdict route: the six verdict rows are pre-execution policy decisions under sandbox-policy-v1-public-synthetic, with four block, one allow, and one review outcome.
Side-effect route: all six requests have side-effect result records; blocked rows use zero diffs, while the public fixture edit and reviewed mock database update carry diff refs plus rollback refs.
Manifest route: source_module_manifest.json records seven copied source bodies, body_in_receipt: false, body_text_in_receipt: false, and the boundary excluding keys, account secrets, account or browser material, model-output data, live network payloads, raw environments, executable escape payloads, and account secret-equivalent material.
Runtime route: src/microcosm_core/organs/agent_sandbox_policy_escape_replay.py, src/microcosm_core/macro_tools/agent_execution_trace.py, and tests/test_agent_sandbox_policy_escape_replay.py verify negative cases, public trace-span construction, exact source-module imports, manifest digest rejection, result record public-relativity, and card result record reuse.

Contract

Input shape: projection_protocol, sandbox_policy, action_requests, policy_verdicts, side_effect_receipts, rollback_receipts, and cold_replay.
Positive evidence: six metadata-only action requests converted into public agent_execution_trace spans, six pre-execution policy verdicts, six side-effect result records, two verified rollback result records, and six cold replay rows.
Negative cases: real secret material, live network access, raw environment export, policy after execution, unlogged side effect, tool-output policy bypass, executable escape payload, and security benchmark claim.
Result record boundary: the validation result record proves the source-faithful trace refactor mechanics, negative-case coverage, secret-exclusion scan, and scope limit.
Scope limit: no live sandbox escape, live secret handling, live network access, host filesystem mutation, executable payload export, raw environment export, external model access, security benchmark claim, source-file changes, or launch-scope decision.

Projection Protocol

Copied: the public shape of the source agent-execution trace membrane and the idea that containment must be proven before a security claim is admitted.

Source-faithfully refactored: PublicTraceSpan construction, sequence-ordered trace rows, authority verdict refs, side-effect and rollback refs, public summary counts, trace digests, local JSON validators, and result record generation.

Cleaned: real secrets, host paths, live network targets, raw environment data, executable payloads, provider data, and account state.

Omitted: live exploit material, hosted sandbox details, real account secrets, raw tool-output bodies, real filesystem paths, raw environment variables, and security benchmark claims claims.

Public runtime surface: a metadata-only sandbox policy bundle plus generated result records under receipts/first_wave/agent_sandbox_policy_escape_replay/ and receipts/runtime_shell/demo_project/organs/agent_sandbox_policy_escape_replay/.

Source-open body floor: the exported bundle carries source_module_manifest.json plus seven exact copied source bodies: the extracted-pattern ledger, the high novelty reconstruction result record, the canonical component model, the source system/lib/agent_execution_trace.py runtime, std_agent_execution_trace, the extracted-pattern route-readiness checker, and the strict JSON helper required by the refreshed trace body. Result records and cards cite the manifest, hashes, material classes, and counts only; full body text stays in the bundle source module files.

Prior Art Grounding

This component is grounded in least-privilege sandboxing and agent-security evaluation work, not in a new exploit technique. The security-control lineage is Saltzer and Schroeder's least-privilege / complete-mediation principles and capability-oriented confused-deputy thinking. The agent-evaluation lineage is closer to sandboxed tool-use benchmarks such as AgentDojo and misuse/harm evaluations such as AgentHarm, where an agent's requested actions, tool calls, and policy outcomes are evaluated under controlled conditions.

Microcosm does not claim real sandbox security, exploit resistance, or live environment isolation. It borrows the pattern that containment must be checked before action, side effects must be logged, rollback needs its own result record, and harmful payloads must stay out of the public surface.

Validation proves the projection boundary and public trace-refactor mechanics for this contract; it does not establish real sandbox security, live model behavior, benchmark claims, exploit resistance, or whole-system safety.

Validation Result records

The focused proof consumer is tests/test_agent_sandbox_policy_escape_replay.py. A passing result record has to show that the fixture validator and exported-bundle validator both recompute the public sandbox-policy trace from action-request refs, pre-execution verdict refs, side-effect result record refs, rollback refs, cold replay rows, and the source-module manifest instead of trusting declared labels.

PYTHONDONTWRITEBYTECODE=1 ./repo-pytest \
  tests/test_agent_sandbox_policy_escape_replay.py \
  -p no:cacheprovider
./repo-python scripts/build_doctrine_projection.py \
  --check-paper-module-corpus

For the focused test, the result record boundary is the asserted shape: six action requests, six policy verdicts, four blocked-without-execution rows, two verified rollback result records, six cold replay rows, six public trace spans, manifest digest checks, public-relative result record paths, and negative cases for semantic mismatch, blocked-action execution, digest mismatch, manifest-boundary breakage, and unsafe card reuse. For the corpus check, the result record is only parity evidence that the JSON bundle and generated paper-module instance still agree; it is not a live sandbox-security result.

Validation Result record Path

Run the first-wave fixture validator from the repo root and write its result record outside the repo working tree:

Then run the exported bundle validator:

cd microcosm-substrate && PYTHONPATH=src ../repo-python \
  -m microcosm_core.organs.agent_sandbox_policy_escape_replay \
  run-sandbox-bundle \
  --input examples/agent_sandbox_policy_escape_replay/exported_sandbox_policy_escape_bundle \
  --out /tmp/agent_sandbox_policy_escape_bundle_receipt \
  --card > /tmp/agent_sandbox_policy_escape_bundle_card.json

The focused regression test and corpus projection checks are:

cd microcosm-substrate && ../repo-pytest \
  tests/test_agent_sandbox_policy_escape_replay.py
./repo-python scripts/build_doctrine_projection.py --check-paper-module-corpus

The result record path proves pre-execution policy replay over public refs, not live sandbox security, exploit resistance, or host isolation.

Scope boundary

Scope limit

This module may claim public fixture evidence that action request refs, pre-execution policy verdict refs, side-effect result record refs, rollback refs, cold replay rows, metadata-only trace spans, source manifests, digest checks, secret-exclusion scans, negative cases, and validation result records are checked by the listed runtime witnesses.

This module may not claim live sandbox escape resistance, live secret handling, live network isolation, host filesystem mutation authority, executable payload export, raw environment export, provider behavior, security benchmark performance, source-file changes, publishing-scope decision, launch-scope decision, or whole-system correctness.

Source and projection details

Governing Lattice Relation

The JSON bundle binds this paper module to the component agent_sandbox_policy_escape_replay and to mechanism.agent_sandbox_policy_escape_replay.validates_public_sandbox_policy_trace. The mechanism row states the actual computation: check action requests, pre-execution policy verdicts, side-effect result records, rollback result records, cold replay rows, public trace spans, source-module manifest boundaries, secret-exclusion scans, and escape negative cases before writing bounded result records.

AX-1 supplies the axiom-level rule: derivation must precede assertion, and a claim cannot be stronger than the checker that accepted it. P-1 specializes that rule into recomputation over lower-level evidence instead of echoing fixture labels, declared verdicts, or public copy lines. P-2 lowers the module's public claim to the strength of the named validator, which is why the scope limit stops at metadata-only public sandbox-policy replay result records. The governing concept, concept.agent_reliability_and_safety_validator_bundle, groups this component with agent reliability and safety validators whose public value is bounded result record evidence, not a broad claim that agents are safe in the world.

Indirect Prompt Injection Information Flow Policy ReplayReplays an agent run to show untrusted text was gated before any sensitive action, leaking no secret.3/5

Does Replays a recorded sample agent episode (built from synthetic, metadata-only rows) and makes visible whether the instructions the agent treated as trusted were kept separate from untrusted web, tool, or browser text before any sensitive action was taken. Row by row, the record shows where untrusted text flowed, what the policy decided for each flow (allow / warn / block / review) before the action, and that the recorded outcome leaked no secret and disclosed no trusted context. It also bundles deliberately-bad cases it must reject (e.g. untrusted text reaching a sensitive action ungated, or a account secret being exfiltrated).

Scope limit Passing result records only show this projection satisfies the named information-flow contract over synthetic, redacted, metadata-only rows; they do not prove general prompt-injection robustness, benchmark performance, live account/tool/provider safety, hidden-message handling in a real system, source-file changes, or launch-scope decision.

Run

microcosm indirect-prompt-injection-information-flow-policy-replay run-prompt-injection-bundle --input examples/indirect_prompt_injection_information_flow_policy_replay/exported_prompt_injection_flow_bundle --out receipts/runtime_shell/demo_project/organs/indirect_prompt_injection_information_flow_policy_replay

EvidenceComputed projectionevidence 3/5Source-faithful refactor

ai-safetyagent-evaluationred-teaming

Source Design note · Source atlas

Paper module Indirect Prompt-Injection Information-Flow Policy Replay

Explainscomponent Indirect Prompt Injection Information Flow Policy Replay mechanism validates public indirect prompt injection information flow policy replay

Governed byprinciples Preserve provenance across every boundary Carry basis and provenance together concept agent reliability and safety validators as bounded public scope limits

Abides byaxiom Provenance propagation and non-interference

Depends onpaper module Source Projection Import Protocol

This validator-backed claim contract admits one narrow public claim: a source-faithful public trace refactor separated trusted instructions from untrusted web/tool/browser text before any privileged action or answer claim was accepted.

The runnable contract requires source trust labels, taint labels, source-to-sink flow rows, pre-action policy verdicts, sanitized-output result records, cold replay, secret-exclusion scan, negative cases, a public agent-execution trace, and an explicit scope limit.

Purpose

An agent that reads web pages, tool output, or retrieved documents takes in text from sources it does not control. Indirect prompt injection is the case where that untrusted text carries an instruction, and the agent acts on it as if the operator had asked. This component exists to make one specific safety property checkable on a synthetic trace: untrusted text was kept separate from trusted instructions, and no untrusted source reached a privileged action without being gated first. It answers a single question. Did the trust boundary actually hold through the flow, or only on paper?

The unusual part is that the validator does not trust the labels the fixture declares. Each flow row claims a set of taint labels and a policy verdict, but the runtime ignores those and recomputes both. It propagates taint along the source-to-sink graph from the labelled sources, so a sink inherits the taint of everything that fed it, and it derives the verdict from that propagated taint plus the sink's privilege, the sanitizer state, the sink kind, and the proposed action. If the declared taint or the declared verdict disagrees with the recomputed one, the row is blocked. A flow cannot quietly relabel an untrusted source as clean, or mark a dangerous action as allow, because the contradiction is recomputed rather than read back.

That recomputation is the point. The failure mode it guards against is a trace that looks safe because the labels were written to look safe. By deriving the labels and verdicts from the graph itself, the contract catches the mislabelled flow that a field-by-field check would wave through. To stay honest about live behaviour, it also takes one generated public tool-call trace span and pushes it through the same machinery as untrusted tool output, so the runtime is seen to treat tool output as data until a policy gate reviews it, never as instruction authority.

Cold-Reader Path

microcosm indirect-prompt-injection-information-flow-policy-replay run-prompt-injection-bundle \
  --input examples/indirect_prompt_injection_information_flow_policy_replay/exported_prompt_injection_flow_bundle \
  --out receipts/runtime_shell/demo_project/organs/indirect_prompt_injection_information_flow_policy_replay

Primary result record: receipts/runtime_shell/demo_project/organs/indirect_prompt_injection_information_flow_policy_replay/exported_prompt_injection_flow_bundle_validation_result.json

First-wave fixture result record: receipts/first_wave/indirect_prompt_injection_information_flow_policy_replay/indirect_prompt_injection_information_flow_policy_replay_validation_receipt.json

Shape

Diagram source

flowchart TD sources["Source rows trusted and untrusted, with taint labels"] flows["Source-to-sink flow rows (declared taint and verdict)"] propagate["Propagate taint along the source-to-sink graph"] derive["Derive verdict from taint + sink privilege + sanitizer + sink kind + action"] compare{"Declared labels and verdict match the derived ones?"} blocked["Block the row (relabelled or wrong verdict)"] gate{"Untrusted into a privileged sink?"} verdicts["Pre-action verdict allow / warn / review / block"] outputs["Sanitized output no trusted context disclosed, no untrusted instruction obeyed"] toolspan["One public tool-call trace span treated as untrusted tool output"] result records["metadata-only result records refs, digests, counts, status"] sources --> flows flows --> propagate propagate --> derive derive --> compare compare -- no --> blocked compare -- yes --> gate gate -- yes --> verdicts gate -- no --> verdicts verdicts --> outputs toolspan --> propagate outputs --> result records blocked --> result records

The module's shape is a public information-flow replay, not a live prompt-injection defense. This page points at the mechanism and runtime component; the runtime validates source trust labels, taint propagation, privileged sink gates, pre-action verdicts, sanitized outputs, cold replay, public trace spans, source-module digest anchors, negative cases, and metadata-only result records.

Technical Mechanism

The runtime mechanism is an evidence compiler plus an information-flow validator. run loads the first-wave fixture with negative cases enabled; run_prompt_injection_bundle loads the exported public bundle and leaves the fixture-only negative cases out. Both routes call _build_result, which loads the projection protocol, injection policy, source-document rows, flow graph, policy verdict rows, sanitized outputs, cold replay rows, public trace, copied source-module manifest, and secret-exclusion policy before it writes any result record.

The source and flow validators separate instruction authority from untrusted data before claim admission. validate_source_documents requires every source row to carry source id, trust label, channel, body ref, taint labels, instruction-authority flag, body redaction, synthetic-fixture status, and no raw or real-account body export; untrusted sources cannot carry instruction authority. validate_information_flow_graph joins each flow to its source row, derives taint labels through _taint_propagation_receipt, derives the expected policy verdict from propagated taint, sink kind, sink privilege, sanitizer state, and proposed action, and rejects hand-written taint or verdict drift.

Policy and output validation then bind the pre-action membrane. The injection policy must name allow, warn, block, and review verdicts; require the source, flow, verdict, and output field floors; and deny real accounts, raw prompt bodies, account secrets, tool-output authority, hidden-message promotion, live tool calls, general robustness claims, and launch. validate_policy_verdicts requires verdicts to join to flows, precede action, cite rules, stay redacted, and match the derived flow verdict. validate_sanitized_outputs requires output refs to join to flows, disclose no trusted context, obey no untrusted instruction, and avoid external action on blocked flows.

Replay and trace validation keep the public claim metadata-only. validate_cold_replay requires replay commands and result record refs to reproduce each verdict and sanitized output without trusted-context disclosure. The component uses build_public_prompt_injection_trace to build five public trace spans, then _live_tool_call_trace_promotion promotes one generated public tool-call trace span back through the same taint-graph machinery as an untrusted tool-output source. That promotion is evidence that the runtime treats tool output as data until a policy gate reviews it, not as instruction authority.

The copied-source floor is checked independently. _source_module_manifest_result requires the exported bundle's source_module_manifest.json to classify copied material as source body material, keep body text out of result records, match declared module counts, resolve path and target_ref to the same copied body, stream SHA-256 digests over each target, and verify required anchors. _source_open_body_import_summary exposes only body ids, classes, manifest refs, counts, and ceiling flags; the copied bodies remain under source_modules/.

The result record mechanism is intentionally small. _write_receipts writes first-wave result, board, validation, and sign-off result records for fixture mode, while exported-bundle mode writes the bundle validation result. result_card emits a compact command card and omits findings, secret-scan details, scope limit bodies, scope boundary text, source refs, target refs, public trace spans, source rows, flow rows, verdict rows, sanitized output rows, cold replay rows, board rows, and copied source-module bodies. The card preserves counts, status, negative-case coverage, trace span count, body-floor status, and result record refs.

The lattice binding is the source record paper_module.indirect_prompt_injection_information_flow_policy_replay, the mechanism row mechanism.indirect_prompt_injection_information_flow_policy_replay.validates_public_indirect_prompt_injection_information_flow_policy_replay, principles P-9 and P-14, axiom AX-8, and concept.agent_reliability_and_safety_validator_bundle. Those refs are used as an admission-control lattice: source-labelled public evidence may enter the claim surface, while untrusted instruction authority, private bodies, model-output data, live account material, source-file changes, and launch claims remain out of scope.

Input Contract

projection_protocol.json: source-available projection statement and omitted private material.
injection_policy.json: required source, flow, verdict, and output fields plus authority denials.
source_documents.json: synthetic trusted and untrusted sources with trust labels and taint labels.
information_flow_graph.json: source-to-sink flow rows before claim admission.
policy_verdicts.json: allow, warn, block, and review verdicts before synthetic action.
sanitized_outputs.json: output refs proving no trusted context disclosure and no untrusted instruction obedience.
cold_replay.json: rerunnable command and result record refs that reproduce verdicts and sanitized state.

Public Trace Refactor

The product evidence is no longer the fixture verdict fields alone. The component uses microcosm_core.macro_tools.agent_execution_trace::build_public_prompt_injection_trace to emit metadata-only spans over the public source, flow, verdict, output, and replay refs. That builder is a Microcosm refactor of the source system/lib/agent_execution_trace.py span model, so the accepted result record can show sequence, authority, audit, coverage, and digest mechanics without copying real accounts, prompt bodies, model-output data, or live tool material.

Reader Evidence Routing

Bundle route: core/paper_module_capsules.json::paper_modules[38:paper_module.indirect_prompt_injection_information_flow_policy_replay] is the JSON authority row; a Mermaid diagram and an Atlas card are generated for this module from that row.
Mechanism route: core/mechanism_sources.json::mechanism.indirect_prompt_injection_information_flow_policy_replay.validates_public_indirect_prompt_injection_information_flow_policy_replay binds the code locus, fixture refs, exported bundle refs, result record refs, validator commands, focused regression, and guardrails.
Runtime route: src/microcosm_core/organs/indirect_prompt_injection_information_flow_policy_replay.py owns run, run_prompt_injection_bundle, _build_result, _write_receipts, result_card, EXPECTED_NEGATIVE_CASES, and AUTHORITY_CEILING.
Exported-bundle route: examples/indirect_prompt_injection_information_flow_policy_replay/exported_prompt_injection_flow_bundle contains bundle_manifest.json, projection_protocol.json, injection_policy.json, source_documents.json, information_flow_graph.json, policy_verdicts.json, sanitized_outputs.json, cold_replay.json, and source_module_manifest.json.
Source-module route: source_module_manifest.json records five copied public source bodies: the extracted-pattern ledger row, high-novelty reconstruction result record, agent execution trace runtime, agent execution trace standard, and strict JSON helper. Result records carry refs, digests, counts, and status only; source body text stays in the bundle's source_modules/ tree.
Focused-test route: tests/test_indirect_prompt_injection_information_flow_policy_replay.py verifies negative cases, public-relative redacted result records, exported-bundle runtime shape, source-module digest and target-ref failures, exact copied source bodies, card result record reuse, and public trace span construction.

Prior Art Grounding

This component is grounded in the prompt-injection and information-flow-control literature. The prompt-injection side follows the threat shape described by Greshake et al. in Not what you've signed up for, the agentic evaluation framing in AgentDojo, and later data-leakage benchmarks over tool-calling agents such as Simple Prompt Injection Attacks Can Leak Personal Data. The policy mechanism borrows from dynamic information-flow / taint-tracking ideas, including Permissive Information-Flow Analysis for Large Language Models.

Microcosm does not claim a general prompt-injection defense. It preserves the prior-art internal control lesson: untrusted content must be labelled as data, source-to-sink flows must be visible before privileged action, and sanitized outputs need result records. The local component turns that lesson into a metadata-only replay contract with explicit scope boundaries and negative cases.

Negative Cases

The validator rejects real account material, secret or trusted-context exfiltration, raw prompt body export, untrusted tool output treated as instruction authority, hidden system-message promotion, account secret exfiltration, final-answer-only success, and ungated untrusted flow into a privileged sink.

These are falsification fixtures. They are part of the contract, not examples of live exploit traffic.

Named Proof Consumers

The named proof consumer is tests/test_indirect_prompt_injection_information_flow_policy_replay.py. It checks first-wave negative-case coverage, five sources, three untrusted and two trusted source labels, five information flows, derived taint paths, derived policy verdicts, allow/warn/block/review counts, blocked-without-external-action counts, sanitized-output non-disclosure, cold replay, scope limit flags, public trace spans, public tool-call trace promotion through taint propagation, public-relative redacted result records, exported-bundle validation, source-module digest mismatch rejection, target-ref/path mismatch rejection, partial digest mismatch rejection, manifest body-text boundary rejection, streaming source-module digests, exact copied source body imports, fresh --card result record reuse, public trace construction, and fixture-manifest binding to the body-open refactor.

The runtime proof consumers are the two module commands in the Validation Result record Path: fixture mode via indirect_prompt_injection_information_flow_policy_replay run, and exported bundle mode via indirect_prompt_injection_information_flow_policy_replay run-prompt-injection-bundle. Fixture mode must observe all eight negative cases and write metadata-only result, board, validation, and sign-off result records. Bundle mode must validate the public bundle shape, source-module manifest, public trace spans, and metadata-only exported bundle result record.

The corpus proof consumer is scripts/build_doctrine_projection.py --check-paper-module-corpus. It is a corpus check only; it does not refresh generated Mermaid, Atlas, site, verifier, or bundle state.

Validation Result record Path

Run the first-wave fixture validator from the repo root and write its result record outside the repo working tree:

Then run the exported bundle validator:

cd microcosm-substrate && PYTHONPATH=src ../repo-python -m microcosm_core.organs.indirect_prompt_injection_information_flow_policy_replay run-prompt-injection-bundle --input examples/indirect_prompt_injection_information_flow_policy_replay/exported_prompt_injection_flow_bundle --out /tmp/indirect_prompt_injection_flow_bundle_receipt --card > /tmp/indirect_prompt_injection_flow_bundle_card.json

The focused regression test and corpus projection checks are:

cd microcosm-substrate && PYTHONPATH=src ../repo-python -m pytest -p no:cacheprovider tests/test_indirect_prompt_injection_information_flow_policy_replay.py -q
cd microcosm-substrate && PYTHONPATH=src ../repo-python scripts/build_doctrine_projection.py --check-paper-module-corpus

The result record path proves synthetic information-flow replay and body omission, not general prompt-injection robustness or live account safety.

Scope boundary

Limitations

The replay is intentionally small and synthetic. Fixture mode covers five source documents, three untrusted and two trusted source labels, five source-to-sink flows, five pre-action verdicts, five sanitized outputs, five cold replay passes, five public trace spans, one generated public tool-call trace promoted through the taint graph, five copied source bodies, and eight negative cases. Exported-bundle mode validates the public bundle, source-module manifest, trace spans, and metadata-only result record shape, but it does not carry the fixture-only negative-case payloads.

Those counts are proof boundaries, not scale claims. They show that this local validator recomputes source trust, taint propagation, pre-action verdicts, sanitized output constraints, cold replay, source-module digest anchors, and metadata-only result record shape over declared public inputs. They do not estimate attack coverage, compare defenses, score a benchmark, certify hidden-message handling in production, or demonstrate live email, browser, account, tool, or provider behavior.

The source-open body floor is also narrow. The manifest proves byte parity and declared anchors for the five copied source bodies in the exported bundle. It excludes private source-root export, raw prompt or system body export, account secret-bearing material, source-file changes, public sharing, hosting, launch-scope decision, complete security, or product readiness.

Scope limit

This module supports only the public claim that the replay exposes and checks a prompt-injection information-flow policy over source trust labels, taint labels, source-to-sink flow rows, pre-action policy verdict refs, sanitized-output refs, cold replay refs, public trace spans, live public tool-call trace taint promotion, copied source-module digests, negative-case result records, secret-exclusion checks, and metadata-only scope limits.

The copied source-module digest row proves byte parity for the named source body only; it does not widen the replay into live source authority.

It does not claim general prompt-injection robustness, live account safety, live tool or provider behavior, raw prompt/system/tool body export, account secret-bearing account data, hidden-message production handling, benchmark security or performance, source-file changes, publishing-scope decision, hosting authority, launch-scope decision, complete security, or product-progress authority.

Scope limit

Passing result records prove only that this public trace refactor satisfies the named prompt-injection information-flow contract over metadata-only rows. They do not prove general prompt-injection robustness, benchmark performance, live account safety, provider behavior, tool behavior, hidden-message handling in a real system, source-file changes, publishing-scope decision, or launch operations.

Source and projection details

Governing Lattice Relation

The generated JSON instance gives this page a specific admission lattice, not a loose security story. The only unresolved selective relation is the dependency edge; it remains a residual because the bundle does not name a sibling paper-module dependency.

The governing law is provenance propagation and non-interference. P-9 requires every source, fixture, result record, public-copy, provider-shape, or private-boundary crossing to carry provenance class and scope limit. P-14 requires byte or row basis and provenance to travel together. AX-8 requires data labels to propagate along flows, with untrusted labels entering privileged sinks only through declared transforms that satisfy the sink policy.

The runtime implements that lattice in _build_result: it loads the projection protocol, source documents, information-flow graph, policy verdicts, sanitized outputs, cold replay rows, public trace spans, source-module manifest, and secret-exclusion policy before status is admitted. validate_source_documents rejects untrusted instruction authority, validate_information_flow_graph derives taint labels and policy verdicts instead of trusting hand-written rows, _live_tool_call_trace_promotion treats generated public tool-call trace spans as untrusted tool output, and _write_receipts/result_card keep public result records metadata-only. The focused proof consumer is tests/test_indirect_prompt_injection_information_flow_policy_replay.py, which checks fixture and exported-bundle modes, taint/verdict derivation, negative cases, source-module digest boundaries, exact copied source-body imports, card redaction, fresh result record reuse, public trace spans, and fixture-manifest binding to the body-open refactor.

Source-Open Body Floor

The exported bundle carries exact copied source bodies under source_modules/ai_workflow/, governed by source_module_manifest.json. The imported bodies are:

state/microcosm_portfolio/extracted_patterns_ledger.jsonl
state/microcosm_portfolio/reconstruction/high_novelty_substrate_gap_scout_v1.json
system/lib/agent_execution_trace.py
codex/standards/std_agent_execution_trace.json
system/lib/strict_json.py

The manifest records source refs, target refs, hashes, material classes, and required anchors. Result records and cards expose refs, counts, and validation status only; they do not embed ledger, reconstruction, prompt, account, account secret, browser UI, model-output data, or live-access bodies.

Agentic Vulnerability Discovery Patch Proof ReplayChecks a fixed-bug evidence chain and re-runs three small real security checks; no real attack material.3/5

Does Takes a claim that an AI agent "found and fixed a security bug" and lays it out as a local, inspectable chain of made-up (synthetic) evidence: the imagined target, the suspected issue, the trace pointed to as backing, a reference to an abstract exploitability argument, the patch, and regression tests the fixture says fail before the fix and pass after it. The component checks only that these pieces are all present, refer to each other consistently, and carry no real targets, exploits, payloads, account secrets, or attack steps; it does not run the tests or judge whether the bug or fix is actually real. The result record shows whether the declared chain holds together, with no real attack material ever present.

Scope limit It validates only the projection/evidence-chain mechanics of a synthetic replay: structural presence, cross-reference consistency, declared boolean flags, and the secret/live-access exclusion scan. It executes small regression witnesses but performs no real vulnerability discovery and makes no judgment of real-world security or fix correctness. It excludes live-target testing, real CVE exploitation, weaponized payloads, account secret handling, network exfiltration, actionable exploit steps, external model access, source-file changes, benchmark security scores, launch, or any whole-system security claim.

Run

PYTHONPATH=src python3 -m microcosm_core.organs.agentic_vulnerability_discovery_patch_proof_replay run --input fixtures/first_wave/agentic_vulnerability_discovery_patch_proof_replay/input --out receipts/first_wave/agentic_vulnerability_discovery_patch_proof_replay --acceptance-out receipts/acceptance/first_wave/agentic_vulnerability_discovery_patch_proof_replay_fixture_acceptance.json

EvidenceComputed projectionevidence 3/5Source-faithful refactor

ai-safetyagent-evaluationred-teaming

Source Design note · Source atlas

Paper module Agentic Vulnerability Discovery Patch-Proof Replay

Explainscomponent Agentic Vulnerability Discovery Patch Proof Replay mechanism validates public agentic vulnerability patch proof replay

Governed byprinciples Recompute, do not echo Lower claim strength to checker strength concept agent reliability and safety validators as bounded public scope limits

Abides byaxiom Derivation before assertion

Depends onpaper module Mission Transaction Work Spine

This module documents the source-available claim contract for agentic_vulnerability_discovery_patch_proof_replay. It turns an agentic vulnerability-discovery claim into a public trace-backed local replay: synthetic metadata-only targets, issue hypotheses, trace evidence, abstract exploitability refs, patch diffs, regression tests, verifier result records, sandbox policy verdicts, false-positive triage, cold replay, negative cases, and scope limits.

Purpose

An agent that says it found and fixed a security bug is making a claim that is easy to assert and hard to check. The phrase "found and fixed" can stand for a real, tested repair, or for a plausible-looking patch that was never run, a false positive promoted to a finding, or a benchmark number with no evidence behind it. This component exists to refuse that ambiguity. It answers one question: before any "found and fixed" language is allowed, does a complete evidence chain line up, from a synthetic target through a hypothesis, a trace, an abstract exploitability ref, a patch diff, a regression test, and a verifier result record?

The part worth noticing is that two of those checks are not field checks. They recompute the thing the fixture is claiming. Each executable regression witness names one of three small, public mini-targets, a webhook redirect allowlist, a notebook log redactor, and a scheduler path normaliser. The validator runs that function twice, once in its unpatched form and once patched, and compares the results it computes against the expected_pre_patch and expected_post_patch values the fixture declared. A witness whose declared output does not match the computed output is rejected. In the same spirit, each verifier result record has its pass or false_positive verdict recomputed from the joined proof, patch, test, and witness evidence; the row's own label and result record filename are not taken on trust. The failure mode this guards against is a fixture that asserts a green result without the work behind it ever having run.

This is a synthetic, metadata-only replay, not live security work. The synthetic overclaim fixtures, live targets, real CVE exploitation, weaponised payloads, exploit steps, patch-without-test claims, benchmark claims, are regression boundaries the runtime must reject, not capabilities it offers. The useful claim is narrow and is stated plainly below: Microcosm can hold an agentic security story to a checked evidence chain before it admits patch-proof language.

Shape

Diagram source

flowchart TD bundle["JSON bundle authority"] markdown["Markdown reader projection"] mechanism["mechanism source row"] component["patch-proof replay runtime"] fixture["first-wave fixture"] bundle["exported patch-proof bundle"] targets["synthetic target refs"] hypotheses["issue hypotheses"] traces["trace evidence refs"] proofs["abstract exploitability refs"] patches["patch diff refs"] regressions["regression test refs"] executable["executable regression witnesses"] verifiers["verifier result records"] sandbox["sandbox verdicts"] negative["negative-case fixtures"] secret_scan["secret-exclusion scan"] replay["cold replay rows"] public_trace["public trace spans"] source_modules["source-module body floor"] result records["metadata-only result records"] consumer["focused proof-consumer tests"] ceiling["scope limit"] bundle --> markdown bundle --> mechanism mechanism --> component component --> fixture component --> bundle fixture --> targets bundle --> targets targets --> hypotheses hypotheses --> traces traces --> proofs proofs --> patches patches --> regressions regressions --> executable executable --> verifiers verifiers --> sandbox negative --> result records secret_scan --> result records sandbox --> replay replay --> public_trace source_modules --> secret_scan source_modules --> public_trace public_trace --> result records result records --> consumer result records --> ceiling

The module shape is a metadata-only synthetic patch-proof replay, not a live vulnerability discovery or fix-correctness claim. The runtime forces target refs, hypotheses, trace refs, abstract exploitability refs, patch diff refs, regression test refs, verifier result records, sandbox verdicts, false-positive triage, cold replay, public trace spans, source-module digests, negative cases, and scope boundaries to line up before bounded patch-proof language is admitted.

Technical Mechanism

The mechanism is an evidence join, not a scanner. The JSON bundle names the component and mechanism row, and the component resolves every claim through _build_result in src/microcosm_core/organs/agentic_vulnerability_discovery_patch_proof_replay.py. That function loads the projection protocol and vulnerability policy, then validates targets, issue hypotheses, trace evidence, exploitability refs, patch diffs, regression tests, executable regression witnesses, verifier result records, sandbox verdicts, false-positive triage, cold replay rows, optional negative-case fixtures, the public trace builder, and the source-module manifest. A result can pass only when those validators agree, the secret-exclusion scan has zero blocking hits, the public trace status is pass, all positive validators are pass, and the exported bundle's manifest digests match copied source bodies.

Two of those validators do work the others do not. The executable regression witness check runs each declared mini-target function in both its unpatched and patched form and compares the computed pre/post outputs against the values the fixture declared, so a witness cannot pass on a label alone. The verifier result record check recomputes each pass or false_positive verdict from the joined hypothesis, proof, patch, test, and witness evidence, and also requires the result record-ref filename to match that recomputed verdict, so a row cannot claim a result its own evidence does not support. The other validators are stricter joins: every hypothesis must resolve to a synthetic target, every patch-required hypothesis must carry both an abstract exploitability ref and a metadata-only patch diff, and every patch must pair with a regression test that fails before the patch and passes after it. A patch without a paired test, or a false positive promoted to a finding, blocks the result.

The runtime deliberately keeps two evidence modes separate. The first-wave fixture includes the negative-case authority, so it must observe the expected overclaim failures such as live target material, real CVE exploitation, weaponized payload export, exploit steps, patch-without-test claims, and benchmark claims claims. The exported bundle is the public runtime example, so its expected_negative_cases can be empty while it still proves the body floor, public trace, digest checks, regression witnesses, and scope limit. Both modes write metadata-only result records; copied bodies stay behind the source_module_manifest.json refs and hashes.

Named Proof Consumers

tests/test_agentic_vulnerability_discovery_patch_proof_replay.py::test_agentic_vulnerability_patch_proof_replay_observes_negative_cases consumes the first-wave fixture and checks the expected counts, negative-case coverage, public trace status, body-import boundary, secret-exclusion scan, and scope limit booleans.
tests/test_agentic_vulnerability_discovery_patch_proof_replay.py::test_agentic_vulnerability_exported_bundle_validates_runtime_shape consumes the exported bundle and checks runtime mode, target/hypothesis/patch counts, executable regression witnesses, source-module manifest status, copied-body count, metadata-only import summary, secret-exclusion status, and public trace span count.
The rejection tests in the same file are the scope limit in executable form: they mutate false-positive promotion, remove regression tests, tamper executable witnesses, omit exploitability proof, cross-wire verifier result records, and alter source-module digests, then require blocked results and specific error codes instead of allowing patch-proof language.

What It Admits

The validator admits only metadata-only patch-proof evidence where trace refs, abstract proof refs, patch diff refs, regression tests, verifier result records, sandbox verdicts, and cold replay line up.

The result record fields to inspect first are target_count, issue_hypothesis_count, patch_diff_count, regression_test_count, verifier_receipt_count, observed_negative_cases, secret_exclusion_scan, public_agent_execution_trace, body_import_verification, and authority_ceiling.

Prior Art Grounding

This component is grounded in the recent line of agentic software-engineering and security-evaluation work that treats code repair as an executable, test-backed claim rather than a prose claim. SWE-bench popularized repository issue resolution as an LLM task with real codebases and test-based patch evaluation, while SWE-agent made the agent-computer interface itself part of the repair system. Security benchmarks such as CyberSecEval 2 and SecCodePLT motivate separating secure-code or vulnerability capability claims from uninspected generated patches.

Microcosm borrows the accountability pattern: issue hypotheses, trace evidence, patch diffs, regression tests, verifier result records, and negative cases must line up before patch-proof language is allowed. It does not import live targets, CVE exploitation authority, weaponized payloads, or benchmark performance claims.

Source-Backed Doctrine Binding

Component: src/microcosm_core/organs/agentic_vulnerability_discovery_patch_proof_replay.py
Bundle: core/paper_module_capsules.json#paper_module.agentic_vulnerability_discovery_patch_proof_replay
Mechanism: core/mechanism_sources.json#mechanism.agentic_vulnerability_discovery_patch_proof_replay.validates_public_agentic_vulnerability_patch_proof_replay
Standard: standards/std_microcosm_agentic_vulnerability_discovery_patch_proof_replay.json
Evidence class: core/organ_evidence_classes.json::agentic_vulnerability_discovery_patch_proof_replay records algorithmic_projection at rank 3.
Source-module manifest: examples/agentic_vulnerability_discovery_patch_proof_replay/exported_patch_proof_bundle/source_module_manifest.json declares nine copied source/control/standard/tool bodies, including strict_json_source_body_import.
Runtime result record: receipts/runtime_shell/demo_project/organs/agentic_vulnerability_discovery_patch_proof_replay/exported_patch_proof_bundle_validation_result.json
Sign-off result records: receipts/first_wave/agentic_vulnerability_discovery_patch_proof_replay/* and result records/sign-off/first_wave/agentic_vulnerability_discovery_patch_proof_replay_fixture_acceptance.json

Reader Evidence Routing

Bundle route: core/paper_module_capsules.json::paper_modules[5:paper_module.agentic_vulnerability_discovery_patch_proof_replay] is the JSON authority row. A diagram view is generated for this module; the Atlas card view is a staged exercise pending the component-atlas lane.
Mechanism route: core/mechanism_sources.json::mechanism.agentic_vulnerability_discovery_patch_proof_replay.validates_public_agentic_vulnerability_patch_proof_replay binds the validator command, exported-bundle validator command, focused regression, guardrails, input refs, result record refs, and runtime code locus.
Runtime route: src/microcosm_core/organs/agentic_vulnerability_discovery_patch_proof_replay.py owns run, run_patch_proof_bundle, _source_module_manifest_result, _source_open_body_import_summary, _build_result, _freshness_basis, EXPECTED_NEGATIVE_CASES, AUTHORITY_CEILING, SOURCE_MODULE_MANIFEST_NAME, BUNDLE_RESULT_NAME, and CARD_SCHEMA_VERSION.
Exported-bundle route: examples/agentic_vulnerability_discovery_patch_proof_replay/exported_patch_proof_bundle is the public runtime bundle for the synthetic patch-proof replay. Open source_module_manifest.json before trusting copied-body counts, then inspect the runtime validation result record.
Focused-test route: tests/test_agentic_vulnerability_discovery_patch_proof_replay.py verifies negative cases, public-relative metadata-only result records, exported-bundle runtime shape, exact copied source modules, digest mismatch rejection, command-card result record reuse, and public trace construction.

Cold-Agent Use

Open the source-module manifest first, then the runtime result record, then the component source. The useful claim is not that a real vulnerability was discovered or fixed.

The useful claim is that Microcosm can force an agentic security story to expose synthetic target refs, issue hypotheses, trace evidence, abstract exploitability refs, patch diffs, regression tests, verifier result records, sandbox verdicts, false-positive triage, cold replay, public trace spans, secret-exclusion scan, negative-case result records, and scope limits before patch-proof language is allowed.

Re-entry condition: after the sibling organ_atlas.json lane releases, bind this paper-module bundle, mechanism ref, and code locus into the atlas row and rerun python -m microcosm_core.doctrine_lattice --check.

Negative Cases

The contract rejects live_target_material, real_cve_exploitation, weaponized_payload_export, account_secret_material, network_exfiltration, exploit_instruction_steps, patch_without_tests, and benchmark_score_claim. These are falsification fixtures, not product evidence.

Validation Result record Path

Run the first-wave fixture validator from the repo root and write its result record outside the repo working tree:

Then run the exported bundle validator:

cd microcosm-substrate && PYTHONPATH=src ../repo-python -m microcosm_core.organs.agentic_vulnerability_discovery_patch_proof_replay \
  run-patch-proof-bundle \
  --input examples/agentic_vulnerability_discovery_patch_proof_replay/exported_patch_proof_bundle \
  --out /tmp/agentic_vulnerability_patch_proof_bundle_receipt \
  --card > /tmp/agentic_vulnerability_patch_proof_bundle_card.json

The focused regression test and corpus projection checks are:

PYTHONPATH=src ./repo-pytest \
  tests/test_agentic_vulnerability_discovery_patch_proof_replay.py
cd microcosm-substrate && PYTHONPATH=src ../repo-python scripts/build_doctrine_projection.py \
  --check-paper-module-corpus

Scope boundary

Scope limit

The result records do not authorize live target testing, real CVE exploitation, weaponized payload export, account secret handling, network exfiltration, actionable exploit instructions, external model access, source-file changes, benchmark security scores, launch, or any whole-system security claim.

Scope limit

This module may claim public fixture evidence that synthetic target refs, issue hypotheses, trace-evidence refs, abstract exploitability refs, patch diff refs, regression-test refs, verifier result records, sandbox verdicts, false-positive triage rows, cold replay rows, public trace spans, source-module digest checks, secret-exclusion scans, negative-case labels, and metadata-only validation result records are checked by the listed runtime witnesses.

This module may not claim live target testing, real CVE exploitation, weaponized payload export, account secret handling, network exfiltration, actionable exploit instructions, live provider behavior, benchmark security scores, patch correctness on real repositories, source-file changes, publishing-scope decision, launch-scope decision, product-progress evidence, or whole-system security.

Source and projection details

Governing Lattice Relation

The governing row is mechanism.agentic_vulnerability_discovery_patch_proof_replay.validates_public_agentic_vulnerability_patch_proof_replay. It binds this reader module to concept.agent_reliability_and_safety_validator_bundle, P-1, P-2, AX-1, and the upstream paper_module.mission_transaction_work_spine dependency. The relation matters because the mechanism is a public safety validator bundle: the paper module can claim that Microcosm checks a source-open, synthetic patch-proof evidence chain, but the lattice ceiling prevents that claim from becoming live vulnerability discovery, exploit proof, benchmark claims, source-file changes, or launch-scope decision.

Source-Open Body Floor

The exported bundle carries nine exact copied source/control/standard/tool bodies under examples/agentic_vulnerability_discovery_patch_proof_replay/exported_patch_proof_bundle/source_modules/. The body floor is governed by source_module_manifest.json, which records digest-verified copies of:

the source pattern ledger
the high-novelty reconstruction result record
the component projection IR
the agent-execution trace runtime and standard
the extracted-pattern route-readiness standard
the mission-transaction preflight wrapper
the mission-transaction landing preflight runtime
the strict JSON helper

Result records and cards do not duplicate those bodies. They carry source_module_manifest_ref, source_open_body_import_refs, source_open_body_imports, body_material_status, and body_copied_material_count so a cold reader can open the real bodies.

The public result record surface stays free of account secrets, account or browser state, browser state, model-output data bodies, browser UI live access, recipient-send state, weaponized payloads, live targets, exploit steps, and account secret-equivalent material.

Agent Route Observability RuntimeRecomputes an agent run's route-compliance score and anti-pattern flags with real trace-analytics code.5/5

Does This validator takes a sample (synthetic, not live) record of an agent's local run — the route it picked, the work it did, the events it logged, the evidence it pointed to, and the authority limit it declared — and checks that this recorded trail is well-formed and self-consistent, instead of leaving raw log JSON to be read by hand. The record is built to state, up front, where the agent's authority was supposed to stop, so the limits are written down and checkable rather than taken on faith. It checks the recorded evidence; it does not watch a live agent or prove one actually stayed in bounds.

Scope limit It validates only public, recorded trace-feedback metadata and regression fixtures; it does not inspect live operator state, certify or prove runtime behavior, read model-output data, mutate the work log, authorize pattern assimilation, or include launch operations.

Run

PYTHONPATH=src python3 -m microcosm_core.organs.agent_route_observability_runtime run --input fixtures/first_wave/agent_route_observability_runtime/input --out receipts/first_wave/agent_route_observability_runtime

EvidenceContract validatorevidence 5/5Import validation

Links to Navigation Hologram Route Plane, Cold Reader Route Map, Routing Anti Patterns Registry, Pattern Binding Contract, Source Projection Import Protocol, Bounded Autonomy Campaign Packet, Saturation Engines Bundle, Proof / Control / Runtime Import Bundle, Unsurfaced Source Primitives Bundle, Trace, Code-Map & Scheduling Engines Bundle, Compliance Pipeline Bundle, Agent Memory Temporal Conflict Replay, Agent Sandbox Policy Escape Replay, Provider Context Recipe Budget Policy, Belief State Process Reward Replay

ai-safetyagent-evaluationred-teaming

Source Design note · Source atlas

Paper module Computer-Use Action Trace Replay

Explainscomponent Agent Route Observability Runtime mechanism validates public route feedback

Governed byprinciples Recompute, do not echo Lower claim strength to checker strength concept agent reliability and safety validators as bounded public scope limits

Abides byaxiom Derivation before assertion

Depends onpaper modules Agent Route Observability Runtime Source Projection Import Protocol

computer_use_action_trace_replay is a validator-backed claim contract under agent_route_observability_runtime. It asks a narrow eval-harness question: does a claimed computer-use episode bind visible observations, affordances, actions, pre-action authority verdicts, state-transition result records, recovery result records, cold replay, falsification fixtures, non-public-state scan posture, and an explicit scope limit?

Run:

PYTHONPATH=src ../repo-python -m microcosm_core.cli agent-route-observability-runtime \
  --input examples/agent_route_observability_runtime/exported_computer_use_action_trace_bundle \
  --out receipts/runtime_shell/demo_project/organs/agent_route_observability_runtime \
  validate-computer-use-bundle

The fixture rejects live account action, account secret entry, external network mutation, purchase/send without approval, destructive action without review, hidden screen-state claims, actions without observation and affordance refs, and benchmark-score claims.

Purpose

A computer-use agent produces a stream of screenshots, clicks, keystrokes, and "it worked" assertions. The hard question for anyone reviewing such a trace is not whether the agent moved the mouse, but whether the record actually supports the claim that something happened safely. A trace can look complete while hiding the two failures that matter most: an action that was blocked or sent for review but is later narrated as a success, and a success that is asserted without any state evidence to back it. This module exists to make that question decidable on a synthetic episode, offline, before any of the language reaches a reader.

The single question it answers is: does each recorded action line up, row by row, with a prior visible observation, a pre-action authority verdict, and a state-transition result record whose outcome agrees with that verdict? The mechanism is a typed join, not a screenshot replay. An action must cite the observation it reacted to and an affordance that was visible in it; a verdict must be stamped before the action and must explicitly deny live-account, account secret, network, destructive, and purchase or send authority; a transition result record must then match the verdict. If the verdict said allow, the result record has to show the action was executed and an oracle confirmed the resulting state. If the verdict said block or review, the result record has to show the action was not executed and the status reads blocked or review-required. Nondeterministic "it probably succeeded" claims are refused outright.

What is genuinely unusual here is the inversion. Most action-trace tooling treats a screenshot as the proof. This module treats the screenshot as the one thing it will not trust: observations enter only as a digest and a visible-state hash, with raw pixels, hidden-state assertions, and live-browser state all required to be absent. The evidence that carries weight is the agreement between the verdict and the transition, not the image. The result record that comes out the other end records counts, refs, hashes, and the redaction posture, and never the raw bodies it checked. It describes a synthetic episode under the route-observability runtime; it does not drive a live browser or desktop.

Shape

Source refs

Component: agent_route_observability_runtime runtime

Diagram source

flowchart TD bundle["JSON source record"] bundle --> mermaid["generated Mermaid available"] bundle --> atlas["generated Atlas linked"] bundle --> component["agent_route_observability_runtime runtime"] component --> bundle["exported computer-use bundle"] bundle --> observations["visible observations: digest + visible-state hash, no raw pixels"] observations --> actions["action rows: cite observation + affordance, allowed kind, redacted"] actions --> verdicts["pre-action authority verdict per action"] verdicts -->|allow| executed["transition: executed + oracle status pass"] verdicts -->|block or review| held["transition: not executed + blocked / review-required"] held --> recovery["recovery result record, no upgrade to executed"] executed --> cold["cold replay reproduces action, verdict, transition"] recovery --> cold cold --> trace["public trace spans: refs, counts, hashes, redaction posture"] trace --> result record["metadata-only validation result record"] result record --> ceiling["scope limit: no live control"]

The shape is a reader route over a synthetic computer-use action trace validator. The evidence path runs through the source record, fixture manifest, exported bundle, runtime validator, public trace builder, metadata-only result records, and explicit scope limit. A diagram view and Atlas entry are generated for this module from the source record.

Technical Mechanism

The runtime entry point is run_computer_use_action_trace_bundle in src/microcosm_core/organs/agent_route_observability_runtime.py. It first loads the bundle through the strict JSON path and decides whether the input is the full fixture with negative cases or the public exported bundle. It then checks the projection protocol, interaction policy, task episodes, screen observations, action trace, authority verdicts, state transitions, recovery result records, cold replay rows, source-module manifest, non-public-state scan, and public trace spans before writing a result record. The status is pass only when positive findings are empty, required negative cases are observed for the fixture path, the non-public-state scan passes, and copied public source-module digests verify.

The mechanism is a typed join, not a screenshot replay. Actions must cite prior observation and affordance refs. Authority verdicts must cite action ids before state transitions can be credited. Cold replay rows must cover the action ids and reproduce the action, verdict, and transition relation. Recovery result records cover blocked or review-required actions without upgrading them into executed mutations. The public trace builder then emits bounded spans over refs, counts, hashes, and redaction posture, while the result record deliberately omits raw screen bodies, account secrets, hidden screen state, model-output data, private source bodies, absolute local paths, and benchmark-score claims.

Named Proof Consumers

validate-computer-use-bundle is the reader command. On the exported bundle, it should produce exported_computer_use_action_trace_bundle_validation_result.json with four episodes, six observations, eight actions, eight authority verdicts, eight state-transition result records, one recovery result record, four cold replay rows, eight public trace spans, copied source-module digest verification, and an explicit no-live-control scope limit.
tests/test_agent_route_observability_runtime.py::test_computer_use_action_trace_replay_observes_negative_cases is the negative fixture consumer. It checks that live account action, account secret entry, external network mutation, unapproved purchase/send, destructive file action, hidden screen-state claims, action-without-observation rows, and benchmark-score claims are rejected.
tests/test_agent_route_observability_runtime.py::test_computer_use_action_trace_receipt_is_public_relative_and_redacted is the result record-safety consumer. It verifies public-relative paths and absence of account secret values, hidden screen state, absolute paths, and raw bodies.
tests/test_agent_route_observability_runtime.py::test_computer_use_action_trace_exported_bundle_validates_runtime_shape is the public-bundle consumer. It checks the exported-bundle shape, action kinds, source-module digest posture, public trace coverage, and no benchmark authority.
tests/test_agent_route_observability_runtime.py::test_computer_use_trace_loader_rejects_duplicate_json_keys is the parser-integrity consumer. It prevents a replay bundle from passing by hiding conflicting values behind duplicate JSON keys.

Reader Evidence Routing

Bundle route: core/paper_module_capsules.json::paper_modules[46:paper_module.computer_use_action_trace_replay] is the source-authority row for this module. A diagram view and Atlas entry are generated from that source record.
Dependency route: downstream modules may reference paper_module.computer_use_action_trace_replay, but this page's source authority is the source record named above, not those downstream dependencies.
Fixture-manifest route: core/fixture_manifests/agent_route_observability_runtime.fixture_manifest.json::computer_use_action_trace_replay_contract_v1 names the positive inputs, negative-case floor, expected result record fields, runtime-example command, and scope limit.
Runtime route: src/microcosm_core/organs/agent_route_observability_runtime.py::run_computer_use_action_trace_bundle loads the bundle, validates projection protocol, interaction policy, episodes, observations, actions, authority verdicts, state transitions, recovery result records, cold replay, source-module manifest, negative cases, and public trace spans.
Exported-bundle route: examples/agent_route_observability_runtime/exported_computer_use_action_trace_bundle contains bundle_manifest.json, projection_protocol.json, interaction_policy.json, task_episodes.json, screen_observations.json, action_trace.json, authority_verdicts.json, state_transition_receipts.json, recovery_receipts.json, cold_replay.json, and source_module_manifest.json.
Source-module route: source_module_manifest.json records copied public source bodies for codex/standards/std_agent_execution_trace.json, system/lib/agent_execution_trace.py, and system/lib/strict_json.py, with body_in_receipt: false.
Focused-test route: tests/test_agent_route_observability_runtime.py validates negative cases, public-relative redacted result records, exported-bundle runtime shape, public trace span coverage, source-faithful public refactor status, source digest matching, and duplicate-key rejection.

Prior Art Grounding

This component is grounded in web and desktop agent benchmarks that make action trajectories inspectable. WebArena and Mind2Web anchor realistic web-task evaluation, while OSWorld extends the concern to multimodal agents acting in real computer environments. Browser automation standards such as WebDriver are also prior art for representing actions against visible browser state through a controlled protocol.

Microcosm borrows the action-trace accounting pattern: observations, affordances, actions, pre-action authority verdicts, transition result records, recovery result records, cold replay, and falsification cases must line up before a computer-use episode is credited. It does not operate a live browser or desktop.

The result record proves only this public synthetic replay boundary. It does not control a live browser or desktop, use accounts, enter account secrets, mutate external systems, export raw screenshots, claim benchmark performance, change source files, use external model services, or include launch operations.

Validation Result record Path

Reader-verifiable bundle command, run from microcosm-substrate/:

PYTHONPATH=src ../repo-python -m microcosm_core.cli agent-route-observability-runtime \
  --input examples/agent_route_observability_runtime/exported_computer_use_action_trace_bundle \
  --out receipts/runtime_shell/demo_project/organs/agent_route_observability_runtime \
  validate-computer-use-bundle

The command writes the computer-use replay result record under receipts/runtime_shell/demo_project/organs/agent_route_observability_runtime/, including computer_use_action_trace_replay_result.json and the exported bundle validation result. The tracked fixture result record records the synthetic observations, affordances, authority verdicts, transition result records, recovery result records, falsification cases, non-public-state scan posture, and scope limit.

This result record path is reader-verifiable evidence only. It does not flip Mermaid/Atlas status, create bundle authority, operate a live browser or desktop, use accounts, enter account secrets, mutate external systems, claim benchmark performance, or aggregate doctrine-lattice coverage.

Scope boundary

Scope limit

This module may claim synthetic computer-use action-trace replay over public fixtures: visible observations, affordances, action rows, pre-action authority verdicts, state-transition result records, recovery result records, cold replay rows, public trace spans, source-module digest checks, expected negative cases, and metadata-only result records.

It does not claim live browser or desktop control, account automation, account secret entry, purchase/send authority, external network mutation, destructive host action, hidden screen-state truth, benchmark performance, provider behavior, source-file changes, launch-scope decision, or whole-system correctness. The diagram view and Atlas entry generated for this module are navigation surfaces; they are not additional proof authority.

Source and projection details

Governing Lattice Relation

The source record binds this module to the accepted agent_route_observability_runtime component and to mechanism.agent_route_observability_runtime.validates_public_route_feedback. That places the page under AX-1 and the P-1 / P-2 claim discipline: a computer-use claim is admissible only when the runtime recomputes it from lower level evidence, and the public sentence cannot exceed what the named validator actually checks. The generated JSON instance records nine resolved edges: component, mechanism, concept, axiom, principle, dependency, and code-locus links.

The relevant concept is concept.agent_reliability_and_safety_validator_bundle, not a generic browser agent benchmark. It frames the replay as an evidence bundle: visible observations and affordances are the basis, action rows are candidate transitions, pre-action authority verdicts decide whether a transition may be executed or blocked, and result record rows carry the bounded public result. The dependencies on agent_route_observability_runtime and macro_projection_import_protocol keep the proof below the source-open import and result record lanes instead of treating this Markdown page as source authority.

Provider Context Recipe Budget PolicyRuns the real context harness to measure assembled byte sizes and check each bundle fits its budget.4/5Runs real tools

Does This component checks that the bundles of context an AI agent would assemble before calling an outside model provider stay inside fixed size limits (in bytes), fill their sections in the declared order until the budget runs out, list any section that was dropped for not fitting, and never carry answer keys, proof solutions, or other "correct answer" material. The record shows the exact size ceilings for each recipe, which sections fit versus got left out, and which output each recipe is allowed to produce, so the context boundary is inspectable as plain accounting before any external model access or answer authority is ever in play. It only validates this metadata; it does not itself call any provider.

Scope limit It validates context-budget projection mechanics (byte ceilings, ordered section fill, omitted-section manifests, deliverable routing, and digest-checked source-body imports) only. It excludes provider/API calls, run Lean/Lake, expose or carry proof or oracle truth-side material, assert theorem or domain-level conclusions, or include launch operations.

Run

PYTHONPATH=src python3 -m microcosm_core.organs.provider_context_recipe_budget_policy run --input fixtures/first_wave/provider_context_recipe_budget_policy/input --out receipts/first_wave/provider_context_recipe_budget_policy

EvidenceBounded runtime computationevidence 4/5Real runtime result

Links to Bounded Autonomy Campaign Packet, Tool Server Pressure Inventory

ai-safetyagent-evaluationred-teaming

Source Design note · Source atlas

Paper module Provider Context Recipe Budget

Explainscomponent Provider Context Recipe Budget Policy mechanism validates public context budget boundary

Governed byprinciples

Abides byaxioms Derivation before assertion Kernelized verification

provider_context_recipe_budget_policy is the public Microcosm component for turning retrieved proof-support metadata into bounded provider context recipes.

It validates six public recipe shapes: minimal_4kb, premise_16kb, skill_32kb, repair_32kb, fewshot_64kb, and strategy_classification_4kb. Each recipe has a fixed byte ceiling, ordered section fill, a graph role, a reducer deliverable type, and an omitted-sections manifest when a section cannot fit.

Purpose

This component answers one question: when a proof-support pipeline is about to hand material to a model, which sections fit inside a fixed byte budget, in what order, and which sections are dropped? It treats the context window as a budget to spend rather than a place to dump everything retrieved. The board records this stance directly as context_is_budget_not_dump.

The byte sizes are not asserted by the fixture. The validator imports the copied benchmark harness, runs its real _provider_context_pack over each recipe, and measures the actual byte size of each packed section. A recipe is filled in declared order, admitting a section only while the running total stays under the ceiling, so an over-budget section is omitted and named in an explicit manifest rather than silently truncated. If the harness is unavailable the component falls back to declared sizes and says so, rather than guessing.

The second deliberate choice is what cannot enter context at all. A small fixed set of section ids and field keys, covering proof bodies, oracle-only premise ids, ideal answers, and provider output bodies, is rejected structurally, not by convention. Any recipe or section material carrying one of them is blocked before a packet is built. The output is metadata about the context shape: byte ceilings, the admitted and omitted section ids, the deliverable route, and a set of authority claims that stay false. No provider is called and no answer is produced.

Shape

Source refs

JSON bundle: paper_module.provider_context_recipe_budget
: provider_context_recipe_budget.md
6 public recipe budgets: provider_context_recipes.json
Runtime: provider_context_recipe_budget_policy.py
9 source-backed sections: section_materials.json
8 copied bodies: source_module_manifest.json

Diagram source

flowchart TD Bundle["JSON bundle paper_module.provider_context_recipe_budget"] --> Instance["Generated instance 19 relationships, no selective residuals"] Bundle --> Markdown["Reader projection provider_context_recipe_budget.md"] Recipes["provider_context_recipes.json 6 public recipe budgets"] --> Runtime["provider_context_recipe_budget_policy.py"] Sections["section_materials.json 9 source-backed sections"] --> Runtime SourceManifest["source_module_manifest.json 8 copied bodies"] --> Runtime NegativeCases["negative fixtures 7 forbidden-boundary cases"] --> Runtime Runtime --> Projection["context_packets included/omitted sections, byte counts, routes"] Runtime --> Result records["metadata-only result records result, board, validation, sign-off"] Projection --> Ceiling["scope limit no provider/proof/launch-scope decision"] Result records --> Ceiling

Evidence and accounting:

Bundle authority: core/paper_module_capsules.json::paper_modules[55:paper_module.provider_context_recipe_budget] sets source_authority: json_capsule, subjects the component provider_context_recipe_budget_policy plus mechanism mechanism.provider_context_recipe_budget_policy.validates_public_context_budget_boundary, and names generated_projections.mermaid.status: available_from_capsule_edges plus generated_projections.atlas_card.status: linked_from_capsule_edges.
Generated instance: paper_modules/provider_context_recipe_budget.json::relationships.edges contains 19 bundle-derived relationship edges, and relationships.unpopulated_selective_relations is empty. That is lattice wiring evidence, not implementation-correctness proof.
Runtime accounting: src/microcosm_core/organs/provider_context_recipe_budget_policy.py defines EXPECTED_RECIPE_BUDGETS for the six recipes, EXPECTED_DELIVERABLES for their reducer routes, _recipe_projection for included/omitted section accounting, _recipe_findings and _section_findings for boundary errors, and _write_receipts for metadata-only result record output.
Fixture inputs: fixtures/first_wave/provider_context_recipe_budget_policy/input/provider_context_recipes.json carries six public recipes with byte budgets from 4096 to 65536, while .../section_materials.json carries nine section rows with source refs and anchors.
Body-floor and result records: core/fixture_manifests/provider_context_recipe_budget_policy.fixture_manifest.json records body_copied_material_count: 8, seven negative_case_ids, four expected fixture result record paths, and source_open_body_imports.authority_ceiling fields that keep external model access, Lean/Lake execution, proof authority, truth-side material, payload export, runtime-correctness claims, and launch-scope decision false.
Focused tests: tests/test_provider_context_recipe_budget_policy.py checks the six recipe ids, expected negative cases, source-backed section materials, public-relative redacted result records, exported bundle validation, omitted-section movement when section size changes, digest mismatch rejection, and manifest body-text result record-boundary rejection.

Technical Mechanism

The runtime mechanism is a context-packet compiler plus boundary validator. It does not ask a provider for an answer. run loads fixture inputs with negative cases enabled; run_budget_bundle loads the exported bundle shape without the fixture-only negative cases. Both routes call _build_result, which loads recipe rows, section rows, copied source-module bodies, and the non-public-state scan policy before it constructs any result record.

Recipe projection is deterministic. _recipe_projection walks each recipe's ordered section ids, computes each section's byte size with _byte_size, admits a section only while the running total stays within the recipe's byte_budget, and records omitted sections when the next section would exceed the budget. The projection records graph role, deliverable type, included and omitted section ids, included bytes, approximate tokens, and whether the omitted-sections manifest is emitted. The six public recipes are the closed set in EXPECTED_RECIPE_BUDGETS: minimal_4kb, premise_16kb, skill_32kb, repair_32kb, fewshot_64kb, and strategy_classification_4kb.

The validator then checks three independent boundaries. _recipe_findings rejects budget changes, forbidden truth-side section ids, proof/provider body fields, provider-call authorization, deliverable-route drift, and over-budget context with no omitted-sections manifest. _section_findings requires each public section to cite an allowed source ref and source anchor, verifies those anchors against the copied source bodies, and rejects synthetic or truth-side section material. _source_module_findings checks the source-module manifest, expected module ids, metadata-only result record flags, target presence, source/target digest equality, and required anchors for the eight copied source bodies.

The result record mechanism is deliberately metadata-only. _write_receipts writes the fixture result, board, validation result record, and sign-off result record for fixture mode; bundle mode writes only the exported-bundle validation result. result_card emits a compact command card while omitting context packets, source-module imports, source refs, result record paths, private scan hit bodies, and the scope boundary payload. The full result records keep counts, ids, hashes, routes, and verdicts, bounded evidence bodies or provider answers.

In lattice terms, the JSON bundle binds this Markdown projection to provider_context_recipe_budget_policy, to mechanism.provider_context_recipe_budget_policy.validates_public_context_budget_boundary, and to concept.agent_reliability_and_safety_validator_bundle. The principle and axiom refs in the bundle (P-1, P-2, P-3, P-6, P-8, P-16 and AX-1, AX-2, AX-5, AX-7, AX-8, AX-9) are implemented here as admission control over public evidence: bounded context metadata is allowed, truth-side material and provider authority are not.

Runtime Surfaces

PYTHONPATH=src python3 -m microcosm_core.organs.provider_context_recipe_budget_policy run \
  --input fixtures/first_wave/provider_context_recipe_budget_policy/input \
  --out receipts/first_wave/provider_context_recipe_budget_policy
PYTHONPATH=src python3 -m microcosm_core.cli provider-context-recipe-budget-policy run-budget-bundle \
  --input examples/provider_context_recipe_budget_policy/exported_provider_context_budget_bundle \
  --out receipts/runtime_shell/demo_project/organs/provider_context_recipe_budget_policy

Named Proof Consumers

The named proof consumer is tests/test_provider_context_recipe_budget_policy.py. It verifies streaming hash and line-count helpers, real-text byte sizing, all six expected recipe ids, all seven negative cases, source-backed section material, public-relative and redacted result records, exported-bundle validation, omitted-section movement when a section becomes small enough to fit, source-module digest mismatch rejection, source/target digest mismatch rejection, manifest and row body-text result record boundary rejection, compact --card output, exact copied source body imports, and fixture-manifest source-open body-floor counts.

The runtime proof consumers are the two module commands in the Validation Result record Path: provider_context_recipe_budget_policy run for fixture mode and provider_context_recipe_budget_policy run-budget-bundle for exported-bundle mode. Fixture mode must observe the negative-case set and write result, board, validation, and sign-off result records. Bundle mode must validate the exported runtime shape and write one metadata-only bundle validation result.

The corpus proof consumer is scripts/build_doctrine_projection.py --check-paper-module-corpus.

Reader Evidence Routing

Start with the JSON Bundle Binding to identify the source record and the launch-safe scope limit before reading any validation result as a capability claim.
Use Structured Lattice Bindings for navigation: it names the component, mechanism, generated row, and runtime code locus that the bundle binds.
Use Validation Result record Path for reproducibility: fixture and bundle commands produce metadata-only result records, the focused pytest exercises negative cases, and the corpus check verifies paper-module projection parity.
The lattice wiring for this module supports discoverability and internal consistency checks; it does not establish external model service, Lean/Lake execution, formal-result correctness, launch-scope decision, or public-send permission.

Negative Cases

budget_overflow_recipe rejects recipes above the public byte ceiling.
truth_side_section rejects oracle-only section ids.
proof_body_leakage rejects proof and provider body fields.
provider_call_authorized rejects any public fixture that authorizes a external model access.
deliverable_type_route_mismatch rejects a recipe whose reducer output type changed.
omitted_sections_suppressed rejects over-budget context without an omitted-sections manifest.
synthetic_section_materials rejects section material that lacks an allowed source ref or source anchor, or that is otherwise synthetic.

Why It Matters

Microcosm needs provider context to look like a small operating system, not a prompt dump. This component makes the context boundary inspectable: a cold reader can see the exact byte ceilings, section order, omitted material, and deliverable routes before any provider or proof authority is even in scope.

Prior Art Grounding

The recipe budget is grounded in retrieval-augmented generation and context packing practice. Lewis et al.'s Retrieval-Augmented Generation paper is the direct research anchor for conditioning generation on retrieved supporting material rather than relying only on model parameters. Microcosm narrows that idea into recipe metadata: retrieved proof-support sections are budgeted, ordered, and omitted explicitly before any external model access is in scope.

The command-facing budget style also borrows from the Command Line Interface Guidelines principle of saying enough but not too much. The component turns that UX pressure into fixed byte ceilings, omitted-section manifests, and deliverable-type routing so "more context" does not silently become proof authority or provider authorization.

Validation Result record Path

Run from microcosm-substrate:

PYTHONPATH=src ../repo-python -m microcosm_core.organs.provider_context_recipe_budget_policy run \
  --input fixtures/first_wave/provider_context_recipe_budget_policy/input \
  --out /tmp/microcosm-provider-context-recipe-budget-policy/fixture \
  --card
PYTHONPATH=src ../repo-python -m microcosm_core.organs.provider_context_recipe_budget_policy run-budget-bundle \
  --input examples/provider_context_recipe_budget_policy/exported_provider_context_budget_bundle \
  --out /tmp/microcosm-provider-context-recipe-budget-policy/bundle \
  --card
PYTHONPATH=src ../repo-python -m pytest -p no:cacheprovider tests/test_provider_context_recipe_budget_policy.py -q
PYTHONPATH=src ../repo-python scripts/build_doctrine_projection.py --check-paper-module-corpus

A green result record proves only public context-recipe metadata, byte ceilings, omitted sections, deliverable routing, copied source-module refs, and negative cases; it does not use external model services, run Lean or Lake, prove formal-result correctness, export proof bodies, expose oracle-only material, include launch operations, or convert context metadata into proof authority.

Scope boundary

Scope limit

This component does not use external model services, run Lean or Lake, prove a theorem, expose a proof body, or reveal oracle-only truth-side material. Its output is context metadata: which sections would be admitted, which sections were omitted, which deliverable route is allowed, and which authority claims remain false.

The strategy_classification_4kb route emits only strategy_id_classification. It is not a proof-body route and cannot carry a provider answer body.

Scope limit

This module covers only public context-recipe metadata: byte ceilings, ordered section admission, omitted-section manifests, deliverable routing, copied source-module refs, digest and anchor checks, negative cases, and metadata-only result records. They do not authorize provider or API calls, Lean or Lake execution, formal-result correctness, proof-body export, oracle-only truth-side material, provider answer bodies, launch-scope decision, publishing-scope decision, or whole-system correctness.

Source and projection details

Source-Open Body Floor

The public bundle carries exact source bodies for the context recipe compiler, formal ladder consumer, provider result record reducer, set calibration report, transform-job ABI, provider adapter policy, compute-provider policy, and provider-navigation transform result record policy. The validator checks every copied module by digest and required anchors; result records report only paths, hashes, counts, anchor status, and verdicts.

The body floor is deliberately metadata-only at the result record edge: runtime result records may prove copied-module paths, digests, anchor presence, counts, and verdicts, but they must not expose proof bodies, oracle-only truth-side material, provider answer bodies, account state, account secrets, or launch-send authority.

Agent Completion Faithfulness AuditRuns real git and pytest on a sample repo so wrap-up claims state only what the evidence proves.4/5Runs real tools

Does Runs a public fixture repo through real git and pytest subprocesses, then checks that completion claims only say what the evidence supports: commit object exists, ledger cap exists, and pytest pass is claimed only after exit-zero status was checked.

Scope limit verified means the referenced evidence object exists or a pytest span ran; it does not imply the span passed unless exit-zero status was explicitly checked

Run

microcosm agent-closeout-faithfulness-audit run --input fixtures/first_wave/agent_closeout_faithfulness_audit/input --out receipts/first_wave/agent_closeout_faithfulness_audit

EvidenceExternal tool runevidence 4/5Real runtime result

ai-safetyagent-evaluationred-teaming

Source Design note · Source atlas

Paper module Agent Completion Faithfulness Audit

Explainscomponent Agent Completion Faithfulness Audit mechanism validates completion evidence claims

Governed byprinciples Recompute, do not echo Lower claim strength to checker strength concept agent reliability and safety validators as bounded public scope limits

Abides byaxiom Derivation before assertion

Depends onpaper module Durable Agent Work-Landing Replay

agent_closeout_faithfulness_audit checks the kind of sentence an agent writes when it finishes a task: "I committed the change, closed the ledger item, and the test passed." It runs the supplied public fixture evidence through real git and pytest subprocesses and refuses any claim that the evidence does not actually support.

Purpose

When an agent reports that work is done, the report is prose. The commit may or may not exist, the ledger row may or may not be there, and "the test passed" may mean the test ran, or it may mean nothing was checked at all. This component exists to answer one question over a fixed fixture: is each completion claim backed by an evidence object that genuinely exists, and is a "passed" claim backed by an explicit exit-zero status check rather than by the wording of the claim?

The approach is unusual in that it does not parse the completion prose or score it against a rubric. The fixture's public_fixture_repo is copied into a throwaway directory, initialised and committed with real git subprocesses, and its HEAD is read back with git rev-parse. A commit claim passes only when it points at that observed HEAD. A declared pytest span is run with python -m pytest <nodeid> inside that temporary repo, and only the exit code decides whether the span passed. The result record records the run as bytes of work that happened, not as a paraphrase of what the agent said.

The distinction the audit defends is narrow and easy to lose. "The span ran" and "the span passed" are separate facts, and a completion sentence that conflates them is the precise failure mode here. A pass claim is admitted only when pass_status_checked is true and the subprocess exited zero; a claim that expected a pass without that check is rejected with CLOSEOUT_PYTEST_PASS_STATUS_NOT_CHECKED. The same separation applies to commits and ledger caps, so a referenced commit object is not treated as a landed change and a named cap is not treated as closed work.

Route Card

Component id: agent_closeout_faithfulness_audit
Accepted-component evidence class: external_subprocess_witness
Standard: standards/std_microcosm_agent_closeout_faithfulness_audit.json
Runner: src/microcosm_core/organs/agent_closeout_faithfulness_audit.py
Fixture input: fixtures/first_wave/agent_closeout_faithfulness_audit/input
Runtime bundle: examples/agent_closeout_faithfulness_audit/exported_agent_closeout_faithfulness_audit_bundle
Source manifest: examples/agent_closeout_faithfulness_audit/exported_agent_closeout_faithfulness_audit_bundle/source_module_manifest.json
Primary result records: receipts/first_wave/agent_closeout_faithfulness_audit/agent_closeout_faithfulness_audit_result.json, receipts/first_wave/agent_closeout_faithfulness_audit/agent_closeout_faithfulness_audit_board.json, receipts/first_wave/agent_closeout_faithfulness_audit/agent_closeout_faithfulness_audit_validation_receipt.json, and result records/sign-off/first_wave/agent_completion_faithfulness_audit_fixture_acceptance.json
Generated posture: this paper module is authored doctrine. Refresh them through their owner commands instead of patching them by hand.

Shape

This module is a completion-claim accounting fixture, not a completion oracle. Its single question is: did the supplied public fixture evidence support the completion claims, and did the result record refuse the overclaims that should not pass?

Source refs

3 fixture claims: closeout_claims.json
Audit: agent_closeout_faithfulness_audit.run
2 cap rows: fixture_ledger.json
declared nodeid: tests/test_closeout_fixture.py
1 exact-copy source body: source_module_manifest.json

Diagram source

flowchart TD Claims[completion_claims.json 3 fixture claims] --> Audit[agent_completion_faithfulness_audit.run] Ledger[fixture_ledger.json 2 cap rows] --> Audit Repo[public_fixture_repo git fixture] --> Audit Pytest[tests/test_completion_fixture.py declared nodeid] --> Audit Manifest[source_module_manifest.json 1 exact-copy source body] --> Audit Audit --> Pass[pass result record 3 verified claims] Audit --> Neg[negative-case semantics 4 overclaim classes] Audit --> Ceiling[scope limit no live mutation or launch]

The accounting is source-backed:

Evidence input	Runtime check	Result record/accounting field
`closeout_claims.json` carries `claim_public_head_exists`, `claim_cap_exists`, and `claim_pytest_span_passed`	`evaluate()` loops over the three claim rows in `src/microcosm_core/organs/agent_closeout_faithfulness_audit.py`	`claim_count: 3`, `verified_claim_count: 3`
`public_fixture_repo` is copied into a temporary git repo	`_prepare_public_fixture_repo()` runs `git init`, config, add, commit, and `rev-parse HEAD` subprocesses	`git_subprocess_count: 6`, `head_verified_by_subprocess: true`
`fixture_ledger.json` names fixture cap rows	task_ledger_cap claims must match task_ledger_caps[].cap_id	`cap_fixture_closeout_receipt_exists` is accepted; missing caps emit `CLOSEOUT_FAKE_CAP_CLAIM`
`tests/test_closeout_fixture.py::test_public_fixture_addition` is the declared pytest span	`evaluate()` runs `python -m pytest <nodeid> -q` and records return code, `span_ran`, and explicit pass-status checking	`pytest_subprocess_count: 1`, `pytest_span_ran_count: 1`, `pytest_pass_status_checked_count: 1`
`source_module_manifest.json` names one copied source source body	the bundle validator checks digest equality, line count, required anchors, and metadata-only result record posture	`module_count: 1`, `line_count: 1703`, `sha256_match: true`, `body_in_receipt: false`

Negative cases are part of the Shape rather than an appendix because they define the claim boundary. EXPECTED_NEGATIVE_CASES names fake commit, fake cap, fake pytest node, and unchecked-pytest-pass classes; the focused tests assert the first three directly against fixture mutation and assert unchecked pass rejection against CLOSEOUT_PYTEST_PASS_STATUS_NOT_CHECKED. The runtime-bundle result record observes all four classes, so a cold reader can distinguish "the span ran" from "the pass claim had exit-zero evidence."

The source-body route is deliberately narrow. The exported bundle copies exactly system/lib/agent_experience_diagnostics.py to examples/agent_closeout_faithfulness_audit/exported_agent_closeout_faithfulness_audit_bundle/source_modules/system/lib/agent_experience_diagnostics.py; the manifest carries the matching digest, 1703 lines, required anchors Agent Experience Grand Rounds and completion, and body_in_receipt: false. Result records carry refs, hashes, counts, verdicts, and scope boundaries only. They do not carry copied body text, private root paths, model-output data, account or browser state, live work log authority, live work log authority, source-file changes, launch-scope decision, or whole-system completion truth.

Technical Mechanism

The fixture validator is centered on evaluate() in src/microcosm_core/organs/agent_closeout_faithfulness_audit.py. It loads closeout_claims.json and fixture_ledger.json, copies public_fixture_repo into a temporary repository, initializes and commits that copy with real git subprocesses, and records the resulting HEAD through git rev-parse HEAD. Commit claims pass only when the claim ref is HEAD or the actual subprocess-observed HEAD; fixture cap claims pass only when the cap id appears in the fixture ledger.

For pytest claims, evaluate() runs python -m pytest <nodeid> -q inside the temporary public fixture repo. A span can be counted as observed when the nodeid runs, but a pass claim is accepted only when pass_status_checked is true and the pytest subprocess exits zero. The same source file carries evaluate_negative_case(), which mutates one claim row at a time to force the fake commit, fake cap, fake pytest node, and unchecked pass paths. The expected error codes are declared in EXPECTED_NEGATIVE_CASES, so the negative floor is source-bound rather than inferred from prose.

The exported-bundle path uses run_agent_closeout_bundle() against examples/agent_closeout_faithfulness_audit/exported_agent_closeout_faithfulness_audit_bundle. That path reuses the same evaluator while making the source-module manifest floor mandatory: the copied diagnostic body must match the manifest digest, include required anchors, and remain absent from result records. AUTHORITY_CEILING then records the scope boundaries in machine-readable form: no live repo mutation, no launch-scope decision, no work log closure, and no pytest-pass claim without exit-zero evidence.

Named Proof Consumers

microcosm_core.organs.agent_closeout_faithfulness_audit.run is the first-wave fixture consumer. It materializes the public fixture repo, ledger, completion-claim rows, semantic negative cases, validation result record, board, and sign-off result record.
microcosm_core.organs.agent_closeout_faithfulness_audit.run_agent_closeout_bundle is the exported-bundle consumer. It validates the source-open bundle and the copied diagnostic body manifest while preserving body_in_receipt: false.
microcosm_core.organs.agent_closeout_faithfulness_audit.evaluate is the subprocess witness consumer. It checks commit, cap, and pytest-span claims against actual fixture evidence instead of accepting completion prose.
microcosm_core.organs.agent_closeout_faithfulness_audit.evaluate_negative_case is the falsification consumer for fake commit, fake cap, fake nodeid, and unchecked pytest-pass overclaims.
tests/test_agent_closeout_faithfulness_audit.py is the focused regression consumer. It asserts the public subprocess witness path, fake-claim rejections, semantic negative-case evaluation, exported-bundle metadata-only source manifest behavior, digest-mismatch rejection, and pytest-capable interpreter selection.

First Commands

From microcosm-substrate:

Validate the exported bundle when the question is whether the public source-open copy still matches the declared source body:

What It Proves

This component checks completion claims against public fixture evidence instead of trusting completion prose. A positive run proves four things:

the fixture repo exists and the referenced commit object is visible to real git subprocesses;
fixture HEAD is checked by subprocess evidence rather than by prose;
the declared pytest span actually ran;
work log style cap claims only point at rows present in the fixture ledger.

The useful distinction is narrow: verified means the referenced evidence object exists or the pytest span ran. A claim that a pytest span passed is valid only when the result record checked an explicit exit-zero status. That is the reader value of this component: it separates "I referenced a test" from "I proved the test passed."

Prior Art Grounding

This component is grounded in claim-verification and reproducibility patterns rather than in trust of summary prose. FEVER popularized fact extraction and verification as a separate task over cited evidence, while TruthfulQA made explicit that fluent model answers can be misleading without a truthfulness check. The artifact-review tradition also motivates separating a claim, its artifact, and its validation evidence instead of treating a report as self-validating.

Microcosm borrows that verification posture for agent completion: commit refs, work log refs, pytest spans, subprocess witnesses, and pass-status checks must line up before completion language is admitted. It does not certify all live completion prose or turn a referenced test into a passed test without exit-zero evidence.

Source-Backed System

The source-open body import is a single exact source body:

system/lib/agent_experience_diagnostics.py

The copied target is:

examples/agent_closeout_faithfulness_audit/exported_agent_closeout_faithfulness_audit_bundle/source_modules/system/lib/agent_experience_diagnostics.py

The manifest records:

source_to_target_relation: exact_copy;
body_copied: true;
body_in_receipt: false;
a 1703-line body;
matching source and target sha256 digests;
required anchors Agent Experience Grand Rounds and completion.

Result records carry refs, hashes, counts, verdicts, and scope boundaries only.

Result record Floor

A passing fixture run emits:

agent_closeout_faithfulness_audit_result.json
agent_closeout_faithfulness_audit_board.json
agent_closeout_faithfulness_audit_validation_receipt.json
agent_closeout_faithfulness_audit_fixture_acceptance.json

A passing runtime-bundle run emits:

exported_agent_closeout_faithfulness_audit_bundle_validation_result.json
agent_closeout_faithfulness_audit_board.json
agent_closeout_faithfulness_audit_validation_receipt.json

The first-wave result must show:

status: pass;
real_substrate_disposition: real_substrate_capsule;
body_in_receipt: false;
source_module_manifest.status: pass;
all_expected_digests_matched: true;
all_required_anchors_present: true;
secret_exclusion_scan.blocking_hit_count: 0;
receipt_body_scan.status: pass.

The exercise floor is:

three verified completion claims;
six git subprocess witnesses;
one pytest subprocess witness;
one checked pass status;
one ran pytest span;
head_verified_by_subprocess: true.

Negative Cases

The current negative-case floor is:

fake_commit_claim -> CLOSEOUT_FAKE_COMMIT_CLAIM
fake_cap_claim -> CLOSEOUT_FAKE_CAP_CLAIM
fake_test_claim -> CLOSEOUT_FAKE_TEST_CLAIM
unchecked_pass_claim -> CLOSEOUT_PYTEST_PASS_STATUS_NOT_CHECKED

These cases are the claim-language guardrail. If they stop appearing in observed negative cases, the component no longer proves that public completion result records reject fabricated commit, cap, test-node, or unchecked-pytest-pass claims.

Evidence Binding

JSON bundle authority: core/paper_module_capsules.json#paper_module.agent_closeout_faithfulness_audit.
Mechanism source: core/mechanism_sources.json#mechanism.agent_closeout_faithfulness_audit.validates_closeout_evidence_claims.
Component atlas edge: core/organ_atlas.json#agent_closeout_faithfulness_audit.
Runtime source: src/microcosm_core/organs/agent_closeout_faithfulness_audit.py.
First command: PYTHONPATH=src python3 -m microcosm_core.components.agent_completion_faithfulness_audit run --input fixtures/first_wave/agent_completion_faithfulness_audit/input --out result records/first_wave/agent_completion_faithfulness_audit --sign-off-out result records/sign-off/first_wave/agent_completion_faithfulness_audit_fixture_acceptance.json.

Reader Evidence Routing

Start with the Route Card and JSON Bundle Binding to identify the component, standard, source row, runner, fixture input, exported bundle, and result record surfaces.
For behavior questions, read src/microcosm_core/organs/agent_closeout_faithfulness_audit.py and the focused tests before trusting this prose.
For source-open body questions, read the exported bundle's source_module_manifest.json; the manifest is the evidence for exact-copy relation, digest match, anchor match, and metadata-only result record posture.
For claim-language questions, read the Negative Cases and Result record Expectations together; the pass path only matters if the overclaim cases still fail.
Treat generated component Markdown, atlas cards, graphs, health files, and runtime result records as navigation or validation projections. They do not become source authority for broader completion truth.

Validation Result record Path

The focused proof consumer is tests/test_agent_closeout_faithfulness_audit.py. A passing result record has to show that completion language was checked against public fixture evidence: referenced commit objects, fixture work log rows, git subprocess witnesses, pytest subprocess witnesses, explicit pass-status checks, negative completion cases, and the exported source-module manifest. It must not rely on completion prose as its own proof.

./repo-pytest tests/test_agent_closeout_faithfulness_audit.py -q --basetemp=/tmp/microcosm_agent_closeout_faithfulness_audit_pytest
./repo-python scripts/build_doctrine_projection.py --check-paper-module-corpus

For the focused test, the result record boundary is the asserted shape: three verified completion claims, at least five git subprocess witnesses, one pytest subprocess witness, one ran pytest span, one checked pass-status row, head_verified_by_subprocess=true, source-module digest and required-anchor matches, metadata-only result record posture, and semantic observation of the four negative completion classes. For the corpus check, the result record only proves bundle/instance parity; it does not close live work log work, mutate live work log state, certify arbitrary completion prose, prove launch-scope decision, or turn a referenced pytest span into a passed span without exit-zero evidence.

Validation Anchors

Focused coverage lives in tests/test_agent_closeout_faithfulness_audit.py and checks:

public git and pytest subprocess witness behavior;
fake commit rejection;
unchecked pytest pass rejection;
fake cap claim rejection;
fake pytest node id rejection;
metadata-only source manifest behavior in the exported bundle;
source-module digest mismatch rejection;
pytest-capable Python selection.

Scope boundary

Scope limit

This module may claim public fixture evidence that completion claims are checked against referenced commit objects, fixture work log rows, pytest subprocess witnesses, explicit pass-status checks, negative completion cases, a copied diagnostic body, source-module manifest digest equality, metadata-only result record posture, and validation result records.

This module may not claim live completion truth, live work log mutation, live work log mutation, live Git mutation, external model access, source-file changes, launch-scope decision, publishing-scope decision, deployment posture, all-agent faithfulness, formal-result correctness beyond the listed witnesses, or whole-system correctness.

Scope limit

This component is a public fixture witness for completion evidence. It does not:

prove arbitrary live commits landed;
close or mutate work log work;
mutate Git state;
include launch operations;
use external model services;
certify all completion prose;
turn a ran pytest span into a passed span without an explicit exit-zero check.

Its useful claim is narrower: over the supplied fixture repo, fixture ledger, completion claims, and copied diagnostic body, the component proves that completion evidence references are checked and that specific overclaims are refused.

Source and projection details

Governing Lattice Relation

That mechanism is active in core/mechanism_sources.json and says the component validates public completion evidence claims through fixture commit objects, fixture HEAD evidence, git subprocesses, pytest span execution, explicit pass-status checks, fixture-ledger cap rows, copied source-module digests, and stable overclaim negative cases before writing metadata-only result records.

The doctrine edge is narrow and constructive. The JSON instance reports concept.agent_reliability_and_safety_validator_bundle, principles P-1 and P-2, axiom AX-1, and dependency paper_module.durable_agent_work_landing_replay; those edges explain why this module is a validator-bundle proof instrument rather than a general completion truth oracle. The generated Mermaid and Atlas edges are navigation result records for that binding, not launch or correctness authority.

Bounded Autonomy Campaign PacketDrafts proposed work from coverage gaps and proves it cannot repair or rewrite the code itself.4/5Runs real tools

Does Turns synthetic coverage gaps into a draft candidate packet in a subprocess and records the boundary that it proposes work but cannot repair itself or write source.

Scope limit self-proposal campaign packet only; no self-repair or unsupervised source-file changes

Run

microcosm bounded-autonomy-campaign-packet run --input fixtures/first_wave/bounded_autonomy_campaign_packet/input --out receipts/first_wave/bounded_autonomy_campaign_packet

EvidenceExternal tool runevidence 4/5Real runtime result

ai-safetyagent-evaluationred-teaming

Source Design note · Source atlas

Paper module Bounded Autonomy Campaign Packet

Explainscomponent Bounded Autonomy Campaign Packet mechanism validates public bounded autonomy campaign packet

Governed byprinciples

Abides byaxiom Derivation before assertion

bounded_autonomy_campaign_packet is a Crown Jewel import component with real runnable system and a strict public scope limit. It consumes synthetic public fixtures, copied source source bodies, and source manifests that verify sha256 digests, line counts, required anchors, secret-exclusion status, and result record body omission.

What it proves: self-proposal campaign packet only; no self-repair or unsupervised source-file changes.

Purpose

An agent can usefully notice its own coverage gaps and draft a plan to close them. The danger is that "draft a plan" quietly becomes "do the work": a proposal grows a write surface, and a system that was meant to suggest starts mutating its own source unsupervised. This component exists to keep those two steps apart. It answers one question: can an agent emit a draft campaign proposal from real coverage gaps without that proposal carrying any authority to act on them?

The design choice that makes this interesting is where the candidate count comes from. The component does not invent a plausible-looking list of work. It runs a real source campaign builder in read-only mode (build_standard_skill_pairing_campaign.py --check --report) and accepts its witness only when the builder reports candidate targets and leaves wrote_packet unset. The proposal is therefore derived from a surface that could do real work, observed in a mode where it did not. Each drafted candidate is then stamped write_surface: none, source_mutation_authorized: false, and requires_human_review: true, so the act of proposing can never be mistaken for the act of authorising.

Two refusals guard the boundary. A campaign policy that lists write_source among its allowed actions is rejected outright, before any candidate is drafted. And a campaign digest that already appears in the failed-campaign ledger more than once is refused, so a plan that has already failed cannot be quietly re-proposed under a fresh wrapper. Both refusals are checked by mutating the fixture and confirming the expected error code fires, not by trusting a declared label.

Shape

Source refs

Read-only builder witness check --report: build_standard_skill_pairing_campaign.py

Diagram source

flowchart TD Inputs["Public synthetic inputs coverage_gaps, campaign_policy, failed_campaign_digests"] PolicyGate{"campaign_policy allows write_source?"} Witness["Read-only builder witness build_standard_skill_pairing_campaign.py --check --report"] WitnessGate{"reports candidate targets and wrote_packet unset?"} Draft["Draft candidate packet write_surface: none, requires_human_review, source_mutation: false"] DigestGate{"failed digest repeated?"} Refuse["Refuse SOURCE_WRITE_FORBIDDEN / REPEATED_FAILED_DIGEST / witness blocked"] Result records["metadata-only result records refs, digests, stdout/stderr hashes; builder output bodies excluded"] Ceiling["Scope limit no self-repair, source-file changes, providers, launch, or public sharing"] Inputs --> PolicyGate PolicyGate -- "yes" --> Refuse PolicyGate -- "no" --> Witness Witness --> WitnessGate WitnessGate -- "no" --> Refuse WitnessGate -- "yes" --> Draft Draft --> DigestGate DigestGate -- "yes" --> Refuse DigestGate -- "no" --> Result records Refuse --> Result records Result records --> Ceiling

This diagram is a reader aid. The machine graph remains the generated paper_module.bounded_autonomy_campaign_packet.mermaid projection derived from the JSON source record.

Technical Mechanism

The runtime is intentionally narrower than "autonomous repair." SPEC declares the four required public inputs, the source-module manifest, the expected negative cases, and an AUTHORITY_CEILING in which self-repair, unsupervised source-file changes, source-write packets, external model access, and launch are all false. run() and run_bounded_autonomy_bundle() then route both the fixture and exported bundle through run_crown_jewel_organ, so the same evaluator, source-manifest checks, metadata-only result record policy, and semantic negative-case evaluator guard both command surfaces.

The positive lane is witnessed by _campaign_builder_witness(), not by a fictional campaign row. It invokes tools/meta/factory/build_standard_skill_pairing_campaign.py --check --report --max-targets <n> from the source root, then accepts the witness only when the builder returns standard_skill_pairing_campaign_summary, reports at least one candidate target, emits a source_digest, and leaves wrote_packet unset. This makes the campaign packet a read-only proposal derived from a real builder surface; the result record stores return code, digest fields, and stdout/stderr hashes, but keeps builder output bodies out of the result record.

_candidate_packet_subprocess() converts the witnessed target count into draft candidate rows. Each candidate is tied to one fixture coverage gap when available, carries the builder ref and builder source digest, sets write_surface: none, requires human review, and records source_mutation_authorized: false. evaluate() then applies the policy checks: write_source in campaign_policy.allowed_actions is a hard refusal; blocked builder witness or empty candidate packet is a hard refusal; any candidate that authorizes source-file changes or writes to the source surface is also refused.

The negative cases are semantic mutations of the input, not trusted labels. evaluate_negative_case() copies the required inputs into a temporary directory and mutates the relevant file: source_write_campaign_packet appends write_source to campaign_policy.allowed_actions, while repeated_failed_campaign_digest rewrites the failed-digest ledger to contain a duplicate digest. The component passes its own evidence floor only when these mutations produce BOUNDED_AUTONOMY_SOURCE_WRITE_FORBIDDEN and BOUNDED_AUTONOMY_REPEATED_FAILED_DIGEST; stale declared error-code labels cannot satisfy the proof consumer.

Reader Evidence Routing

The primary evidence for this module is the fixture result record and the exported-bundle result record, which demonstrate the bounded campaign packet behavior under synthetic public inputs. Source-module manifests and digest checks are evidence for copied body provenance. This page is an explanation of those sources; the underlying JSON and test outputs are the authority.

Prior Art Grounding

This component borrows from AI risk-management, policy gating, and controlled workflow-automation patterns. Useful anchors include:

NIST's AI Risk Management Framework, which frames AI work in terms of governance, mapping, measuring, and managing risk rather than assuming autonomy is inherently authorized.
Open Policy Agent, as a policy-engine pattern for deciding whether a proposed action may proceed.
GitHub Actions workflow syntax, as a widely used automation surface where jobs, permissions, and concurrency behavior are declared before execution.

Microcosm borrows the governed-campaign and preflight-gate shape, but keeps the component to draft self-proposal packets over synthetic public coverage gaps. It does not self-repair, change source files unsupervised, use external model services, or include launch operations.

How to run it:

microcosm bounded-autonomy-campaign-packet run --input fixtures/first_wave/bounded_autonomy_campaign_packet/input --out receipts/first_wave/bounded_autonomy_campaign_packet

Runtime bundle route:

python -m microcosm_core.organs.bounded_autonomy_campaign_packet run-bounded-autonomy-bundle --input examples/bounded_autonomy_campaign_packet/exported_bounded_autonomy_campaign_packet_bundle --out receipts/runtime_shell/demo_project/organs/bounded_autonomy_campaign_packet

Validation Result record Path

If the fixture or bundle reports source-module digest drift, route that through microcosm_exact_copy_refresh; this page is source-linked only for copied source bodies. If the full projection check fails because another active session holds shared lattice outputs, treat that as unrelated contention and use the corpus check as the local gate for this module.

Negative cases covered by the fixture manifest: repeated_failed_campaign_digest, source_write_campaign_packet.

Source provenance is anchored by examples/bounded_autonomy_campaign_packet/exported_bounded_autonomy_campaign_packet_bundle/source_module_manifest.json and result records carry refs, digests, counts, verdicts, and scope boundaries only.

Scope boundary

Scope limit

This component emits a draft self-proposal from public synthetic coverage gaps and refuses source-write or repeated-failure packets. It does not self-repair, change source files unsupervised, use external model services, include launch operations or public sharing, or widen the proof boundary beyond the copied source bodies, synthetic fixtures, source manifests, negative cases, and validation result records.

Scope limit

This paper module demonstrates a bounded-autonomy fixture that builds a draft campaign packet and refuses unsafe packets under public synthetic inputs. A diagram view and atlas card are generated for this module.

It cannot claim autonomous repair, unsupervised source-file changes, external model access, launch-scope decision, publishing-scope decision, production campaign safety, private-system equivalence, or whole-system correctness.

Secondary Runtime Source BundleRuns eight trace, graph, and market engines on test rows without fetching live markets.5/5

Does This bundle imports a second Set 7 runtime slice as public runnable system. It checks agent trace view-model trust classes, lane-progress state normalization, graph-lens focus roles, graph projection summaries, observe-only cartography rendering, stockgrid payload terms, Polymarket CLOB microstructure, and four-lens market scanning over synthetic public fixtures without exporting sessions, fetching live markets, or giving trading decisions.

Scope limit verified source body import only; no browser/session export, wallet authority, live market data, investment-related actions, external model access, source-file changes, private-system equivalence, public sharing, launch, semantic-truth, or whole-system correctness claim

Run

microcosm batch7-secondary-runtime-capsule run --input fixtures/first_wave/batch7_secondary_runtime_capsule/input --out receipts/first_wave/batch7_secondary_runtime_capsule --acceptance-out receipts/acceptance/first_wave/batch7_secondary_runtime_capsule_fixture_acceptance.json

EvidenceVerified source importevidence 5/5Copied source body

source intakeprovenancedrift-control

Source Design note · Source atlas

Paper module Set 7 Secondary Runtime Bundle

Explainscomponent Secondary Runtime Source Bundle mechanism validates public secondary runtime bundle

Governed byprinciples

Abides byaxioms

batch7_secondary_runtime_capsule imports a second Set-7 runtime slice into Microcosm. It exact-copies runtime view-model, lane-progress, graph-lens, graph-projection, cartography, stockgrid, and Polymarket source bodies into a public bundle, runs the bounded witness path, and exercises the Python market/numeric cores against synthetic public fixtures.

Imported Source Bodies

system/server/ui/src/components/world/agentTraceViewModel.ts
system/server/ui/src/components/world/laneProgress.ts
system/server/ui/src/components/graph/universalGraphLens.ts
system/server/ui/src/components/graph/graphProjection.ts
system/server/ui/src/lib/capCartographyShadowRender.ts
their focused Vitest witnesses where public-safe
tools/stockgrid/stockgrid.py
tools/polymarket/clob_snapshot.py
tools/polymarket/score.py
tools/polymarket/models.py

Purpose

This module is the reader-facing instrument for the accepted batch7_secondary_runtime_capsule component. Its source authority is the JSON source record in core/paper_module_capsules.json; this Markdown explains what a cold reader may trust from the public secondary-runtime fixture and what remains out of scope.

The component exists to answer one question: do these copied frontend and market bodies still behave the way their original code claims to, when run in isolation over synthetic inputs? It copies eight slices into a bundle, then exercises each one against a small fixture and re-checks the exact behaviour the original author relied on. The interesting part is not that the code runs, but that each engine is paired with a planted regression. The component mutates a single token in the copied body, or feeds an adversarial input, and asserts that the behaviour breaks in the expected way. A check that only passes on good input proves little; a check that also fails on the right bad input is evidence the behaviour is real.

Several of these guards encode a concrete bug that was found in production. The Polymarket order-book reader documents a probe from 2026-05-12: the API can return bids floor-first and asks ceiling-first, so a naive bids[0] / asks[0] reader silently inverts best-bid and best-ask. The body derives best prices by numeric extrema instead, and the polymarket_sorted_book_trap case feeds a deliberately mis-sorted book to confirm the extrema rule still holds. The stockgrid momentum primitive refuses an impossible -100% daily change rather than returning a misleading number. The graph projection drops self-edges so a collapsed cluster does not draw an arrow to itself. The scope stays narrow on purpose: this is local body import and synthetic-fixture witness evidence, not live market access, wallet authority, browser export, or investment-related actions.

Shape

Source refs

Vitest witness: world/graph/cartography tests

Diagram source

flowchart TD bundle["Exported bundle copied bodies + source digest anchors"] witness["Vitest witness world/graph/cartography tests"] subgraph Engines["Eight fixture engines"] ui["Trace view-model and lane progress"] graph["Graph lens and graph projection"] carto["Cartography observe-only render"] market["Stockgrid + Polymarket CLOB and four-lens scoring"] end subgraph Negatives["Planted regressions"] invert["Mis-sorted book must still find extrema"] momentum["-100% change must be refused"] selfedge["Self-edge must be dropped"] resolved["Resolved market must gate NEWSBREAKER"] end result records["metadata-only result records status, digests, anchor checks"] ceiling["scope limit"] bundle --> witness witness --> ui bundle --> graph bundle --> carto bundle --> market ui --> Negatives graph --> Negatives carto --> Negatives market --> Negatives Negatives --> result records result records --> ceiling

Reader Evidence Routing

Start from the component source when checking behavior:

EXPECTED_ENGINES names the eight fixture engines for trace view-models, lane progress, graph lenses, graph projection, cartography, stockgrid, CLOB microstructure, and Polymarket scoring.
EXPECTED_NEGATIVE_CASES names the planted regressions for raw-authority omission, unknown lane state, hidden descendants, self edges, observe-only cartography, extreme stock momentum, sorted-book traps, and resolved-market gating.
AUTHORITY_CEILING keeps launch, public sharing, provider/model dispatch, browser or wallet access, source-file changes, investment-related actions, semantic-truth authority, and test-completeness proof false.
run, run_batch7_secondary_bundle, and result_card expose the reproducible command and metadata-only summary.

What the engines check

Each engine reads a copied body and asserts a specific, checkable behaviour. The four with the clearest stakes:

Polymarket CLOB microstructure. compute_best_prices derives the best bid as the maximum bid price and the best ask as the minimum ask price, never from the first row of each side. This guards a real failure documented in the source: the API can return bids floor-first and asks ceiling-first, which inverts a naive bids[0] / asks[0] reader. The polymarket_sorted_book_trap case feeds a mis-sorted book and confirms the chosen best bid (0.42) and ask (0.53) are not the first entries, then checks the spread and that depth imbalance stays in [-1, 1].
Stockgrid momentum. _daily_log_momentum_bps converts a percentage change into a daily log-return in basis points, but returns nothing when the ratio is at or below -0.999999. A claimed -100% daily change has no finite log return, so the primitive refuses it rather than emitting a misleading value. The stockgrid_extreme_momentum case asserts that refusal.
Graph projection. projectGraphForRender groups nodes into per-lane, per-wave summary clusters and rewrites edges between clusters. It drops any edge whose source and target land in the same cluster, so a collapsed cluster never draws an arrow to itself. The graph_projection_self_edge case removes the sourceId === targetId guard from the copied body and confirms the self-edge would otherwise survive.
Polymarket four-lens scoring. calculate_lenses zeroes the NEWSBREAKER lens for any market that is resolved, low-volume, low-uncertainty, or an outlier in velocity. The fixture scores one open and one resolved synthetic market and asserts the resolved one scores zero on NEWSBREAKER while the open one does not.

The remaining engines cover the trace view-model trust taxonomy (seven labels including missing and fallback, with an explicit "raw provider JSONL is unavailable" path), lane-progress state normalisation (an unknown state falls back to idle, not an invented status), the graph lens (collapsing a parent keeps the parent visible but hides its descendants), and the cartography render (a fixed set of mutating actions stays blocked, so the surface observes without creating or editing). Each negative case is run by mutating one token in the copied body or supplying an adversarial input, then checking the engine reports blocked. The result records record status, digests, and anchor matches only; copied bodies and command output are never inlined.

Prior Art Grounding

The component borrows from MVVM/read-model UI architecture, graph visualization, and market-data board patterns: view models shape raw state for views, graph projections make relationships inspectable, and market rows must preserve provider identity and missingness. Useful anchors include:

Microsoft's MVVM guidance, where view models encapsulate presentation state while separating UI from underlying model logic.
D3 force layouts as a common graph visualization family for networks and hierarchies.
The CFTC's prediction markets explainer, as a boundary reference for event-market data and consumer caution.

Microcosm borrows the view-model, graph-projection, and market-diagnostic shapes, but runs them only over synthetic runtime packets and synthetic market rows. It is not browser/session export, live market data, trading decisions, or proof that frontend projections are complete.

Validation Result record Path

Reader-verifiable fixture command, run from microcosm-substrate/:

Focused test result record, run from the repository root:

PYTHONPATH=src ./repo-pytest \
  tests/test_batch7_secondary_runtime_capsule.py \
  -q --basetemp /tmp/microcosm-batch7-secondary-runtime-tests

The fixture run writes receipts/first_wave/batch7_secondary_runtime_capsule/batch7_secondary_runtime_capsule_result.json, receipts/first_wave/batch7_secondary_runtime_capsule/batch7_secondary_runtime_capsule_validation_receipt.json, and receipts/first_wave/batch7_secondary_runtime_capsule/batch7_secondary_runtime_capsule_board.json; the sign-off file records fixture sign-off. The exported-bundle re-run uses the run-batch7-secondary-bundle action over exported_batch7_secondary_runtime_capsule_bundle.

This result record path is public fixture evidence only. It does not export browser or account sessions, fetch live market data, provide investment-related actions, complete UI/ranking coverage, include launch operations or public sharing, or aggregate doctrine-lattice coverage.

Scope boundary

Scope limit

This bundle can claim fixture-bound public source-body import evidence and secondary runtime/market witness result records. It cannot authorize browser/session export, wallet authority, live market data, investment-related actions, external model access, source-file changes, launch, public sharing, private-system equivalence, semantic truth, complete UI/ranking coverage, or whole-system correctness.

Source refs

Built from public source refs, with each input path recorded for provenance.

Each component has a stable public source path with commands, source links, and its supported scope.

Browse all components → Repository atlas →