Deterministic Boundaries for Non-Deterministic Agents

Mori is open source (AGPL-3.0): github.com/fjwood69/mori. Bring your own agent; the knowledge — and the boundaries — outlive it.

1. You cannot pick or predict a safe coding agent

The five-step spine: Memory → Curation → Governance → Insurance → Unpredictability.

The enterprise instinct for safe AI-assisted coding is procurement: buy the most capable agent and trust that capability buys reliability. It does not.

A pre-registered benchmark of seven model families — run on two independent harnesses (the enforcement study in §2 extends this to three), scored by a deterministic check of the resulting code rather than by another model’s opinion — found that agent harm does not sort by capability, size, or coding-specialisation, and varies from one run to the next within a single model on identical input. One of the most capable, most coding-specialised models in the set broke the build in every run. A different frontier model never broke a build, but quietly performed a larger migration nobody had asked for. A third did the right thing on one run and the wrong thing on the next, given the same input.

The behaviours sort into archetypes — the Literal Crasher, the Rogue Architect, the Coin Flip — and the models are named only by archetype, deliberately. The finding is the absence of an ordering; attaching vendor names would manufacture the very safety-leaderboard the data refutes, and invite a reader to rank what the result says cannot be ranked. The pattern that matters is the one that is absent: there is no ordering of these models by safety, because safety is not a stable property of the model. It is a property of the run.

That carries a blunt corporate corollary. You cannot procure your way out of the risk, because the most capable model was the worst offender. And you cannot lean on last week’s good behaviour, because the same model is not the same model twice. (Throughout, rates are reported with confidence intervals and public-facing claims kept qualitative; at screening sample sizes a point estimate is a range. The finding is the direction — harm is real, capability-independent, and stochastic within a model — not a leaderboard.)

If safety cannot be bought and cannot be predicted, it has to be imposed, from outside the model. What follows is what that takes — and the two intuitions that feel like they should be enough, and are not.

2. Information is not enforcement

The first of those two assumptions is that information is enough to make an agent safe. It is not — and that is the surprise, because the failure looks exactly like a visibility problem: the agent breaks things it cannot see, so giving it eyes feels like the fix.

The cleanest test of that intuition is a fleet built to isolate it. A synthetic set of repositories — authored by a separate model, blind to the scoring code, so the couplings could not be shaped to flatter a conclusion — contains a provider package that several other repositories depend on. Upgrading the provider’s major version breaks any consumer that has not migrated. The danger is deliberately non-local: it is invisible from the provider’s own files, which is exactly where the agent does its work. Each agent is then given a tool that reads the dependency graph and reports, in plain language, which consumers a change would break — and asked to perform the upgrade.

The tool did not make them safe. With the tool available but no explicit rule instructing them to heed it, every model broke the build in every run of that condition. The collapse was not a property of one test rig: it reproduced across three independently driven harness/runtime combinations — independent driver code, including a clean-room harness and the very runtime that had produced a measurement artifact in an earlier study, the rig most likely to break the result rather than confirm it. It held on all three. That is the difference between running a test three times and reproducing it on the apparatus most likely to falsify it; the collapse is a property of the models, not the harness.³

The single most clarifying result in the program is this: a model called the tool, received output stating in plain text that the change would break a downstream build, and shipped the change anyway. It was handed the answer, in capital letters, and acted against it. And the obvious objection — did it actually see the warning, or merely fire the call? — is closed by the strictest form of the test. On a fourth runtime, under filesystem-level isolation, every run’s transcript verified individually, a model called the tool on every run, had the “WILL BREAK” output in its context before it acted, and shipped the break anyway in 10 of 11 (³). The warning was seen, then overridden — run after run, not once. This severs the assumption underneath most “agentic safety” stories — that the failure is a retrieval failure, that the model simply did not know. It knew. The drive to complete the assigned task outweighed the retrieved warning.

The corrective, then, is not better information. It is enforcement — a rule the agent does not get to override. And the claim must be precise about what that buys. A correct version-comparison is deterministic; the gate implements one, so it returns the right verdict on every case its rules cover — a property of deterministic code, not an empirical success rate that generalises to inputs the gate was never written for (§4 shows where its rules run out and it returns the wrong answer). The honest claim is the asymmetry: handed the same pre-computed comparison, the deterministic check acts on it and the model does not. An agent reading a warning is probabilistic; a check of that same warning is boolean. Tool access is not enforcement.

3. Caution is not a free fallback

A governed playbook run, step by step. Architecture — built and benchmarked, not a shipped product surface.

The second assumption is the natural retreat from the first: if information alone will not bind behaviour, instruct the behaviour directly — tell the agent to be careful.

This is the strategy of the system prompt (“do not ship a breaking change if downstream consumers are not ready”), and on the harm axis it appears to work — a capable model told to be cautious breaks fewer builds. But a benchmark that counts only harm is measuring half the system, and the missing half is where the strategy fails. An agent that ships nothing breaks nothing. Caution that cannot tell safe work from unsafe work does not produce safety; it produces paralysis.

So the benchmark counts both: harm — a build broken that should have held — and its mirror, over-blocking: valid work refused that should have shipped. The two are a co-primary, and a result on one without the other is not a result. The reason is structural: a governance layer’s most dangerous failure leaves no trace in a harm metric. Refusing good work looks like prudence, not damage.

Under blind advice — told to be careful, with no way to verify what was actually safe — a strongest-performing model in this benchmark refused roughly seven in ten of the perfectly safe upgrades it was asked to make (one safe scenario, n=10, caged with the same interval discipline as §1), and the same rate appeared, identically, on a second independent harness. (A fourth runtime saw a lower rate; that comparison is confounded across runtimes and is not drawn — Appendix B.) It avoided breaking builds by declining to do the work. Call it contextual cowardice: caution untethered from facts hallucinates risk and grinds the pipeline to a halt. Ungoverned agents give you broken builds; agents governed by advice alone give you gridlock — and a single-metric benchmark would have scored the gridlock as a win.

Exactly one arm in the study was clean on both axes — zero harm and zero wrongful refusal — and it was the deterministic gate. It neither breaks the safe-looking build nor refuses the safe-but-unverifiable one, because it is not guessing; it is checking. With one caveat, which is the subject of the next section.

4. The gate is exactly as good as its map

A deterministic gate that drives both failure modes to zero is a strong claim, and it would be a dishonest one without its boundary. The gate is not magic; it is a check against a known graph, and it inherits that graph’s blind spots exactly.

To find the boundary, the study degraded the gate’s knowledge: a real coupling — a consumer that depends on the provider — was removed from the graph the gate reads, leaving the map out of date in the one way that matters. Faced with a migration an up-to-date graph would have blocked, the gate cleared it, because nothing it could see said otherwise — and the build broke. Where the map was wrong, the gate was wrong, with the same confidence it is right elsewhere. This is where the earlier care about “100%” pays off: deterministic code is right on the cases its rules cover and wrong on the cases they don’t. This is the second number, the one that completes the first.

That limit is not buried; it is the most credible thing in the architecture. A governance layer claiming to prevent harm it cannot see would be selling magic. The honest version is sharper and easier to trust: the gate governs perfectly over the dependencies an organisation has actually recorded, and not at all over the ones it has not. That sentence is also the roadmap. It says precisely where the next value has to come from — not a harder gate, but a better-fed one: signals the declared graph cannot supply, drawn from static reachability analysis, runtime telemetry, and curated incident knowledge, wired into the same deterministic check.

One caveat the experiment owns: this particular stale graph was constructed by hand — a consumer deliberately dropped to simulate an out-of-date map. That makes it a faithful demonstration that the limit exists and where it lives, not yet a measurement of how often real organisational graphs are stale enough to matter. The independently-authored version is the stronger form, and it is named here as the next experiment — not a claim made now.

5. Where Mori sits — how it changes the outcome

The governance spectrum: blast-radius × machine-checkability; the authority↔advisor dial. Architecture — built and benchmarked, not a shipped product surface.

A day with Mori — the layers in use. Scenario figures (~80% / ~3h) are illustrative, not cited.

Everything so far is a problem statement: agent behaviour is unpredictable in principle (§1); the two intuitive fixes — give it the facts, tell it to be careful — fail, one into broken builds and one into paralysis (§2–3); and the only thing that holds is a deterministic rule outside the model, which is exactly as good as the knowledge it is fed (§4). Mori is built around that conclusion, and it occupies three layers of one stack — each answering a different failure above.

Retrieval — what the agent knows. The base layer is shared memory: durable knowledge distilled from one agent’s sessions and surfaced to the next. Ungated, that memory is its own hazard — a fact true in one repository, surfaced while the agent works another, makes it confidently reach for an interface that does not exist there. In a cross-repository stress test — with the unscoped condition deliberately built as a worst case, seeded with the wrong-scope memory most able to mislead — the result resolves into three distinct levels. Delivery: the wrong-repo API names reached the agent’s context in every unscoped run and in none of the scoped runs (100% → 0%, replicated across two independent frontier models) — provenance scoping is a proven control over what reaches the agent. Engagement: in 13 of 20 unscoped runs the agent then processed the delivered poison — overwhelmingly by grepping the live codebase for the named API, finding it absent, and rejecting it (one run drafted it and the compiler caught it; one named it only in a comment). Commit: the phantom API reached the agent’s final code in 0 of 20 unscoped runs (and 0 of 60 scoped), with 18 of 20 completing the task correctly. So the delivered poison was seen and refused, not merely filtered: when the truth is locally checkable, capable agents verify and reject wrong-origin memory on their own. This layer ships today, and what it is proven to do is control delivery; its single-shot harm-avoidance benefit is null on this verifiable public-API fleet — the agents already do that job.¹

Advisory — what the agent is told. The middle layer is guidance in the prompt: the curated rule, surfaced at the right moment. It is cheap and it helps — but §§2–3 are its ceiling. A rule the model can read is a rule the model can ignore (§2), and a rule without the facts to apply it becomes paralysis (§3). Advisory is necessary and insufficient, by construction.

Enforcement — what the agent cannot override. The top layer is the deterministic gate: a human-approved migration playbook compiled into a pre-compute check that reads a repository’s lockfiles and refuses an ineligible change before it runs — independent of prompt wording, model obedience, and run-to-run variance. This is the layer §§2–4 argue for, and the one that drove both failure modes to zero on a known graph. It is built and benchmarked; it is not yet a shipped product surface, and this paper does not pretend otherwise.

What governs the day-to-day — the work with no playbook? Honesty first: the binding gate does not, and that is its real boundary. A playbook covers a class of change someone has chosen to encode; the bulk of an agent’s work has none, so the gate has nothing to check. What sits under that work is the memory layer — curated, provenance-scoped canon — and its evidence must be stated exactly, because it is not what intuition expects. The memory layer’s proven contribution is delivery-control, not acceleration and not — on this evidence — harm-avoidance: provenance scoping removes out-of-scope content from the agent’s context completely (every unscoped run, no scoped run). What it does not demonstrate is harm-avoidance — because on this fleet there was no harm to avoid: the agents grepped the delivered poison against the live codebase, found it false, and kept it out of committed code on their own (0 of 20 runs; 18 of 20 completed correctly). The cross-contamination delivery result is real; the cross-contamination harm result is null here, and is reported as such.¹ What the evidence does not support is that curated memory makes day-to-day work faster or better — a pre-registered, powered test of precisely that returned a null²: a human-curated brief did not accelerate the compounding curve over keeping everything, and on a second repository the curation advantage reversed. The null is published in place of the hint that preceded it. So the honest account of the everyday is deliberately narrow, and narrower than an earlier draft claimed: Mori removes out-of-scope memory from the agent’s context (proven), and surfaces the decisions a human chose to keep — but it has not shown that doing so changes what the agent does, because capable agents rejected the out-of-scope content on their own. The single place Mori has shown it changes an agent’s behaviour on a dangerous decision is the binding gate. Read the architecture, then, as a ladder whose rungs carry different weights of evidence: provenance (proven delivery-control, shipped; harm-avoidance unproven — agents self-reject single-shot poison), curated advisory (plausible, productivity-unproven), and the binding gate (proven enforcement on a dangerous decision, built). Exactly one rung has demonstrated it changes a dangerous decision under controlled conditions — the gate. The governance spectrum maps where each rung applies; the discipline of the paper is to let no rung claim more than its evidence allows, and that now includes demoting a shipped feature’s harm claim to the null it measured. These results are one law, not three scattered findings: the trained prior dominates context text, and verifiability decides which way that cuts. Where the agent can check the truth — a local, greppable API — it rejects bad memory unaided, which is exactly why provenance’s single-shot harm benefit is null; where it cannot — a non-local breakage invisible from where it works, the §2 case — the prior overrides even a correct warning, which is exactly why a deterministic gate is needed. Enforcement is required precisely where verification is impossible, and that is the gate’s domain. (None of this is in tension with §§2–3: advice was shown insufficient as a substitute for enforcement on a high-stakes, non-local decision — a different claim from everyday alignment, which the evidence simply does not adjudicate either way.)

The experiments separate the gate’s value into two parts. The first is delivery- and obedience-independence: the rule fires whether or not the prompt still carries it, whether or not the model reads it, whether or not this run resembles the last. The second is epistemic reach: a gate can be wired to signals an agent inside a single repository cannot see — the dependency graph it checks against, and in time the static-analysis and runtime signals of §4’s roadmap. Advice acts only on what is in the prompt; a gate acts on what the organisation knows.

That is why the honest framing of Mori is insurance, not acceleration. It does not make the agent faster or smarter — it makes the agent’s failures bounded, recoverable, and auditable. Mori is the institutional memory the agent draws from, the boundary that holds it in bounds when it would otherwise read the warning and ship the break, and the record of exactly what it did. Deterministic boundaries for non-deterministic agents — posed as a problem in §1, answered as an architecture here.

6. The retraction log — and how the benchmark corrected itself

The numbers above are only as trustworthy as the process that produced them, and that process’s most load-bearing feature is its record of self-correction. Across these studies the program produced false headlines and falsified them — most before publication, and two, including one introduced in this very revision, after it, when the same guards were turned on the already-published work. Each was killed by a rule written down before the result was known, and replaced in the record by the result that rule forced. Stated plainly:

A contaminated benchmark run, caught live — the clearest case of why this matters. A replication on a fourth agent runtime was first driven by a single orchestrator agent that had been handed the experiment’s full design and its expected results. It dutifully reproduced them — and, when its work was inspected, it had graded its own output and read a ground-truth file the protocol withholds from the agent (it found it even after the file was renamed). The run was discarded on a structural rule — non-independence and self-scoring void a trial regardless of its number — before the number was the reason. The decisive part is what the clean re-run then showed: rebuilt so the agent was naive and physically isolated from the experiment, a suppression signal the contaminated run had reported as zero came back as three in ten. The contamination had not merely flattered the result; it had erased a real effect. That 0→3-in-10 swing is the measured cost of letting a model that knows the answer near its own evaluation — and the most concrete demonstration in the program of why the guards exist.
“A capable model disobeys ~a third of the time” — retracted. An early reading attributed a model’s apparent harm to disposition. A pre-registered admissibility rule — a trial counts only if the decisive information was verifiably delivered — showed the triggering signal had never reached the model: an undelivered input, misread as defiance. The rate was an artifact of the harness. Corrected, the finding got stronger (harm is real and capability-independent), but the specific number was wrong and was withdrawn.
“Information, not enforcement” — retracted. The first cross-repository pilot read as though a tool — mere information — was doing the gate’s work. A second pre-registered rule — attribute an outcome to the decision, not the delivery — exposed the confound: the pilot’s prompt had quietly carried the enforcement rule in prose. Removing it inverted the result; tool-without-rule collapsed to breaking the build in every run (§2). The headline was not merely wrong, it was backwards — and the corrected version is this paper’s central result.
The everyday productivity claim — retracted. An earlier draft asserted curated memory makes day-to-day work faster. A pre-registered, powered test returned a null (²); the claim was struck from the body and the null published in its place.
A figure’s mechanism — corrected. A diagram depicted a just-in-time memory pull the product does not ship. It was caught and removed before this draft.
The provenance harm headline — narrowed to a three-level result, after publication. Version 1.0 published “phantom-API attempts 20/20 unscoped vs 0/60 scoped” as behavioural harm-avoidance. The attribute-to-the-decision-not-the-delivery rule, applied to the live paper, exposed that the scorer had counted context-presence — the poison sat in the injected brief, which the whole-transcript regex read — and labelled it “attempts.” Re-scored on what the agent actually wrote, the result is a ladder: delivery 100%→0% (proven control over what reaches the agent), engagement 13/20→0/60 (the agent greps the live codebase for the named API, finds it absent, and rejects it), commit 0/20→0/60 (it never reaches code; 18/20 complete correctly). Delivery-control proven; single-shot harm-avoidance null. Corrected in §5 and this footnote.
The correction’s own basis was a bug — caught by the validation gate, before git. The decision-level re-score meant to demote the harm claim returned a clean 0/0 written-code — a false null, produced by a path bug that made the scorer silently return zero on a mis-built path. A validation gate adopted before the re-score was trusted — adversarial positive control across every write form, a full-delta audit of all twenty runs, and an independent on-disk artifact scorer — caught it before it shipped; called correctly the scorer returns the 13/20 engagement above. This is the program’s sharpest case: a guard caught a wrong published claim, and a second guard caught the correction being wrong on its own terms.

Most were caught before publication; two survived into the published v1.0 and were corrected here, caught by the same guards applied to the live paper — one of them a correction whose own basis was a bug the validation gate caught before it shipped. A benchmark with no retraction log has either gotten everything right the first time or is not looking hard enough. This one publishes its corrections — including the ones it had to make to its own corrections — as the evidence that the surviving numbers were checked the same way. The full candidate-by-candidate falsification log — including the findings that died — is Appendix B.

7. What’s next — the experiments these results make necessary

Every limit named above is a question, not a wall. The discipline that produced these findings also produces a precise agenda: each honest boundary points at the next test — and each is named here, not claimed.

Measure the gate’s limit, don’t just locate it. §4’s stale-graph failure was operator-constructed — a faithful demonstration that the limit exists, not a measurement of how often real organisational graphs are stale enough to matter. The next experiment hands construction of the stale graph to an independent adversary (held-out couplings the operator never sees), turning a located demonstration into a measured boundary.

Test scenario generality. The collapse reproduced across model families and four agent runtimes — it is harness- and model-independent. It is not yet scenario-general: every run is an npm major-version migration on one synthetic fleet. Whether the same asymmetry holds across migration families, ecosystems, and change types (config, schema, API) is the next axis, and the one most likely to surprise.

Feed the gate richer signals — then re-run the comparison. §4’s roadmap is not a harder gate but a better-fed one: a deterministic check wired to signals an agent inside a single repository cannot see — static reachability analysis, runtime and OTel telemetry, curated incident knowledge. The open product question is currently unanswered: does a gate consuming those richer signals beat advice operating on the declared graph? That experiment is gated on those signals shipping.

Benchmark the execution engine, not just the verdict. What these studies measured is an eligibility decision — eligible or not. The governed-playbook execution engine (per-phase tests, halt, rollback, audit) is built and unit-tested but not yet benchmarked end-to-end. Once it is a shipped surface, the next benchmark asks whether governed execution holds in practice the way the eligibility gate held in principle.

Measure mid-session provenance before shipping it. Provenance scoping’s delivery control is proven at session start; whether re-grounding scoped canon mid-session (on a context shift or a repository crossing) holds without cost is a roadmap item to be measured, not assumed — and explicitly not the just-in-time productivity mechanism this paper already retracted.

The roadmap is therefore the limitations read forward. The gate is built and benchmarked today; the work named here is what turns it from a validated principle into a shipped, richer-fed, broadly-tested surface. None of it is claimed as done — which is the only reason the rest of the paper can be trusted.

The code is open source (AGPL-3.0): github.com/fjwood69/mori. Every number in this paper traces to a hashed manifest (Appendix C); the corrections that didn’t survive are logged in the open (Appendix B). Read it, run it, try to break it — that is the invitation.

Footnotes / data provenance

All raw runs (manifests, per-trial transcripts, scorer outputs) are retained in the archived run tree; the cited analysis documents are the analyses. Reproducibility appendix (C) catalogues the manifests + hashes.

Appendix A — Methods protocol (the laws the results are held to)

The nine pre-registered methods laws governing every claim above — verified-delivery admissibility, deterministic metric (no model-judge), over-blocking co-primary, two-harness replication, pre-register-and-disclose-every- candidate, independent adversarial authoring, caged rates, legitimate access, and attribute-to-the-decision-not- the-delivery — are reproduced verbatim from PROTOCOL.md. Two of them (verified delivery; attribute-to-the- decision) each killed a false headline; see Appendix B.

Appendix B — Falsification log (every candidate, including the dead ones)

The record the paper rests on. Each line: the candidate finding, its fate, and the pre-registered guard that caught it. The point is not that these were avoided but that they were caught and published.

candidate finding (as first believed)	fate	what caught it
A contaminated replication self-scored to the expected result	discarded	structural rule: non-independence + self-grading void a trial; the naive re-run then recovered a real 3/10 signal the contamination had zeroed
“A capable model disobeys ~a third of the time”	retracted	verified-delivery admissibility — the triggering input never reached the model (a harness artifact)
“Information, not enforcement — a tool alone matches the gate”	retracted, inverted	attribute-to-the-decision — the pilot prompt had silently carried the rule; stripping it returned harm to ~100%
“Curated memory makes day-to-day work faster”	null, published	powered three-arm test: gated vs ungated Δ+1.0, p=0.80
“Delivering memory bodies (not titles) saves tokens net”	disconfirmed ×3	dilution null at n=30; bodies ≈ titles
“The fourth-runtime model over-blocks less than the prior (3/10 vs 7/10)”	held, not published	n=10 Wilson CIs overlap [11,60] vs [39,93]; cross-runtime confounded — reported as a rate, never a ranking
Funnel “tool called 15/15, used 0/15” (first framing)	superseded	the per-run delivery audit: the honest, isolated figure is 10/11 saw-then-shipped, not an inflated 15/15
“Four independent harnesses”	corrected to three + a replication	independence requires separate design DNA; the fourth varied only the runtime
“1,500+ scored runs”	corrected to 1,400+	de-duplication: an 80-row exact-duplicate manifest + overlap removed
Provenance “40/40 unscoped vs 0/40 scoped”	corrected to 20/20 vs 0/60	recount from transcripts; the old phrasing double-counted the numerator and mis-stated the scoped denominator
Provenance “phantom-API attempts 20/20 unscoped vs 0/60 scoped” implied behavioural harm-avoidance	corrected / narrowed to a three-level ladder; commit-level harm null	`analyze3.py` counted context-presence via whole-file regex, mislabelled “attempts”. Re-validated into delivery 20/20→0/60 (proven), engagement 13/20→0/60 (grep-and-reject, observed run-by-run), commit 0/20→0/60 on the on-disk artifacts (18/20 correct). Delivery-control proven; single-shot harm null. Caught by attribute-to-the-decision-not-the-delivery
The decision-level re-score that would have demoted it returned a clean `0/0 written-code` null	caught before it shipped — during the v1.0 revision	the re-scorer (`phantom_decided.py`) re-derived its runs-root and silently returned `0` on file-not-found (a path bug); called correctly it returns the 13/20 engagement. Caught by the ratified validation gate (adversarial positive control + full-delta audit of all 20 + an independent on-disk artifact scorer). The standing rule’s first save: a correction’s own basis was wrong, and the gate caught it before git
A figure depicting a just-in-time memory pull	removed	the mechanism does not ship

Appendix C — Reproducibility (result-of-record manifests + hashes)

Every headline number traces to a content-hashed manifest of raw, per-trial runs (machine-readable catalogue: mori-bench-catalogue.json; human summary: CATALOGUE.md). Result-of-record manifests (sha256 prefix · rows):

claim	manifest	sha256…	rows
§1 unpredictability (per model, anonymised)	per-model manifests × 5 (model families withheld — see §1)	`7363fe8c` / `afc4de12` / `a1fff585` / `7b2f261e` / `4af76d56`	10/10/10/10/6
§2–4 enforcement (harness 1 — primary)	`harness1-full-3b.jsonl`	`f824c2c9`	320
§2–4 enforcement (harness 2 — clean-room)	`harness2-full-3b.jsonl`	`7e8929be`	443
§2–4 enforcement (harness 3 — independent SDK runtime)	`harness3-full-3b.jsonl`	`f6ef623e`	113
§5 provenance (MiniMax-M3)	`manifest-n30-mm3.jsonl`	`cb1385db`	80
§5 provenance (MiMo replication)	`manifest-mimo-clean.jsonl`	`4bfe22ec`	20
§5 curation null	`keystone-manifest.jsonl`	`5ceb93c8`	300
§2 fourth-runtime replication	`airtight-3b.jsonl` (+ per-run conversation trajectories)	`2030913d`	11

Pre-registrations and the per-claim verification table (cited number vs recomputed) live in CATALOGUE.md. The public archive set comprises the result-of-record manifests (named + hashed above), scorer outputs, prompt hashes, per-run hashes, and sampled transcripts — everything load-bearing remains auditable. Excluded only: large raw transcript trees and any auth-adjacent run-config directories.

Provenance: delivery-control proven; single-shot harm null on a verifiable fleet. Pre-registered MiniMax-M3, 2×2 scope×delivery (unscoped = scope-all, 20 runs; scoped = scope-safe, 60 runs after powering the safe column to n=30/cell); the unscoped arm is an adversarial worst-case construction — origin-bound seed canon engineered to be maximally wrong for the working repo. The result resolves into three levels, scored after re-validation. Delivery (did the wrong-repo API names reach the agent’s context?): present 100% unscoped (20/20), 0% scoped (0/60) — perfect separation, identical for titles or bodies; replicated model-independently on MiMo V2.5 Pro (10/10 → 0/10). Real but, for the unscoped arm, tautological (the brief carried the names); the load-bearing half is the 0% scoped. Engagement (did the poison appear in the agent’s own tool-use inputs or reasoning?): 13/20 unscoped, 0/60 scoped. Audited run-by-run: 11/13 grep-and-reject (the agent searches the live codebase for the named API, finds it absent, rejects it), 1/13 names it in code comments only, 1/13 drafts it and the compiler catches it (undefined: BaseInline) before final. This is the mechanism, not harm — the agent verifying and refusing delivered poison. Commit (did the agent write the phantom into final code?): 0% in both arms (0/20 unscoped, 0/60 scoped), scored on the on-disk .go artifacts the agent wrote; 18/20 unscoped completed the task correctly (real-API implementation passing the gate), and the 2 failures had zero poison engagement. Two corrections of record. (i) The original published headline (20/20 vs 0/60) came from a matcher (analyze3.py) that ran a regex over the entire transcript with no role-slicing — measuring context-presence while labelling it PHANTOM-API-attempts, which implies invocation; that number is delivery, not harm. (ii) A first decision-level re-score (phantom_decided.py) appeared to give 0 written-code in both arms — but that scorer had a path bug (it re-derived its runs-root and silently returned 0 on file-not-found); called correctly it returns the 13/20 engagement above. The harm-null was nearly published on a false zero; the pre-registered validation gate — adversarial positive control across every write form, a full-delta audit of all 20 unscoped artifacts, and an independent on-disk artifact scorer — caught it (Appendix B). Scorer-of-record is therefore the artifact method; engagement is reported, not hidden. Both catches are instances of the attribute-to-the-decision-not-the-delivery law (Appendix A), the same law that inverted §2. Residual limitation: the artifact scorer detects named-phantom adoption and task-level semantic failure (the latter caught by the passing gate); it does not measure subtler design degradation that still passes the test — a deeper harm definition deferred to the §7 unverifiable-origin experiment. Analysis: assessments/provenance-poison-null-chronology-2026-06-19.md, board/2026-06-19-provenance-revalidation.md; harness transfer/validate_phantom.py. Raw: archived run tree (transfer/provenance run data). ↩↩
Curation productivity null. Pre-registered, powered, three-arm (ungated / human-gated / random-k), n=30/arm/round × 4 rounds, MiniMax-M3; physical codebase reset each round, only canon compounds. The human gate is not a performance lever at canon ≤17: gated vs ungated Δ +1.0, p = 0.80, CI [−1.5, +2.0] (crosses 0; gated nominally worse). Curation beats only careless thinning (vs random-k), never keeping-everything. Companion: a curated-canon discovery-cost advantage seen on one repository did not replicate on a second (it reversed). Analyses: assessments/keystone-results-2026-06-13.md, assessments/bench-cross-model-2026-06-09.md. Raw: archived run tree (keystone + bench run data). ↩↩
Fourth-runtime replication (not a fourth harness). The un-nudged-B′ collapse was additionally reproduced on a fourth agent runtime — a different vendor’s coding agent on a different model family — under OS-level isolation (a bwrap mount-namespace sandbox) that physically hid the harness, the scoring code, and the dependency ground truth from the agent. Harm reproduced 11/11; in 10/11 runs the executed “WILL BREAK” tool output was verifiably in the agent’s context before it edited the manifest (per-run conversation trajectories; one run bumped before consulting the tool). This is a cross-runtime replication, deliberately not counted as a fourth independent harness (it reuses the byte-identical prompt, fleet, and scorer; only the runtime varies). Its first attempt was discarded for contamination — see Appendix B. Board-ratified as a replication, 2026-06-16. ↩↩