Green on Mocks Is Not Done

I built a large multi-agent system over a few days. Every layer declared itself done. The design passed seven rounds of adversarial review. Each unit was test-green and mutation-proven. Each unit was reviewed and committed. And then, every single time, the next — more real — layer found bugs the previous layer could not have seen.

Design said “flawless.” The first real implementation pass found non-executable SQL and a safety cap that silenced itself during the exact incident it was meant to catch. Units said “mutation-proven.” An independent review found two critical and a dozen major bugs the unit’s own tests passed right over. Review said “sound, committed.” The live box found integration bugs no unit test could catch — a component that couldn’t read the real dependency at all, a process check that inspected the wrong process.

The pattern only ever broke when verification touched something more real: a different model, the live machine, a human’s eyes.

What “green” actually meant

Here is the uncomfortable part. Every “green” was honest. The tests really passed. The mutations were really caught. The reviewer really approved.

But “green” meant consistent with my own assumptions — not matches reality. The bugs all lived at the boundary between the system and the real environment: the real command’s exact output shape, the difference between a window and a session, the difference between a shell process and the program running inside it, a directory two agents quietly shared.

Those are exactly the places mocks hide. My unit tests injected fakes. So they verified the logic against the same wrong assumptions the code was built on. “Mutation-proven” proves a test catches a regression against the mock’s shape. It says nothing about whether the mock’s shape is right. At a boundary, it usually isn’t.

The verification chain was allowed to terminate on a mock. That was the whole bug.

The ladder

The fix is a ladder, and a rule: a unit is not done until it has been exercised against the most-real substrate available right now, and every rung it hasn’t reached is named out loud.

R0  logic on mocks / injected seams     ← cheapest; never enough for a boundary unit
R1  real-I/O integration smoke          ← real command / fs / network, zero fakes
R2  live single-instance                ← one real run, end to end
R3  live fleet / soak                   ← many instances + the time dimension

Mocks may verify logic. They may never close a unit that touches an external boundary.

Claim	What it actually proves
”tsc clean”	the types line up
”tests pass”	the logic matches the fixtures
”mutation-proven”	the test catches a regression against the mock
”reviewed sound”	a reader agreed with the author’s assumptions
”works on the live box”	it actually does the thing

Only the last row is reality. The rest are necessary and never sufficient for anything that calls a real command, spawns a process, or talks to another machine.

The discipline

Three rules carry it:

Every boundary unit needs at least one acceptance test against real I/O — not injected seams — before it’s done. A suite that is 100% fakes has verified nothing about the boundary.
The reviewer is a different substrate. A different model, the live machine, a human. Not the builder’s own fixtures grading the builder’s own code.
No silent mock-termination. If a thing was only checked at R0, say so. “Logic-green, reality-unverified” is an honest status. “Done” is a claim you have to earn with reality-contact.

The words done, green, sound, shipped applied to a boundary unit should make you ask one question: against what? If the answer is “a mock I wrote,” it isn’t done. It’s a hypothesis that happens to be green.

How I wire it in

The discipline only holds if it’s mechanical. The shape I use:

A unit’s “done” transition is gated on an R1+ artifact — a real-I/O test, a live-run log, a screenshot — not just a passing unit suite. For UI, that means actually opening the page in a real browser (I use the Interceptor-style “drive the real Chrome” approach), not asserting a 200.
The reviewer is a different model. When I build with Claude Code, the adversarial review pass runs on a different frontier model than the builder — a cross-vendor check catches the blind spots a model shares with itself. (This is also why mutation testing — Stryker and friends — is necessary but not sufficient: it proves the test catches a regression against the mock, never that the mock is real.)
Every status carries its reality rung explicitly: R0 logic-green, R1 real-I/O, R2 live, R3 soak. An un-reached rung is named, never silently omitted.

None of it is heavy. It’s one gate and one habit: don’t let the verification chain terminate on something you wrote to make it pass.

Built on: the verification ladder is just Google’s testing pyramid read from the other end — most teams over-trust the bottom; the lesson is that the boundary needs the top. Adversarial cross-model review uses Claude Code with a different-vendor reviewer. Mutation testing via the Stryker family.