← All posts

My Agent's 'Green' Was a Lie Until I Ran the Real Test

How I stopped hand-wiring build-and-review pipelines for every task and standardized on a plan→work loop — and the one gate the loop can't close for me.

  • ai-agents
  • claude-code
  • orchestration
  • testing
  • workflow

For months, every non-trivial change I made with AI coding agents started the same dumb way: I’d fire off a build agent — Codex, or my own Forge setup — to write the code, then I’d separately spin up a second agent by hand to review what the first one did. Dispatch, then review. Glue I re-tied every single time.

If you’re not in the weeds here: an “agent” is just an AI given a task and the keys to actually do it — edit files, run commands, the works. Left alone, it will happily start typing code before it understands the problem, the same way an eager contractor starts knocking down walls before anyone’s looked at the plumbing. My hand-rolled dispatch-and-review dance was me being the general contractor for every job, badly, from scratch.

The fix was boring and it worked: I made two commands my default and stopped improvising.

First is a plan step — /ce-plan. It refuses to write code. It forces the agent to research the actual codebase, propose an approach, and write out acceptance criteria into a plan.md file before anything else happens. That plan file is a leash. The agent can’t lunge at the keyboard because the only thing it’s allowed to produce in that phase is understanding.

Then /ce-work executes that plan — and here’s the part I’d been rebuilding by hand for no reason: the review is inside the loop. The heavy build engine runs, and an adversarial code-review pass runs against its own output, automatically. I’m not wiring up a second reviewer anymore. The loop brings its own critic.

So I deleted my bespoke orchestration and I don’t miss it.

But there’s a trap, and it cost me a “done” I had to take back. The loop went green. Tests passed. I almost shipped. Then I ran a real test — a live call against the actual database, actual network, no fakes — and it fell over.

Here’s why. The loop’s tests lean on mocks: stand-in fakes that pretend to be the database or the API so tests run fast. When the loop says “green,” it means green against the fakes. That’s a real signal — it proves the logic holds together — but it is not proof the thing works against reality. A flight simulator landing isn’t a landing.

Where the mock boundary bites, concretely give me the detail

The /ce-work loop runs unit tests, and units mock their I/O. A typical passing test never touches the wire:

def test_sync(mocker):
    db = mocker.patch("app.db.write")   # <-- the fake
    sync_record({"id": 42})
    db.assert_called_once()             # green, but never hit a real DB

That passes even if the real schema changed, the credentials rotated, or the API now returns a field you don’t expect. The reality gate is a separate run I drive myself — the live test, real I/O, no mocker.patch. I keep it outside the skill on purpose so a mocked pass can never masquerade as a real one.

The general habit, and the thing I’d tell anyone orchestrating these agents: standardize the loop so you stop reinventing the glue — plan first, execute-with-review second — and then treat the loop’s “pass” as a hypothesis, not a verdict. The agent can prove its logic to itself. Only a live test proves it to reality, and that test is your job, not the loop’s.