CODY Has to Prove Itself Wrong First

The first version of CODY — my standalone AI code reviewer that runs over my repos and flags problems before I merge — told me a race condition existed in a function that had no shared state at all. It was confident. It cited line numbers. It explained the interleaving in crisp prose. And it was completely wrong.

That’s the thing nobody warns you about when you point a language model at a codebase and ask “what’s broken here?” It will always find something. It’s fluent by design, and fluency reads as authority. So CODY would hand me a tidy list of ten findings, and three of them were real, and the other seven were beautifully-argued fiction. Sorting the real from the imagined cost me more time than just reading the diff myself. A reviewer that makes your job slower is worse than no reviewer.

If you’ve never built one of these: think of it like a spell-checker that’s also willing to invent grammar rules and insist your correct sentence is broken. The underlining looks the same whether it’s right or not.

My first instinct was the obvious one — make the finder smarter. Better prompt, more context, sharper instructions to “only flag high-confidence issues.” That barely moved the needle. Asking a confident model to be less confident just makes it write more hedging words in front of the same wrong answer. The wrongness wasn’t a knowledge problem I could prompt away. It was structural: one model, one pass, no opposition.

So I stopped trying to improve the finder and added an opponent.

Now CODY works in two stages. Opus does the finding — it reads the diff and raises every issue it suspects. Then each finding gets handed to a second model from a different vendor (I use codex through a tool I call Forge) whose entire job is to refute it. Not to confirm. To kill it. “Here’s a claimed race condition — prove it can’t happen.” A finding only reaches me if it survives that attack.

The seven fictional findings? Most of them collapse the moment something actually tries to argue against them, because there’s no real evidence underneath — just plausible narrative. The three real ones survive, because the refuter goes looking for the disproof and can’t find it.

The finder/refuter harness give me the detail

The key move is using a different model family for the refuter. Same-family models share the same blind spots — Opus refuting Opus tends to rubber-stamp its own reasoning. Cross-vendor disagreement is the signal.

The refuter prompt is deliberately one-sided:

You are refuting a code-review finding. Your job is to
DISPROVE it, not validate it. Output VERDICT: SURVIVES
only if you cannot construct a concrete counterexample,
execution path, or code reference that defeats the claim.
Otherwise output VERDICT: REFUTED with the specific reason.

Findings are dispatched in parallel, one refutation call each. Cheap relative to the cost of me chasing a phantom bug. The harness — finder, then adversarial check, keep only survivors — is maybe 40 lines of orchestration around two models that already existed.

Here’s why this generalizes beyond code review. Any AI that makes judgment calls — flagging fraud, triaging tickets, reviewing contracts — fails the same way: a single confident pass produces plausible-but-wrong calls, and each one spends a little of your trust until you stop reading the output at all.

The fix isn’t a better finder. It’s an adversary. Pair your generator with a skeptic from a different model family and ship only what survives the attack. The model’s raw intelligence matters less than the harness you put around it — and a harness that forces the system to argue against itself is the cheapest trust you can buy.