CODY Has to Argue Itself Out of Every Bug It Finds

The first week I ran CODY on my own repos, it told me a function had a race condition. Confident, specific, named the two lines. I spent twenty minutes reading code that turned out to be perfectly fine — the “race” required a calling pattern that didn’t exist anywhere in the codebase.

CODY is a code reviewer I built — a standalone tool that reads my pull requests and tells me what’s wrong before a human has to. Think of it as a very fast junior engineer who never gets tired. The problem with that junior engineer, in version one, was that it was fluent. It wrote up every finding in the calm, authoritative voice of someone who’s right. And maybe half the time, it wasn’t.

That’s worse than no reviewer. A reviewer you can’t trust isn’t a reviewer; it’s a pile of homework. Every false alarm costs you the time to disprove it, plus a little erosion of your willingness to read the next one. I caught myself skimming CODY’s output, which defeats the entire point.

My first instinct was the obvious one: make it smarter. Better prompt, more context, bigger model. I told it to “only report high-confidence issues.” You can guess how that went — it just got more confident about the wrong things. Confidence and correctness are not the same dial, and prompting mostly turns the first one.

So I stopped trying to make the finder better and added an enemy.

Now every finding CODY raises gets handed to a second model, from a different vendor, whose only job is to refute it. Not to double-check politely — to actively argue the finding is wrong. The skeptic gets the same code and a single mission: prove this bug isn’t real. If it can build a credible case, the finding dies right there and I never see it. Only the findings that survive a genuine attempt to demolish them reach me.

The race-condition false alarm? The refuter would have killed it instantly: show me the caller that triggers this — there isn’t one.

Why this works is the interesting part. A single model grading its own work shares its own blind spots; asking it “are you sure?” just gets you the same reasoning, restated. A model from a different family fails differently. It wasn’t trained on the same data in the same way, so where the finder hallucinates a calling pattern, the skeptic has no reason to hallucinate the same one. You’re not averaging two opinions — you’re making them fight, and disagreement is the signal.

How the finder/refuter loop is wired give me the detail

CODY runs the finder pass with Opus, then dispatches each finding to a different-vendor refuter (codex via a tool I call Forge). The contract matters: the refuter is told to refute, not to “review.” Framing it as adversarial is what stops it from rubber-stamping.

findings = opus_review(diff)              # the finder
for f in findings:
    verdict = forge_refute(diff, f)       # different model family, hostile prompt
    # prompt: "Argue this finding is WRONG. Cite the code.
    #          Conclude SURVIVES or REFUTED."
    if verdict.conclusion == "SURVIVES":
        surface(f, rebuttal=verdict.reasoning)

The refuter’s failed rebuttal ships with the finding — so when I do get a report, I’m reading the strongest case against it too. That alone roughly halved what reached me, and what reached me was real.

The lesson generalizes past code review. If you’re building an AI that judges, flags, or scores anything — resumes, transactions, support tickets — your instinct will be to make the judge more confident. Don’t. Pair the finder with a skeptic from a different model family and ship only what survives the fight. The harness around the model earns trust faster than the model’s raw intelligence ever will.