← All posts

Make Your AI Reviewer Argue With Itself

CODY surfaced confident, wrong findings until I gave every one of them to a skeptic from a different vendor whose only job was to tear it down.

  • AI
  • code-review
  • agents
  • Claude
  • engineering

The first week I ran CODY on my own repos, it flagged a race condition in a function that had no concurrency anywhere near it. CODY is my standalone code reviewer — an AI that reads a diff and writes up what looks wrong, the same job a teammate does when they leave comments on your pull request. It wrote three confident paragraphs about a thread-safety bug. There were no threads.

That was the problem in a sentence. CODY was fluent, and fluent reads as correct. So it kept handing me plausible-sounding findings that were just wrong, and I’d burn ten minutes proving each one false. A reviewer that costs you ten minutes per bad call is worse than no reviewer — at least with no reviewer you don’t go chasing ghosts.

My first instinct was the obvious one: make it smarter. Better prompt, more context, bigger model. That helped the wording and not the trust. A sharper writer just makes a wrong claim more convincing. I was tuning the part that was already too good at sounding right.

What actually fixed it was adding a second model whose entire job is to disagree.

Now when CODY (running on Opus, Anthropic’s model) raises a finding, it doesn’t come to me. It goes to a different-vendor model — codex, in a setup I call Forge — and that model is told one thing: refute this. Find the reason this is wrong. Show the code path that makes it a non-issue. Only if the finding survives that attack does it reach my screen.

The race-condition flag died instantly. The refuter pointed out there’s no shared state and no second caller — the function runs once, top to bottom. CODY had pattern-matched on a variable name. The skeptic caught it because catching it was the only thing it was being graded on.

Why a different vendor matters: two instances of the same model tend to share the same blind spots. They were trained on overlapping data and they rhyme. Ask Opus to check Opus and it mostly nods. A model from a different family fails differently, so it notices things the first one is structurally blind to. The disagreement is the feature.

The finder/refuter harness give me the detail

The structure is two passes with a hard gate between them. CODY runs the finder (Opus) over the diff and emits structured findings. Each finding is then dispatched to the refuter (codex/Forge) with an adversarial system prompt, and the refuter must return a verdict plus a concrete code-path justification.

finding = opus_review(diff)          # vendor A: find
rebuttal = codex_refute(finding, diff)  # vendor B: attack

if rebuttal.verdict == "REFUTED":
    drop(finding, reason=rebuttal.justification)
else:
    surface(finding)  # survived the skeptic, show me

The refuter has to do work to dismiss a finding — name the missing caller, the guard clause, the type that makes it impossible. “Looks fine” isn’t accepted; it has to win the argument. Run the skeptic at low temperature; you want it pedantic, not creative.

The thing I keep coming back to is that the harness mattered more than the finder’s raw intelligence. I spent days trying to make the smart part smarter when the missing piece was a dumb, stubborn opponent.

So if you’re building anything where an AI makes a judgment call — flags fraud, triages tickets, reviews code — don’t spend your effort making it more sure of itself. Pair it with a skeptic from a different model family and ship only what survives the fight. Confidence is cheap. Surviving an attack is the signal.