A Claude Code agent opened a pull request that wired a database query straight into a request handler with no timeout, and the AI reviewer I’d set up to catch exactly that kind of thing left a thoughtful three-paragraph comment about naming conventions instead. It approved the PR.
That was the afternoon I stopped trying to make the reviewer smarter.
Here’s the situation in plain terms, because it’s a problem anyone managing people would recognize. I have agents writing a lot of code now — more than I can read line by line. So I’d handed the reviewing job to another AI: a large language model (the same kind of “predict the next word” system that powers chatbots) reading each change and commenting like a senior engineer would. The trouble is that an LLM is a brilliant, slightly unreliable intern. Ask it the same question twice and you get two different answers. Some days it caught the missing timeout. Some days it wrote an essay about variable names and waved the real bug through. You cannot build a gate out of something that changes its mind.
So I split the job in two.
The cheap, boring stuff — the rules that are either true or false — I pulled out into a layer of plain pattern checks that run first and block the merge if they fail. No database call without a timeout. No secret-shaped string committed in plaintext. No catch block that swallows the error and returns nothing. These are dumb. A regex can do most of them. That’s the point: they give the same answer every single time, they run in under a second, and an agent in a hurry can’t sweet-talk its way past them.
Only after that gate passes does the LLM reviewer get a turn — and now it’s purely advisory. It can’t block anything. I freed it up to do the one thing the dumb rules genuinely can’t: judge intent. Is this the right abstraction? Does this change make the architecture worse even though every line is technically fine? That’s where probabilistic judgment earns its keep.
The reason this works is almost embarrassing. The expensive reviewer was failing not because it was a bad reviewer but because I’d given it a job — be reliable — that its very nature forbids. The reliable job belongs to the cheap deterministic thing. The judgment job belongs to the expensive probabilistic thing. I’d had them swapped.
The harness: self-proving rules in the blocking path give me the detail
The trick that made me trust the gate: every blocking rule ships with a known-bad snippet it must flag, run as a unit test of the rule itself. If a rule ever stops catching its own canary, CI fails — so the gate can’t silently rot.
# rules/no_query_without_timeout.py
PATTERN = re.compile(r"\.execute\((?![^)]*timeout)")
KNOWN_BAD = "cur.execute('SELECT 1')" # must be flagged
KNOWN_GOOD = "cur.execute('SELECT 1', timeout=5)" # must pass
def test_rule_proves_itself():
assert PATTERN.search(KNOWN_BAD)
assert not PATTERN.search(KNOWN_GOOD)CI runs pytest rules/ (the gate) before it ever spends a token on the LLM pass. Deterministic checks are exit-code 1 on failure and un-bypassable; the Claude review step posts comments and always exits 0.
If you’re drowning in AI-written pull requests, don’t shop for a smarter reviewer. Take inventory of what your reviewer is checking, and move everything with a yes/no answer into a fast un-bypassable gate. Leave the model only the questions that have no regex. The harness around the model matters more than which model you picked.