← All posts

Why the Same Code Looks Different From Every Angle: BugBot Lessons Learned

After running BugBot across several real codebases, the result that surprised me most wasn't the bugs it found. It was which iteration found them. The same files, reviewed from a different angle in a

  • ai-agents
  • code-review
  • debugging
  • claude-code
  • developer-tools
  • testing

I was staring at a JSON state file named bugbot-state.json, scrolling through findings, when I noticed something that stopped me cold. Two different Claude Code agents had flagged the same json.loads() call on line 1142. Neither agent knew the other existed. They’d been given different review angles, read the files in different orders, and each started from a completely fresh Claude conversation session — no shared memory, no inherited bias. Yet both landed on the same bug independently.

The code hadn’t changed between runs. The angle had.

This is Part 2 of the BugBot series. Part 1 covers the methodology and design. This post is about what I learned running it in practice.

BugBot Lessons Learned: Angle Diversity in Code Review

The Second Pass Finds What the First Walked Past

I built BugBot as an adversarial code review loop: Claude Code agents review files from different attack angles, each in a brand-new LLM conversation (the AI starts each review with no context from what any prior agent noticed — the conversation is truly blank). Between runs, the only thing that persists is bugbot-state.json, a flat JSON file tracking which angles have been tried and what findings exist so far. Everything else — the agent’s reasoning, its assumptions, the path it took to reach a conclusion — gets thrown away.

I expected this design to catch bugs. What I didn’t expect was which iteration would catch them. Again and again, the second or third pass at a file — same code, different review angle — surfaced issues the first agent had walked right past.

One review: an agent running a “data round-trip” angle (tracing what happens to data as it moves through the system) caught a normalization mismatch. An earlier agent focused on “cross-feature interaction” had seen the exact same code and dismissed it as correct. Same file, different frame, opposite conclusion.

Human reviewers do this intuitively — you walk away from code, come back with fresh eyes, and see what you missed. BugBot automates that pattern. Every iteration is genuinely fresh because Claude Code starts each agent in a new conversation with no accumulated bias from what came before.

Shuffled File Order Changes What Gets Noticed

Each parallel agent gets the file list in a different randomized order. I took this idea straight from Cursor’s BugBot research on parallel review passes.

Here’s why it works. When you read file_A before file_B, you form hypotheses from file_A that you carry into file_B. Those hypotheses act as a filter — they determine what you notice and what you skip. Reverse the order, and different hypotheses form first. Different hypotheses, different things stand out.

I saw this play out concretely. An agent that read the frontend template first caught a missing variable that a backend-first agent had completely missed. The backend agent had formed an assumption — “this data is always present” — before it ever reached the template. The frontend-first agent had no such assumption. It saw the gap cold.

It’s a one-line code change. The payoff is real.

When Two Agents Flag the Same Line, I Pay Attention

BugBot does something simple when two independent agents flag the same issue from different angles: it bumps the confidence level.

C1 (Possible)  →  C2 (Probable)
C2 (Probable)  →  C3 (Confirmed)

Individual agent findings are noisy. An agent will sometimes flag a suspicious pattern that’s actually fine — an LLM being overly cautious, seeing a ghost. But when two agents, each reading files in a different order, each focused on a different review angle, both independently point at line 1142? That convergence is a stronger signal than either agent alone could produce.

That json.loads() finding I mentioned earlier — a “data integrity” agent and a “NULL/empty/missing” agent both flagged it without ever seeing each other’s output. The consensus upgrade moved it from C2 (Probable) to C3 (Confirmed), which bumped the priority from MEDIUM to HIGH. It was real. json.loads() on stored data was returning a dict (a key-value mapping) when the code expected a list (an ordered sequence), and the type mismatch was silently producing wrong output downstream. No error, no crash — just quietly wrong results.

The ALL_CLEAN Contract Is Strict For a Reason

The completion criteria for BugBot is deliberately demanding:

the mechanism — state, shuffling, and consensus give me the detail

Why fresh context actually works. Each iteration spawns a completely new LLM conversation — no message history from the prior pass. The only continuity is an explicit JSON state file (bugbot-state.json) that records which angles ran, which findings exist, and their current confidence tier. This structure means the handoff is data, not memory. That distinction matters: memory is lossy and biased; data is exact and angle-neutral.

File-order shuffling is one line of code with non-trivial payoff.

import random
files = list(repo_files())
random.shuffle(files)   # each agent gets a different order

Because LLMs form hypotheses early in a context and then pattern-match against them, reading auth.py before api.py produces different priors than the reverse. Shuffling per-agent is the cheapest way to break shared anchoring across parallel runs.

Consensus upgrading is a simple counter check. When two independent agents both flag file:line, the state file increments a hit counter for that finding. A threshold check (hit count ≥ 2) triggers a confidence bump: C1 → C2, C2 → C3. No embedding similarity, no vector dedup — just location-keyed agreement across agents that never saw each other’s output.

The ALL_CLEAN contract that actually terminates the loop:

GatePurpose
Zero CRITICAL/HIGH findingsNo production-blocking bugs open
All 7 ODC triggers exercisedError, boundary, role, format, config, concurrency, recovery paths each touched
≥ 15 attack angles completedBreadth across all seven angle categories
All regression tests passingFixes don’t introduce new breaks

The 7-trigger requirement is the most important gate — it prevents the loop from quitting after covering only the happy path.

Try it yourself: add a triggers_hit set to your next code review checklist. Before you sign off, verify you’ve explicitly tested error recovery, empty/null inputs, and a user with an unexpected role. The gaps that remain are exactly where production bugs hide.

Early versions of BugBot declared clean too quickly. I’d watch the loop exit after two passes, confident it had covered everything, only to find bugs later that it had never even tried to look for. The 7-trigger requirement came from that frustration — I realized “error recovery” and “configuration” paths almost never got reviewed in the first few passes because the early agents gravitated toward the happy path. The happy path is comfortable.

It’s also where the fewest bugs live.

The tool now refuses to stop until it has forced itself to think about what happens when the network is down, when a user has an unusual role, when data arrives in an unexpected format. These are exactly the paths that fail silently in production — no crash, no alert, just wrong behavior that nobody notices until a customer reports it three weeks later.

What Surprised Me

The most interesting bugs were in the interaction category. Feature A writes a field. Feature B reads it. Feature B was written before Feature A existed, so Feature B’s author baked in assumptions that were true at the time but broke silently when Feature A came along. Classic integration bug. BugBot’s “adjacent feature mutation” angle surfaces exactly this — it asks: who else reads what you just wrote? Every feature author thinks about their own feature. Almost nobody thinks about the downstream readers.

The mechanical pre-pass catches more than you’d think. I added a pre-pass that runs ruff (a Python linter that checks for style and logic errors) and black (a Python code formatter) before any Claude Code agent touches the code. I figured it’d catch a few formatting issues. It consistently finds real problems — unused imports, variables that accidentally shadow builtins — that would have wasted agent context if discovered mid-review. Automate the automatable. It’s obvious advice, but seeing the before-and-after difference made it land.

Evidence requirements eliminate noise. Every finding in BugBot needs three things: a specific file:line citation, the actual code snippet, and a trigger scenario (the conditions that make the bug surface). When I added this rule, the false positive rate dropped noticeably. Agents that couldn’t cite their evidence had to downgrade their finding to C1 (Possible), which doesn’t block shipping. Requiring evidence doesn’t just filter noise — it changes how the agents approach the review. They look harder when they know they need to produce a citation.

The state file is the whole system. Between iterations, bugbot-state.json is the only thing that persists. Every Claude Code conversation is disposable. Every agent’s reasoning is temporary. The state file is the system’s memory, and it’s explicit, readable, and trivially inspectable. This is a design principle I’ve been applying to other agent workflows: if your system’s correctness depends on anything other than explicit, readable, persistent state, you have a fragile system. If you can’t explain what the agent knows by pointing at a file, you don’t actually know what the agent knows.

Applying This to Any Code Review Process

You don’t need BugBot to benefit from these ideas. Here’s what generalizes:

  • Use angle diversity, not one monolithic pass. Don’t try to catch everything in a single review. Run separate focused passes: data integrity, security, error handling, cross-feature effects. Each pass has a specific mandate, and that narrow focus produces better results than a broad “find all the bugs” sweep.
  • Track trigger coverage, not just findings. Before you call a review complete, ask yourself: have I tested what happens on error? On empty data? With different user roles? The answers tell you what you’ve missed — and what you’ve missed is where the bugs are.
  • Require evidence. A suspicion without a code citation isn’t a finding, it’s noise. File, line number, snippet, trigger conditions. If you can’t produce those four things, you haven’t found a bug — you’ve found a feeling.
  • Iterate. The second pass at a different angle will find things the first pass didn’t. The third will find things the second didn’t. Stop when a complete pass — one that exercises every angle you’ve defined — finds nothing new.

The core insight is simple: code review completeness is about which failure modes you tested, not how carefully you read the diff. The diff is the same every time you look at it. The angle is what changes.