← All posts

Agentic Engineering, Part 3: Tracing Every Code Path Before It Becomes a Bug

BugBot finds bugs in code that's already written. But what about the bugs that exist because the architecture is wrong — where the code does exactly what it says, but "what it says" is inconsistent ac

  • agentic-engineering
  • claude-code
  • architecture
  • code-review
  • ai-agents
  • series

BugBot finds bugs in code that’s already written. But what about the bugs that exist because the architecture is wrong — where the code does exactly what it says, but “what it says” is inconsistent across six different call sites that each implement their own version of the same logic?

I found this the hard way.

A clinic wanted visit-reason notes attached to every appointment. I added the note creation to book_appointment(). Tested it. Worked perfectly. Shipped it. Moved on.

Three weeks later: “the visit reason notes are missing from about half the appointments.”

I opened the codebase and just stared at the screen. book_appointment() wasn’t the only function that called create_appointment(). There were four others — reschedule flows, admin overrides, a timer-based auto-book. None of them created the note. The side effect lived in the wrong layer. The bug wasn’t in any single line of code. It was in the architecture.

That pattern — duplicated logic, bypassed pipelines, side effects in the wrong place — keeps showing up in any codebase that grows fast. So I built ArchReview, a skill that traces every code path through a feature and maps where paths diverge.

Architectural Trace

Here’s the core idea: most bugs that survive testing are gaps between code paths, not bugs inside them. A single path works fine in isolation. It’s when four different paths all reach the same destination, each applying a different subset of the same processing steps, that you get “works on my machine” — or worse, “works on the route I tested.”

Two Modes: Audit and Design

ArchReview has two workflows that form a pipeline:

WorkflowWhat It DoesWhen To Use
AuditFeatureTrace all code paths, map entry points, find structural problems”Why is this broken?” or “How does this actually work?”
DesignSolutionResearch patterns, run a Red Team debate, generate implementation spec”How should we fix this?”

You can run them independently or chain them: audit first to understand the architecture, then design a solution for the problems found.

How AuditFeature Works

When invoked, AuditFeature spawns five specialized agents in parallel — separate Claude Code subprocesses, each with a narrow focus and deep expertise:

AgentFocusOutput
Entry Point MapperFind EVERY call site for the key function, trace what happens before and after each callNumbered list of all entry points with data flow
Logic Duplication DetectorFind ALL copies of the feature’s core logic, diff them against each otherCoverage matrix showing which filters/transforms each path applies
Data Flow TracerFollow data from source to sink, find monkey-patches and overridesExecution order map with every transformation point
Transaction Safety AuditorVerify every database write uses BEGIN IMMEDIATE transactionsSafety matrix: N of M write paths are transaction-safe
i18n Completeness CheckerVerify every data-i18n key exists in all language dictionariesTranslation coverage: N of M keys are fully translated

The key output is the coverage matrix — a table showing which processing steps each code path applies:

| Code Path        | Filter A | Filter B | Post-hook | SSE Broadcast |
|------------------|----------|----------|-----------|---------------|
| SSE update       | Yes      | NO       | NO        | N/A           |
| UI click         | Yes      | Yes      | Yes       | Yes           |
| API call         | Yes      | Yes      | NO        | Yes           |
| Timer refresh    | NO       | NO       | NO        | NO            |

When you see a matrix like that, the architecture problem stops being invisible. Four paths to the same destination, each applying a different subset of processing. The bugs aren’t in any individual path — they’re in the gaps between paths.

Template Variable Completeness

One of ArchReview’s most valuable checks is something no linter catches: template variable completeness across render_template callers.

In Flask, multiple routes can render the same template. If you add a new feature to one route — say, a lunch break banner that needs a lunch_break variable — every other route that renders that template needs to pass the same variable. Miss one and you get a NameError in production. But only on the route you didn’t test.

The Entry Point Mapper agent greps for every render_template call for each template, compares the keyword arguments, and flags any variable present in some callers but missing from others. This caught a real bug: both /welcome and /flow render kiosk/welcome.html, but only /flow passed the lunch_break variable. The lunch break banner worked on one route and crashed the other.

Transaction Safety Auditing

This one is specific to SQLite but the pattern generalizes. Python’s sqlite3 module uses DEFERRED transactions by default — it acquires a shared lock on first read, then tries to upgrade to exclusive on write. Under concurrent load from multiple Gunicorn workers (the server processes that handle web requests in parallel), this upgrade fails instantly, completely bypassing busy_timeout. The fix is BEGIN IMMEDIATE, which grabs the write lock upfront before anyone else can touch it.

ArchReview’s Transaction Safety agent traces every database write in the feature under audit and verifies it goes through the transaction() context manager (which issues BEGIN IMMEDIATE).

the mechanism: why this catches what linters miss give me the detail

The SQLite concurrency trap. Python’s sqlite3 module opens transactions as DEFERRED by default — it acquires a shared read lock on first access, then tries to upgrade to exclusive on the first write. Under concurrent Gunicorn workers, that upgrade races and fails immediately, completely bypassing busy_timeout. The fix is a single keyword:

# Dangerous — default DEFERRED mode races under concurrency
conn.execute("INSERT INTO visits ...")
conn.commit()

# Safe — IMMEDIATE acquires the write lock upfront, so busy_timeout applies
with conn:
    conn.execute("BEGIN IMMEDIATE")
    conn.execute("INSERT INTO visits ...")

You can reproduce the race locally with two threads hitting the same SQLite file simultaneously — the DEFERRED path will raise OperationalError: database is locked while the IMMEDIATE path queues cleanly.

How the agent finds the gap. The Transaction Safety agent runs a targeted grep/AST walk for every conn.commit(), session.flush(), or direct execute() that isn’t inside a context manager wrapping BEGIN IMMEDIATE. It then cross-references against the entry-point map from the parallel Entry Point Mapper agent — so it knows which call sites reach each write, not just which files contain one. The result is the coverage matrix the plain post describes:

| Write Location    | Wrapped in transaction()? | Reachable via concurrent path? | Risk  |
|-------------------|---------------------------|-------------------------------|-------|
| data_service.py   | Yes                       | Yes                           | None  |
| app_db.py         | No — bare commit()        | Yes (timer + API both hit it) | P0    |

Generalizes beyond SQLite. The same pattern — “which write paths skip the safe wrapper?” — applies to any resource with exclusive-lock semantics: Redis MULTI/EXEC, Postgres advisory locks, file-system flock. The agent doesn’t need to know the database; it needs to know what the project’s safe wrapper is (documented in the skill) and grep for every write that bypasses it.

This is architectural, not syntactic. A linter can’t tell you that bare commit() is dangerous specifically because of how SQLite handles lock upgrades under concurrency. The agent understands the architectural context because the workflow explains it.

DesignSolution: Red Team Debates

Once AuditFeature maps the problems, DesignSolution finds the fix.

It starts with parallel research — two agents search for patterns in open-source codebases solving similar problems. One in the primary domain, one in adjacent domains: React patterns applied to vanilla JS, backend pipeline patterns applied to frontend, that kind of cross-pollination.

From the research, I select the two most promising approaches. Then a Red Team agent debates them:

For EACH approach:
  1. Steel-man it (present it at its strongest)
  2. Identify the top 3 risks/weaknesses
  3. Score on: complexity, regression risk, cognitive load,
     edge case handling, future extensibility
  4. Deliver a verdict with what the winner should borrow
     from the loser

The structured debate produces better decisions than me asking myself “which approach is better?” ever did. The steel-manning forces me to actually present the option I’m biased against in its best light — and more than once, that’s been the one that won. The scoring matrix prevents the gut-feel decision where I pick the approach that feels cleaner but has worse edge-case handling.

The output is a full implementation spec written to .agent/specs/ — problem statement, before/after architecture, every call site that needs migration, and a testing checklist.

The Difference From CodeReview

I have three review skills, and people ask how they’re different:

SkillScopeDepthOutput
/PortalCodeReviewBroad codebase sweepSurface — pattern matching across 12 categoriesPrioritized findings list
/PortalBugBotRecent changesDeep — adversarial loop with attack anglesFixed bugs + regression tests
/PortalArchReviewSingle featureDeepest — full code path traceArchitecture audit + implementation spec

CodeReview is a net cast wide. BugBot is a drill aimed at recent changes. ArchReview is an X-ray of one system’s skeleton. They complement each other because they find different classes of problems: CodeReview finds anti-patterns, BugBot finds bugs, ArchReview finds architectural debt.

Composing Into the Pipeline

In practice, these skills layer:

  1. Build a feature on a branch
  2. /PortalArchReview if the feature touches complex pipelines — audit before implementation to understand the architecture you’re modifying
  3. /PortalBugBot after implementation — adversarial review of your changes
  4. /PortalCodeReview periodically — broad sweep for accumulating anti-patterns
  5. /PortalDevFlow throughout — enforces the pipeline at every step

Each skill encodes knowledge I’ve accumulated through bugs that shipped. The visit-reason note bug became a rule in ArchReview. The phone normalization bug became an attack angle in BugBot. The timezone bugs became a category in CodeReview. The skills get smarter because the mistakes are encoded as structure, not just memory.

What I Learned

Architecture audits before implementation save more time than reviews after. When I run ArchReview on a feature before modifying it, I find the five call sites that all need updating instead of finding them one at a time through production bugs.

Structured debate beats intuition for architectural decisions. The Red Team workflow has reversed my initial instinct on approach selection multiple times. Steel-manning the option I was leaning against often reveals it’s actually better.

The coverage matrix is the most valuable artifact. A single table showing which processing steps each code path applies makes invisible inconsistencies visible instantly. Most architectural bugs are gaps in that matrix.

Next: Part 4 covers how all nine skills compose into a complete development lifecycle — from cleaning test data to deploying to production.