One morning I found five of my AI developer workflows — I call them ADWs, little agent processes that take a coding task, open a pull request, and babysit it through review — still grinding away on work that was already finished. Their PRs had merged hours ago. The code was in main. And there they were, every ten minutes, politely polling GitHub for review comments that would never come, pushing fixes nobody asked for onto branches nobody was watching.
Zombies. Doing their job with great enthusiasm, long after the job existed.
If you’re not an engineer: picture a delivery driver who keeps circling your block re-ringing the doorbell because nobody told him the package was already signed for. The order was closed. He just never checked.
Here’s what I got wrong. Each ADW has an outer loop — “is this PR still open? then keep going” — and inside it, a review_fix step with its own inner loop that waits for code review and applies changes. The outer loop checked whether the PR had merged. The inner ten-minute poll did not. So the moment a PR merged while the inner loop was mid-wait, the outer check never got a turn again. The inner loop just kept spinning, oblivious, on its own little timeline.
I assumed the boundary check at the top level was enough. It wasn’t, because the inner loop can outlive the condition that the outer loop was watching for.
My first instinct was to reach for a watchdog — a separate process that notices a stuck agent and kills it. I actually started writing one. Then I realized I was building a system whose primary way of stopping was “something else kills it from outside.” That’s backwards. An agent should know when it’s done. A watchdog is for when it crashes and can’t know — belt-and-suspenders, never the belt.
The real fix was three rules, and I now apply them to every nested wait-loop I write:
Re-check your terminal conditions at the top of every iteration of every loop — not just the outermost one. The inner poll now asks “is this PR still open?” before each wait, same as the outer one.
Validate required inputs at step entry and fail loud. review_fix now throws a specific ValueError if pr_number, token, or repo is missing, instead of politely looping on garbage.
Put a hard wall-clock ceiling under every iteration budget. “Poll up to 30 times” isn’t a limit if each poll can hang — “and never run past 90 minutes” is.
The re-check pattern for nested poll loops give me the detail
The bug lived in a Python review_fix step. Simplified:
def review_fix(pr_number, token, repo, max_iters=30, deadline_min=90):
if not all([pr_number, token, repo]):
raise ValueError(f"review_fix needs pr_number/token/repo, got {pr_number!r}")
hard_stop = time.monotonic() + deadline_min * 60
for i in range(max_iters):
# terminal check at the TOP of the INNER loop — the fix
if pr_is_merged_or_closed(repo, pr_number, token):
return "done: PR no longer open"
if time.monotonic() > hard_stop:
return "done: wall-clock cap hit"
feedback = poll_review(repo, pr_number, token) # can hang
if feedback:
push_fixes(feedback)
time.sleep(600)
return "done: iteration budget exhausted"The three guards — input ValueError, per-iteration terminal check, monotonic() deadline — each stop a different failure. A systemd watchdog still runs, but only to reap genuinely crashed PIDs. It is never how a healthy agent decides it’s finished.
The generalizable version: if a loop waits on external state — a merged PR, a finished job, a freed lock — the condition that ends it must be checked inside the loop that’s actually waiting, at the same depth as the wait. A stop condition one level up is a stop condition that can be skipped. Go read your own agent loops and ask, for each nested while: what tells this loop to quit, and does it ask on every pass? If the honest answer is “the process gets killed eventually,” you’ve got zombies waiting to happen.