← All posts

My Agent Said It Was Blocked. It Had the Keys the Whole Time.

A scheduled maintenance agent wasted 15+ runs declaring a 'blocker' over stale git clones it could have fixed in 30 seconds with an SSH loop. Here's why, and the rule I added so it never happens again.

  • agents
  • automation
  • claude-code
  • ssh
  • ops

For about a day and a half, one of my maintenance agents kept logging the same thing every time it woke up: blocked — fleet has stale git clones, needs human intervention.

The “fleet” is just a handful of remote machines I keep agents running on. (“Stale git clones” means their copy of the code had drifted out of sync with the real version — like a bunch of laptops all running last week’s draft of a shared doc.) The agent runs on a schedule, the way a backup job does: it wakes up every so often, looks around, does its chores, goes back to sleep.

Except this one wasn’t doing chores. It was waking up, writing “I’m blocked,” and going back to sleep. Fifteen-plus times. Each cycle it produced a tidy little status update explaining that a human needed to step in.

Here’s the part that made me wince. The agent had a dedicated SSH key that let it log into every single one of those machines. (“SSH” is just remote login — the agent could open a terminal on any box and type commands.) The fix for a stale clone is one line: log in, run git reset --hard, done. Thirty seconds across the whole fleet if you put it in a loop. The agent had the keys to the building and was filing a ticket asking someone to unlock the door.

So why did it freeze?

I went back through the logs and found the source. An earlier session — a different agent, doing something unrelated — had hit those same stale clones and written “this is blocked” in a note. My maintenance agent read that note, took it as gospel, and inherited a conclusion it never tested. It never asked the one question that mattered: given what I can actually do, is this really a blocker?

That’s the whole bug. “Blocked” got treated as a fact passed down from context, when it should have been a claim the agent had to verify against its own tools. A stale clone is only a blocker if you can’t reach the machine. This agent could reach all of them. The label was just wrong, and it copied the wrong label forward.

Forcing a verification step before any 'blocked' state give me the detail

The fix wasn’t smarter prompting in the abstract — it was a hard checklist the agent must run before it’s allowed to write “blocked” anywhere. I split the world into two lists in its instructions:

GENUINE blockers (escalate to human):
  - physical hardware (a box is powered off / unreachable)
  - credentials/secrets not present on this machine
  - sudo required on a host this key can't reach

SELF-SERVICE (just do it, never escalate):
  - stale git clones        -> ssh + git reset --hard
  - restarting a service     -> ssh + systemctl restart
  - clearing disk on a box you can reach

And the gate: before emitting any “blocked” status, the agent must run the actual capability check and paste the result.

# Verify reachability BEFORE claiming blocked
for host in $(cat fleet.txt); do
  ssh -i ~/.ssh/fleet_key -o ConnectTimeout=5 "$host" \
    'cd /srv/app && git reset --hard && git pull' \
    && echo "$host OK" || echo "$host UNREACHABLE <- real blocker"
done

If every line says OK, the “blocker” disproves itself. The agent only earns the right to escalate the hosts that actually printed UNREACHABLE. Inherited conclusions from another session’s notes don’t count as evidence — only a fresh check does.

The reason this matters beyond my fleet: agents read each other’s context, and a wrong conclusion is contagious. One session says “blocked,” the next three copy it, and now you’ve got a confident consensus built on nobody having checked. The cure is cheap. Make “blocked” expensive to say — require proof, scoped to what the agent can actually touch — and make “just fix it” the default for anything inside its reach.

Give your agents a sharp, checkable line between needs a human and needs ten seconds. Then make them prove they’re on the wrong side of it before they’re allowed to give up.