The Rescuer Couldn't See the Lifeline

Three of my four Claude Code accounts hit their rate limit within the same hour, and the daemon whose entire job is to rescue them sat there reporting that nothing could be done.

Some background for anyone who isn’t knee-deep in this. I run a bunch of automated Claude Code sessions — think of them as workers, each logged in under a different account. Every account has a usage ceiling, a cap on how much it can do in a given window, like a monthly data plan on your phone that throttles you once you’ve used it up. When a worker hits that ceiling I call it “walled.” So I built two things: a load balancer named cl that hands new work to whichever account has room, and a separate rescue daemon that watches for walled sessions and moves them onto a healthy account.

The balancer was smart about finding accounts. It just looked on disk — every account is a credential file at /opt/claude/credentials-<name>.json — and used whatever it found. Drop a new file, you’ve got a new account. No code change.

The rescue daemon did not do that. Somewhere in its guts it carried its own hardcoded list of account names, written down once and never updated.

For a long time that didn’t matter, because the two lists happened to agree. Then I registered a fourth account to give myself more headroom. The balancer saw it instantly — new file, done. The daemon never did. As far as the rescuer was concerned, only three accounts existed in the whole world.

You can guess the afternoon this bit me. Those same three accounts all walled at once. The one account with plenty of room left was the fourth — the lifeline. And the rescuer, scanning its stale little array, concluded every account it knew about was dead and gave up.

What gets me is the shape of the bug. It was invisible right up until the exact moment it mattered. Any normal day, three healthy accounts out of three is fine. The flaw only surfaces during a failover, which is the one situation the daemon exists for. A rescuer that’s blind precisely when you need rescuing is worse than no rescuer, because you thought you were covered.

The fix (PR #475) was small and a little embarrassing: delete the hardcoded array, make the daemon discover accounts the same way the balancer does — read the directory, trust the files.

Why one source of truth beats two agreeing copies give me the detail

The trap is that two hardcoded lists can be correct simultaneously, which feels like agreement but is really just coincidence. Correctness has to be derived, not duplicated. Both processes now resolve the roster at runtime from the same place:

from pathlib import Path

def discover_accounts():
    creds = Path("/opt/claude")
    return sorted(
        p.stem.removeprefix("credentials-")
        for p in creds.glob("credentials-*.json")
    )

The filesystem is the registry. Adding an account is cp of a credential file; removing one is rm. There’s no second list to forget. A cheap test that would have caught this: assert the balancer and daemon return identical sets, and run it in CI.

assert set(balancer.accounts()) == set(daemon.accounts())

Here’s the rule I’d hand anyone running agents or API keys across a balancer plus a recovery process: if two components share a roster of resources, neither is allowed to remember it. They both have to look it up, from the same source, every time. A duplicated list is a time bomb with a delay fuse set to “next failover.”

Go find the place where you wrote the same list down twice. One of those copies is already drifting.