I run a fleet of Claude Code agents — autonomous coding sessions working in parallel across several Anthropic accounts. The constraint that shapes everything is rate limits: each account has a rolling 5-hour budget and a weekly one. When a session burns through its budget mid-task, it’s walled — it just stops, often in the middle of something useful.
So I built a rescue daemon. It watches every session’s remaining quota, and when one gets walled (or is about to), it restarts that session on a different account that still has headroom — carrying the full conversation context over with --resume, so the agent picks up exactly where it left off. For weeks it worked beautifully. An agent would hit a wall, and a minute later it was back, running on a fresh account, none the wiser.
Then one afternoon a bunch of sessions hit their limits at once, and the daemon didn’t save any of them. It was dead too.
Here’s why, and it’s almost funny: the rescue daemon authenticated against the same pool of accounts it was rescuing. When the pool got tight enough that sessions started getting walled — which is exactly the moment the daemon exists for — the daemon’s own account got walled mid-rescue. Now the stuck agents were still stuck, the rescuer was down, and the rescue attempt had burned the last scraps of headroom something else could have used. Strictly worse than having no daemon at all.
The general lesson outlived the specific bug. It applies to any self-healing component — a watchdog, a failover controller, a circuit breaker’s recovery path, a backup job. Ask one question: does the healer depend on the exact resource it’s trying to heal? If yes, you don’t have a self-healing system. You have a single point of failure wearing a rescue costume. The healer has to keep working precisely in the condition where everything it watches has failed — that’s the only condition that matters.
how the rescue daemon actually works give me the detail
The daemon polls each session’s status line, which reports the remaining 5-hour and weekly budget per account. A session counts as walled when its budget hits zero or the model starts refusing with a rate-limit notice. Recovery is a tmux respawn-pane that relaunches the same agent pointed at a different account’s CLAUDE_CONFIG_DIR, with claude --resume <session-id> so the conversation history carries over intact — same agent, same context, fresh budget.
The fix for the shared-failure-domain bug is isolation: the daemon gets a reserved account the worker pool never draws from, plus an explicit fallback to a cheaper, separate substrate for the moment even that is exhausted. The startup assertion is one line — the healer’s account must not be a member of the pool it monitors:
assert rescuer.account not in worker_pool, "rescuer shares the failure domain it heals"Run the daemon itself under systemd with Restart=always so it survives its own host hiccups — not as a thread inside the very process it’s meant to revive. And the test that would have caught this before production: drain the pool on purpose and confirm the daemon still runs. If it can’t act with the pool exhausted, it was never going to help you on the day you needed it.
So whenever you build something that recovers a system from failure, trace its dependencies against the failure it’s meant to handle. If they overlap, the day they overlap is the day you needed it most — and the day it won’t be there.