← All posts

My Resume Hook Came Back Alive and Froze on the First Question

I killed my whole tmux server to test crash recovery for a fleet of Claude Code agents. The processes came back. Every one of them got stuck on a menu nobody was there to answer.

  • claude-code
  • tmux
  • agents
  • resilience
  • automation

I run a small fleet of Claude Code sessions — the AI coding agents I leave working in their own terminal panes — and I’d never actually tested what happens if the whole thing crashes. So one afternoon I just did the scary thing: tmux kill-server. That nukes the entire terminal multiplexer, the program that holds all those panes open, in one shot. About sixty windows, gone.

The recovery machinery is two pieces. tmux-resurrect saves and reloads the layout of all my panes. Then a hook I wrote — a little script that fires after the layout comes back — walks each pane and runs claude --resume <session-id> to wake each agent up exactly where it left off. Think of it as a power-cut drill for a wall of monitors: when the lights come back, every screen should reopen the right document.

First lesson, and I almost botched it: the hook didn’t fire right away. I checked a few seconds after restore, saw nothing happening, and assumed the hook was broken. It wasn’t. It fired minutes later. tmux-resurrect was still rebuilding sixty panes, and my script politely waited its turn. If I’d shipped a “the hook failed” fix based on that early peek, I’d have been fixing the wrong thing. Poll for slow side effects before you declare anything dead.

Then the real problem. Geometry restored perfectly — every pane back in its place. The hook ran. The agents relaunched. And almost all of them sat there frozen.

Here’s why. When a Claude Code session has a summary checkpoint — a compressed memory of the conversation so far — --resume doesn’t just resume. It asks you a question:

❯ 1. Resume from summary (recommended)
  2. Resume full session as-is

That arrow is a cursor waiting for a human. My hook launched the process and walked away. Nobody pressed anything. Every checkpointed agent was parked at that menu, alive but useless — a relaunched process that never actually restored its session. The ones I cared about most, the ones deep enough to have summaries, were exactly the ones stuck.

That’s the part worth carrying out of my terminal and into yours. Restarting an agent is not the same as recovering it. The gap hides in the interactive prompt — the recovery question that assumes a person is watching. Cron doesn’t answer questions. systemd doesn’t. A rescue daemon doesn’t. So your process count goes green while your context quietly bleeds out, and you’ve got a wall of dead seats that look alive.

Driving the resume dialog deterministically give me the detail

The fix is to make the hook answer the menu instead of hoping no menu appears. After launching, send the keystrokes for the choice you want, then verify the pane left the dialog:

# resume, then pick "Resume full session as-is" (Down, Enter)
tmux send-keys -t "$pane" "claude --resume $sid" Enter
sleep 8   # let the dialog render; it is not instant
tmux send-keys -t "$pane" Down Enter

Two traps. The sleep matters — fire the keys before the dialog renders and they vanish into the void. And not every session shows the menu (no checkpoint, no prompt), so blindly sending Down Enter can poke a live agent. Better: a watchdog that captures the pane and classifies state before acting:

if tmux capture-pane -p -t "$pane" | grep -q "Resume from summary"; then
  tmux send-keys -t "$pane" Down Enter
fi

That capture-pane grep is also your real health check — “stuck at dialog” is a distinct state from “process running,” and your monitoring should treat it that way.

So if you auto-restart AI agents, test the full resume path on a real crash, not a clean one. Kill the server. Watch what question comes back. Then teach your machinery to answer it — because the failure that gets you isn’t the agent that died, it’s the one that came back and waited politely for a person who wasn’t there.