Recovery Paths Must Terminate on Durable Ground Truth
The most dangerous failure isn't the crash — it's the recovery that quietly resumes the wrong state. When a system restarts, anything ephemeral becomes a lie…
The machine rebooted, and tmux brought my Claude fleet back — every window restored, every pane in place — and then resumed the wrong conversation in each one. tmux is the terminal multiplexer that runs my panes (the split cells of a terminal window); my Claude fleet is the cluster of Claude Code agent sessions I keep alive across those panes. tmux-resurrect — the plugin that snapshots the layout and replays it after a crash — had faithfully rebuilt the geometry, then guessed which session belonged where. The guess was wrong, and it was silent about being wrong.
That’s the takeaway, and the failure I keep walking into: the most dangerous failure isn’t the crash — it’s the recovery that quietly resumes the wrong state. A crash is loud. A recovery that restores the wrong thing and keeps running is silent.
Silence is where the damage compounds.
Restarts lie, and here’s why. When anything restarts — a terminal, an agent session, a whole machine — anything ephemeral becomes a lie the moment the process dies. Volatile handles (temporary IDs a program mints at runtime, like “this is pane 3,” which mean nothing outside the process that issued them) get reassigned to somebody else. In-memory maps — lookup tables that lived only in RAM — go stale. Fuzzy heuristics — rules of thumb, like “newest file wins” — resume the wrong thread or orphan the right one, and they do it with a straight face.
A recovery path is only as good as the signal it trusts. Make it terminate on durable ground truth — a written registry, an exact identifier, a stable name — never on a fragile side-channel or a guess.
I got that backwards first. My first rebuild matched panes to sessions by mtime — the modification timestamp — of each session’s .jsonl transcript (the log Claude writes for one conversation). Most-recently-touched equals most-recently-active, right? It passed every test I threw at it. Then I rebooted after a stretch of reading old transcripts, and the “most recent” file was a conversation I’d merely opened to skim, not the one I’d been driving. Panes grabbed the wrong threads, and the right ones sat orphaned. A heuristic that feels clever passes every test until the one condition you didn’t think to test — then fails silently.
So stop trusting the volatile thing. Address things by their stable role — “the deploy pane,” not “pane 47” — and resolve to the live handle at the moment you act. Caching a volatile id across a restart boundary is a latent misroute waiting to fire; treat any persisted ephemeral handle as poison the instant a restart is possible, and re-resolve every time. My fix was a durable pane→session-id registry — Claude’s exact identifier for one conversation, captured live at SessionStart, the moment a session begins while the mapping is still true — that the resurrection code reads instead of guessing from mtimes.
Then the second failure, the one that got me after I thought I was done: backups fail silently too. tmux-resurrect’s snapshot is itself a save mechanism, and it rides on a side-effect — it gets written during clean terminal teardown, so a hard crash or a killed process means the save dies unnoticed. You come back to a “last good state” that’s hours or days stale, and because the file is there and looks normal, you trust it. That’s what makes a stale backup worse than no backup: no backup is honest. You know you have nothing.
“Green” must mean “a fresh artifact is actually on disk,” not “the mechanism appears to exist.” The save step being in your code is not the same as the save step having run. Add a deterministic backstop — a save that runs on its own clock, independent of the happy path (the normal nothing-went-wrong flow) — so a killed teardown still leaves something recent behind. And alarm loudly when freshness lapses. A quiet dead backup is worse than no backup, because you trust it — so make it impossible to stay quiet.