← All posts

My tmux-resurrect Snapshot Lied, So I Rebuilt the Claude Fleet From jsonl mtimes

After my tmux server got killed, the backup tool I trusted was days stale. The session log files themselves held the exact fleet roster.

  • claude-code
  • tmux
  • agents
  • recovery
  • ops

The tmux server died at 8:16 pm and took 25 Claude Code sessions with it.

I run a fleet of long-lived agent sessions on the dev box. Each one gets its own tmux window, each window its own project directory — separate concerns, separate panes. They run for days. Some babysit deployments, some orbit research threads, some hold half-written code I’ll get back to. When the server process vanishes, everything it was holding vanishes too: working directories, scrollback, the session registry that knows which window was which.

My first instinct was the safety net I’d installed: tmux-resurrect plus tmux-continuum, which save snapshots periodically and auto-restore on server restart. I ran the restore. It produced a layout — windows appeared, titles populated — but the windows were wrong. The working directories were stale. The session list didn’t match what had actually been alive. Some panes pointed at directories I hadn’t touched in weeks. Others were missing entirely.

I dug into why. The continuum save had quietly stopped running days earlier. No error, no alert, no crash — it just went silent. I’d been carrying a safety net with a hole in it and didn’t know until I fell.

So I ignored the snapshot and looked for something that couldn’t lie about itself.

The artifacts that timestamped their own death

Claude Code writes a per-session transcript — a .jsonl file under ~/.claude/projects/ — appending one line per turn. It’s not a periodic dump. It’s continuous. When the tmux server process was killed, every one of those 25 sessions flushed one final write at the kill instant before the OS tore them down. That left a tight cluster of files all sharing the same modification time: a window of about six seconds.

That cluster was the roster. Not a guess. Not a mapping from a backup tool that might be three days stale. The files that were alive at the moment of death timestamped themselves.

Better still, each filename is the session UUID. Claude Code uses that UUID to resume — feed it --resume <uuid> and it picks up the conversation exactly where it left off. And the project directory is right there in the path. If the path is ~/.claude/projects/-opt-ra-some-project/<uuid>.jsonl, the cwd is /opt/ra/some-project (dashes stand in for slashes in the directory encoding).

Recovery became mechanical: take every jsonl in the mtime cluster, read the UUID from the filename, read the project directory from the path, open a new tmux window with that cwd, and resume.

Why this works: provenance

The difference between the two approaches is where the truth lives.

tmux-resurrect’s snapshot is a periodic description of state written by a separate process. Its window-to-session mapping is a best-effort guess made minutes or hours ago by something that can fail silently — and did. The snapshot says “here’s what I think was running when I last checked.” If the checker stops checking, the snapshot rots.

The jsonl files are self-describing. The file is the session. The name is the ID. The mtime is the death certificate. There’s no intermediate mapping to be wrong about, no separate process whose health you have to trust. The session log knows what it is. The backup tool only knows what it last remembered.

Before trusting the whole cluster, I verified one file: tail -1 on a transcript and confirmed it was the conversation I expected. A live artifact, not just a timestamp. Then I resumed all 25.

The principle generalizes: when you recover, prefer a source with exact self-provenance over a snapshot that depends on a fragile mapping you hope is current. Find the artifact that can’t lie about itself, drive recovery off that, and verify with one live read before you go mass-resuming.

Rebuild the fleet from mtime clusters give me the detail

Find the cluster, then resume each peer in its own window. Sort jsonl files by modification time and look for the tight band at the crash instant:

# List recent session transcripts, newest first, with mtimes
find ~/.claude/projects -name '*.jsonl' -printf '%TY-%Tm-%Td %TH:%TM:%TS %p\n' \
  | sort -r | head -40

Pick the timestamp band (e.g. 20:16:4620:16:52), then drive recovery off it. The project dir is encoded in the path; the UUID is the basename:

CRASH='2026-06-22 20:16:4'   # match the frozen-second prefix
find ~/.claude/projects -name '*.jsonl' -newermt "$CRASH" \! -newermt "${CRASH}9" \
| while read -r f; do
    uuid=$(basename "$f" .jsonl)
    # path segment after projects/ encodes the cwd with dashes -> slashes
    proj=$(dirname "$f" | sed "s|$HOME/.claude/projects/||; s|^-|/|; s|-|/|g")
    tmux new-window -c "$proj" \
      "claude --dangerously-skip-permissions --resume $uuid"
  done

Always verify before mass-resuming: tail -1 <file> | jq . on one transcript to confirm it’s the conversation you think it is. --resume <uuid> reattaches Claude Code to that exact session history.