My phone buzzed with a voice alert: “recovery at risk.” Then again two minutes later. Then again. The box it was yelling about was healthy — its terminal sessions were running fine, nothing had crashed. The watchdog was wrong, and it was wrong loudly, on a channel I actually listen to.
Here’s the setup in plain terms. I run a small fleet of headless Linux machines — no monitor, no keyboard, just servers doing work. Each one keeps a saved snapshot of its terminal layout (the windows and panes I’d want back if it rebooted) using a tool called tmux-resurrect. Think of it like your browser remembering which tabs were open. I’d written a little backstop — a timer that fires every 120 seconds to check those snapshots are fresh, so I’d know if recovery was quietly broken.
The backstop checked the wrong thing. It looked at the snapshot file’s age — the timestamp on the file — and screamed if that timestamp was older than a few minutes. Sounds reasonable. Old save means a broken save, right?
No. And the reason is buried in how tmux-resurrect writes its files.
When it saves, it writes a new timestamped snapshot, then compares it to the previous one. If they’re identical — which they always are on a stable box that hasn’t changed its layout in hours — it deletes the new file to avoid clutter. So the last pointer never advances. The content is perfectly current; the clock just doesn’t move. On an idle headless box, an old timestamp is the expected, healthy state.
My alarm read that healthy stillness as death. I’d literally compared the saved pane sizes against the live ones — 157 rows matched 157, 85 matched 85, exact — and the box was still getting flagged. The data was current. The mtime was old. Both true at once.
The fix was to stop asking “how old is this file” and start asking “did the save just succeed, and is the artifact usable.” Alarm only when the save returns a non-zero exit code, or when the snapshot is missing or empty. Age never enters into it.
Why run-shell lied to me about the exit code give me the detail
First fix attempt (PR #419) regressed because I ran the save through tmux run-shell "$SAVE". That’s async — it fires the command and returns immediately, so the inner exit code never propagates back. My check was reading the rc of launching the save, not the save itself. Always success. I reverted, then re-fixed it in PR #553 by calling the script directly and capturing the real status:
bash "$SAVE"; rc=$?
snap="$RESURRECT_DIR/last"
if [ "$rc" -ne 0 ] || [ ! -s "$snap" ]; then
notify "save-backstop: rc=$rc snapshot=$snap"
fi[ -s ] is “exists and non-empty.” No mtime, no STALE_AFTER, anywhere.
The general trap: any store that dedupes or content-addresses breaks the assumption that “stale equals old timestamp.” Dedup stores, content-hashed caches, idle headless boxes — all of them keep current data under an old clock. If you’re building a watchdog over a backup or snapshot tool, go read how the tool actually writes files before you alarm on anything. Freshness is proven by the save succeeding and the artifact being usable — never by what its clock says.