← All posts

Green Didn't Mean Seeing: The Night My Rescue Daemons Watched Nothing

A linuxbrew-vs-/usr/bin tmux mismatch made my rescue daemons scan zero seats all night while reporting LIVE. Process-alive is not the same as sees-the-work.

  • tmux
  • systemd
  • daemons
  • observability
  • agents

I spent a night rolling out little watchdog programs — rescue daemons — across a fleet of machines. Each one’s job is simple: watch the “seats” on its box (a seat is one running Claude Code agent session, parked in a tmux pane), and if a session falls over, bring it back. The overseer dashboard glowed green. Every daemon: LIVE. I went to bed proud.

In the morning, nothing had been rescued. Not because nothing broke — things broke — but because every single daemon had been watching an empty room all night.

Here’s the part a non-engineer can hold onto: tmux is a program that keeps terminal sessions alive in the background, like a TV that keeps playing when you close the laptop. It has two halves — a server that actually holds the sessions, and a client command you type to ask the server “what’s running?” My daemons were running the client. But — and this is the whole bug — there were two copies of tmux installed on each box, and the two halves were different copies.

The sessions lived inside the tmux server started by linuxbrew’s tmux, off in /home/linuxbrew/.linuxbrew/bin. But my daemons ran under systemd with the plain /usr/bin/tmux first on their PATH. Two binaries, two private channels. When /usr/bin/tmux asked “list every pane,” it got back a polite, confident zero. No error. No crash. Just an empty answer to a question it was asking the wrong server.

So each daemon scanned zero seats, found zero failures, fixed zero things — and reported itself perfectly healthy. Because “healthy” only meant the process is running. It never meant the process can see the fleet.

That gap is the lesson I actually paid for. A liveness check answers “am I alive?” My daemons answered yes, truthfully, while being completely blind. A green light that measures the wrong thing is worse than a red one — it tells you to stop looking.

The fix was small and slightly embarrassing: a systemd drop-in that pins PATH so linuxbrew comes first, so the daemon shells out to the same tmux binary that started the server. One file. Hours of nothing prevented.

The mismatch and the two fixes give me the detail

The trap: command -v tmux in your interactive shell and the PATH a service actually gets under systemd --user are often different. Login shells source zsh/bash profiles that prepend linuxbrew; systemd units don’t.

Check the discrepancy directly:

# what you see
command -v tmux            # /home/linuxbrew/.linuxbrew/bin/tmux
# what the daemon sees
systemctl --user show rescue.service -p ExecStart
systemctl --user show-environment | grep PATH   # /usr/bin first — wrong server

Pin PATH with a drop-in (~/.config/systemd/user/rescue.service.d/path.conf):

[Service]
Environment=PATH=/home/linuxbrew/.linuxbrew/bin:/usr/bin:/bin

Then make health mean observed work, not liveness. Have the daemon emit the seat count it actually saw, and treat zero-when-you-expect-many as unhealthy:

seats=$(tmux list-panes -a 2>/dev/null | wc -l)
echo "{\"status\":\"live\",\"seats_seen\":$seats}" > /run/rescue/health.json
# overseer alerts if seats_seen == 0 on a box that should have sessions

systemctl reload after, and confirm seats_seen is non-zero before trusting any green.

If your service shells out to any tool with a client/server split — tmux, Docker, a database socket — and that tool has more than one copy installed, diff the binary your shell uses against the binary your service uses. They love to disagree. And make your health signal report the work it can see, in numbers, so a blind monitor can never pass for a healthy one.