← All posts

The Process I Killed Was Alive — It Just Had a Different Name

A liveness check that matched the wrong argv form declared every rescued Claude seat dead. The fix was learning all the names a process can wear.

  • claude-code
  • process-monitoring
  • agents
  • reliability
  • debugging

I woke up to a pile of overnight alerts saying half my Claude Code sessions had died. They hadn’t. Every one I checked was sitting there, alive, chewing through work. My monitor was crying wolf, and it did it all night.

Here’s the setup in plain terms. I run a bunch of Claude Code agent sessions — think of each one as a worker at a desk, each in its own terminal pane. Over time some of them stall or get walled (they run out of their hourly usage budget), so I have a rescue routine that quietly relaunches the dead ones in place. And I have a separate little sensor whose only job is to keep asking, “is there actually a live Claude at this desk?” If the answer is no, it fires a dead-seat alert and tries to nudge the session back to life.

The sensor decided liveness by looking at the running program’s name. Every process reports what launched it — the first word of its command line — and mine just checked: is that word claude? If yes, alive. If no, dead. Simple, and it worked great for months.

The bug was in the word “launched.”

A freshly-started session really does show up as claude. But a session that my rescue path relaunched doesn’t. It comes back running as the actual versioned binary on disk — a path like /opt/claude/versions/<ver>.elf instead of the friendly claude name. Same logical program, same worker at the same desk, doing the same work. Different name on its badge.

So the sensor was, in effect, only checking the front door. Anyone who came in through the side door — every single rescued seat — read as an empty chair. And rescued seats are exactly the ones you’d most want the monitor to trust, because they just survived something.

What makes this sneaky is the false negatives were invisible until the exact population you didn’t test for showed up. I’d only ever tested the sensor against seats I’d just spawned by hand. I never tested it against a seat that had been restarted, which is a different birth story with a different name.

Matching every argv form a process can wear give me the detail

The check was basically this:

# gets the command name of the pane's foreground process
cmd=$(ps -o comm= -p "$pane_pid")
[[ "$(basename "$cmd")" == "claude" ]] && echo alive

For a rescued session, comm is <ver>.elf and basename never equals claude. The fix is to accept both the canonical name and the versioned-binary form:

name=$(basename "$(ps -o comm= -p "$pane_pid")")
if [[ "$name" == "claude" || "$name" == *.elf ]]; then
  echo alive
fi

Better still, match on the resolved executable path (ps -o args= or readlink /proc/$pid/exe) so a wrapper script and the real binary both resolve to the same identity. Whatever you pick, write a test that restarts a process and asserts it still reads as live — not just one that spawns a fresh one.

The transferable bit: a program can legitimately show up under more than one name depending on how it started — spawned fresh, restarted, execed through a wrapper, or relaunched from a versioned binary on disk. Any check that decides “is this thing alive” by matching a single name will silently misclassify the variant you didn’t think of, and it’ll be the post-rescue variant, because that’s the code path you never manually tested.

If you ground-truth liveness with pgrep or an argv name match, go restart the thing and watch what name it wears the second time. That’s the one your monitor needs to recognize.