The fix had been on main for two days. Commit 64b8e19, reviewed, merged, green. And the bug it fixed was still happening in production — a downstream queue was quietly emitting everything twice, doubling its output with no error, no crash, nothing that would page anyone. It just… did the wrong thing, calmly, for 48 hours.
I spent the first ten minutes of the incident confused in a very specific way. I pulled up the code on the box. The fix was right there. git log showed the commit. git status was clean. The file on disk contained the corrected logic. So why was the service behaving like the old version?
Because it was running the old version.
Here’s the thing I’d half-forgotten. The service runs under bun run — Bun is the JavaScript runtime we use, like Node — and a long-lived bun run process reads your source files exactly once, when it starts. After that, it’s executing an in-memory copy. You can git pull a hundred fixes onto that machine and the running process will never see a single one of them. The code on disk and the code actually executing had silently drifted two days apart, and nothing anywhere told me.
Think of it like a printed recipe you memorized. Someone can correct the cookbook on the shelf, but you’re still cooking from the version in your head until you deliberately re-read it. “Merged” updated the shelf. The cook never looked up.
The fix for the incident was embarrassingly small: restart the process. It picked up 64b8e19, the double-emit stopped, and the queue went back to normal in seconds. The two days were entirely gap, not work.
Verify the running artifact, not the repo state give me the detail
The trap is that git tells you about disk, not about the live process. Confirm the running code actually contains your fix before you close the incident.
# What commit does the RUNNING process's working dir point at?
# (find the pid, then read its cwd)
pid=$(pgrep -f 'bun run')
cd "/proc/$pid/cwd" && git rev-parse HEAD # want: 64b8e19...
# When did the process START vs. when did the fix land?
ps -o lstart= -p "$pid" # process boot time
git show -s --format=%ci 64b8e19 # commit time
# If boot time is BEFORE commit time, you're running stale code.Bun does have bun --hot, which re-reads source on change, but that’s a dev-mode reloader I don’t want holding state in production. The durable answer is a tracked deploy step: pull, then restart under systemd, and log both. If restart isn’t in your deploy, you don’t have a deploy — you have a repo update.
What I took away: for any interpreted-but-non-reloading runtime — bun run, plain node, a Python worker — “landed on main” is a claim about your repository and nothing else. The process boots, snapshots your code, and stops caring what you do next. Silent-success bugs love that gap, because nothing errors; the old logic just keeps running with total confidence.
So treat the restart as the deploy, not the cleanup after it. And when you’re sure a fix is live, check the running process actually contains it. The shelf being correct never fed anyone.