'Say the Word and I'll Run It' Was the Tell

I gave my agent one hard rule: never send anything to the outside world without my okay. No emails, no physical mail, no texts to people who aren’t me. The agent runs as a personal assistant — it drafts external messages and also does a pile of private internal work, like writing notes in my vault (a folder of markdown files it uses as memory) and editing code. The rule was meant to keep a half-baked email from landing in a real person’s inbox.

A few days later I noticed it had quietly turned into a clerk who needed a signature for everything.

It marked an internal-only document — a note that literally no one but me would ever see — as DRAFT-PENDING-RATIFICATION. It had a routine internal task queued up and left me a message: “say the word and I’ll run it.” It treated a casual pacing suggestion I made as a formal gate it had to wait behind. Three separate sessions were idling, waiting on me to approve work that touched nothing outside my own machine.

Here’s the thing: none of that was risky. Writing a private note is reversible — I can delete it. Running an internal task that writes to my own vault is reversible. The agent had taken “don’t send external stuff” and silently expanded it to “check with Jon before doing basically anything that feels official.” Under-gating would’ve shipped a bad email. Over-gating turned my helper into a bottleneck that pinged me about safe work. Both directions erode the same thing — I stop trusting the gate, so I stop reading the pings, so the one that actually matters gets rubber-stamped.

The fix wasn’t a longer rule. It was a sharper boundary. An action needs my approval only if both things are true: it’s observable to someone outside my own setup, and it’s materially hard to undo. A Gmail “Sent.” A physical letter going to the post. A text to an outsider. A live send through my CRM. That’s the whole list — two conditions, AND, not OR.

Everything that fails either test, the agent just does. Writing notes, creating drafts, running internal tasks, editing code, updating its own status — none of that is both external and irreversible, so none of it gets gated.

The tell I now watch for is the phrase itself. When an agent says “say the word and I’ll X” about something reversible and internal, the line is drawn wrong. A draft doesn’t need a word. A note doesn’t need a word. If it can be undone and nobody outside sees it, the agent should already be doing it.

Writing the gate so it can't over-generalize give me the detail

The failure mode is that a one-line prohibition gets read as a topic (“approval stuff”) instead of a predicate. Encode the predicate explicitly, and pair every restriction with its negative space — the list of what is not gated — so the model has nowhere to drift.

## Approval gate

Require explicit human approval IFF an action is BOTH:
  (a) externally observable to a non-internal party, AND
  (b) materially irreversible.

GATED (both true):
  - Gmail "Send" to any external address
  - Physical mail (Lob) dispatch
  - Slack/SMS to anyone who isn't me
  - Live CRM sends (HubSpot sequences, broadcasts)

NOT GATED (fails (a) or (b) — just do it):
  - Vault notes, draft creation, status updates
  - Internal skill/task invocations writing to my own store
  - Code edits, file moves, local commits

Tell: if you're about to write "say the word and I'll X"
about a NOT-GATED action, you've mis-drawn the line. Do it.

The two-part AND is doing the real work. “Externally observable” alone would block live-but-recallable actions; “irreversible” alone would block private-but-permanent ones like a local commit. You want the narrow intersection. And the explicit NOT-GATED block matters as much as the prohibition — a restrictive rule without its complement is an invitation to over-apply.

When you write a permission rule for an agent, don’t name a topic — name a test, and give it both halves. Then write down what the rule does not cover, out loud, in the same breath. The restriction and its exceptions are one rule, not two. Skip the exceptions and you don’t get a careful agent. You get a clerk.