← All posts

My Agent-to-Agent Message Ledger Came Back Out of Order — Clock Skew Was the Bug

I run a fleet of Claude Code agents that coordinate through an append-only message ledger. Ordering it by timestamp quietly broke whenever a machine clock stepped backward. The fix: order by row ID.

  • claude-code
  • agentic-ai
  • ai-agents
  • databases
  • reliability

I run a fleet of Claude Code agents — autonomous sessions working in parallel across multiple machines and Anthropic accounts. They coordinate through an agent-to-agent (A2A) message ledger: an append-only Postgres table where agents post messages to each other and read the conversation history back out.

The ledger was ordered by created_at. For months that worked fine. Then it quietly broke.

Messages started coming back out of insertion order. Not dramatically — just the occasional message appearing a few slots early or late in the conversation history the agent was reading. No error, no exception. The agent just saw a subtly wrong picture of what had been said. The kind of bug you find by noticing something feels off, not by reading a stack trace.

The cause was clock skew. A few of the machines in my fleet are laptops that suspend and resume, and one is a VM. When a laptop wakes from sleep or NTP corrects a drifted VM clock, the system clock can step backward by anywhere from a fraction of a second to a few seconds. A message inserted after that correction gets a created_at timestamp earlier than a message inserted before it. The table’s chronological order now lies about what actually happened.

The fix took about ten minutes and is permanent: stop ordering the ledger by timestamp, order it by the database’s auto-increment integer primary key instead. The row ID is a monotonic counter that only ever goes up. It has no relationship to the wall clock. It can’t step backward. It reflects actual insertion order by construction, not by measurement of a clock I don’t control.

The timestamp stays — it’s real metadata and useful for display and range queries. It just doesn’t get to decide what came first anymore.

This applies to any append-only log: event streams, audit trails, chat histories, agent message queues. If order matters and the table is append-only, the row ID is the right sort key.

Postgres BIGSERIAL, IDENTITY columns, and keyset pagination give me the detail

Use BIGINT GENERATED ALWAYS AS IDENTITY (or BIGSERIAL) as the primary key, and treat created_at as metadata only:

CREATE TABLE agent_messages (
  id          bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  created_at  timestamptz NOT NULL DEFAULT now(),  -- metadata, not ordering
  sender      text NOT NULL,
  recipient   text NOT NULL,
  payload     jsonb NOT NULL
);

-- correct ledger order — monotonic, clock-skew-immune:
SELECT * FROM agent_messages
WHERE recipient = $agent_id
ORDER BY id ASC;

The same id column gives you keyset pagination for free, which you want on a ledger that grows indefinitely. OFFSET gets slower the deeper you page and can skip or repeat rows when the table is written to concurrently:

-- "give me messages after the last one I saw"
SELECT * FROM agent_messages
WHERE recipient = $agent_id
  AND id > $last_seen_id
ORDER BY id ASC
LIMIT 100;

One thing worth knowing: Postgres sequences are monotonic but not gapless — a rolled-back transaction leaves a hole in the ID sequence. That’s fine for ordering and cursor pagination; just don’t assume “next ID = previous ID + 1.”

If you’re on a distributed store with no single sequence (sharded database, multi-writer setup), reach for a monotonic ID scheme like a Snowflake ID or ULID rather than falling back to the wall clock. The property you need is “monotonically increasing by construction” — not “high-resolution timestamp.”

The general rule: when you need strict ordering, base it on something monotonic by construction, not on a measurement that can be corrected, adjusted, or skewed. A timestamp is a measurement of a clock I don’t control. A row ID is a counter I do. For ordering, reach for the counter.