Design log collectors to be incremental from day one

When you turn agent transcripts or activity logs into a data source, assume the corpus will grow far larger than you expect. A single workspace can accumulate tens of thousands of session files, and a collector that re-scans everything on each run will quietly become unusable.

Build the ingestor to be incremental from the start: track a cursor by modification time or byte offset, and only read what changed since the last pass. Full re-scans should be a rare, deliberate operation — never the default path.

The same discipline applies before ingesting anything sensitive. Machine-generated titles and summaries often embed private or regulated content, so treat scrubbing as a load-bearing requirement, not an afterthought. Prove your assumptions against real artifacts rather than trusting the shape you imagined.