I ran Claude Code against my API repo one afternoon and watched it burn $0.04 before it even touched a single file. The culprit wasn’t my code. It was my CLAUDE.md — all 2,000 words of it, every single one getting loaded into the system prompt on every agent step, multiplying across 40-some tool calls.
I’d written that file with care. Architecture overview. Directory layout. Coding conventions. Style rules. I assumed more context meant better answers. That assumption cost me tokens, wall time, and — as I’d later learn from a paper — actually made the agent worse at its job.
The paper that confirmed my CLUADE.md was actively hurting me came out of ETH Zurich. “Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?” (Gloaguen et al., ICML 2025). The researchers tested Claude Code, OpenAI’s Codex, and Qwen Code across hundreds of real GitHub issues, using both the established SWE-bench benchmark and a new benchmark they built from repos that already had developer-written context files checked in.
The headline finding stopped me mid-scroll: LLM-generated context files reduce task success rates by ~3% while increasing inference cost by over 20%. Even developer-written files — files like mine, written by humans who know the codebase — only improved success by ~4% on average. Barely above noise. And they still added 20%+ to the bill.
the mechanism — and how to measure it yourself give me the detail
The paper reports the headline numbers (LLM-generated files: −3% task success, +20–23% cost; developer-written: +4% success, +19% cost), but the behavioral breakdown is more useful than the averages.
Why it hurts: context files are prepended to every tool call’s system prompt, so their token cost is paid on every agent step, not just once. A 2,000-token CLAUDE.md injected into a 40-step task costs 80,000 extra tokens — before any of your code is read. Modern coding agents (Claude Code, Codex, OpenHands) use structured tool-use loops: each loop iteration appends tool outputs to the conversation and re-samples. The context file anchors the top of every sample window, and the paper measured that agents issue more grep, read_file, and run_tests calls when it’s present — they’re pattern-matching on instructions rather than reasoning about the actual task.
Quantify your own file before cutting it:
# Count tokens with tiktoken (pip install tiktoken)
python3 - <<'EOF'
import tiktoken, pathlib
enc = tiktoken.encoding_for_model("gpt-4o")
text = pathlib.Path("CLAUDE.md").read_text()
toks = enc.encode(text)
print(f"{len(toks)} tokens — costs you that many tokens × agent steps × $/1M")
EOFThe one thing that genuinely helps: non-standard CLI tooling the agent can’t infer. If you run bun not npm, uv not pip, or a custom make proto step that must precede builds — say exactly that, nothing else. Those lines have no discoverable substitute. Everything else does.
The behavioral analysis is what actually convinced me. When context files are present, agents run more tests, grep more files, read more files, and write more files. They’re being dutiful. They’re following your instructions. Which sounds virtuous — until you realize that following unnecessary instructions is just busywork. It burns tokens. It doesn’t improve outcomes.
Why Overviews Don’t Help
Eight out of twelve developer-written files the researchers studied included codebase overviews. Over 90% of LLM-generated files did. And when the team measured how many steps it took an agent to first interact with a file from the actual bug fix? No meaningful difference.
Turns out agents are already good at exploring codebases. They grep. They list directories. They follow imports. A section that says “the API routes are in src/routes/” doesn’t help — the agent would’ve found that in one ls command. But it still consumed tokens on every step, and it still added cognitive weight to every prompt.
I had to sit with that for a minute. All those hours I spent writing architecture descriptions. For nothing. Worse than nothing — they were a tax.
What Actually Helps
The paper did find one mechanism where context files genuinely earn their tokens: surfacing non-standard tooling.
When a context file mentions uv (a Python package manager that’s not the default), agents use it 1.6 times per task on average versus fewer than 0.01 times when it’s not mentioned. Repository-specific tools show the same pattern — 2.5 uses per task when mentioned, 0.05 when not.
This makes intuitive sense once you think about it. An agent can discover your project structure by reading code. It cannot discover that you use bun instead of npm from a package.json that lists both as valid options. It cannot discover that make proto must run before cargo build unless something tells it directly.
The 4-Question Filter
Based on the paper’s findings, I built a tool called ContextOptimizer that applies a simple inclusion filter to every line in a context file:
Include a line if and only if:
1. NOT DISCOVERABLE — agent can't learn this from README, configs, or --help
2. ACTIONABLE — it tells the agent to DO something specific
3. PREVENTS SILENT FAILURE — getting it wrong causes hard-to-debug issues
4. BROADLY APPLICABLE — relevant to most tasks, not just one workflow
Everything that fails any of these four questions gets cut. The tool has three workflows: Audit (score existing files against 8 weighted anti-patterns), Optimize (rewrite files using the filter), and Generate (create minimal files from scratch for repos that don’t have one yet).
The Anti-Pattern Scoring System
The Audit workflow assigns a “bloat score” from 0-100 based on detected anti-patterns:
| Anti-Pattern | Weight | Why It Hurts |
|---|---|---|
| Codebase overview | +20 | Proven ineffective at helping navigation |
| Redundant with README/configs | +15 | Wastes tokens on discoverable info |
| Generic boilerplate | +15 | ”Write clean code” applies to every repo |
| Linter-enforced style rules | +10 | Already handled by tooling |
| Architecture descriptions | +10 | Agents discover this by reading code |
| Non-actionable statements | +10 | Agent can’t act on “designed for scale” |
| Over 500 words | +10 | Longer files = more cost, not more success |
| Marketing language | +10 | ”Best-in-class” helps nobody |
A score of 0 means perfectly minimal. Most files I’ve audited land between 40-70. Mine was higher.
What a Good Context File Looks Like
After optimization, most context files shrink by 80% or more. Here’s the structure I now use:
# Project Name
## Critical Constraints
Use bun, never npm or yarn.
Never import from @internal/* in test files — causes silent CI failures.
## Testing
Run `bun test --bail -- src/` for unit tests.
Integration tests require REDIS_URL env var.
## Conventions
API routes use kebab-case: /api/user-profiles, not /api/userProfiles.
That’s it. Under 300 words. No overview. No architecture. No style guide that your linter already enforces. Every single line passes the 4-question filter.
What I Learned
Less is more, and the data proves it. Every line in your context file costs tokens and compliance overhead. The paper measured this directly — agents spend more reasoning tokens when context files are present, not because the problems are harder, but because the agent is working harder to follow your instructions.
The 4-question filter works everywhere. CLAUDE.md, AGENTS.md, .cursorrules — the principle holds: only include what can’t be discovered, what’s actionable, what prevents silent failure, and what applies broadly.
If your README already says it, your CLAUDE.md shouldn’t. The paper showed that LLM-generated context files are highly redundant with existing documentation. They only become helpful when all other docs are removed — which never happens in a real repository. Stop duplicating information across files.
The full paper is at arxiv.org/abs/2602.11988. If you maintain context files in any of your repositories, it’s worth the read. The ContextOptimizer tool is available as a Claude Code skill — just say “audit my CLAUDE.md” and it’ll tell you exactly what to cut.