Implementing the GCC Paper: Giving AI Agents Persistent, Structured Memory

The 30th re-explanation of the ARCS Health Portal’s architecture to a fresh Claude Code session was the one that broke me. I’d burned 4,000 tokens — words the AI reads and I pay for — describing the same database schema, the same auth flow, the same branch I was on yesterday, to an agent that arrived with zero memory of any of it. Not the bug I’d spent an hour debugging at 2 AM. Not the design decision that took three sessions to settle. Nothing.

Every new agent session starts blank. That’s the design. It’s also the single biggest drain on productivity and budget in AI-assisted coding.

So when I found GCC: Git-Context-Controller by Junde Wu at Oxford, the paper’s framing felt like someone had been reading my terminal logs. Agent memory shouldn’t be a flat text file you dump into a prompt. It should be a version-controlled codebase — branches, commits, merges, and retrieval at different zoom levels, exactly like git. I spent a week building it from scratch. The results surprised me.

The Core Idea — Memory as a Repo

The GCC paper (arXiv 2508.00031) defines four operations an AI agent can call during its own reasoning — modeled directly on git:

Command	When to Call	What It Does
`COMMIT`	After a coherent milestone	Checkpoints progress with a three-block narrative summary
`BRANCH`	Before exploring an alternative	Creates an isolated workspace for experiments
`MERGE`	When an experiment succeeds	Synthesizes branch results back into the main trajectory
`CONTEXT`	To orient or resume work	Retrieves memory at multiple resolutions

The architecture: agent memory lives in a .GCC/ directory. main.md holds the global roadmap. Each branch gets its own commit.md (milestone summaries), log.md (fine-grained traces of what the agent observed, thought, and did), and metadata.yaml (file structure, dependencies, configs). Agents equipped with GCC hit 48% resolution on SWE-Bench-Lite — the best result published at the time, ahead of 26 competing systems.

What I Built: gcc-memory

gcc-memory is my open-source implementation. ~2,600 lines of Python in four layers:

src/gcc_memory/
├── store.py       # 757 LOC — core storage engine
├── cli.py         # 447 LOC — Typer CLI (commit, branch, merge, context)
├── utils.py       # Atomic writes, file locks, timestamps
├── server.py      # HTTP + WebSocket for real-time streaming
└── adapters.py    # Codex/Claude/OpenCode transcript parsers

integrations/claude/
├── gcc_memory_observe.py   # UserPromptSubmit hook → observations
├── gcc_memory_stop.py      # Stop hook → thoughts
├── gcc_memory_sync.py      # PostToolUse hook → actions
└── hook_common.py          # Shared: debounce, dynamic import, trimming

scripts/
├── backfill_history.py     # Mine 800+ session transcripts into events
└── run_backfill.sh         # uv-backed runner

The Three-Block Commit

The paper’s signature move is the three-block commit. Every commit captures three things:

Branch Purpose — why this branch exists at all (anchors intent so future sessions know what you were trying to do)
Previous Progress Summary — a compressed chain of all prior summaries on this branch
This Commit’s Contribution — what actually changed in this milestone

### Commit: Implement JWT auth (2026-02-18T10:30:00+00:00 | main)

**Branch Purpose:** Full-stack authentication system

**Previous Progress Summary:** Set up Express server with route structure.
Added PostgreSQL connection pool with migration system.

**This Commit's Contribution:**
Replaced session cookies with JWT tokens. Simplifies the API gateway
and enables stateless horizontal scaling. Validated with integration
tests covering token refresh, expiry, and revocation.

The secret sauce is _synthesize_progress() — it chains previous summaries with a 1,500-character ceiling, so N commits compress into a fixed-size window. After 50 commits, you still get a coherent summary that fits in a few hundred tokens. The memory doesn’t grow with the project.

Three Hooks, Three Channels

The paper specifies Observation-Thought-Action (OTA) traces. I capture all three through Claude Code’s hook system — lifecycle triggers that fire at specific moments in every agent session:

Hook	Event Type	Channel	What It Captures
`UserPromptSubmit`	Observation	`claude-hook`	User’s request (the “what”)
`Stop`	Thought	`claude-hook`	Agent’s reasoning (the “why”)
`PostToolUse`	Action	`claude-hook`	Tool execution (the “how”)

The PostToolUse hook is the richest. Instead of just logging “bash” as a tool name, it builds enriched summaries:

# Instead of: "bash"
# We get: "migrate database schema (exit 0)"
def _build_enriched_summary(tool_name, payload):
    if tool_name == "bash":
        desc = tool_input.get("description", "")
        exit_code = result_obj.get("exit_code", "")
        return f"{desc} (exit {exit_code})" if desc else cmd[:120]

Two filters keep logs from drowning in noise: debouncing (a 3-second window merges rapid-fire duplicate events from back-to-back tool calls) and terse-response filtering (skip anything under 60 characters — “Done.”, “OK.”). Together they cut log noise by ~70% while losing almost nothing of value.

Auto-Commit as Safety Net

Every 300 seconds of continuous tool activity, the PostToolUse hook triggers an auto-commit. But that’s the backup plan. The real value comes from agent-driven narrative commits — the skill I wrote explicitly tells the agent: “Auto-commit is a fallback; your narrative commits and curated summaries are what make this memory useful to future sessions.”

The Hardest Part: Mining the Past

The storage engine was straightforward. Making the system useful for projects that already had months of history — that was the real problem.

My first attempt used ~/.codex/history.jsonl, thinking that was the source of truth. It only contains user prompts. No agent reasoning. No tool calls. No record of which files changed. Memory built from prompts alone was nearly useless — like reconstructing a conversation when you only have one side of it.

I almost gave up there.

The breakthrough: Claude Code stores full session transcripts at ~/.claude/projects/{project}/*.jsonl. Each transcript is the complete conversation — user messages, assistant reasoning blocks, and every tool call with its inputs and outputs. I wrote a parser that mines these directly.

the mechanism — hooks, compression, and mining give me the detail

Three hooks, one OTA channel. The implementation wires three Claude Code lifecycle hooks — UserPromptSubmit, Stop, and PostToolUse — to capture the full Observation-Thought-Action trace. Each hook appends a timestamped JSON event to log.md. A 3-second debounce window collapses burst-fire calls (e.g. rapid PostToolUse from a multi-step tool) into one log entry, cutting noise by roughly 70% with negligible information loss.

Commit compression that actually scales. The _synthesize_progress() function chains the text of previous commit summaries up to a 1,500-character cap, then feeds that ceiling into the new commit’s “Previous Progress Summary” block. This means you can have 200 commits and the context cost of loading history stays fixed — it doesn’t grow with project age.

Mining full transcripts, not just prompts. Claude stores complete session transcripts (user turns, assistant reasoning blocks, tool calls) as newline-delimited JSON under ~/.claude/projects/{project-slug}/. The backfill parser reads these directly:

# backfill_history.py — mine full reasoning, not just prompts
for record in records:
    if record["type"] == "user":
        user_texts.append(extract_text(record))
    elif record["type"] == "assistant":
        for block in record["message"]["content"]:
            if block["type"] == "text":
                reasoning_parts.append(block["text"])
            elif block["type"] == "tool_use":
                tool_calls.append(summarize(block))
                if block["name"] in ("Edit", "Write"):
                    files_changed.add(block["input"]["file_path"])

Try it. Point the backfill script at any project slug and watch structured commits materialize from months of history:

python scripts/backfill_history.py --project your-project-slug --output .GCC/

The payoff is immediate: instead of a flat list of user prompts, you get OTA-structured commits where the agent’s reasoning — what it tried, what files it changed, why — is preserved alongside the action.

For the ARCS Health Portal — 36 days of development — the backfill mined 655 Claude sessions and 733 Codex prompts, producing commits like:

2026-01-19 (37 sessions)

[16:37] Implement batch lab upload feature
  Reasoning: Let me start by reading the specification
  Files changed: lab_upload.py, lab_upload_store.py, name_extractor.py
[18:24] Let user mark invalid form history and lab results
  Reasoning: Let me look at the data stores and template
  Files changed: ehr.py, filled_form_store.py, patient_detail.html

Prompts-only was a ghost town. This is a living record.

Closing the Gap with the Paper

After the initial implementation, I ran a systematic comparison against the paper. Not everything matched. Here’s what was missing and how I fixed it:

Paper Requirement	Initial State	Fix
Git commit on COMMIT/MERGE	Not implemented	Added `--git` flag
MERGE calls CONTEXT on target first	Missing	Added `context_branch()` call before merge
BRANCH initializes commit.md	Empty file	Writes initial entry with Branch Purpose
main.md has milestones + to-do list	Only Purpose/Decisions/Questions	Added Milestones and To-Do sections
Per-file responsibilities in metadata	Path list only	Documented as optional (paper says “manually added”)

The git integration was the biggest miss. The paper is explicit: COMMIT “finalizes the memory and code changes as a Git commit, using the agent-authored summary as the commit message.” Now gcc-memory commit --git does exactly that — stages all changes and creates a real git commit alongside the GCC commit. The agent’s own words become the commit message.

What I’d Do Differently

File-based storage scales further than you’d think. For a single workspace with 1–3 agents, Markdown + YAML with file locks is simple, auditable, and enough. Every event is visible in log.md. Every commit is human-readable in commit.md. No database to migrate, no server to keep alive.

Structure events for the future you, not the present you. Recording observation/thought/action on every event felt like over-engineering in the moment. I almost skipped it. But when I needed to build the backfill system later, having a consistent OTA schema made it possible to reconstruct structured memories from raw transcripts. Events you don’t structure now are events you can’t reconstruct later.

Agent curation beats automation. This one hurt. My first approach was to auto-generate everything — summaries, highlights, main.md updates — fully automatic. The result was technically correct and completely useless. It read like a robot summarizing another robot. The breakthrough was treating agents as curators: the skill tells them when and how to update main.md, but they write the actual content. The quality difference is not subtle.

Mine transcripts, not prompts. My biggest wrong turn by far. Spent days convinced history.jsonl was the right data source. Session transcripts — the full conversation, reasoning chain, file changes — are where the institutional knowledge actually lives. A user prompt alone tells you what was asked. The transcript tells you what was tried, what failed, and why.

Try It

gcc-memory is open source at github.com/RooseveltAdvisors/gcc-memory

git clone https://github.com/RooseveltAdvisors/gcc-memory
cd gcc-memory && bash install.sh

All credit for the GCC framework goes to Junde Wu’s paper. I just built an implementation and learned a lot along the way.