Claude Code MCP Server: Building Persistent Memory

Everyone’s building AI agents. Almost nobody is building memory for them.

The default Claude Code experience is this: you open a session, you do great work, you close the session, and it’s gone. No Claude Code MCP server ships with the product to fix this. Next morning, you open a new session and explain the same project structure, the same deployment rules, the same “don’t push to production without checking the allowlist” that you’ve explained every day this week. Claude is brilliant. Claude is also an amnesiac.

At one project, that’s annoying. Across a live client portfolio, it’s a wall. I was burning the first ten minutes of every session on logistics that the system already knew and forgot. Same overview. Same rules. Same warnings. The AI equivalent of training a new hire every morning.

So I stopped accepting the default and built a custom Claude Code MCP server with persistent memory. What started as a quick fix turned into the core of how I work. This is context engineering in practice: building the infrastructure that makes AI reliable, rather than hoping better prompts will fix it.

01What MCP Is (And What It Isn’t)

MCP (Model Context Protocol) lets you give Claude tools it doesn’t ship with. You run a server, Claude connects to it, your server exposes capabilities that Claude calls during a session. Anthropic’s docs cover setup.

This post isn’t about the plumbing. It’s about what you build once the plumbing works, and why the interesting problems start after “hello world.”

02What a Claude Code MCP Server Actually Needs (And Why)

My server gives Claude four things it doesn’t have by default: persistent memory, context condensation, delegated file reading, and compliance checking. I didn’t design any of this upfront. Something broke, I fixed it, and the fix became a feature.

Memory That Survives Between Sessions

This came first, because the pain was loudest. I needed Claude to remember things across sessions: project configurations, platform quirks I’d spent hours debugging, deployment rules that came from three broken deploys and a near-miss on a production org.

The server indexes all of that into a vector database. Thousands of knowledge chunks, searchable by meaning and by keyword. When Claude starts a session, the first thing it does is search the memory server. The rule is simple: check what you already know before guessing.

I use hybrid search. Vector similarity finds conceptually related content. Keyword search catches exact terms. Neither alone is reliable, and I learned that the hard way. Semantic-only search kept returning adjacent results that missed the specific command or config value I needed. Adding keyword matching fixed the retrieval quality problems, but only after weeks of wondering why search felt “close but wrong.”

Condensation (The Problem Nobody Warns You About)

When your memory server works too well, it returns too much.

One operation was returning over 200,000 characters of raw project context. That payload literally couldn’t fit in the tool response. Claude would choke before reading a single result. Your memory server becomes a liability the moment it knows more than the context window can hold.

The fix was a condenser. Results pass through a lighter model before reaching Claude. That model reads the full output and returns a distilled summary. Two hundred thousand characters compress down to a few thousand. Claude gets the answer without the bloat.

If you’re building an MCP server and you don’t have a condensation layer, you’ll hit this wall the moment your knowledge base grows past a few hundred entries. I know because I ran without one for weeks and couldn’t figure out why sessions were getting slower and dumber. The condenser was the fourth thing I built. It should have been the first.

Delegated Reading (Keep the Context Window Clean)

Claude’s context window is finite. Every file it reads directly consumes capacity that could be used for reasoning. Big file loads are expensive, and the cost isn’t dollars. It’s degraded output quality three tool calls later, because the window is stuffed with a 2,000-line config file that Claude only needed two lines from.

So I built a reader. A lighter model scans the file and answers specific questions, returning cited answers with line numbers. Claude asks “what are the deployment rules for this project?” and gets back a sourced answer without loading the entire document.

Same principle for writing. Mechanical work (session logs, documentation updates, structured captures) gets delegated to a cheaper model. Claude focuses on reasoning. The formatting happens elsewhere. You don’t pay senior rates for data entry.

Compliance Checking (Because Prompts Drift)

This one came from the most painful failure. I needed Claude to validate proposed actions against rules before executing. Not a prompt instruction. Not “please remember to check the allowlist.” Prompts get compressed. Prompts get forgotten. A prompt is a suggestion. A gate is a wall.

The server accepts a proposed action, checks it against predefined rules, and returns pass or fail. The difference between asking someone to remember a checklist and bolting that checklist to the door so they can’t walk through without completing it.

If you’ve ever told an AI “don’t do X” and then watched it do X forty-five minutes later after a long conversation, you understand why mechanical enforcement exists. The model didn’t disobey. It forgot. Forgetting and disobeying look identical from the outside, but only one of them is fixable with infrastructure.

03Session Loading: How Much Context Is Too Much?

Once you have persistent memory, you face a new question. How much do you load at session start?

Load everything, and you burn half the context window on background knowledge before the session begins. Load nothing, and you’re back to square one.

I built a tiered loader. One call returns exactly what Claude needs. First tier: core rules, security protocols, workflow constraints. Always loaded, always lean. Second tier: project-specific context. Only loaded when relevant. Both tiers pass through the condenser before returning, so the loaded context is measured in thousands of characters, not hundreds of thousands.

Claude starts every session knowing the rules, the recent project activity, and what happened yesterday. One call. Under a second. No re-explaining.

04How Claude Code Loses Memory Mid-Session (And How to Fix It)

This is the problem nobody talks about, and it’s the one that will cost you the most debugging time.

Claude Code compresses your conversation when the context window fills up. Older messages get summarized. In theory, this is efficient. What that means in practice: the deployment rules you loaded at session start can silently vanish mid-session. The behavioral constraints? Gone. The project state? Compressed into a summary that may or may not preserve what matters.

My server detects this. When Claude calls the session loader a second time in the same session, the server includes a recovery hint: the most recently active project. Claude reloads the relevant context surgically. Not the full knowledge base. Just what the current task needs.

Before this existed, long sessions would silently lose their constraints around the two-hour mark. I wouldn’t notice until Claude deployed a metadata package to the wrong org because the deploy rules from session start had been compressed away. The failures were quiet. That’s what made them expensive.

05What Persistent Memory Changes for Claude Code Workflows

Before the memory server, every session started with ten minutes of setup. Reading project files, re-establishing context, reminding Claude which org belongs to which project. Creative time wasted on logistics.

After: one call. Rules load, project context loads, recent activity loads. I start working immediately.

But the real win is compounding. Every session generates learnings. Deployment patterns that worked. API gotchas that burned an hour. Platform quirks that only surface in production. Those learnings get indexed automatically. The next session starts with that knowledge already searchable. The session after that starts with even more.

I co-founded Aether Global Technology, a Salesforce consulting partner in Manila. The memory server runs alongside that work as a personal R&D system. It doesn’t touch client data. What it does is compound operational knowledge across projects and platforms, so Claude rarely encounters a problem it hasn’t seen a version of before.

Mistakes get encoded so they don’t repeat. The memory server doesn’t make Claude smarter. It makes Claude less likely to be stupid in the same way twice.

06What I’d Build Differently

I over-indexed on features and under-indexed on condensation. The memory server had rich search, tiered loading, and compliance checking before it had a condenser. That meant every search returned massive payloads that burned through the context window. If I were starting over, condensation would be the first thing I built, not the fourth.

I’d also start with a smaller embedding model. My instinct was to use the most capable sentence transformer I could find. The difference in search quality between models was marginal. The difference in startup time and memory footprint was not. A lighter model that loads in seconds would have saved weeks of debugging cold-start problems on a machine that was already running six other services.

And I’d design for compression survival from day one, not bolt it on after losing context mid-session three times. That pattern is now the part of the system I trust most, but it didn’t need to take three incidents to build.

07Frequently Asked Questions

What is a Claude Code MCP server?+

An MCP (Model Context Protocol) server is a custom backend that gives Claude Code capabilities it doesn’t have out of the box. You run the server, Claude connects to it, and your server exposes tools that Claude can call during a session. A memory-focused MCP server specifically solves the problem of Claude forgetting everything between sessions by providing persistent, searchable knowledge storage.

Does Claude Code remember between sessions?+

Not by default. Claude Code starts every session fresh. CLAUDE.md files provide some static context, but they don’t scale past a single project. A custom MCP server with a vector database and session loading gives Claude persistent memory across sessions, so it knows your project rules, past learnings, and recent activity without you re-explaining every time.

What is context compression in Claude Code?+

When your conversation with Claude Code fills the context window, older messages get summarized to make room. This is called context compression. The problem is that rules, constraints, and project state loaded at session start can silently disappear during compression. Without a recovery mechanism, Claude forgets its guardrails mid-session.

How do I add persistent memory to Claude Code?+

Build an MCP server that indexes your knowledge into a vector database, then expose search and retrieval as MCP tools. Claude calls these tools at session start to load context. Add a condensation layer so large results don’t overflow the context window. The key insight is that memory alone isn’t enough. You need condensation, tiered loading, and compression recovery to make it work at scale.

08Try It Yourself

I’m working on a lightweight version of this memory server. Stripped to the core: vector search, session loading, and basic condensation. Enough to give Claude Code persistent memory without the full production infrastructure. Follow my GitHub for updates.

If you want something you can use today, I open-sourced the pre-action gate pattern. Mechanical enforcement that blocks your AI agent from executing before checking the rules. Zero dependencies. Works with Claude and Gemini.

github.com/tomtokitajr/ai-agent-gates

Tom Tokita is co-founder of Aether Global Technology and builds AI operations systems in Manila. He writes about what works in production.