Building a Semantic Memory Layer for Multi-Agent AI Systems

A technical deep dive into implementing persistent, context-aware memory for AI agent systems.

A technical deep dive into how we replaced grep with pgvector, Ollama, and semantic search for the OpenClaw agent platform.

The Problem Nobody Talks About

Everyone's excited about AI agents. Autonomous systems that research, write, code, and collaborate — the future of how work gets done. But there's a dirty secret: most AI agents have the memory of a goldfish.

They're smart in the moment. Ask them something in context and they're brilliant. But the next day? The conversation they had with you last week about that API credential, the design decision made in Tuesday's standup, the debugging session that finally cracked the authentication bug — gone. Every session starts fresh. Every agent is an amnesiac genius.

We hit this wall hard with OpenClaw, our multi-agent platform running four specialised agents: Mario, Peach, Toad, and Zelda. Each agent had a MEMORY.md file — a curated log of important things they'd learned. But as the system grew, two problems emerged. First, the files got unwieldy and had to be periodically pruned, discarding knowledge that had felt important when it was written. Second, even when the information was there, the agents could only find it if they searched for the exact right words.

Search for "API keys" and miss the entry that says "credentials for the backend service." Search for "contact limits" and miss "5 contacts on the free tier." The knowledge existed. The agents just couldn't reach it.

We fixed this by building a semantic memory layer — a system where agents find information by meaning, not by keyword. Here's how we did it, and what you can learn from it.

What "Semantic Search" Actually Means

Before we get into the implementation, it's worth understanding the core idea — because it's genuinely clever and not as complicated as it sounds.

Traditional search is syntactic. It looks for matching strings. If your query and your document use different words for the same concept, you get nothing.

Semantic search converts text into numbers — specifically, a long list of floating-point numbers called an embedding. These numbers capture the meaning of the text in mathematical form. Text with similar meaning produces similar number sequences, regardless of the specific words used. You can then measure how "close" two pieces of text are by comparing their number sequences — a technique called cosine similarity.

Search Type	Input	Finds	Misses
Keyword	"API keys"	"API keys"	"credentials", "tokens", "authentication"
Semantic	"API keys"	"credentials", "tokens", "secret access codes"	Nothing relevant

The model we used — nomic-embed-text — converts any text into a list of 768 numbers. Those numbers become the agent's memory fingerprint. Store them in a database that supports vector comparisons, and suddenly you have memory that works the way human memory works: by association and meaning rather than exact recall.

The Architecture: Simple by Design

The system has two layers that work together, and the key design decision was keeping them separate.

┌─────────────────────────────────────────────────────────┐
│                    AGENT WORKSPACE                      │
│                                                         │
│   MEMORY.md (full content)   memory/*.md (full content) │
│         │                            │                  │
│         └────────────┬───────────────┘                  │
│                      │  sync every 2h                   │
│                      ▼                                  │
│   ┌─────────────────────────────────────────────────┐   │
│   │            VECTOR DATABASE (pgvector)           │   │
│   │  • Redacted content (secrets removed)           │   │
│   │  • 768-dim embeddings (nomic-embed-text)        │   │
│   │  • Cosine similarity index                      │   │
│   └──────────────────────┬──────────────────────────┘   │
│                          │  semantic search              │
│                          ▼                              │
│   ┌─────────────────────────────────────────────────┐   │
│   │         QUERY RESULTS (ranked by meaning)       │   │
│   │  • Agent name   • Source file path              │   │
│   │  • Content preview   • Similarity score         │   │
│   └─────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────┘

Layer 1 — Local files: The source of truth. Full content including credentials, API keys, and sensitive data. Agents read these directly at startup and for detailed lookups. Fast, always available, never leaves the machine.

Layer 2 — Vector database: The search index. Redacted copies of memory content, converted to embeddings and stored in pgvector. Agents query this when they need to find something. Returns pointers to the relevant local files, not the secrets themselves.

The elegance here is in what the vector DB doesn't do. It's not a replacement for local files. It's a smart index over them. If the database goes down, agents still work — they just fall back to keyword search. The memory never disappears; it just becomes harder to query.

How This Fits With QMD Memory

OpenClaw already has a memory management system called QMD — a structured approach where agents actively curate their MEMORY.md files, deciding what's worth keeping, consolidating duplicates, and retiring stale entries. If you're already using QMD, you might wonder whether this vector system is redundant. It's not. They solve different halves of the same problem.

Think of it this way: QMD is the writer, the vector DB is the librarian.

QMD focuses on quality — it ensures memory files stay clean, well-structured, and relevant. It's an active curation process that keeps agent memory from becoming a dumping ground. But even perfectly curated files don't help if an agent can't find the right entry when they need it. That's the gap vector search fills.

	QMD Memory	Vector Memory
Role	Curates and maintains memory files	Makes those files searchable by meaning
Strength	Memory quality and structure	Retrieval across large, distributed memories
Scope	Per-agent	Cross-agent — all four agents share one index
What happens to old memories	Retired or pruned to keep files lean	Permanently indexed — nothing is ever unreachable

The two systems actually reinforce each other. QMD-maintained files are high-quality inputs for the vector index — well-structured, deduplicated, and meaningful. In return, the vector DB means QMD can prune aggressively without fear of losing knowledge permanently. The pruned content is still indexed and searchable; it just lives in the DB rather than cluttering the active memory file.

There's also the cross-agent dimension that QMD alone can't address. QMD operates within a single agent's workspace. Mario's QMD has no visibility into what Zelda's QMD has curated. The vector DB breaks that silo — a single semantic search spans all four agents simultaneously, surfacing relevant knowledge regardless of which agent originally captured it.

The pipeline looks like this:

QMD curates MEMORY.md files (quality, structure, relevance)
        ↓
Vector sync indexes those files every 2 hours (searchability, scale)
        ↓
Agents query by meaning, get pointed back to QMD-maintained files
        ↓
Nothing is ever pruned into oblivion — it's just moved to the index

If you're building a multi-agent system and already have a memory curation approach, don't replace it with vector search — layer vector search on top of it. The curation keeps signal-to-noise high. The semantic index makes that signal findable.

The Components

pgvector: Vectors Inside PostgreSQL

Rather than spinning up a specialised vector database, we extended our existing PostgreSQL instance with the pgvector extension. This keeps the infrastructure footprint small and leverages a database we already trust.

-- Vector column stores 768-dimensional embeddings
CREATE TABLE memories (
    id            SERIAL PRIMARY KEY,
    agent_name    VARCHAR(50) NOT NULL,
    source_file   VARCHAR(255),
    content       TEXT NOT NULL,
    metadata      JSONB DEFAULT '{}',
    embedding     VECTOR(768),  -- The magic happens here
    created_at    TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- IVFFlat index for approximate nearest neighbour search
CREATE INDEX idx_memories_embedding
    ON memories USING ivfflat (embedding vector_cosine_ops);

The <=> operator computes cosine distance between two vectors. Distance 0 means identical meaning; distance 2 means opposite. We convert to a similarity score (1 minus the distance) so higher numbers mean better matches.

Ollama: Embeddings That Stay Local

To generate embeddings, we use Ollama running nomic-embed-text locally. No cloud API calls, no per-token costs, no data leaving the machine.

# Pull the model once (275MB, MIT license)
ollama pull nomic-embed-text

# Every piece of text becomes a 768-float vector
curl http://localhost:11434/api/embeddings \
  -d '{"model": "nomic-embed-text", "prompt": "API credentials for service"}'

Local embeddings also mean determinism: the same text always produces the same vector, which matters for deduplication and cache invalidation.

The Sync Pipeline: How Memories Get Indexed

def sync_memory_file(filepath: Path, agent_name: str):
    """One file's journey to the vector DB"""

    # 1. Read original content (with secrets)
    content = filepath.read_text()

    # 2. Redact secrets before they touch the DB
    content_redacted = redact_secrets(content)
    # sk-...         → [REDACTED-OPENAI-KEY]
    # plane_api_...  → [REDACTED-PLANE-TOKEN]

    # 3. Convert to embedding
    embedding = ollama.embed(content_redacted)
    # Returns: [0.319, -0.299, -3.875, ...] (768 floats)

    # 4. Store in pgvector
    db.execute("""
        INSERT INTO memories (agent_name, source_file, content, embedding)
        VALUES (%s, %s, %s, %s)
    """, (agent_name, str(filepath), content_redacted, embedding))

The sync script runs every 2 hours via a systemd timer and tracks content hashes to avoid re-indexing files that haven't changed.

Security: Keeping Secrets Out of the Index

This was non-negotiable. Agent memory files contain real credentials — API tokens, database passwords, service keys. None of that could touch the vector database.

SECRET_PATTERNS = [
    (r'github_pat_[a-zA-Z0-9]{22}_[a-zA-Z0-9]{59}', '[REDACTED-GITHUB-TOKEN]'),
    (r'plane_api_[a-f0-9]{64}',                       '[REDACTED-PLANE-TOKEN]'),
    (r'sk-[a-zA-Z0-9_-]{40,}',                        '[REDACTED-OPENAI-KEY]'),
    (r'sk_[a-zA-Z0-9]{32}',                           '[REDACTED-REVENUECAT-KEY]'),
    (r'password["\']?\s*[:=]\s*["\']?[^\s"\']{8,}',   '[REDACTED-PASSWORD]'),
]

def redact_secrets(text: str) -> str:
    redacted = text
    for pattern, replacement in SECRET_PATTERNS:
        redacted = re.sub(pattern, replacement, redacted, flags=re.IGNORECASE)
    return redacted

Layer	Contains Secrets?	Use Case
Local files	✅ Yes	Agent startup, credential access
Vector DB	❌ No (redacted)	Semantic search, cross-agent discovery

The flow is: agent needs to find something → queries vector DB (redacted) → gets back a file path → reads the local file for the full details including credentials. The vector DB tells agents where to look, never what the secret is.

Every indexed entry also carries a metadata audit trail:

{
  "hash":      "b44b0fec40a8d638",
  "redacted":  true,
  "file_size": 3642,
  "synced_at": "2026-03-12T21:51:35"
}

How Agents Actually Use It

From an agent's perspective, the interface is simple. Search your own memories, or search across the whole team:

from memory_tool import memory_search, search_cross_agent, format_results

# Search own memories
results = memory_search("RevenueCat API credentials")
print(format_results(results))
# [mario] workspace-mario/MEMORY.md  |  Similarity: 0.85
# ## RevenueCat API (LumenDoc) ...

# Search all agents' memories
results = search_cross_agent("kickoff meeting")
# Returns: Zelda's notes, Peach's summary, Toad's action items

And from the command line:

# Search your own memories
python3 memory_tool.py "API keys"

# Search across all agents
python3 memory_tool.py "Plane migration" --cross

# Target a specific agent
python3 memory_tool.py "design decisions" --agent peach

The fallback is built in — if the vector DB returns nothing, it falls back to local file grep. Agents always get an answer.

def memory_search(query, fallback=True):
    results = search_vector_db(query)
    if not results and fallback:
        return search_local_files(query)  # Safety net
    return results

Deployment

Docker

postgres-vector:
  image: pgvector/pgvector:pg16
  environment:
    POSTGRES_DB:       agent_memory
    POSTGRES_USER:     memory
    POSTGRES_PASSWORD: ${DB_PASSWORD}
  ports:
    - "5435:5432"   # Separate port from the app DB on 5432
  volumes:
    - agent_memory_data:/var/lib/postgresql/data

Systemd Timer

# /etc/systemd/system/agent-memory-sync.timer
[Unit]
Description=Agent Memory Sync — every 2 hours

[Timer]
OnBootSec=5min
OnUnitActiveSec=2h
Persistent=true

[Install]
WantedBy=timers.target

Five minutes after boot, then every two hours. If a sync is missed (machine was off), Persistent=true catches it up on next boot.

What We Learned: The Honest Version

It works well. But here's what we'd tell anyone building the same thing:

Redaction patterns need to be exhaustive. Our first pass missed table-formatted credentials — keys stored in Markdown tables rather than inline. Regex alone isn't enough; you need to test against the actual shape of your data, not what you assume the shape is.

Not everything should be indexed. "Store every conversation" sounds appealing until you realise you're indexing routine task logs, weather checks, and small talk alongside genuinely important decisions. Retrieval quality degrades with noise. We index curated memory files and daily summaries — not raw chat transcripts.

Token budget matters. Each agent query fetches relevant chunks and injects them into context. At 5 chunks × 500 tokens each, that's 2,500 tokens of memory overhead per call. Fine for most models, but worth designing chunk sizes and retrieval limits deliberately from the start rather than optimising later.

The 2-hour sync lag is rarely a problem. We expected agents to complain about it. They don't. Critical information is always in local files and immediately accessible. The vector DB is for finding things you forgot you knew, not for real-time coordination.

The Results

We went from 4 agents with siloed, grep-searchable memory files to a unified semantic memory layer with over 1,000 memories indexed across the team. Knowledge that would have been pruned and lost is now permanently searchable. Agents regularly surface information from weeks ago that directly answers current questions — without being told to look for it.

The cross-agent capability turned out to be the most valuable part. Mario pulling a relevant note from Zelda's memory without Zelda being in the conversation — that's institutional knowledge working the way it's supposed to.

What's Next

Memory decay: Old memories will eventually clutter results. We'll add time-decay weighting so recent entries score higher, fading older ones without deleting them.

Conversation indexing: Currently only curated memory files are indexed. The next step is summarising raw conversations before embedding — so even informal exchanges get captured.

Proactive surfacing: When Mario starts working on an API integration, the system should surface Peach's notes on a similar problem without being asked. Agents recommending memories to each other based on current task context.

Build It Yourself

If you're running your own agent system and hitting the same memory walls, the stack is straightforward:

pgvector — github.com/pgvector/pgvector
Ollama — ollama.com
nomic-embed-text — huggingface.co/nomic-ai/nomic-embed-text-v1

The total model size is 275MB. The sync script is under 200 lines of Python. If you already have PostgreSQL running, you're most of the way there.

The hard part isn't the technology — it's deciding what's worth remembering. Start with your curated memory files, prove retrieval quality works, then expand from there.

Built by the OpenClaw team. Deployed 2026-03-12.