The Truth About 26%: Mem0's Paper, Benchmark Wars, and the Promise vs Reality of Graph Memory

“26% accuracy boost.”

This number shows up in Mem0’s fundraising pitch deck, TechCrunch coverage, AWS partnership announcements, and the GitHub README. It supports a company’s $24M Series A, 41K GitHub stars, and its position as the exclusive memory partner for AWS Strands SDK.

But how was that 26% actually calculated? Compared to what? Under what conditions?

This article takes that number apart, piece by piece.

The Two-Stage Pipeline: Elegant Simplicity

Let’s start with Mem0’s core mechanism — how memory gets “written.”

In the previous article, I gave Mem0’s extraction pipeline a one-line summary. This time we’re opening up every step.

Stage 1: Extraction (extracting candidate memories)

After each conversation, Mem0 doesn’t just throw the entire dialogue at an LLM for “summarization.” Instead, it carefully assembles three context sources, then has the LLM extract facts worth remembering:

Context Source	Content	Why It’s Needed
Latest Exchange	The most recent user-assistant exchange	Freshest information source
Rolling Summary	A rolling summary of the entire conversation	Provides global context, prevents out-of-context extraction
Recent Window	The last m messages (sliding window)	Preserves short-term memory detail

The Rolling Summary updates asynchronously in the background — it doesn’t block inference. This is a solid engineering decision: the summary doesn’t need to be precise in real-time; lagging a few turns behind is fine.

Stage 2: Update (deciding how to update the memory store)

Once candidate facts are extracted, each fact is compared against the top-s most similar memories in the vector DB. The LLM then decides on one of four operations via tool call:

Operation	When	Example
ADD	Brand new fact, nothing similar in memory store	”User works at Google”
UPDATE	Similar memory exists but can be enriched	”Likes cricket” → “Likes playing cricket with friends”
DELETE	New fact contradicts existing memory	”Moved to SF” → Delete “Lives in New York”
NOOP	Already memorized, or not important	Duplicate information

The full pipeline looks like this:

Conversation ends
  ↓
[Extraction LLM] ← Latest Exchange + Rolling Summary + Recent Window
  ↓ Candidate facts
[Vector Search] ← Each fact vs memory store top-s
  ↓ Candidate facts + similar memories
[Update LLM] → Tool Call: ADD / UPDATE / DELETE / NOOP
  ↓
Vector DB (+ optional Graph DB)

Sr/Staff Perspective: Why Use an LLM for Operation Classification?

This is worth pausing on. The four operations (ADD/UPDATE/DELETE/NOOP) look perfect for a rule-based system — say, set a cosine similarity threshold: above 0.95 = NOOP, 0.7-0.95 = UPDATE, below 0.7 = ADD.

But Mem0 chose an LLM. Why?

Because UPDATE and DELETE require semantic understanding. “User moved to SF” and “User lives in New York” might have high cosine similarity (both about residence), but the semantic relationship is contradiction, not complementary — it should be DELETE, not UPDATE. A rule-based system struggles to distinguish “complementary” from “contradictory.”

What’s the trade-off? Each conversation’s ingestion requires two LLM calls. 50 conversation rounds = 100 LLM calls, just for memory management. That’s a non-trivial cost, which is why the cost analysis later matters.

LOCOMO Benchmark Teardown: How That 26% Was Actually Calculated

Mem0’s flagship 26% comes from the LOCOMO (Long-term COnversational MeMOry) benchmark. Let’s first understand what this benchmark actually tests.

Benchmark design:

10 long conversations, each ~600 dialogue turns, ~26,000 tokens
Each conversation comes with ~200 test questions with ground-truth answers
4 question types: single-hop, multi-hop, temporal, open-domain
Scoring: LLM-as-a-Judge (J score) — an LLM compares answers against ground truth

Key results:

System	Overall J Score	Notes
Full Context (brute force)	~73%	No memory system at all
Mem0ᵍ (Graph Memory)	~68.4%	Mem0’s best configuration
Mem0 (Base)	~66.9%	Pure vector version
Zep (as reported by Mem0’s paper)	~65.99%	Later disputed
Best RAG	~60.97%	Chunk-based retrieval
OpenAI Memory	~52.9%	ChatGPT’s built-in memory

Here’s how that 26% is calculated:

Mem0 Base’s 66.9% vs OpenAI Memory’s 52.9%.

(66.9 - 52.9) / 52.9 ≈ 26.5% relative improvement.

Not a 26 percentage point absolute difference. It’s the relative improvement rate.

This isn’t cheating in academic papers — using relative improvement is common practice. But in marketing materials, it easily creates the illusion that “accuracy went up by 26%.”

The More Awkward Part

Full Context scored ~73%, higher than Mem0’s 66.9%.

What is Full Context? No memory system at all — just stuff the entire 26K token conversation history into the context window. A brute force solution with zero cleverness.

But it won.

This isn’t to say Mem0 is useless — Full Context has a p95 latency of 17.12 seconds and token cost of ~26K per query. Mem0 delivers 1.44 seconds latency and ~1.8K tokens. The efficiency advantage is enormous, but accuracy falls short of “just stuff everything in.”

This raises a core question: LOCOMO’s 26K tokens fit easily within modern LLM context windows. GPT-4 Turbo 128K, Claude 200K, Gemini 2M — 26K tokens is nothing. Does this benchmark actually test “memory” capability? Or is it testing “finding answers within 26K tokens”?

Breakdown by Question Type

Type	Mem0	Mem0ᵍ	Full Context	Observation
Single-hop	Best	Slightly lower	High	Vector search’s sweet spot
Multi-hop	51.15	Even lower	—	Requires reasoning, all systems struggle
Temporal	58.13	Slightly better	—	The only category where Graph wins
Open-domain	Good	Best	—	Graph’s structural advantage

Notable: Mem0ᵍ actually performs worse than base Mem0 on single-hop. The paper itself acknowledges this — adding graph actually introduces noise from graph traversal on simple queries.

The Benchmark War: Zep Strikes Back

Academic benchmarks are usually quiet number comparisons. But Mem0’s paper kicked a hornet’s nest — it included Zep as a comparison baseline, and Zep’s numbers didn’t look great.

Zep wasn’t having it.

Zep’s Accusations

Zep’s founder published a blog post titled “Lies, Damn Lies, & Statistics: Is Mem0 Really SOTA in Agent Memory?” — quoting Mark Twain, full of gunpowder.

Core accusations:

Incorrect Zep implementation: Mem0’s paper ran Zep’s search sequentially when it should have been concurrent, artificially inflating latency
Deflated scores: Zep ran the same benchmark themselves and got 75.14%, far above the 65.99% Mem0’s paper reported
Inconsistent system prompts: Mem0’s paper modified the system prompt and retrieval template, deviating from the standard configuration used for other baselines

Zep’s conclusion: if implemented correctly, Zep not only beats Mem0 but also beats Full Context.

Mem0’s Counter-Strike

Mem0 co-founder Deshraj Yadav fired back on GitHub Issue #5, with an equally combative title: “Revisiting Zep’s 84% LoCoMo Claim: Corrected Evaluation & 58.44% Accuracy.”

Key finding: Zep included the adversarial category in their J score calculation. The LOCOMO benchmark’s standard practice is to only calculate Categories 1-4 (single-hop, multi-hop, temporal, open-domain), excluding adversarial.

Corrected numbers:

Version	J Score
Zep as reported in Mem0’s paper	65.99% ± 0.16
Zep’s self-reported claim	75.14% ± 0.17
Mem0’s corrected Zep score	58.44% ± 0.20

From 75.14% to 58.44% — the gap comes down to whether the adversarial category is included.

My Verdict

Both sides have problems.

Zep including the adversarial category is a methodological error. This is clear-cut.
But whether Mem0’s Zep implementation was correct is hard for outsiders to verify. Sequential vs concurrent search genuinely affects performance.
The more fundamental issue is the benchmark itself. LOCOMO’s 26K tokens doesn’t qualify as “long-term memory” by 2026 context window standards. Full Context brute-forcing scores ~73%, which shows the benchmark lacks discriminative power.

The real takeaway: don’t pick tools based on benchmark numbers. Two companies accusing each other of cheating ultimately proves one thing — benchmarks are only valid for benchmarks. Your scenario isn’t LOCOMO. Your conversation length, question types, and latency requirements are all different.

Mem0ᵍ: Graph Memory’s Promise vs Reality

This is the part I most wanted to take apart. In the previous article, the Graph-based school’s core claim was: memory isn’t a pile of isolated facts — it’s a structured knowledge network. Mem0ᵍ is the concrete implementation of that claim.

Architecture

Mem0ᵍ adds a Knowledge Graph layer on top of the base pipeline:

Entity Extraction: LLM extracts entities from conversations (names, places, concepts)
Relationship Generation: Another LLM evaluates relationships between entity pairs, labeling them (lives_in, prefers, works_at)
Storage: Entities stored as Nodes, Relations as Edges, written to Graph DB (Neo4j, Memgraph, Neptune, etc.)
Conflict Resolution: When new info contradicts old relationships, old facts are marked as invalidated rather than deleted — preserving the timeline

Retrieval is dual-path:

Entity-centric: Analyze which entities the query involves, pull related nodes and their neighbors from the graph
Semantic triplet: Simultaneously use vector search for similar triplets

Results from both paths are merged and provided to the LLM for answer generation.

The Awkward Numbers

The theory is beautiful. But the data?

Mem0ᵍ only beats base Mem0 by about 2%. (68.4% vs 66.9%)

The per-category breakdown is even more awkward:

Type	Mem0 Base	Mem0ᵍ	Difference
Single-hop	Higher	Lower	Graph actually adds noise
Multi-hop	51.15	Even lower	The scenario Graph should excel at — worse instead
Temporal	Average	Slightly better	The only clear win for Graph
Open-domain	Good	Best	Structured knowledge helps

Multi-hop performing worse is hard to accept. The whole point of Knowledge Graphs is connecting cross-entity relationships, enabling questions like “A knows B, B works at company C, C is in city D.” Yet Mem0ᵍ actually regresses here.

The paper’s own words: “graph memory yields marginal performance drop on single-hop queries.” But it doesn’t deeply explain why multi-hop also fails to improve meaningfully.

Extended Reflection

Remember a finding from the previous article? In Letta’s research, an Agent using basic filesystem operations (74.0%) beat an Agent using Mem0 graph memory (68.5%).

Two independent research results pointing to the same conclusion: Graph Memory’s ROI is far below expectations. Adding Graph DB, Graph Extraction, Graph Traversal — a pile of extra complexity and latency — ultimately yields only ~2% improvement, and even regresses in some scenarios.

This doesn’t mean Knowledge Graphs are useless. Zep’s Graphiti genuinely has an edge in temporal reasoning because it was designed for timelines from day one (bi-temporal model). The problem is that Mem0ᵍ’s graph is “bolted on” — the base pipeline was already working fine, and the graph is bolt-on rather than built-in.

From Paper to Production

Academic data reviewed. Back to the questions engineers care about most: if you’re actually going to use Mem0, how do you deploy? What pitfalls? What costs?

Two Deployment Modes

Mode	Pros	Cons
Managed (api.mem0.ai)	Zero ops, compliance, scale	Data residency, opaque pricing
Self-hosted (Docker)	Data sovereignty, customizable	You maintain the infra

Self-hosted architecture is three Docker containers: FastAPI (API server) + PostgreSQL/pgvector (vector store) + Neo4j (graph store, optional).

Security Pitfall: Pay Attention

The self-hosted default configuration has two problems:

No authentication: The API server ships with zero auth
CORS: allow_origins=["*"]: Any origin can call it

This isn’t unique to Mem0 — many open-source projects default to “developer convenience.” But if you go straight to production without adding a reverse proxy + auth, your memory store is open to the entire internet.

AWS Strands Integration

Mem0 is the official memory provider for AWS Strands Agents SDK. Integration is straightforward:

from strands import Agent
from strands.models import BedrockModel
from mem0 import MemoryClient

# Initialize Mem0
memory = MemoryClient(api_key="m0-xxx")

# Create Agent
model = BedrockModel(model_id="us.anthropic.claude-sonnet-4-5-20250929-v1:0")
agent = Agent(model=model)

# Store memories after conversation
response = agent("I prefer dark mode and use Kotlin")
memory.add(response, user_id="user-123")

# Load memories before next conversation
memories = memory.search("user preferences", user_id="user-123")

If you’re already in the AWS ecosystem (Bedrock, Lambda, Fargate), Mem0 is the lowest-friction memory solution.

Cost Model

This is the part many people overlook. Mem0’s memory management isn’t free:

Each conversation = 1 Extraction LLM call + 1 Update LLM call = 2 LLM calls

50 conversation rounds = 100 LLM calls (memory management alone, not counting the conversation itself)

Add vector search, graph operations (if using Mem0ᵍ), and embedding generation — memory management costs could account for 20-30% of your Agent’s total cost.

Compare this with Observational Memory (Mastra)‘s “compression + prompt cache” approach: Mem0’s per-conversation cost is significantly higher. But Mem0’s cross-session capability is something Mastra doesn’t have — different scenarios, different trade-offs.

v2.4.0 New Features (2026-03)

Feature	Purpose
`structured_data_schema`	Specify structured output format for memories
`immutable` memories	Memories that can’t be UPDATEd or DELETEd
Async mode	Asynchronous writes, doesn’t block inference
`filter_memories`	Filter specific memories during search
Memory export	Export the memory store

immutable memories is an interesting feature — certain core facts (user identity, compliance requirements) that you don’t want the LLM to accidentally delete or modify.

Conclusion: My Selection Recommendations

After studying Mem0’s paper, the benchmark controversy, and production realities, here’s my conclusion:

Mem0’s real value isn’t Graph Memory — that ROI is too low. The real value is the base pipeline’s elegant simplicity.

The two-stage pipeline (Extraction → Update) is a clean design. It breaks “what to remember” into explicit steps, each debuggable, each with swappable LLMs, each with tunable parameters. Compared to Letta’s Agent self-management (black box) or Mastra’s compression (lossy), Mem0’s pipeline is the easiest to understand and control.

Specific recommendations:

Your Scenario	Recommendation	Rationale
Simple personalization	Mem0 base (skip graph)	Simple, sufficient; graph’s +2% isn’t worth the extra complexity
Temporal reasoning	Zep / Graphiti	Zep’s bi-temporal model is purpose-built for temporal reasoning
AWS ecosystem	Mem0	Official Strands partnership, lowest friction
Cost sensitive, single session	Mastra Observational	Zero infra + prompt cache = lowest cost
Cross-session + audit	Zep managed	Temporal KG + compliance support

The most important takeaway:

Don’t pick tools based on benchmark numbers.

26% is Mem0 vs OpenAI Memory’s relative improvement on LOCOMO, scored by LLM-as-a-Judge. Zep says they actually scored highest, Mem0 says Zep’s math is wrong. Both accuse the other of cheating. Full Context brute force actually outscores both.

Look at your scenario. How long are your conversations? Do you need cross-session? What’s your latency tolerance? What’s your budget? Answer those questions, and the answer emerges naturally.

The perfect memory system doesn’t exist. But “ship it, then iterate” is the best engineering strategy — far more useful than agonizing over benchmark numbers.

References

Mem0: Memory Layer for AI Agents (arXiv:2504.19413) — Mem0 core paper, source of the 26% accuracy boost data
Zep: Is Mem0 Really SOTA in Agent Memory? — Zep’s challenge to Mem0’s benchmark
Revisiting Zep’s 84% LoCoMo Claim (GitHub Issue #5) — Mem0’s counter to Zep
AWS + Mem0 Strands Partnership — AWS official partnership announcement
Mem0 Raises $24M Series A (TechCrunch) — Funding coverage
Mem0 Official Docs — Technical documentation
Zep: A Temporal Knowledge Graph Architecture (arXiv:2501.13956) — Zep/Graphiti paper
Letta: Benchmarking AI Agent Memory — “Filesystem beats Graph Memory” research

This is part of the “AI Agent Architecture in Practice” series. Previous: 2026 AI Agent Memory Wars: Three Schools of Thought Go Head-to-Head. Stay tuned for the next article.

Git as an External Brain for Claude Code: Beyond MEMORY.md

2026 AI Agent Memory Wars: Three Architectures, Three Philosophies