“26% accuracy boost.”
This number shows up in Mem0’s fundraising pitch deck, TechCrunch coverage, AWS partnership announcements, and the GitHub README. It supports a company’s $24M Series A, 41K GitHub stars, and its position as the exclusive memory partner for AWS Strands SDK.
But how was that 26% actually calculated? Compared to what? Under what conditions?
This article takes that number apart, piece by piece.
The Two-Stage Pipeline: Elegant Simplicity
Let’s start with Mem0’s core mechanism — how memory gets “written.”
In the previous article, I gave Mem0’s extraction pipeline a one-line summary. This time we’re opening up every step.
Stage 1: Extraction (extracting candidate memories)
After each conversation, Mem0 doesn’t just throw the entire dialogue at an LLM for “summarization.” Instead, it carefully assembles three context sources, then has the LLM extract facts worth remembering:
| Context Source | Content | Why It’s Needed |
|---|---|---|
| Latest Exchange | The most recent user-assistant exchange | Freshest information source |
| Rolling Summary | A rolling summary of the entire conversation | Provides global context, prevents out-of-context extraction |
| Recent Window | The last m messages (sliding window) | Preserves short-term memory detail |
The Rolling Summary updates asynchronously in the background — it doesn’t block inference. This is a solid engineering decision: the summary doesn’t need to be precise in real-time; lagging a few turns behind is fine.
Stage 2: Update (deciding how to update the memory store)
Once candidate facts are extracted, each fact is compared against the top-s most similar memories in the vector DB. The LLM then decides on one of four operations via tool call:
| Operation | When | Example |
|---|---|---|
| ADD | Brand new fact, nothing similar in memory store | ”User works at Google” |
| UPDATE | Similar memory exists but can be enriched | ”Likes cricket” → “Likes playing cricket with friends” |
| DELETE | New fact contradicts existing memory | ”Moved to SF” → Delete “Lives in New York” |
| NOOP | Already memorized, or not important | Duplicate information |
The full pipeline looks like this:
Conversation ends
↓
[Extraction LLM] ← Latest Exchange + Rolling Summary + Recent Window
↓ Candidate facts
[Vector Search] ← Each fact vs memory store top-s
↓ Candidate facts + similar memories
[Update LLM] → Tool Call: ADD / UPDATE / DELETE / NOOP
↓
Vector DB (+ optional Graph DB)
Sr/Staff Perspective: Why Use an LLM for Operation Classification?
This is worth pausing on. The four operations (ADD/UPDATE/DELETE/NOOP) look perfect for a rule-based system — say, set a cosine similarity threshold: above 0.95 = NOOP, 0.7-0.95 = UPDATE, below 0.7 = ADD.
But Mem0 chose an LLM. Why?
Because UPDATE and DELETE require semantic understanding. “User moved to SF” and “User lives in New York” might have high cosine similarity (both about residence), but the semantic relationship is contradiction, not complementary — it should be DELETE, not UPDATE. A rule-based system struggles to distinguish “complementary” from “contradictory.”
What’s the trade-off? Each conversation’s ingestion requires two LLM calls. 50 conversation rounds = 100 LLM calls, just for memory management. That’s a non-trivial cost, which is why the cost analysis later matters.
LOCOMO Benchmark Teardown: How That 26% Was Actually Calculated
Mem0’s flagship 26% comes from the LOCOMO (Long-term COnversational MeMOry) benchmark. Let’s first understand what this benchmark actually tests.
Benchmark design:
- 10 long conversations, each ~600 dialogue turns, ~26,000 tokens
- Each conversation comes with ~200 test questions with ground-truth answers
- 4 question types: single-hop, multi-hop, temporal, open-domain
- Scoring: LLM-as-a-Judge (J score) — an LLM compares answers against ground truth
Key results:
| System | Overall J Score | Notes |
|---|---|---|
| Full Context (brute force) | ~73% | No memory system at all |
| Mem0ᵍ (Graph Memory) | ~68.4% | Mem0’s best configuration |
| Mem0 (Base) | ~66.9% | Pure vector version |
| Zep (as reported by Mem0’s paper) | ~65.99% | Later disputed |
| Best RAG | ~60.97% | Chunk-based retrieval |
| OpenAI Memory | ~52.9% | ChatGPT’s built-in memory |
Here’s how that 26% is calculated:
Mem0 Base’s 66.9% vs OpenAI Memory’s 52.9%.
(66.9 - 52.9) / 52.9 ≈ 26.5% relative improvement.
Not a 26 percentage point absolute difference. It’s the relative improvement rate.
This isn’t cheating in academic papers — using relative improvement is common practice. But in marketing materials, it easily creates the illusion that “accuracy went up by 26%.”
The More Awkward Part
Full Context scored ~73%, higher than Mem0’s 66.9%.
What is Full Context? No memory system at all — just stuff the entire 26K token conversation history into the context window. A brute force solution with zero cleverness.
But it won.
This isn’t to say Mem0 is useless — Full Context has a p95 latency of 17.12 seconds and token cost of ~26K per query. Mem0 delivers 1.44 seconds latency and ~1.8K tokens. The efficiency advantage is enormous, but accuracy falls short of “just stuff everything in.”
This raises a core question: LOCOMO’s 26K tokens fit easily within modern LLM context windows. GPT-4 Turbo 128K, Claude 200K, Gemini 2M — 26K tokens is nothing. Does this benchmark actually test “memory” capability? Or is it testing “finding answers within 26K tokens”?
Breakdown by Question Type
| Type | Mem0 | Mem0ᵍ | Full Context | Observation |
|---|---|---|---|---|
| Single-hop | Best | Slightly lower | High | Vector search’s sweet spot |
| Multi-hop | 51.15 | Even lower | — | Requires reasoning, all systems struggle |
| Temporal | 58.13 | Slightly better | — | The only category where Graph wins |
| Open-domain | Good | Best | — | Graph’s structural advantage |
Notable: Mem0ᵍ actually performs worse than base Mem0 on single-hop. The paper itself acknowledges this — adding graph actually introduces noise from graph traversal on simple queries.
The Benchmark War: Zep Strikes Back
Academic benchmarks are usually quiet number comparisons. But Mem0’s paper kicked a hornet’s nest — it included Zep as a comparison baseline, and Zep’s numbers didn’t look great.
Zep wasn’t having it.
Zep’s Accusations
Zep’s founder published a blog post titled “Lies, Damn Lies, & Statistics: Is Mem0 Really SOTA in Agent Memory?” — quoting Mark Twain, full of gunpowder.
Core accusations:
- Incorrect Zep implementation: Mem0’s paper ran Zep’s search sequentially when it should have been concurrent, artificially inflating latency
- Deflated scores: Zep ran the same benchmark themselves and got 75.14%, far above the 65.99% Mem0’s paper reported
- Inconsistent system prompts: Mem0’s paper modified the system prompt and retrieval template, deviating from the standard configuration used for other baselines
Zep’s conclusion: if implemented correctly, Zep not only beats Mem0 but also beats Full Context.
Mem0’s Counter-Strike
Mem0 co-founder Deshraj Yadav fired back on GitHub Issue #5, with an equally combative title: “Revisiting Zep’s 84% LoCoMo Claim: Corrected Evaluation & 58.44% Accuracy.”
Key finding: Zep included the adversarial category in their J score calculation. The LOCOMO benchmark’s standard practice is to only calculate Categories 1-4 (single-hop, multi-hop, temporal, open-domain), excluding adversarial.
Corrected numbers:
| Version | J Score |
|---|---|
| Zep as reported in Mem0’s paper | 65.99% ± 0.16 |
| Zep’s self-reported claim | 75.14% ± 0.17 |
| Mem0’s corrected Zep score | 58.44% ± 0.20 |
From 75.14% to 58.44% — the gap comes down to whether the adversarial category is included.
My Verdict
Both sides have problems.
- Zep including the adversarial category is a methodological error. This is clear-cut.
- But whether Mem0’s Zep implementation was correct is hard for outsiders to verify. Sequential vs concurrent search genuinely affects performance.
- The more fundamental issue is the benchmark itself. LOCOMO’s 26K tokens doesn’t qualify as “long-term memory” by 2026 context window standards. Full Context brute-forcing scores ~73%, which shows the benchmark lacks discriminative power.
The real takeaway: don’t pick tools based on benchmark numbers. Two companies accusing each other of cheating ultimately proves one thing — benchmarks are only valid for benchmarks. Your scenario isn’t LOCOMO. Your conversation length, question types, and latency requirements are all different.
Mem0ᵍ: Graph Memory’s Promise vs Reality
This is the part I most wanted to take apart. In the previous article, the Graph-based school’s core claim was: memory isn’t a pile of isolated facts — it’s a structured knowledge network. Mem0ᵍ is the concrete implementation of that claim.
Architecture
Mem0ᵍ adds a Knowledge Graph layer on top of the base pipeline:
- Entity Extraction: LLM extracts entities from conversations (names, places, concepts)
- Relationship Generation: Another LLM evaluates relationships between entity pairs, labeling them (
lives_in,prefers,works_at) - Storage: Entities stored as Nodes, Relations as Edges, written to Graph DB (Neo4j, Memgraph, Neptune, etc.)
- Conflict Resolution: When new info contradicts old relationships, old facts are marked as invalidated rather than deleted — preserving the timeline
Retrieval is dual-path:
- Entity-centric: Analyze which entities the query involves, pull related nodes and their neighbors from the graph
- Semantic triplet: Simultaneously use vector search for similar triplets
Results from both paths are merged and provided to the LLM for answer generation.
The Awkward Numbers
The theory is beautiful. But the data?
Mem0ᵍ only beats base Mem0 by about 2%. (68.4% vs 66.9%)
The per-category breakdown is even more awkward:
| Type | Mem0 Base | Mem0ᵍ | Difference |
|---|---|---|---|
| Single-hop | Higher | Lower | Graph actually adds noise |
| Multi-hop | 51.15 | Even lower | The scenario Graph should excel at — worse instead |
| Temporal | Average | Slightly better | The only clear win for Graph |
| Open-domain | Good | Best | Structured knowledge helps |
Multi-hop performing worse is hard to accept. The whole point of Knowledge Graphs is connecting cross-entity relationships, enabling questions like “A knows B, B works at company C, C is in city D.” Yet Mem0ᵍ actually regresses here.
The paper’s own words: “graph memory yields marginal performance drop on single-hop queries.” But it doesn’t deeply explain why multi-hop also fails to improve meaningfully.
Extended Reflection
Remember a finding from the previous article? In Letta’s research, an Agent using basic filesystem operations (74.0%) beat an Agent using Mem0 graph memory (68.5%).
Two independent research results pointing to the same conclusion: Graph Memory’s ROI is far below expectations. Adding Graph DB, Graph Extraction, Graph Traversal — a pile of extra complexity and latency — ultimately yields only ~2% improvement, and even regresses in some scenarios.
This doesn’t mean Knowledge Graphs are useless. Zep’s Graphiti genuinely has an edge in temporal reasoning because it was designed for timelines from day one (bi-temporal model). The problem is that Mem0ᵍ’s graph is “bolted on” — the base pipeline was already working fine, and the graph is bolt-on rather than built-in.
From Paper to Production
Academic data reviewed. Back to the questions engineers care about most: if you’re actually going to use Mem0, how do you deploy? What pitfalls? What costs?
Two Deployment Modes
| Mode | Pros | Cons |
|---|---|---|
| Managed (api.mem0.ai) | Zero ops, compliance, scale | Data residency, opaque pricing |
| Self-hosted (Docker) | Data sovereignty, customizable | You maintain the infra |
Self-hosted architecture is three Docker containers: FastAPI (API server) + PostgreSQL/pgvector (vector store) + Neo4j (graph store, optional).
Security Pitfall: Pay Attention
The self-hosted default configuration has two problems:
- No authentication: The API server ships with zero auth
- CORS:
allow_origins=["*"]: Any origin can call it
This isn’t unique to Mem0 — many open-source projects default to “developer convenience.” But if you go straight to production without adding a reverse proxy + auth, your memory store is open to the entire internet.
AWS Strands Integration
Mem0 is the official memory provider for AWS Strands Agents SDK. Integration is straightforward:
from strands import Agent
from strands.models import BedrockModel
from mem0 import MemoryClient
# Initialize Mem0
memory = MemoryClient(api_key="m0-xxx")
# Create Agent
model = BedrockModel(model_id="us.anthropic.claude-sonnet-4-5-20250929-v1:0")
agent = Agent(model=model)
# Store memories after conversation
response = agent("I prefer dark mode and use Kotlin")
memory.add(response, user_id="user-123")
# Load memories before next conversation
memories = memory.search("user preferences", user_id="user-123")
If you’re already in the AWS ecosystem (Bedrock, Lambda, Fargate), Mem0 is the lowest-friction memory solution.
Cost Model
This is the part many people overlook. Mem0’s memory management isn’t free:
Each conversation = 1 Extraction LLM call + 1 Update LLM call = 2 LLM calls
50 conversation rounds = 100 LLM calls (memory management alone, not counting the conversation itself)
Add vector search, graph operations (if using Mem0ᵍ), and embedding generation — memory management costs could account for 20-30% of your Agent’s total cost.
Compare this with Observational Memory (Mastra)‘s “compression + prompt cache” approach: Mem0’s per-conversation cost is significantly higher. But Mem0’s cross-session capability is something Mastra doesn’t have — different scenarios, different trade-offs.
v2.4.0 New Features (2026-03)
| Feature | Purpose |
|---|---|
structured_data_schema | Specify structured output format for memories |
immutable memories | Memories that can’t be UPDATEd or DELETEd |
| Async mode | Asynchronous writes, doesn’t block inference |
filter_memories | Filter specific memories during search |
| Memory export | Export the memory store |
immutable memories is an interesting feature — certain core facts (user identity, compliance requirements) that you don’t want the LLM to accidentally delete or modify.
Conclusion: My Selection Recommendations
After studying Mem0’s paper, the benchmark controversy, and production realities, here’s my conclusion:
Mem0’s real value isn’t Graph Memory — that ROI is too low. The real value is the base pipeline’s elegant simplicity.
The two-stage pipeline (Extraction → Update) is a clean design. It breaks “what to remember” into explicit steps, each debuggable, each with swappable LLMs, each with tunable parameters. Compared to Letta’s Agent self-management (black box) or Mastra’s compression (lossy), Mem0’s pipeline is the easiest to understand and control.
Specific recommendations:
| Your Scenario | Recommendation | Rationale |
|---|---|---|
| Simple personalization | Mem0 base (skip graph) | Simple, sufficient; graph’s +2% isn’t worth the extra complexity |
| Temporal reasoning | Zep / Graphiti | Zep’s bi-temporal model is purpose-built for temporal reasoning |
| AWS ecosystem | Mem0 | Official Strands partnership, lowest friction |
| Cost sensitive, single session | Mastra Observational | Zero infra + prompt cache = lowest cost |
| Cross-session + audit | Zep managed | Temporal KG + compliance support |
The most important takeaway:
Don’t pick tools based on benchmark numbers.
26% is Mem0 vs OpenAI Memory’s relative improvement on LOCOMO, scored by LLM-as-a-Judge. Zep says they actually scored highest, Mem0 says Zep’s math is wrong. Both accuse the other of cheating. Full Context brute force actually outscores both.
Look at your scenario. How long are your conversations? Do you need cross-session? What’s your latency tolerance? What’s your budget? Answer those questions, and the answer emerges naturally.
The perfect memory system doesn’t exist. But “ship it, then iterate” is the best engineering strategy — far more useful than agonizing over benchmark numbers.
References
- Mem0: Memory Layer for AI Agents (arXiv:2504.19413) — Mem0 core paper, source of the 26% accuracy boost data
- Zep: Is Mem0 Really SOTA in Agent Memory? — Zep’s challenge to Mem0’s benchmark
- Revisiting Zep’s 84% LoCoMo Claim (GitHub Issue #5) — Mem0’s counter to Zep
- AWS + Mem0 Strands Partnership — AWS official partnership announcement
- Mem0 Raises $24M Series A (TechCrunch) — Funding coverage
- Mem0 Official Docs — Technical documentation
- Zep: A Temporal Knowledge Graph Architecture (arXiv:2501.13956) — Zep/Graphiti paper
- Letta: Benchmarking AI Agent Memory — “Filesystem beats Graph Memory” research
This is part of the “AI Agent Architecture in Practice” series. Previous: 2026 AI Agent Memory Wars: Three Schools of Thought Go Head-to-Head. Stay tuned for the next article.