Agent Memory Systems: A Source-Level Analysis of Eight Architectures

This is a source-level analysis of eight agent memory systems: Letta (v0.16.4), Cognee (v0.5.2), Graphiti (v0.27.1), Tacnode (closed-source), Mem0 (v1.0.3), Hindsight (v0.4.11), EverMemOS (commit 1f2f083), and Hyperspell (YC F25, closed-source). For each open-source system, the analysis covers the full code path from ingestion through storage through retrieval, with documentation of where the implementation diverges from the published docs. For Tacnode and Hyperspell (closed-source), the analysis draws from their papers, docs, SDK source, and public claims — with the caveat that nothing is verifiable at the code level.

The space is genuinely diverse, with at least four fundamentally different bets on what agent memory should be. Some systems trust the LLM to manage everything (Mem0, Letta). Some build explicit knowledge structures and let the pipeline handle it (Cognee, Graphiti, Hindsight, EverMemOS). One — Tacnode — sidesteps the knowledge question entirely and bets that the real problem is data infrastructure: consistency, transactions, and multi-modal storage underneath whatever memory architecture you choose. And one — Hyperspell — sidesteps both knowledge and infrastructure to solve the data access problem: connecting to 43 live sources and getting the data in the first place. Those last two bets might be the most important ones in the long run.

There are some universal patterns: temporal handling is inconsistently implemented, nobody has combined strong infrastructure with strong knowledge construction, and the gap between what these systems do and what production agents actually need is larger than the docs suggest.

But the thing that kept surfacing throughout the analysis is a more basic problem: every one of these systems defines “memory” as some variation of fact extraction, storage, and retrieval — or in Hyperspell’s case, skips extraction entirely and calls document search “memory.” They differ in how they extract, what they store, and how they retrieve — but the underlying model is the same. Memory as a search index over past conversations. For agents that need to do more than answer questions about what happened — agents that plan, adapt, and maintain context over time — that model is not going to be enough.

The landscape in 30 seconds

If you’re new to this space: as AI agents have moved from stateless chatbots to long-running systems that remember things across conversations, a whole category of tools has emerged to manage that memory. They differ a lot in philosophy — some trust the LLM to handle everything, some build explicit knowledge structures, some focus on database-level guarantees, and one focuses on getting data from external sources into a searchable index. There’s no consensus yet on the right approach, and the design space runs along several axes that are underappreciated: how much to trust the model vs. enforce structure, whether to store memory as text or as richer representations, how much operational infrastructure to put around the memory itself, and where the data comes from in the first place.

System	Approach	Entity extraction	Storage	Temporal model	Update semantics	Memory consolidation	Data connectivity
Mem0	LLM-managed facts	None (LLM decides)	Vector (Qdrant) + optional graph	None	Overwrite in place	None	None
Letta	Agent-managed notebook	None (agent decides)	PostgreSQL + file system	Conversation timestamps	Versioned blocks (ACID)	`memory_rethink` (sleep-time agent)	None
Cognee	Knowledge graph pipeline	Typed entities + relations	Relational + graph + vector	None	Append-only graph	None	Document loaders (PDF, DOCX, images, audio)
Graphiti	Temporal knowledge graph	Typed edges, entity dedup	Neo4j	Bi-temporal (4 fields per edge)	Edge invalidation + replacement	None	None
Hindsight	Biomimetic typed facts	4 fact types, causal links	PostgreSQL + pgvector	Temporal search + date augmentation	Consolidation system	`observation` facts (background)	None
EverMemOS	Multi-type taxonomy	7 memory types	MongoDB + ES + Milvus + Redis	Episode boundaries	Sequential multi-backend writes	None	None
Tacnode	Database infrastructure	N/A (user-defined)	Custom multi-modal DB	Native time travel (23h default)	ACID transactions	N/A	None
Hyperspell	Data access layer	None	Managed search index	None	Continuous sync from sources	None	43 OAuth integrations (Gmail, Slack, Notion, etc.)

System-by-system: what each one does and how it thinks

Mem0 — The simplest architecture in the space

Philosophy: Let the LLM handle everything. Keep the infrastructure minimal.

This is the purest example of what you might call the minimal-structure approach to agent memory — maximum trust in the model’s judgment. Let the LLM figure out what to extract, how to reconcile, what to keep. The bet is that model capability is the binding constraint, not system architecture.

Mem0 has 44K GitHub stars and $24M in funding (as of February 2026). Given that footprint, the architecture is surprisingly lean — it’s the simplest system of the eight.

The core loop: when you call add(), Mem0 makes two LLM calls. The first extracts facts from your conversation (“user likes pizza,” “user works at Acme”). The second compares those facts against what’s already stored and decides what to add, update, delete, or ignore. That’s it. Facts go into a vector store (Qdrant by default), with an optional graph layer via Neo4j.

There’s no schema on the extracted facts. No structural validation. No provenance chain from a fact back to the conversation that produced it. Updates overwrite the previous version in the vector store — there’s a separate SQLite changelog, but it’s not integrated into the retrieval path.

What Mem0 gets right is adoption friction. It supports 19 vector stores, 15 LLM providers, 10 embedding providers. You can have something running in minutes. That’s a real achievement — the integrations list alone represents a lot of engineering work.

The tradeoff is that every quality guarantee depends on LLM judgment. Whether this matters depends partly on how good the models get. If extraction and reconciliation become near-perfect, Mem0’s architecture is prescient — the minimal infrastructure stays out of the way of improving models. If they stay noisy, the lack of structural guardrails becomes the bottleneck. It’s a genuine bet on a trajectory, not a shortcut.

Hyperspell — The fan-in gateway

Philosophy: The problem is data access. Connect everything, normalize it, make it searchable. The knowledge layer is someone else’s job.

Hyperspell (YC F25, 3-person team) is a managed platform with 43 OAuth data source integrations — Gmail, Slack, Notion, Google Drive, Salesforce, GitHub, Jira, and dozens more. You connect your accounts, Hyperspell handles OAuth, token refresh, continuous sync, and indexes everything into a unified search API. The integration breadth is unmatched in this analysis. Cognee has document loaders (PDF, DOCX, images, audio), which is more than anyone else — but those are passive file parsers, not live connectors. You still have to get the file to Cognee. Hyperspell reaches out and pulls data continuously from your connected accounts. No other system does that.

If this sounds familiar, it should. Segment solved the same fan-in problem for analytics: connect all your data sources, normalize the schema, pipe it to whatever downstream system needs it. Hyperspell is solving it for agent context.

The search configuration is thoughtful: per-source filters let you scope by Gmail labels, Slack channels, Notion pages. Cross-source search with per-source weighting handles the “find this across my whole workspace” use case. Multi-tenancy is strong — scoped user tokens with configurable expiry and CSRF protection, safe for frontend embedding. API testing confirmed that memories added via a user token are invisible to app-level searches. It’s the strongest multi-tenancy story of any system in this analysis. There’s also a feedback/evaluation API — unique among the eight systems — that lets you score query results and individual chunks, though what happens with that feedback server-side is opaque.

What makes Hyperspell interesting in this analysis is that it’s solving a genuinely different problem than the other seven systems. Every other system assumes you already have data and asks “how do I turn it into knowledge?” Hyperspell asks “how do I get the data in the first place?” That’s not a lesser question — building and maintaining 43 OAuth connectors with token management, rate limiting, incremental sync, and per-source configuration is a real engineering challenge that nobody else in the agent memory space has attempted.

What Hyperspell does not do — and I think deliberately does not do, at this stage — is knowledge construction. There’s no entity extraction, deduplication, contradiction detection, or graph structure. If contradictory documents enter the index, both are returned. There’s no temporal state model or version history. These aren’t oversights for a system focused on data access — they’re the domain of whatever knowledge layer sits downstream.

The platform is young and some rough edges showed up in API testing — search result ordering, filter behavior on certain endpoints, and answer model availability still have room to mature. The platform is closed-source with limited public technical documentation, though the SDKs are MIT-licensed and well-typed.

Hyperspell is more accurately understood as a data access layer — the first piece of a stack that doesn’t fully exist yet. The architecture it implies is three layers: a connectivity layer that feeds a knowledge construction layer (entity extraction, contradiction detection, temporal tracking) that feeds the agent. The marketing calls it “memory,” but the engineering is solving data access. And data access is a real problem that the knowledge construction systems mostly ignore — Graphiti has the best entity dedup in the space but no way to connect to your Gmail.

Letta — The agent manages its own memory

Philosophy: Memory is something the agent does, not something done to it.

Letta takes a fundamentally different approach. Instead of extracting facts from conversations via a separate pipeline, Letta gives the agent tools to read and write its own memory. The metaphor is closer to a notebook the agent maintains than a database that processes its conversations.

Memory comes in four types: Core Memory (always in context, like the agent’s working notes), Message Memory (session scratchpad), Archival Memory (searchable long-term storage via pgvector), and Recall Memory (searchable conversation history). The agent decides what to save, what to update, and what to search for.

What Letta gets right is infrastructure discipline. PostgreSQL with optimistic locking, ACID transactions, proper concurrency handling, versioned blocks with edit history. If the server crashes mid-operation, your state is consistent. The MemFS feature extends this to let agents edit memory as markdown files with git-backed version control.

What it doesn’t do is knowledge construction. Letta doesn’t extract entities, build knowledge graphs, or detect contradictions. The agent manages all of that reasoning itself, which means memory quality scales with the quality of the agent’s own judgment about what’s worth remembering. It’s a bet on agent intelligence over system intelligence — and if agent reasoning keeps improving, it may be the right one.

One feature worth flagging: Letta’s sleep-time agent runs a secondary model between or during conversations with a memory_rethink tool that does a full rewrite of memory blocks. This isn’t fact extraction — it’s the system stepping back and reconsidering what its memory should contain given everything it’s observed. That’s closer to consolidation than retrieval, and it’s a gesture toward a different model of what memory is for. The underlying primitive is still text blocks, so how far this can go is an open question — but the intent is notable.

Cognee — Knowledge graph construction as a pipeline

Philosophy: Turn documents into structured knowledge that any application can query.

Cognee is the most explicitly pipeline-oriented system. Its API has three main steps: add() (ingest documents), cognify() (build a knowledge graph), and search() (query it). It’s not an agent — it’s infrastructure that agents call.

The cognify pipeline processes documents through chunking, LLM-based entity extraction (extracting typed entities like Person, Organization, Concept), relationship extraction, summarization, and storage across three databases: a relational store (SQLite or PostgreSQL), a graph database (Kuzu by default, or Neo4j), and a vector store (LanceDB or PGVector).

The pipeline design is genuinely modular — each step is a composable Task, you can swap backends through clean interfaces, and the system supports 12 different search strategies. The async architecture is well-structured.

Where Cognee struggles is at the boundaries between these components. Each chunk is processed independently — the LLM extracting entities from chunk 3 doesn’t see what was extracted from chunks 1 and 2. Entity deduplication is string-based (uuid5(name.lower())), so “Apple Inc” and “Apple” become different entities. There’s no temporal model at all — the system has no concept of when facts are true or when they stop being true.

There’s something interesting to notice here about representations. When Cognee extracts a triplet like (Guido van Rossum, created, Python), it’s taking a continuous semantic relationship that exists in embedding space and discretizing it into a labeled graph edge. The edge label created is a lossy compression of the actual relationship. The vector embeddings of the entity names are arguably richer representations than the graph structure itself. Whether graph edges are the right primitive, or an intermediate step toward something more continuous, is one of the open questions at the end of this analysis.

Graphiti — A temporal knowledge graph

Philosophy: Time is a first-class citizen. Knowledge evolves, and the graph should track that evolution.

If Mem0 represents the minimal-structure end of the spectrum, Graphiti is closer to the other end — explicit schemas, typed edges, temporal fields, entity dedup heuristics. The system defines what knowledge looks like and enforces it structurally, rather than trusting the LLM to keep it all straight. Though it still uses LLMs heavily in the extraction pipeline, the storage model is deliberate and schema-driven.

Graphiti (by the team behind Zep) has arguably the most thoughtful data model of the eight systems. Every edge carries four temporal fields: t_created (when the edge was added to the graph), t_valid (when the fact became true in the real world), t_invalid (when it stopped being true), and t_expired (when it was superseded in the graph). When a new fact contradicts an old one, the system automatically invalidates the old edge.

Entity deduplication uses a two-phase approach: first a deterministic pass (exact match, then MinHash/Jaccard similarity with an entropy filter to avoid false positives on short names), then an LLM pass for anything that’s still ambiguous. It’s the most sophisticated entity resolution in the open-source agent memory space as of this writing.

The search pipeline offers 16 pre-built recipes combining BM25, cosine similarity, BFS graph traversal, and five different rerankers (RRF, MMR, cross-encoder, node distance, episode mentions). And unlike Cognee’s search, Graphiti does real graph traversal — actual Cypher path queries in Neo4j.

The tradeoff is LLM cost. Every episode requires a minimum of 3 + N LLM calls (where N is the number of extracted edges). Entity extraction, entity resolution, edge extraction, per-edge dedup and contradiction detection, summary updates. For a typical episode with 5 edges, that’s 8+ LLM calls. This is a deliberate bet that accuracy is worth the cost — and for many use cases it is. But it means real-time streaming at high volume isn’t feasible.

Hindsight — Biomimetic memory with typed facts

Philosophy: Model memory the way cognitive science suggests it works — different types of knowledge, stored differently, retrieved differently.

Hindsight (by Vectorize.io) runs on a single PostgreSQL database with pgvector — no Neo4j, no Elasticsearch, no Milvus — and implements arguably the most conceptually rich retrieval system of the eight. Facts are typed: world (general knowledge), experience (personal events), opinion (beliefs with confidence scores), and observation (synthesized patterns generated by a background consolidation system). Facts are linked through temporal, semantic, entity, and causal relationships — with causal links getting boosted weight during retrieval.

Retrieval runs four searches in parallel (semantic, BM25, graph traversal with spreading activation, temporal), fuses them via Reciprocal Rank Fusion, and reranks with a cross-encoder. A Reflect mode runs an agentic loop: searching mental models, then observations, then raw facts, gathering evidence before answering. All of this from a single database — the complexity is in the retrieval logic, not the infrastructure.

The most interesting part is the observation type. These are synthesized patterns generated by a background consolidation process, not directly extracted from any single conversation. It’s the system noticing something across experiences, not just storing what happened. Like Letta’s memory_rethink, this is a gesture toward memory as representation rather than memory as retrieval. Neither system takes the idea very far — there’s no schema for how synthesized knowledge relates to other synthesized knowledge, no structure around how observations build on each other over time — but the architectural instinct is pointing somewhere important.

EverMemOS — The most comprehensive taxonomy

Philosophy: Different types of memory need different extraction, storage, and retrieval strategies.

EverMemOS is the most architecturally ambitious system of the eight. It defines seven memory types (episodes, profiles, preferences, relationships, semantic knowledge, basic facts, and core memories), each with its own extractor, storage path, and retrieval strategy.

The system introduces the “MemCell” concept — a boundary detection mechanism that determines when a conversation has shifted topics and should be chunked into a new episode. It uses both hard limits (8,192 tokens or 50 messages) and LLM-based boundary detection (evaluating topic change, intent transition, content relevance). It’s a thoughtful approach to the “when does one memory end and another begin” problem.

Operationally, EverMemOS requires four core containers: MongoDB, Elasticsearch, Milvus, and Redis. Each memory type is saved to up to three of these backends. The system includes five retrieval methods (keyword, vector, hybrid, RRF, and an agentic multi-round approach) and supports bilingual (English/Chinese) processing.

The breadth is impressive. But the multi-backend architecture creates consistency challenges. Writes go MongoDB to Elasticsearch to Milvus sequentially, and there are no cross-system transactions. If Elasticsearch fails after MongoDB succeeds, you have data that’s persisted but unsearchable until the index catches up. Each additional backend is another consistency boundary.

Tacnode — The database-first approach

Philosophy: The problem isn’t knowledge extraction — it’s data infrastructure. Give agents a consistent, fast, multi-modal database and let the application layer handle the rest.

Tacnode takes a completely different approach from every other system on this list. It’s a closed-source, PostgreSQL-compatible database (built from scratch, not a fork) that stores structured rows, semi-structured JSON, unstructured text, and vector embeddings within a single transactional system.

The founder, Xiaowei Jiang, has a verifiable track record contributing to production-scale database systems (co-authored the Hologres paper at Alibaba, Apache Flink/Blink contributions, Microsoft SQL Server). The theoretical foundation is a position paper on “Decision Coherence” that makes a standard distributed-systems argument: composing independently advancing databases can’t guarantee consistent state across all of them without a coordination layer.

Key features: native time travel queries (SELECT * FROM users FOR SYSTEM_TIME AS OF '-2h'), SQL-based semantic operators (llm_classify, llm_extract, llm_summarize), ACID transactions across all data modalities, and compute-storage separation.

What Tacnode explicitly does NOT do: entity extraction, relationship discovery, graph traversal, community detection, episodic memory. It’s infrastructure under the knowledge layer, not the knowledge layer itself.

The fundamental caveat: it’s closed source. Every ACID guarantee, every performance claim, every consistency promise is based on public documentation and papers, not verified code. For this analysis, that means the Tacnode sections should be read as “what the team claims” rather than “what the implementation does.” Everything else in this post was verified at the source level.

Patterns across all eight

A few patterns show up consistently across the analysis. None of these are failures on anyone’s part. These are hard, early-stage systems being built by smart teams. The gaps feel more like “where the space is” than “what anyone’s doing wrong.”

The infrastructure/knowledge split

There’s a divide that becomes obvious once you look at the code.

On the infrastructure side: systems with real database discipline — ACID transactions, incremental updates, consistency guarantees. Letta and Tacnode are the clearest examples. But neither extracts entities, builds knowledge graphs, or detects contradictions. They provide the substrate, not the knowledge.

On the knowledge side: systems with sophisticated extraction and retrieval pipelines — Graphiti’s two-phase entity dedup, Hindsight’s 4-way parallel retrieval with causal link boosting, Cognee’s 12 search strategies. But their transactional guarantees range from partial to nonexistent.

I didn’t find a single system that does both well.

And Hyperspell reveals a third axis entirely — live data connectivity. It has neither infrastructure guarantees nor knowledge construction, but it solves the data access problem that the other systems mostly ignore. Cognee can parse your PDFs and DOCXs, which is more than anyone else offers on the ingestion side. But none of the knowledge construction systems can connect to your Gmail, sync your Slack, or pull from your Salesforce. Hyperspell is the only one asking “where does the data come from?” as a first-class problem.

It’s not that anyone’s making the wrong choice — these are genuinely different layers of the same problem, and each team started from the layer that felt most urgent. But the full stack (broad data access AND knowledge construction AND infrastructure guarantees) doesn’t ship in any product I could find.

Operational complexity doesn’t predict capability

You might expect more infrastructure to produce better results. It doesn’t always. Hindsight runs on a single PostgreSQL database and has one of the most sophisticated retrieval pipelines. EverMemOS requires four containers and has consistency gaps at every backend boundary. Letta’s simple block model with proper PostgreSQL transactions provides more reliable persistence than systems with far more moving parts.

This isn’t a knock on the more complex systems — sometimes you need multiple backends for different access patterns. But it’s worth noticing that architectural complexity and memory quality aren’t the same thing.

Temporal handling is inconsistently implemented

Cognee has no temporal model at all. Graphiti has a rich bi-temporal schema on every edge. Letta tracks conversation timestamps but nothing about when facts are true or false. Hindsight has causal link tracking with temporal spreading activation — probably the most production-ready temporal retrieval. Tacnode has native time travel queries but with a 23-hour default retention window.

Most agent memory systems can’t answer “what did the agent know last Tuesday?” or “when did the agent first learn X?” These seem like questions you’d want to answer in production.

The word “memory” doesn’t mean much yet

This is the pattern that seems most worth paying attention to. Among the systems that actually do knowledge construction — Mem0, Letta, Cognee, Graphiti, Hindsight, EverMemOS — every one implements the same fundamental model:

Extract facts or knowledge from conversations/documents
Store them in some combination of vectors, graphs, and relational tables
Retrieve them via search when the agent needs them

The differences are in how they extract, what structures they store, and how they retrieve. But the underlying assumption is shared: agent memory is a fact retrieval problem.

Hyperspell stretches the definition further. It skips step 1 entirely — no extraction, no structuring — and calls document search “memory.” Tacnode skips steps 1 and 3 and calls database infrastructure “memory.” The word is covering everything from temporal knowledge graphs to OAuth-connected document search to PostgreSQL-compatible databases. Eight products, at least four different definitions of what “memory” means.

Is any of them what memory actually is? When a person “remembers” something, they’re not running a search query against a fact store. Memory shapes how they interpret new information, what they pay attention to, what patterns they recognize. It’s less about retrieval and more about how past experience changes present behavior.

None of these systems model that. Whether they should is an open question — maybe fact retrieval is sufficient for current agent architectures, and the cognitive analogy is misleading. But it’s striking that eight independent teams all ship under the same label without much visible debate about what the label means.

The prescriptive spectrum

There’s a design axis running through these systems that doesn’t get discussed explicitly enough. It’s useful to name it.

Neither end is obviously right. The minimal approach is cheaper, more flexible, and improves as models improve — but it has no floor when the model makes mistakes. The rich approach is auditable, verifiable, and explicit about its data model — but it’s rigid, expensive to maintain, and can’t capture what the schema doesn’t anticipate.

The likely answer is somewhere in the middle, and Graphiti arguably already lives there — strong structural schema, but LLMs doing the heavy lifting within that structure. The question is where on the spectrum the equilibrium settles as models get more capable. Do the guardrails become unnecessary scaffolding? Or do they become more valuable as the foundation that makes memory auditable and testable?

Two systems sit outside this spectrum entirely, for different reasons.

Below it, Tacnode occupies the infrastructure layer: composable database primitives (SQL, time travel, semantic operators) that you build a memory system on top of. Not “minimal structure” — a different abstraction layer.

Above it, Hyperspell occupies the data access layer: 43 OAuth connectors that normalize heterogeneous sources into a searchable index. It doesn’t prescribe a memory model because it doesn’t build one — it’s a fan-in gateway. Segment doesn’t tell you how to analyze your data, it gets your data into one place so analysis is possible. Hyperspell doesn’t tell you how to build knowledge from your documents — it gets your documents into one place so knowledge construction is possible. Whether it eventually moves down the stack toward actual knowledge building is an open question.

What the outliers suggest about where this is heading

Tacnode and Hyperspell are both operating at fundamentally different layers than the six knowledge-construction systems — but they’re different from each other, and together they sketch the outline of a stack that nobody has assembled.

The six systems in the middle (Mem0 through EverMemOS) are all trying to solve knowledge construction — how to extract, structure, and retrieve facts. Tacnode is solving the infrastructure problem underneath: consistency, transactions, multi-modal storage, temporal queries. Hyperspell is solving the data access problem above: get content from 43 different sources into one place where it can be processed.

What’s interesting is that these three layers — data access, knowledge construction, data infrastructure — map cleanly onto a stack that already exists in traditional enterprise architecture. Segment (fan-in data collection) → transformation layer (dbt, Fivetran transforms) → data warehouse (Snowflake, BigQuery). The pattern is the same: normalize heterogeneous sources, pipe them into a processing layer, store the results in something with real guarantees.

In the agent memory space, that stack would look like:

Data access (Hyperspell): Connect sources, handle OAuth complexity, normalize content, continuous sync
Knowledge construction (Graphiti/Cognee-like): Entity extraction, deduplication, contradiction detection, temporal tracking, graph building
Data infrastructure (Tacnode): ACID transactions, time travel, multi-modal storage, consistency guarantees

The knowledge construction systems all assume data is already available and focus on what to do with it. Hyperspell assumes knowledge construction isn’t its problem and focuses on getting data in. Tacnode assumes both layers above it exist and focuses on storing their output reliably. Each layer solves a real problem. None of them compose into a product today.

The more I think about it, the more this three-layer decomposition feels like where the space has to go. You can’t build reliable knowledge without reliable infrastructure. You can’t build knowledge from data you can’t access. And you can’t access data without someone maintaining 43 OAuth integrations. The question isn’t which layer matters most — it’s how long before someone assembles the full stack.

Open problems

The analysis surfaced questions at several levels — from practical engineering tradeoffs to more fundamental issues about what memory should be.

Database relations as proxies for continuous semantics

Every system stores relationships between memories as database primitives: foreign keys, graph edges, similarity scores. These are adjunct structures — discrete, typed proxies for what are actually continuous, high-dimensional semantic relationships.

When Graphiti creates an edge labeled works_at between “John” and “Acme,” it’s compressing a rich contextual relationship into a schema-friendly label. The embedding of the sentence “John has been working at Acme since 2019 as a senior engineer, primarily on their payments team” contains far more information about that relationship than the edge label does.

Graph edges and relational links may be an intermediate representation — useful now because they’re queryable and interpretable — but eventually replaced by something that operates directly in embedding space. Not “entity → relationship_type → entity” but continuous semantic regions where the nature of a relationship is encoded in the geometry of the space itself, not as a discrete label.

This surfaces concretely when you look at Cognee building a knowledge graph and then scoring search results by summing vector distances over flattened triples. The graph structure and the embedding space are doing different jobs. At scale, the embedding space might be the more natural representation.

Token-level vs. latent memory

A related question about what level memory should operate at. Current systems work at the token level — extract text, store text, embed text, retrieve text. The agent’s memory is fundamentally a collection of strings.

But LLMs don’t think in strings. They think in activations across layers. There’s emerging work on persisting and injecting hidden states (not just token sequences) as a form of memory — the idea that an agent’s “memory” of a conversation could be a compressed latent representation rather than a bag of extracted facts. This would be a fundamentally different primitive than what any of these eight systems are building.

Whether that approach will work at scale is an open question. But if it does, it makes the current token-level extraction paradigm look like an expensive detour — converting rich internal representations to text and back to embeddings, losing information at every step.

What does “working correctly” even mean for memory?

All the questions above assume that the fact-extraction-and-retrieval model is correct and the debate is about implementation. But there’s a more basic question underneath them.

How would you evaluate whether an agent’s memory is working? The obvious answer is retrieval accuracy — given a query, does the right fact come back? And every system in this analysis is optimized for some version of that metric.

But think about what you actually want from an agent that “remembers” a long-running interaction. You want it to behave differently because of what it’s experienced. You want its judgments to reflect accumulated context. You want it to notice patterns across conversations, not just recall individual facts. The measure of memory quality isn’t “can you retrieve fact X” — it’s “does the agent’s behavior change appropriately as a function of its experience.”

Almost none of these systems measure that, and the few that gesture toward it — Hindsight’s synthesized observations, Letta’s memory rethink — are still evaluated by retrieval metrics. The consolidation happens, but there’s no framework for asking whether it actually changed the agent’s behavior in the right way.

This is the part of the problem that seems most underexplored. A bag of facts isn’t memory for the same reason that a list of pixel values isn’t an image. What makes memory memory is the compression — the part where some things get ignored, other things get emphasized, and what remains is a working model that shapes how new information gets interpreted. That’s not retrieval. That’s representation. The difference between “I can look up that the user changed jobs” and “I understand this person’s career trajectory well enough that the job change wasn’t surprising” is the difference between a search index and something that actually remembers.

For any agent doing something more complex than answering questions about past conversations — planning, adapting, maintaining relationships over time — fact retrieval is clearly insufficient. The systems that are reaching toward consolidation and synthesis are pointing in the right direction. But the current paradigm is still mostly optimizing an intermediate step and mistaking it for the destination.

Agent Memory Systems:A Source-Level Analysis of Eight Architectures