Your Agent Has a Goldfish Memory

Most production AI agents in 2026 forget the previous sentence the moment they generate the next one. Feed a frontier model roughly 115,000 tokens of conversation history and its accuracy on the LongMemEval benchmark falls to around 60 percent, close to a 30 point drop from its score when the same facts sit right in front of it (Wu et al., ICLR 2025). We have built systems that pass the bar exam and cannot reliably remember your name on Tuesday.

Here is the uncomfortable part. There is exactly one existence proof in the known universe of a long-running, autonomous, general cognitive system that holds together for decades. It is running between your ears right now. And nearly every serious agent-memory paper of the last three years has, knowingly or not, reinvented a single slice of it.

Generative Agents reinvented importance-weighted recall (Park et al., 2023). A memory-as-operating-system design reinvented paging between fast and slow stores (Packer et al., 2023). Voyager reinvented procedural skill as a first-class library (Wang et al., 2023). Temporal knowledge graphs reinvented memory that knows when a fact stopped being true (Rasmussen et al., 2025). Self-organizing memory reinvented the cross-linked note (Xu et al., 2025). Sleep-time compute reinvented the value of thinking while idle (Lin et al., 2025). Every one is a real result. Not one of them is the whole architecture.

My engineering partner and I have repeated the same line to each other for twenty years: you model the system after what it actually is in the real world. Memory is not a storage problem with a clever index bolted on top. Memory is a cognitive architecture, and biology already shipped the reference implementation. Here is what changes when you take that seriously.

The same architecture, described twice. Every property a serious agent-memory system needs already has a name in neuroscience.

01 — The costThe amnesia tax

Before the design, the bill. Strip the reflection step out of a generative-agent simulation and its behavior collapses from coherent multi-day planning into repetitive, context-free responses (Park et al., 2023). Take the skill library away from Voyager and it reaches the same milestones up to fifteen times slower (Wang et al., 2023). Run a long conversation straight into a frontier model’s context window and you pay twice: accuracy falls by tens of points as the history grows, while latency and token cost climb with every turn. A purpose-built memory layer measured against that baseline cut response latency by around 90 percent and improved accuracy at the same time (Chhikara et al., 2025; Rasmussen et al., 2025).

Read those numbers again. Memory is not a feature that makes a good agent slightly better. It is the variable that decides whether an agent can run for a week instead of for a paragraph.

02 — The taxonomyTulving was right. So was Squire.

The first thing the brain does with memory is refuse to treat it as one thing. Endel Tulving split declarative memory into episodic, what happened to me, bound to a time and place, and semantic, what I know to be true, stripped of when I learned it, more than fifty years ago (Tulving, 1972). Larry Squire drew the deeper line between declarative memory and the nondeclarative, procedural kind, the knowing-how that lives on entirely different neural tissue (Squire & Zola, 1996).

This is not taxonomy for its own sake. It is a constraint. An event and a generalization want different storage, different decay, different retrieval. A skill and a fact are not the same kind of object and cannot be queried the same way. A memory system that pours everything into one undifferentiated pile is making a mistake the brain ruled out before we had language for it.

03 — Two systemsOne store is never enough

So why keep separate systems at all? Because a single store that learns fast also forgets fast, overwriting old patterns every time it absorbs a new one. The classic answer is Complementary Learning Systems theory (McClelland, McNaughton & O’Reilly, 1995; Kumaran, Hassabis & McClelland, 2016): the hippocampus learns individual episodes in one shot, while the neocortex integrates across thousands of them slowly, extracting the structure they share. Fast and specific on one side, slow and general on the other. Neither alone is enough. The handoff between them is the whole trick.

Fast capture, slow integration, and a handoff that happens offline. The episode is written instantly; the understanding forms later.

04 — DecayForgetting is a feature, not a bug

The instinct in software is to keep everything and buy a bigger disk. The brain does the opposite, on purpose. Hermann Ebbinghaus measured the forgetting curve in 1885 and it has replicated for more than a century (Murre & Dros, 2015). Robert Bjork’s work shows the forgetting is adaptive: retrieving one memory actively suppresses its competitors, which is why a cluttered memory recalls worse than a pruned one (Anderson, Bjork & Bjork, 1994). There is even direct neural evidence in mammals that retrieval induces the forgetting of competing traces under prefrontal control (Bekinschtein et al., 2018).

So the design rule is not never forget. It is forget on a curve set by importance, and keep a path back. Let low-value memories fade toward dormancy. Weight the curve by how much a memory has actually mattered. And when a fresh cue makes a faded memory relevant again, resurrect it. Forgetting, done deliberately, is what keeps the important things findable.

Decay weighted by importance, with resurrection on cue. The trivial fades, the consequential endures, and nothing useful is truly gone.

05 — The dream stateConsolidation happens at idle, not at query time

Here is the property the field is only starting to take seriously. The brain does its heaviest memory work while you sleep. During slow-wave sleep the hippocampus replays the day’s episodes and hands them to the neocortex, turning raw experience into durable, schema-shaped knowledge (Klinzing, Niethard & Born, 2019; Born et al., 2023). The integration does not happen while you are answering a question. It happens while nothing is being asked.

If a system consolidates in the middle of serving a request, the user pays for it in latency at the worst possible moment. Do it offline, and the next session wakes up sharper for free.

That is not a poetic detail. It is an engineering constraint. The recent sleep-time compute work makes the case directly: pre-process during idle and you cut the work done at query time substantially (Lin et al., 2025). Evolution shipped this a few hundred million years ago. We are catching up.

06 — ReconsolidationRecall is reconstruction. Never destroy, only supersede.

We imagine memory as playback. It is nothing of the kind. Frederic Bartlett showed in 1932 that remembering is active reconstruction, not replay. Decades later, Karim Nader and Joseph LeDoux demonstrated reconsolidation: the act of recalling a memory returns it to a fragile, rewritable state before it is stored again (Nader, Schafe & LeDoux, 2000). Every read is, quietly, a write.

A memory system honest about this does not overwrite. When a fact changes, it does not delete the old one and lose the history of its own belief. It keeps two senses of time, when something was true in the world and when the system came to believe it, and it supersedes rather than destroys (the bi-temporal model formalized for agent memory by Rasmussen et al., 2025). The old record stays queryable. You can still ask what the system believed last month, and why it changed its mind.

A correction closes the old interval and opens a new record. The history of belief survives, so the system can explain how it changed its mind.

07 — MetamemoryThe system must know what it does not know

Humans carry a quiet, second-order sense of their own knowledge. You know that you know your mother’s name without retrieving it, you know that you do not know a stranger’s, and you feel the difference between them. That faculty has a name, metamemory, and a formal model going back decades (Nelson & Narens, 1990). Tulving himself flagged knowledge of one’s own knowledge as one of the truly distinctive features of human memory.

An agent without this is dangerous in a specific way. It answers a question it has no business answering in exactly the same confident tone it uses for something it knows cold. That is the single most expensive failure mode in deployed systems, because a confident wrong answer is the hardest kind for a user to catch. Give every memory a confidence, track staleness, map where coverage is thin, and return that alongside the answer. Hallucination stops being a metaphysical mystery and becomes an engineering signal you can act on.

08 — BeliefBelief is not knowledge

Related, and just as important: the brain does not file the deadline is the first of the month next to I think this vendor is unreliable as the same kind of fact. One is an observation with a source. The other is a judgment that should move as evidence arrives. The predictive-processing account treats perception and belief as probabilistic priors, continuously revised by the error between what was expected and what arrived (Bottemanne & Friston, 2024).

So opinions get their own shelf, each carrying an explicit confidence and a record of what it was before the last update, revised by something close to a Bayesian rule as evidence comes in. Facts revise opinions. Opinions do not get to quietly rewrite facts. A system that cannot tell its own beliefs from its own observations cannot reason soundly, and cannot explain itself when challenged.

09 — CausationRemember cause, not just sequence

Why does episodic memory exist at all, given how reconstructive and error-prone it is? The leading answer is startling: it is not really about the past. Schacter and Addis argue that episodic memory is a tool for simulating the future, flexibly recombining fragments of past experience into possible scenarios (Schacter & Addis, 2007). Memory is for prediction.

That reframes what to store. A log of what happened in what order cannot answer why, and cannot reason about what would have happened otherwise. Store causation as a first-class link, this led to that, with its own confidence and its own window of validity, and the system can trace a problem back to the decision that caused it and reason counterfactually about choices it did not make. Sequence tells you the story. Cause lets you learn from it.

10 — RetrievalMany routes, one answer

A single retrieval strategy answers a whole category of questions badly. Pure meaning-based search misses exact identifiers. Pure graph traversal is poor at fuzzy similarity. Pure keyword matching misses paraphrase. The brain does not pick one. It queries many regions in parallel and resolves a single coherent answer. A memory layer should do the same, running meaning, connection, and exact-term routes together and fusing them behind one interface, then returning not a pile of search results but a briefing, shaped by who is asking and what they are doing.

Three retrieval routes, one fused and reranked result. The intelligence lives in one place, so improving it improves every caller at once.

11 — The loopThe loop that makes it improve

Static retrieval is a guess that never learns from being wrong. The final property closes the loop. After the system uses a memory, it notices whether that memory actually helped, and feeds the result back. Memories that earned their keep are reinforced. Memories that misled are down-weighted. That single outcome signal updates the decay rates, the routing, and which memories get carried forward, so the store gets measurably sharper with use instead of frozen at launch quality (in the spirit of self-organizing agent memory, Xu et al., 2025).

Every retrieval is also a signal about whether retrieval is working. Feed it back, and the system compounds in quality.

The thesisThe integration is the contribution

Here is the part worth sitting with. Almost every property above exists, somewhere, in a paper or a product. Bi-temporal storage exists. Reflection exists. A skill library exists. Confidence scoring exists. Idle consolidation exists. What does not exist, anywhere I have found, is a single architecture that holds all of them at once, the way a brain does, with each one feeding the next: bi-temporal history giving consolidation something to integrate, provenance giving confidence something honest to weigh, metamemory telling the idle process where to dig, and the outcome signal sharpening everything it touches.

The pillars are not a checklist. They are a loop. And the loop is the point.

In practiceWhat this buys you

I am not going to walk you through how we built ours. I am going to tell you what it did. The week we put a memory architecture shaped like this underneath our agents, they stopped leaving themselves half-finished breadcrumbs they could never find again. They stopped re-deriving the same procedure every session. They stopped asking me to explain the same context twice. I have run them for weeks at a stretch since, and the thing that changed was not the model. It was that the model finally had a memory built like the only kind of memory we have ever known to work.

You model the system after what it is in the real world. For memory, that means the brain. The field has spent three years rediscovering its parts one at a time. The interesting work begins the moment you stop collecting parts and start building the whole organism.