Learning Library

← Back to Library

Rethinking Memory in AI Agents

Key Points

  • Agentic context engineering, which focuses on how AI agents manage memory and state, is the most critical yet misunderstood topic in current AI development.
  • Many developers incorrectly treat “context” as a large prompt window and “memory” as a simple vector store, overlooking that true agent memory is a dynamic system that stores, filters, and evolves actions.
  • Recent papers (e.g., Google’s ADK) propose tiered memory architectures—working context, session memory, long‑term memory, and artifacts—where prompts are compiled on the fly rather than blindly accumulated.
  • The ACCE paper demonstrates that static prompts and one‑shot fine‑tuning fail for long‑horizon tasks, emphasizing the need for adaptive prompts, instructions, and memory that evolve via execution feedback.
  • Practical implementations, such as the Nanis approach, show that long‑running agents succeed by aggressively reducing context size and offloading irrelevant data, turning memory management into a first‑class runtime concern.

Sections

Full Transcript

# Rethinking Memory in AI Agents **Source:** [https://www.youtube.com/watch?v=Udc19q1o6Mg](https://www.youtube.com/watch?v=Udc19q1o6Mg) **Duration:** 00:20:19 ## Summary - Agentic context engineering, which focuses on how AI agents manage memory and state, is the most critical yet misunderstood topic in current AI development. - Many developers incorrectly treat “context” as a large prompt window and “memory” as a simple vector store, overlooking that true agent memory is a dynamic system that stores, filters, and evolves actions. - Recent papers (e.g., Google’s ADK) propose tiered memory architectures—working context, session memory, long‑term memory, and artifacts—where prompts are compiled on the fly rather than blindly accumulated. - The ACCE paper demonstrates that static prompts and one‑shot fine‑tuning fail for long‑horizon tasks, emphasizing the need for adaptive prompts, instructions, and memory that evolve via execution feedback. - Practical implementations, such as the Nanis approach, show that long‑running agents succeed by aggressively reducing context size and offloading irrelevant data, turning memory management into a first‑class runtime concern. ## Sections - [00:00:00](https://www.youtube.com/watch?v=Udc19q1o6Mg&t=0s) **Understanding Agentic Context Engineering** - The speaker highlights common misconceptions about AI agent memory versus prompt context, explains why larger context windows haven’t solved memory issues, and outlines the key concepts and use cases of proper agentic memory management. - [00:04:03](https://www.youtube.com/watch?v=Udc19q1o6Mg&t=243s) **Tiered Memory Architecture for LLM Agents** - The speaker outlines a four‑layer memory system—working context, session logs, durable searchable memory, and artifact references—that minimizes the LLM’s context window while enabling arbitrarily large, structured state for long‑running autonomous agent loops. - [00:08:07](https://www.youtube.com/watch?v=Udc19q1o6Mg&t=487s) **Offloading State & Isolating Sub‑Agents** - The speaker advises keeping model context lean by writing tool results to disk and using a small orthogonal toolset, and recommends using specialized sub‑agents with isolated state to prevent context explosion and hallucinated teamwork. - [00:12:19](https://www.youtube.com/watch?v=Udc19q1o6Mg&t=739s) **Avoiding Common LLM Agent Pitfalls** - The speaker warns that excessive error messages, overly large tool sets, anthropomorphizing agents, and static prompts dilute performance and observability, advocating lean, well‑structured memory and adaptable systems for stable agent operation. - [00:17:02](https://www.youtube.com/watch?v=Udc19q1o6Mg&t=1022s) **Memory Layer Enables Scalable, Auditable Multi-Agent Systems** - The speaker outlines how a well‑designed memory layer supports coordinated multi‑agent orchestration, deep reasoning over massive artifacts, full auditability for regulated domains, and sub‑linear cost scaling through intelligent caching and compaction. - [00:20:15](https://www.youtube.com/watch?v=Udc19q1o6Mg&t=1215s) **Essential Step for Agent Success** - The speaker emphasizes that a crucial step cannot be omitted if agents are to operate effectively, concluding with a well‑wish. ## Full Transcript
0:00The most critical topic in the world 0:02today is agentic context engineering or 0:05how you deal with memory and AI agents. 0:07We are overdue for a deep dive on this. 0:10So I'm going to go into three key papers 0:11that were recently published on aentic 0:13context engineering. I'm going to tell 0:15you how it works, how people commonly 0:17misbuild or mischaracterize their 0:19agentic memory systems, and then we're 0:21going to talk about the use cases that 0:23you can only unlock with aentic context 0:25engineering. So let's dive in. First, 0:27people misunderstand memory. When we say 0:30context, people often think a giant 0:32prompt window. And when we say memory, 0:34they often think, well, that has to be a 0:36rag or vectorized embeddings in a 0:37database. Really, for agents, memory is 0:41the system. The prompt is not the agent. 0:44The LLM by itself is not the agent. The 0:47state, how the agents actions are 0:50stored, transformed, filtered, reused, 0:53evolved. That's the entire difference 0:55between a toy demo and something that 0:57handles real work. And we misunderstand 0:59that. The last two years have given us 1:01longer context windows and they've given 1:03us much much smarter models. But they 1:06did not solve the memory problem. In 1:08fact, they intensified it. The naive 1:10mental model is as contexts get bigger, 1:13agents get more capable. But what 1:15actually has happened is that attention 1:17has become scarce and logs have 1:20ballooned and irrelevant history so 1:22often drowns out critical signals when 1:25we talk about agentic memory. And so 1:27because we don't handle our memory 1:29correctly, that means performance has 1:31actually often fallen as tasks get 1:34longer. And that's not the fault of the 1:35LLM, it's the fault of our memory 1:37construction. So this is for us to 1:39shift. We have to stop trying to stuff 1:42everything into a context window and 1:44stop assuming everything is a rag and we 1:46need to start engineering memory as a 1:49first class runtime environment. 1:51Google's ADK showed the architectural 1:54fix here. This is the one of the papers 1:55that I'm going to be talking about. It's 1:57a tiered memory system where you have 1:59working context, you have sessions, and 2:02you have memory and you have artifacts. 2:04So the prompt is dynamically compiled at 2:07each step. It's not just accumulated 2:09blind. It's the first mainstream 2:11articulation where context is truly 2:14computed, not just appended. The ACCE 2:17paper showed the adaptive fix side of 2:20things. How prompts, instructions, and 2:22memory must evolve through execution 2:25feedback. AC, by the way, is the agentic 2:27context engineering paper that Enthropic 2:30put together. So, static prompts and 2:32oneshot fine-tunes are not going to 2:34survive long horizon tasks. So many 2:36people say, "Do I need to fine-tune my 2:37agent for this or that?" You don't. 2:39Agents need systems that update their 2:41strategies without collapsing into 2:43vagueness. And the system is what 2:45matters. The combination of the prompts, 2:47instructions, and memory. Meanwhile, 2:49Nanis published a paper that showed a 2:51very practical fix. Longunning agents 2:53only work when they aggressively reduce 2:56context, when they offload their heavy 2:58states into a file system or a virtual 3:00machine and isolate sub agent scope very 3:03cleanly. So without this long tasks just 3:06end up imploding under the bloat of logs 3:08or of tool noise or of instruction 3:10drift. And that's part of what degrades 3:11agent performance. So when you put all 3:13these three together, the enthropic 3:15piece, the Google piece, the manis 3:16piece, you get the first coherent 3:19blueprint for agentic context 3:21engineering in late 2025. So if I were 3:24to summarize it out, and we're going to 3:25talk about each of these, we need to 3:27talk about memory first design, context 3:29as effectively a compiler output, 3:32retrieval over pinning, we'll get into 3:34that, schemadriven summarization, 3:37offloading heavy state and what that 3:39looks like, how you isolate agent scopes 3:41effectively, and evolving playbooks that 3:43sharpen over time. So let's jump into 3:46it. So the first scaling principle I 3:47want to talk is that you should treat 3:49context as a compiled view not as a 3:53transcript. Every LLM call should be a 3:56freshly computed projection against a 3:59durable state. So what's relevant now? 4:01What instructions apply now? Which 4:03artifacts matter now? And which memories 4:05should I surface now? You're computing 4:07that at runtime. You're not just 4:09assuming a durable state that's always 4:11the same. This prevents signal dilution, 4:13right? So instead of dragging the last 4:14500 turns of the context window or the 4:16conversation into memory for the agent, 4:19you rebuild the minimal very relevant 4:23slice that preserves task continuity so 4:25you don't flood the LLM's attention. 4:28It's the only way to make multi-hour 4:30agent loops work. So often people assume 4:33that they can get away without this and 4:35you can't. Principle number two, you 4:37need to build a tiered memory model that 4:40separates storage from presentation. So 4:43working conduct context is a minimal per 4:46call view, right? That's what we talked 4:47about. Sessions are much more like 4:50structured event logs for the whole 4:52trajectory of action. And memory is a 4:54durable searchable insight that's 4:57extracted across multiple runs. 4:59Meanwhile, artifacts I would define as 5:01large objects that are referenced by 5:02handle or by tag. They're not just 5:05pasted in. So when you define these four 5:07correctly and you have a truly tiered 5:09memory model across working context, 5:11sessions, memory and artifacts, you can 5:13separate it out so that context window 5:16stays small while state the overall 5:18memory system can grow arbitrarily 5:20large. It can grow very rich and that 5:22mirrors traditional computer 5:24architecture to be honest with you. 5:25There's we have the idea of a cache, a 5:27RAM and disc drive because the same 5:30bottlenecks reappear in LLM agents. And 5:32so why reinvent the wheel? Let's just 5:34apply it correctly in this context. 5:36Principle number three, scope by 5:38default. The agent should pull memory 5:41when needed, not inherit everything. So 5:44default context should contain nearly 5:46nothing. I'm going to say it again 5:47because almost no one says this. Default 5:50context should contain nearly nothing. 5:53And retrieval then becomes an active 5:55decision. The agent chooses when to 5:58recall past steps. It chooses when to 6:01fetch artifacts. It chooses when to load 6:03additional details. This keeps attention 6:06focused. It avoids context rot and it 6:09makes long horizon tasks very feasible 6:12because old information is not ever 6:14passively carried forward. Principle 6:16number four, retrieval beats pinning. So 6:19long-term memory needs to be searchable, 6:20not pinned and permanent. There's 6:22attempts to keep everything in context 6:24that tend to fail because attention 6:26constraints bite. So if you have a very 6:28large context window, like a million 6:29tokens, we'll see even larger ones. It's 6:31tempting to just stick it in the context 6:33window, but your retrieval accuracy 6:34tends to drop. So treat memory as 6:36something that the agent will query on 6:38demand with very clear relevance ranked 6:41and structured information. So the 6:42window is always the the context window, 6:44right? It's always the result of a 6:46search, not the accumulation of history 6:49passively. And this is how agents can 6:51differentiate between a critical 6:53constraint from 5 days ago and noise 6:55from 5 minutes ago. Right? Otherwise, 6:57the agent is going to be very recency 6:59biased, very confused in the context 7:01window. You have to give the agent clear 7:03instructions for retrieval and enable it 7:05to retrieve in order to get past this 7:07idea that you can just sort of throw 7:09long-term memory in if you compress it 7:11enough. Principle number five, 7:12summarization needs to be schema- driven 7:14and it needs to be structured and 7:16ideally reversible. And people don't 7:18talk about that part. So, naive 7:19summarization will turn multi-step 7:21reasoning into a very vague sort of 7:22overarching glossy soup. It strips away 7:25decision structures. it just kind of 7:27compresses it all in. But if you compact 7:30intentionally, if you compact using 7:32schemas, using templates, using event 7:35types very intentionally so that you 7:37preserve the essential semantics of 7:38whatever you have a memory about, then 7:41you're going to be dropping surface 7:42detail, but you know for a fact that 7:44your structure, your schema guarantees 7:46that the relevant parts of the memory 7:48are preserved. That is what makes long 7:50run context maintainable. It's what 7:52makes it debugable because you can 7:54inspect not just what was summarized but 7:57how it got summarized. Almost no one is 7:59talking about this. By the way, this is 8:01a really important piece that I don't 8:02see in practical agentic conversations. 8:05Principle number six, please, please, 8:07please offload your heavy state to 8:09tools, to file systems, to sandboxes. Do 8:12not feed the model raw tool results, 8:15especially at scale. You want to write 8:17them to disk and pass pointers. You 8:19don't expose if you can 20 overlapping 8:22tools, right? Expose a small orthogonal 8:24set of tools like a shell or a browser 8:26or file operations and then and then let 8:29the agent compose its own workflows that 8:32keeps the context really lean and it 8:34reduces the cognitive burden on the 8:36model which unlocks much more complex 8:38chains of behavior. You might think if 8:40we don't have 20 overlapping tools we 8:42can't get to complex chains of behavior 8:43but it's actually the opposite. When you 8:45have a very clearly orthogonal set of 8:47tools, the agent is more free to 8:50understand what's in the box and it can 8:52allocate more compute toward those cool 8:54workflows. Principle number seven, use 8:56sub aents to isolate the state and the 8:59scope, not to mimic human organizational 9:02charts. So, sub agents should stop 9:04context explosion by giving different 9:06actors their own working context and 9:07responsibilities. planner, executor, 9:10verifier are all classic agent types, 9:12and they need to have narrow scoped 9:13views and communicate through structured 9:15artifacts, not just sprawling 9:17transcripts they pass back and forth. 9:19This should eliminate a lot of the cross 9:20talk, the reasoning drift, the 9:22hallucinated teamwork that plagues a lot 9:24of naive multi- aent designs. Notice 9:26that planner, executor, verifier are not 9:29human job titles. Do not create your 9:31agents with human job titles. There is 9:32no point. Think of it as agentic task 9:34and don't throw your human assumptions 9:37onto it. Principle number eight, design 9:39the prompt layout around caching and 9:42prefix stability. I'll explain what I 9:44mean. So a stable prefix like the 9:46identity, the instructions, the static 9:48strategy of the agent should rarely 9:50change so that the cache can be reused 9:53across turns. Only the variable suffix, 9:56the current user input, the fresh tool 9:58outputs should change. So this 10:00transformed multi-step agent loops from 10:02a very lengthy 200 milliseconds per step 10:05because it's reading this gigantic 10:07prompt layout with lots of caching and 10:09prefix instability into a very stable 10:12thin prefix that can then be read 10:15quickly for the suffixes like the 10:16current user input. So the schema is 10:18clean, right? It's very narrow. It's 10:20very clean and only the output is 10:21changing in a way the model can 10:22understand. That can drop your latency 10:2410x right from 200 milliseconds to 20 10:26milliseconds or something like it. And 10:28that can drive down cost dramatically, 10:29but it also enables more reliable 10:31outputs. Principle number nine, let the 10:33agents own strategies evolve. So static 10:36prompts will freeze the agent until you 10:39evolve the prompt, right? It freezes it 10:40at version one. Whereas anthropics 10:42approach shows that strategies, 10:44memories, and instructions should update 10:46through execution feedback. Small 10:49structured increments that sharpen 10:51capabilities instead of overwriting 10:52them. And that allows agents that 10:54actually learn from doing, not from 10:56human tinkering. So fundamentally, if 10:58your agent is allowed to update its 11:00strategy, if it's allowed to update its 11:02memory, it's allowed to update its 11:04instructions as it learns, you then 11:06unlock the possibility of an agent that 11:08learns to do its job better. What are 11:10some common pit pitfalls that break if 11:12people do not apply these? And I do mean 11:14common. I have done some looking. I have 11:16built agents. This often happens. Number 11:18one, dumping it all into the prompt, 11:20right? That leads to signal dilution. It 11:22leads to costs that rise, degraded 11:24reasoning. Agents literally become less 11:26accurate when you do this. Number two, 11:28blind summarization that erases any kind 11:31of domain insight. Right? If you just 11:33say summarize it and you don't have a 11:34schema, you don't have structure, you're 11:36losing constraints, you're losing your 11:37edge cases, you're losing any kind of 11:39causal relationship there, the things a 11:41capable agent needs to do its job, 11:43right? And this produces context 11:45collapse where the agent becomes 11:46increasingly generic rather than doing 11:48specific useful work. Pitfall number 11:50three, treating long context windows as 11:53if they're unlimited RAM. Bigger windows 11:56actually increase noise, right? And they 11:58increase confusion unless they're paired 12:00with relevance filtering. More tokens 12:02does not necessarily mean you're going 12:03to get more clarity and it often means 12:05more distraction. People will treat 12:07context windows like they like the trunk 12:09of a car. They can just dump stuff in 12:11there. Don't do that. Pitfall four, 12:13using the prompt as an observability 12:15sink. People will stick debug logs. 12:17They'll sync error lo messages. They'll 12:19they'll put giant tool outputs into the 12:21prompt and that will just pollute 12:23attention. Humans need the 12:25observabilities, but the agent just just 12:27drowns in all of that. So, please 12:28construct the system for stable agent 12:31performance. We will find other ways to 12:33get to observability. And actually, a 12:34well- constructed memory system is very 12:37observable. Pitfall number five, tool 12:39bloat. Right? If you give the model many 12:43subtly different tool options and a 12:45giant tool schema, you might think 12:46you're very sophisticated, but all 12:48you're doing is increasing error rates 12:50and you're slowing your system down. 12:51Pitfall number six, anthropomorphizing 12:54the agents, right? If multiple agents 12:56have the same transcript and they're all 12:58trying to talk and they're trying to 12:59assume human roles because you gave them 13:01human jobs, you're going to have 13:02reasoning drift. You're going to have 13:04duplicated efforts. You're going to have 13:05compounding hallucinations. You want to 13:08make the system less fragile, right? And 13:10I think there's probably a lesson for us 13:12humans here too, right? We are designing 13:14systems where the agents just have to 13:15have the context needed to do the job. 13:17Maybe there's a clue for us at work as 13:19well into how our roles are evolving. 13:21Just a thought. Seventh pitfall, static 13:23prompt configurations that never change. 13:27People put in like no accumulation of 13:30knowledge or sharpening of huristics. 13:31They don't design the system to evolve. 13:34You essentially rebuild the agent from 13:36scratch every run, but you don't give 13:38the agent room to adjust its prompting 13:42or thinking as it grows in intentional 13:44ways. This is a somewhat sophisticated 13:46implementation, but good multi-agent 13:48implementations give the system room to 13:51learn intentionally. They don't just 13:54have static configurations that never 13:56change. And as long as you're 13:58documenting that learning, you have room 13:59to control it, observe it, change it, 14:01and manage it safely. Number eight, if 14:03you overstructure the harness, the model 14:06will feel boxed in. So if a Frontier 14:08model produces no improvement when it's 14:10swapped in, your architecture is usually 14:12the bottleneck. So basically, rigid 14:15harnesses can kill emerging capability 14:17and there's a fine line between a useful 14:19harness with orthogonal tools and a very 14:21clear injectable context prompt that you 14:24can configure as needed with schemas 14:26that evolve that that gives the model 14:29room to demonstrate what it can do. If 14:31your harness is so locked down, you 14:34probably won't see as much different 14:36difference model to model. And you may 14:38naively blame the model. But really, 14:39your your structure is so narrow and 14:41boxed in. There's only one thing the 14:43model can do. Pitfall number nine, 14:45ignoring the caching and prefix 14:47discipline. Remember I said you have to 14:48have caching and preflex discipline. If 14:50you don't do that, if you don't have 14:53clean prefix discipline in your 14:55prompting schemas, you're going to have 14:58really unpredictable latency because the 14:59model has to rep that prompt a few times 15:01to figure out what's going on. It makes 15:03it very difficult to scale as tasks get 15:06longer. So now we've gotten through the 15:08pitfalls. We've gotten through some of 15:10the principles for designing agentic 15:12context systems with good context 15:14engineering, with good memory. What do 15:17you unlock? What can you get done once 15:20you're able to design these systems 15:22appropriately? Number one, this unlocks 15:25really long horizon autonomy. And so if 15:27you want to be in the business of 15:28multi-hour research, of doing web 15:30browsing, of simulating runs over time, 15:33say of repo audits, multi-stage code 15:36generation where agents stay coherent, 15:39you have to be in the business of giving 15:41them relevant slices of their state 15:44rather than just throwing all the 15:46context at them at once. And so this 15:47enables long horizon autonomy. Use case 15:50number TR two. This enables true 15:52self-improving agents. So agents that 15:55get better over time will be able to log 15:58and update their strategies, their 16:00huristics, their domain knowledge if you 16:02construct your memory systems 16:05appropriately. And if you scope them 16:07right, they will not update those in 16:10ways that overcope the agent. You can 16:13still constrain a gentic scope, but 16:16allow the agent to execute within that 16:19scope with increasing intelligence as it 16:21learns from each run. This isn't 16:24training with weights, right? We're not 16:26changing the weights of the model. This 16:28can happen entirely in your memory and 16:30instruction layers if your agents are 16:33clearly instructed to record and learn 16:36from what they did. Use case number 16:38three, this enables cross-session 16:40personalization that actually scales. So 16:43if you want persistent profiles that 16:45remember user preferences, that remember 16:47constraints, that remember prior 16:49outcomes or behavior patterns in your 16:51agents, then you don't have to balloon 16:53out the per call context to get that 16:55done. If you're constructing the memory 16:56state correctly because you just inject 16:58the particular slice that matters. 17:00Number four, this enables real 17:02multi-agent orchestration without cross 17:04talk and without drift. So you can have 17:07a planner, a researcher, an executor, a 17:09validator, a tester all collaborating 17:11through structured artifacts and you 17:13don't just have chaotic chatter between 17:15them that leads to context poisoning. 17:17Use case number five, you can enable 17:19deep reasoning over very large bodies of 17:22work. So agents can analyze entire 17:25repos, data sets, PDFs or logs by 17:28treating them as artifacts instead of 17:30tokenizing all of them. because you can 17:32structure sampling, you can structure 17:34retrieval, you can structure clean 17:35summarization that decouples reasoning 17:38from the raw size of the repo. Number 17:41six, you can enable auditable, 17:44compliant, enterprise ready agentic 17:46systems. And this is an outcome perhaps, 17:48but I think it's worth calling out 17:50because memory is what stands in the way 17:52of those systems. So you can get full 17:55reconstructibility of what the model saw 17:58and why it acted. So session logs, 18:01compaction events, memory updates would 18:03all be traceable. And that's that's all 18:05happening at the memory layer. If you 18:07don't construct your memory layer 18:08correctly, it makes it difficult to 18:11trace in ways that are critical for 18:13finance, for legal, for medical, or 18:15really for production systems. Use case 18:17number seven, you need coststable agent 18:20operations. And so you need cost growth 18:23that isn't linear. In fact, it should be 18:25sublinear. And you can get that if you 18:28reuse your caches, if you compact views 18:30intelligently, if you keep your working 18:32context small. Basically, every token 18:35counts if you construct your memory 18:37properly. And that gets you to things 18:40like always on agent services where 18:41you're not just scaling your costs. And 18:44that matters if you're trying to really 18:45scale agents across the enterprise. 18:47Number eight, you can get to 18:48domainspecific agent OS environments. 18:51And so finance agents will have 18:54long-term risk if their memory isn't 18:56constructed correctly. Coding agents can 18:59have full workspace history if the 19:01memory system is constructed correctly. 19:03Medical agents will have durable patient 19:05state if the memory system is 19:07constructed correctly. Any place where 19:09you have persistent workspaces, you can 19:11get longerterm intention for agents. And 19:14it's not really a function of waiting 19:16for LLMs to get smarter. the LLM can 19:18understand the long-term history of a 19:21project or the long-term risk factors uh 19:24as I called out with finance or the 19:26long-term opportunity factors because 19:28the memory architecture is there and so 19:31when I wrap all of this up what I want 19:33to call out is that this is think of 19:35this as the trade craft the lessons 19:38learned that we have collectively put 19:40together at major model makers me in 19:43practice building agents others who are 19:45building agents in production systems 19:48This represents the best of what we know 19:50about constructing memory. And it's 19:51critical to get memory right. If you 19:54want agents that work in production at 19:56scale and so that's why I wanted to do a 19:58really deep dive on context engineering 20:00for agents. I don't think there's a 20:02substitute here. There's no magic 20:03bullet. You have to dig in and 20:05understand how memory and context 20:07engineering actually works. So I did a 20:09much longer write up on this on the 20:11Substack. I encourage you to dive in. 20:13It's something we can't skip if we want 20:15agents to really work. Best of luck.