Learning Library

← Back to Library

RAG: Best Practices & Pitfalls

Key Points

  • Retrieval‑augmented generation (RAG) promises to turn LLMs into real‑time, data‑driven assistants, unlocking a market projected to grow from ~ $2 B today to over $40 B by 2035.
  • RAG tackles core LLM flaws—knowledge cut‑offs, hallucinations, and lack of access to proprietary data—by retrieving relevant documents, augmenting the query with those facts, and then generating answers grounded in reality.
  • Adoption is already widespread: roughly 80 % of enterprises use RAG (preferring it to fine‑tuning), and 73 % of AI‑focused firms cite the need for up‑to‑date data access as essential.
  • Real‑world wins include LinkedIn’s dramatic cut in support‑ticket resolution time, demonstrating how RAG acts like an “open‑book exam” for an LLM, while other companies have over‑invested and later regretted poorly scoped implementations.
  • Successful RAG systems rely on high‑dimensional embeddings to match semantic meaning, careful scaling from prototype to millions of queries, and strict avoidance of common pitfalls that have derailed many projects.

Sections

Full Transcript

# RAG: Best Practices & Pitfalls **Source:** [https://www.youtube.com/watch?v=z8-0INxN_Hg](https://www.youtube.com/watch?v=z8-0INxN_Hg) **Duration:** 00:23:24 ## Summary - Retrieval‑augmented generation (RAG) promises to turn LLMs into real‑time, data‑driven assistants, unlocking a market projected to grow from ~ $2 B today to over $40 B by 2035. - RAG tackles core LLM flaws—knowledge cut‑offs, hallucinations, and lack of access to proprietary data—by retrieving relevant documents, augmenting the query with those facts, and then generating answers grounded in reality. - Adoption is already widespread: roughly 80 % of enterprises use RAG (preferring it to fine‑tuning), and 73 % of AI‑focused firms cite the need for up‑to‑date data access as essential. - Real‑world wins include LinkedIn’s dramatic cut in support‑ticket resolution time, demonstrating how RAG acts like an “open‑book exam” for an LLM, while other companies have over‑invested and later regretted poorly scoped implementations. - Successful RAG systems rely on high‑dimensional embeddings to match semantic meaning, careful scaling from prototype to millions of queries, and strict avoidance of common pitfalls that have derailed many projects. ## Sections - [00:00:00](https://www.youtube.com/watch?v=z8-0INxN_Hg&t=0s) **RAG's $40B Promise & Pitfalls** - The video outlines retrieval‑augmented generation as a solution to LLM limits—offering real‑time, company‑specific knowledge—while detailing implementation steps, success stories, and common mistakes that can cause costly failures. - [00:03:21](https://www.youtube.com/watch?v=z8-0INxN_Hg&t=201s) **Effective Text Chunking Strategies** - The passage explains that raw text must be divided into overlapping chunks—using fixed-size, sentence‑based, semantic, or recursive methods aligned with business goals—to preserve meaning and improve LLM retrieval, emphasizing that vector searches rely on cosine similarity of semantics rather than simple keyword matching. - [00:06:36](https://www.youtube.com/watch?v=z8-0INxN_Hg&t=396s) **RAG Maturity Levels Overview** - The speaker walks through four escalating RAG approaches—hybrid keyword‑semantic search, multimodal search across text, images, video and audio, agentic multi‑step reasoning, and full enterprise‑grade deployment—highlighting their increasing accuracy, speed nuances, implementation complexity, and operational requirements. - [00:09:48](https://www.youtube.com/watch?v=z8-0INxN_Hg&t=588s) **Graph RAG and Hybrid Search** - The speaker explains how graph‑based Retrieval‑Augmented Generation preserves entity relationships, combines keyword and semantic searches through rank‑voting hybrid methods, and highlights careful handling of multimodal data such as images and tables. - [00:13:07](https://www.youtube.com/watch?v=z8-0INxN_Hg&t=787s) **RAG Memory vs Context Windows** - The speaker explains that retrieval‑augmented generation serves as an advanced memory manager to retain conversation details, making OpenAI’s seemingly larger context window a result of clever memory tricks, whereas Claude suffers from a hard, shorter memory limit. - [00:16:28](https://www.youtube.com/watch?v=z8-0INxN_Hg&t=988s) **Scaling Secure Enterprise Vector Search** - A guide to building large‑scale, cost‑optimized vector‑search systems—starting with production‑like testing, open‑source options, early pipeline and embedding versioning, then addressing sharding, caching, model cascades, and comprehensive security and compliance reviews. - [00:20:21](https://www.youtube.com/watch?v=z8-0INxN_Hg&t=1221s) **Future of Retrieval‑Augmented Generation** - The speaker outlines how RAG will become more agentic, integrate larger context windows and memory advances, remain essential for precise data retrieval, drive market growth, and see democratized fine‑tuning by 2026. ## Full Transcript
0:00What if Chad GPT had perfect memory and 0:03never hallucinated? That is the $40 0:05billion promise that Rag is making to 0:08the industry. Rag is retrieval augmented 0:11generation. In this video, you're going 0:13to get a one-stop shop that unpacks all 0:15the current debates, all the current 0:17best practices on Rag, how companies are 0:20implementing RAG, a few success stories, 0:22and not to be left out, a few places 0:24where you should not use Rag. because 0:27yes, there are companies that have 0:28absolutely overinvested in rag and 0:31profoundly regret it. So, let's dive in. 0:34The problem is fundamentally that LLM 0:37are brilliant but jagged. I've talked 0:38about this before. They have fatal 0:40flaws. They have knowledge cutoff dates, 0:43so their knowledge is frozen in time. 0:44They have hallucinations or confident 0:46lies and they obviously can't access 0:48your company's data, which in most cases 0:51companies do not mind. So the solution 0:53preview is basically rag plus an LLM or 0:57large language model will give your AI a 1:00realtime research assistant flavor. So 1:02what you're going to learn is how to 1:04build your rag system, how to scale from 1:07a prototype up to true scale like 1:09millions of queries and how to avoid the 1:11pitfalls that kill so many rag projects. 1:14This is all based on actual deep dives 1:16I've done on rag. Lots of research I've 1:19done. It's very comprehensive. So, 1:21bookmark this one and watch it in 1:22chunks. First, why Rag changes 1:25everything. We're going to actually get 1:26into the stakes and then we're going to 1:28get into how it actually works. Rag is 1:30currently a roughly $2 billion market, 1:33although that's exploding so fast it's 1:34hard to measure. It is on track for $40 1:37billion plus by 2035. So many 1:39enterprises use rag. The loose running 1:41number is around 80%. And they use it 1:44over fine-tuning because they perceive 1:45it as easier. So, fine-tuning a model is 1:47perceived as more difficult, at least 1:49right now. And those that are engaging 1:51with AI, 73% of them say that they need 1:55real- time data access. By the way, if 1:57you're wondering where am I making up 1:58these statistics from, I have a list of 2:01links in my Substack that you can go and 2:04follow up to find all of the actual 2:07stories that underly this. So, as an 2:09example of a success story, LinkedIn had 2:12a significant reduction in support 2:14ticket resolution time because of rag. 2:18because RAG enabled them to know their 2:20business. So, it's like an LLM having an 2:23openbook exam instead of a closed book 2:25exam. And yes, that story is public. So, 2:27how does it work? What's the magic? 2:29Retrieval is searching the knowledge 2:30base for relevant info. Augmentation is 2:32combining the query with retrieved 2:34facts. And generation is an LLM creating 2:37an answer grounded in real data. So, how 2:40does rag really work? Number one, 2:42embeddings. So text is embedded as 2:46numbers in dimensional space, 2:48highdimensional space. For example, the 2:50phrase refund policy might be embedded 2:53as a series of numbers or vectors. And 2:55the key insight is that similar meanings 2:57will cluster together mathematically. If 3:00you've watched previous videos of mine 3:01talking about how large language models 3:03work, it's the same darn thing. You're 3:05taking the words and you're encoding 3:07them as numbers in highdimensional 3:09vector space. And if you want to know 3:12how many dimensions, one of the best 3:14practices right now is 1,536 3:16dimensions. That's a lot of dimensions. 3:18So if you have the dimensions, you might 3:21wonder, is that enough? Can we just feed 3:22the text raw? The answer is no. You want 3:24to chunk. And by chunking, we mean that 3:27you want to break the large blocks of 3:30text that you're giving to the system 3:32into pieces in ways that help the LLM 3:35understand relationships and semantic 3:37meaning. Bad chunking ruins. so many rag 3:41projects. So pay attention. You have 3:42four different strategies here. You can 3:44have fixed size chunks that can be 3:46dangerous. It can cut off mid-sentence. 3:48You can have sentencebased chunks that 3:51will respect boundaries. And you have 3:52semantic chunks that group by topic. And 3:55you have recursive chunks that group by 3:56hierarchical structure. The key is 3:59making sure you understand what you want 4:00to get from a business perspective and 4:02driving your chunking strategy off of 4:05that. You should plan to have overlap 4:07between chunks. You don't want to have 4:09just a 0 0 cut off because if you do 4:12then you don't have the chance for the 4:14LLM to find something in a similar chunk 4:18that it might have run across in the 4:20original chunk. And that basically if 4:22you give it a little overlap, you 4:24maximize the odds of the AI finding what 4:26it needs in a really complicated 4:28haststack. So when we're looking at 4:30things in vector space, we are not 4:33keyword matching. That's often a 4:34misunderstanding. People will say, "Oh, 4:36the LLM is looking for a keyword match." 4:38No, it's not. It's looking to match 4:40meaning. And so, it's actually looking 4:42for what's called cosign similarity and 4:45finding the nearest neighbors in vector 4:46space. As an example, how do I get my 4:49money back? Might be a query that a 4:50customer types in. That would find, say, 4:53refund processing 0.95 similarity, 4:56return policy 93 similarity, and 4:59shipping info 38 similarity not 5:01retrieved. That one's not a fit. Now, if 5:03you want to like think about how you 5:05handle retrieval, you can actually 5:07rerank based on how you get actual 5:10queries back and you can boost accuracy 5:12for business purposes significantly if 5:15you do re-ranking. It's an advanced 5:16technique, but in this situation, if you 5:19want the system to retrieve shipping 5:21info with how do I get my money back 5:22because maybe they need to ship their 5:23item back, then you can rerank and you 5:26can get to that in what we'll call postp 5:29production, for lack of a better term. 5:30Okay, so how do you build a rag? Very 5:33simply, I would recommend like go to 5:35Llama index. You're going to load up 5:37your documents, which Llama gives you a 5:39way to do from the command line. Create 5:40an index and query. And it's easy to get 5:44a stack for this. This is not expensive. 5:46This is not hard as simple. Rag is 5:47quick. Uh you can use lang chain. It's a 5:50Swiss Army knife. It's going to do 5:51everything. You can use llama index. 5:53It's optimized specifically for rag. 5:55other vector DBs, Pine Cone, Chroma, 5:58Cudrant, they all work. And if you want 6:01something as simple as what's the 6:02warranty period on a manual or handbook, 6:05you're going to be able to get 2 years 6:06for EU purchases or a similar answer 6:08that's correct very, very quickly. Like 6:10this is something where in 2025 it's not 6:12hard to build a simple rag. The 6:14challenge is most people don't just want 6:17a simple rag. So if you want a level one 6:20basic Q&A, you can get that one done in 6:23like a week or so, right? even at a 6:24company. Simple vector search, single 6:26source, couple of seconds of latency, 6:29internal FAQs only, super fast. It's 6:31basically a slightly fancier custom GPT. 6:34Level two, hybrid search, where you're 6:36combining both a keyword match and 6:38semantic meaning match. That's a little 6:39bit more complicated. You definitely get 6:41better accuracy. It can be faster in 6:44some cases because you're handling 6:45keywords directly. Uh, and it can be 6:49helpful for handling edge cases. It's 6:51much more complicated to implement 6:52though. But it gets more complicated 6:54from there. All right, level three. 6:56Let's say you want to search text and 6:57images and video and audio. Modal rag. 7:00It can be quite accurate. You can get it 7:03to be quick. Uh but you are going to 7:05have to put a ton of work in on the data 7:07side and the chunking side. You think 7:08chunking is complex complex with text? 7:11Wait till you're trying to come up with 7:12a chunking strategy for text and images 7:15and video and audio. One example would 7:17be uh Vimeo's video search with 7:20timestamps. That's an interesting one, 7:22right? So, level four, agentic rag. 7:24That's where you actually have the agent 7:26go in and do multi-step reasoning and 7:28self-improve on what it finds. It's 7:30going to be a longer wait. You can get a 7:32more accurate response. You not only 7:34have to build a full rag, you have to 7:36build an agent over the top. And then 7:37finally, if you want enterprise 7:39production, there's a lot of security. 7:41There's a lot of compliance. There's a 7:42lot of monitoring. You have performance 7:44expectations around how fast this thing 7:46needs to respond and how it handles load 7:48when there's multiple queries and all of 7:50that additional software engineering 7:52that goes into putting something on a 7:54million boxes. That does not go away. 7:56That is still complicated. No AI will 7:58magically put software that lives on a 8:00million or 10 million or 100 million 8:02boxes easily. Okay. So, we've talked a 8:05fair bit about data. Let's do a little 8:06bit more of a deep dive there because 8:08data is the key to a good rag system. So 8:11when you are looking at documents for a 8:14rag, there's a few things that you want 8:16to keep in mind as like trusty tool tips 8:19that will help you to go farther. Number 8:22one, PDFs often have terrible header and 8:25footer pollution. Like they have stuff 8:26where like, have you ever copied and 8:27pasted a PDF? Like that's how that's how 8:30the system sees it and it will read 8:32those little footers and get confused. 8:33It'll read the weird header and get 8:34confused. OCR for scan documents. Are 8:37you sure the optical character 8:39recognition is correct? This is why 8:41Mistl released a special OCR tool just 8:44for scanning documents. It's difficult 8:45to get a good rag if you don't have good 8:47clean text that's digital. Tables are 8:49going to need special handling because 8:50you have to encode spatial 8:52relationships. You need to get to clean 8:54boilerplate in documents before thinking 8:57about chunking. Do not try to chunk a 8:59PDF. Get to clean boilerplate first. Get 9:02to clean markdown first. Okay. metadata 9:05can be a dramatically impactful choice 9:08as far as how you handle accuracy. So if 9:10you add source, section, and date to 9:12each chunk, retrieval is going to be 9:14vastly improved. For example, policy 9:16updated March 20, now the system knows 9:18that it's a 2024 update. And if it finds 9:20a 2025 update, it's going to probably 9:23choose the 2025 update if it understands 9:25you're looking for a recency based 9:27retrieval. So what does this look like? 9:2910 steps. Convert to text with the 9:31appropriate parser. You got to split it 9:33into sections. You have to remove the 9:36boiler plate that's like crappy headers 9:38and footers. You have to normalize all 9:39the white space. You have to extract the 9:41section titles. You have to add the 9:43metadata. You have to chunk with the 9:44overlap. You have to embed the chunks. 9:47You have to verify samples. And then you 9:48have to iterate. That's how much work it 9:50is. And that is for frankly a fairly 9:52simple exercise. And this is why I say 9:54like rag can get complicated. But we're 9:56going to get even more advanced because 9:58this is one of those videos. Let's talk 10:00about graph rag. So traditional rag is 10:03just isolated text chunks. Graph rag 10:06preserves entity relationships as it 10:09encodes in a rag. And so LinkedIn saw 10:12significantly better retrieval with 10:13knowledge graphs from graph rag. Another 10:15hybrid approach that's interesting is 10:18search deep dive. And so can you catch 10:20exact matches or error codes with a 10:23hybrid search that not only looks at 10:25like the vector space but also looks at 10:28for example error codes. So the best 10:30document often ranks in different 10:32searches at different positions and it's 10:34sort of like rank choice voting where 10:36it's looking for like what is the 10:37retrieval answer that is highest across 10:39these different search methods in our 10:41hybrid search approach. And then maybe 10:42that's the number one, right? So maybe 10:44the error code that we're looking for 10:46ranks really highly in our keyword 10:48search and it ranks not as highly in the 10:50semantic meaning search, but it's still 10:52there because we've used correct 10:53metadata when we chunked it. And so it 10:55all sort of comes out in the wash and it 10:57comes out as number one combined and 10:58that improves the accuracy. Multimodal, 11:01you want to be thoughtful about how you 11:03handle especially the relationship 11:05between image, table, and text. Invoices 11:08are a good example of this. They will 11:09often have tables. They'll definitely 11:11have text. They may have images as well. 11:13You want to use something like uh a tool 11:15like clip for image embeddings. You want 11:17to unify an index across all your 11:18modalities. So unified index for text 11:20and images and tables. And you should 11:22be, if you do it right, able to send a 11:25query that says something like, "Show me 11:26the revenue table from Q3." And it 11:28should retrieve both the image and the 11:30data because the index is common across 11:33both modalities. Okay, this is where I 11:35mention MCP. MCP is helpful because it 11:38ends up being like the USB port for AI. 11:41It's a universal protocol to enable AI 11:43data connectivity. and it is super super 11:45helpful to enable systems to plug into 11:50and access data that they would not 11:52otherwise be able to get. And so a good 11:54system that has rag for internal company 11:56data can also extend that search 11:58relatively easily using MCP to other 12:01data sources as well. Let's get to 12:04memory management. So if you think about 12:06memory, part of the whole reason we got 12:08here is memory and why memory is a 12:10problem for AI. We have to get the 12:12memory right. Context windows are 12:15working memory. It's what every AI ships 12:17with. Often it's 100,000, 200,000, 12:20400,000, maybe even a million tokens. 12:22Vector stores long-term memory. We've 12:24talked about embeddings. Long-term 12:26memory, effectively unlimited that we 12:28have get compressed and summarized. And 12:31so, if you think about how all of this 12:34relates together, you can be in a 12:36position where you can compress old 12:37terms of conversation and summarize them 12:39in memory. You can retrieve previous 12:42conversation with a rag on the 12:44conversation itself. You can have 12:46multiple abstraction levels. And one 12:48good example is making sure that you can 12:52encode enough of a previous longrunning 12:54conversation to not forget key facts. So 12:58as an example, let's say you're ordering 13:00French fries, right? And you're talking 13:01with an AI bot about ordering French 13:04fries. It is 2025. That could happen. 13:05Maybe you're on Door Dash. I don't know 13:07what happens if you mention that the 13:11order you're working on doesn't have 13:12enough fries in it and you want extra 13:14fries and that's on the second chat that 13:16you send and then 20 or 30 chats later 13:18cuz you're having a great conversation 13:20with Door Dash uh as we all do um it 13:23forgets it forgets that you have French 13:26fries. That kind of visceral moment 13:27where it forgets the previous 13:29conversation is something almost 13:30everyone has experienced with AI and you 13:33don't have to experience it with a rag 13:35system because the rag system can 13:36effectively be used as an advanced 13:38memory manager to re reduce that sense 13:40that the me memory is just going to 13:42disappear as the context window moves 13:44along. A lot of the fancy work that 13:47companies do to keep context windows 13:49open a long time basically revolves down 13:52to this fancy memory management. This is 13:54one of the reasons why OpenAI feels like 13:57it has a larger context window even 13:59though it doesn't. They don't exactly 14:01reveal what they do, but basically they 14:02do some fancy work with memory 14:04management to keep the conversation 14:05flowing longer. Whereas Claude has a 14:08pretty hard memory cap and they aren't 14:10keeping the conversation longer with a 14:11technique like this right now at least. 14:13And so you'll run into the you've run 14:15out of memory on Claude really fast. And 14:18what's fascinating is people think that 14:20means that Claude has shorter context 14:22windows and shorter memory. But that's 14:24not true. OpenAI is fooling you with 14:26fancier memory management. Okay, let's 14:28get to evals in testing. Four things 14:30that I want to call out. Relevance, are 14:32we retrieving the right chunks? 14:34Faithfulness, is the answer based on 14:36actual sources? Quality, would a human 14:38rate it as correct? And latency, is this 14:41fast enough? And you'll have to set that 14:42bar, but oftentimes it's like sub a 14:44couple of seconds. And so you need to 14:46start this by building a eval set, a 14:49question set for this rag that you will 14:52consider gold standard. Include edge 14:53cases, include things that are tricky. 14:55Don't make it easy. You want to measure 14:57both retrieval and generation. So can it 14:59get it and can it write it well? And you 15:01want to AB test improvements in your rag 15:03system. If you're going to move to 15:04hybrid search, take it seriously. One 15:07example, uh, Notion worked on AB testing 15:10their rag system when they moved and 15:12they could prove the improved value of 15:15search over time and so they were able 15:18to analyze their failures, fix their 15:19data and problems. That's another 15:20publicly available story. Okay, I've 15:22given you some examples of how to do 15:24rag. Let's talk about how rag goes 15:26wrong. We've talked about chunking going 15:28wrong, breaking context midsentence. 15:30We've talked about LLM's missing info in 15:33big chunks and how things get lost in 15:35memory. Well, it's possible if you set 15:37up a bad rag that you can actually get 15:39lost in the middle because the LM can't 15:40retrieve the info. So, that's a way that 15:42rag can actually make the memory problem 15:44worse if you implement it badly. You 15:46have hallucination horror stories where 15:47the rag will make up facts despite the 15:50context being available. That can happen 15:52with poorly labeled context. It can 15:54happen for a variety of reasons. Number 15:56four, you can have frankly an incorrect 15:58vector DB setup. It can be very 16:00expensive. Number five, you can have 16:02stale data or bad data inside it. So 16:05there's no update pipeline and the data 16:06gets out of date and then it's useless. 16:08Number six, security leaks, PII 16:11exposure, compliance failures. It's just 16:13not fun. And number seven, mismatching 16:15on embeddings. Different models for 16:17index versus query can lead to complete 16:19garbledegook. So as you would expect, 16:22the prevention strategies make a lot of 16:23sense here. Always overlap your chunks. 16:26Test with production-like data. Let it 16:28be okay to have I don't know responses. 16:30That really helps with hallucinations. 16:32Start with open- source or cheap 16:33options. That prevents you from having 16:35the wrong vector database. You can just 16:37before you pour the concrete, just check 16:39it, right? Build update pipelines on day 16:41one. Don't build them later. Have a 16:43security review before you start on the 16:45architecture. Track embedding versions 16:48so you don't have different embedding 16:49versions that screw you over between 16:51index and query. Okay. Now, let's get 16:54into some of the challenges that occur 16:55especially when you have very large 16:57systems like enterprise and scale 16:59systems scaling to 10 million queries, 17:01right? You have to start to shard your 17:03vector DB and replicate it. You have to 17:05cache popular queries. You have to 17:07figure out how to cascade models. You 17:09may want to expand a prompt and then 17:10have a different model handle the prompt 17:12from there. Cost optimization will save 17:15you millions of dollars because it's so 17:17expensive to run these things. And so 17:18this is where like you'll be shaving 17:20models and figuring out what is the 17:21absolute smallest model you can use and 17:23how do you trade off different models in 17:25a system depending on the query. You're 17:27going to have a security deep deep dive 17:28like no other. access control filtering, 17:31PII scrubbing, audit trails, compliance, 17:33HIPPA, GDPR, SOCK 2, you name it. Add an 17:36acronym, right? There's going to be a 17:37lot. Plan for it to take months, but 17:39it's worth it, right? Like another 17:41example is RBC banking. They built a rag 17:44for support agents, another publicly 17:46available story. It indexes policies, it 17:49indexes past tickets, faster resolution, 17:51better consistency, and they rolled it 17:53out internally at first and then to 17:55customers. It is possible to do ragged 17:57scale. Let's do just a little bit of a 18:01look at rag versus agentic search as we 18:04come toward the end of this long 18:06introduction. Thank you for staying with 18:07me to rag. So rag versus agentic search 18:10is a huge question. Fundamentally rag is 18:13a single retrieval answer modality 18:15whereas agents are thinking and planning 18:17in multi-step and they can be more 18:18accuracy but they're much slower and 18:19more expensive. So you want to use rag 18:21for simple Q&A for documentation. You 18:24want to use rag and agents for complex 18:26reasoning, multissource, etc. And I 18:28would be remiss even in a video that's 18:31all about rag if I did not say when not 18:34to use rag. I know of companies that 18:36have they have regretted their rag 18:39implementations. Because what they used 18:41rag for was not company data, not 18:44something that the LLM could not get any 18:48other way. What they used it for was 18:51essentially a way to make the LLM 18:52temporarily smarter. And what they found 18:55after a very expensive halfmillion 18:56dollar million dollar implementation is, 18:58oh no, we implemented a rag and the next 19:01general purpose model was smart enough 19:03it didn't matter. It had a big enough 19:04context window it didn't matter. We 19:06still need rag. It just needs to be 19:08intelligent. It needs to be smart. It 19:10needs to follow some of the best 19:11practices I've outlined here around how 19:12you handle data, how you chunk data, why 19:14you use it, how you set it up. So here's 19:16some examples that are kind of when not 19:19to use rag. Things you can do to avoid 19:21making those kinds of mistakes. Number 19:22one, check if the base model knows it or 19:25almost knows it. I mentioned that 19:27already. Number two, this is more for a 19:29personal rag system, but if it's for 19:30stories or poems or creative writing, 19:32rag just generally doesn't work well 19:34because the semantic meaning doesn't 19:36work the same way. Number three, if you 19:38need it to be super super fast, like 19:40gaming system fast, don't bother with a 19:43rag. It's not going to work ever. uh 19:44because you just have to go and get the 19:46data and it takes time. If you have 19:47highly volatile data like stock market 19:49tickers, don't use a rag. It's not going 19:52to ever work. If you have a high 19:54maintenance cost and you don't have a 19:55really clear benefit, like if it's a 19:57small data set, don't use a rag. If you 20:00have relatively simple transformations, 20:02like basic calculations, basic 20:03formatting, there's not a lot to do, 20:06don't use a rag. It's not worth it. If 20:07it's privacy critical, you have to 20:10ensure you can't store user data. And if 20:11you do, you're in trouble. Now if we 20:13look to the future and what's going to 20:14happen, I think there's some clear 20:16writing on the wall. One, the models are 20:18going to get more agentic and smarter. 20:19That means rag is going to become more 20:21and more agentic search rag, more and 20:23more agentic search plus MCP rag, and 20:25they are going to make active progress 20:28on the memory side, which leads you to 20:30ask me, well, heck, if they're going to 20:32get the memory figured out, why are we 20:33using rag? And my answer to you is rag 20:36is a way of talking with data that has a 20:39little bit of stability, a widespread 20:42good topic diffusion, and that you can 20:44actually query against that data in a 20:47way that enriches current conversations. 20:49You actually would not want to populate 20:52a magical 10 million token working 20:54memory with your entire wiki of your 20:56company anyway because it would just 20:57make your answers dirty. What you want 20:59is retrieval augmented generation 21:01sometime because it gives you a precise 21:03picture of a larger data set that is 21:05relevant to your query. So rag will have 21:08its place even as memory improves but 21:10only if you use it smartly. So expect 21:13more context windows million plus 21:15context windows are going to be typical. 21:17Expect a rapid spread of model context 21:19protocol and MCP. expect huge market 21:22growth as companies start to use rag as 21:24a way of bridging the world of their 21:26data with AI models and expect a much 21:29more sophisticated relationship between 21:31model fine-tuning and rag. I would 21:34expect that fine-tuning becomes much 21:36more democratized in 2026 just as rag is 21:40really common now in 2025. Okay, so 21:43we've walked through a lot of this. 21:44We've walked through how to set up a 21:45rag. We've walked through some of the 21:46pitfalls with rags, how rag works, how 21:48data works, how chunking works. I want 21:50to leave you with this. Rag is a way to 21:53solve some of AI's biggest problems. 21:54Hallucination, stale knowledge, lack of 21:57memory. It can be started as simple as a 21:59few lines of code. And it does scale up 22:01to the enterprise scale, although not 22:03with 15 lines of code. This is why so 22:05many enterprises and businesses are 22:07thinking about Ragen moving toward 22:09Brack. It solves problems that are real 22:11if implemented well. The tools do exist 22:13today. I've mentioned some of them on 22:15this video. You have no excuses not to 22:17start if you have a problem that fits in 22:20the fairly wide rag problem space. Most 22:22of us have run across hallucination 22:23stale a stale knowledge and memory 22:25issues. But if you're going to do it, 22:27pick a small use case. Build just a 22:30prototype to start. Don't pour the 22:31concrete. Measure the impact and eval 22:34it. Evaluate it and then learn and 22:37iterate. The companies that win are not 22:39going to be the companies that just have 22:41the magical biggest models. The size 22:44doesn't matter, right? the smartness of 22:46the model is not going to be the magic 22:47thing. It's going to be their ability to 22:49take AI integrate it into their company 22:51data and knowledge maybe with rag and in 22:54ultimately enable AI to drive their 22:57workflows forward. That's what I would 22:59suggest you think about for rag in your 23:01situation. What problems are ragshaped 23:04for you? And critically, what problems 23:07do you want to avoid using rag on? 23:09Because if you've watched this video 23:11this far, you know I'm not trying to 23:13tell you rag is the solution for 23:14everything. I just want you to 23:15understand what it is so you're not 23:17surprised the next time someone talks to 23:19you about it. Cheers. I hope you've 23:21enjoyed this introduction to Rack.