RAG: Best Practices & Pitfalls
Key Points
- Retrieval‑augmented generation (RAG) promises to turn LLMs into real‑time, data‑driven assistants, unlocking a market projected to grow from ~ $2 B today to over $40 B by 2035.
- RAG tackles core LLM flaws—knowledge cut‑offs, hallucinations, and lack of access to proprietary data—by retrieving relevant documents, augmenting the query with those facts, and then generating answers grounded in reality.
- Adoption is already widespread: roughly 80 % of enterprises use RAG (preferring it to fine‑tuning), and 73 % of AI‑focused firms cite the need for up‑to‑date data access as essential.
- Real‑world wins include LinkedIn’s dramatic cut in support‑ticket resolution time, demonstrating how RAG acts like an “open‑book exam” for an LLM, while other companies have over‑invested and later regretted poorly scoped implementations.
- Successful RAG systems rely on high‑dimensional embeddings to match semantic meaning, careful scaling from prototype to millions of queries, and strict avoidance of common pitfalls that have derailed many projects.
Sections
- RAG's $40B Promise & Pitfalls - The video outlines retrieval‑augmented generation as a solution to LLM limits—offering real‑time, company‑specific knowledge—while detailing implementation steps, success stories, and common mistakes that can cause costly failures.
- Effective Text Chunking Strategies - The passage explains that raw text must be divided into overlapping chunks—using fixed-size, sentence‑based, semantic, or recursive methods aligned with business goals—to preserve meaning and improve LLM retrieval, emphasizing that vector searches rely on cosine similarity of semantics rather than simple keyword matching.
- RAG Maturity Levels Overview - The speaker walks through four escalating RAG approaches—hybrid keyword‑semantic search, multimodal search across text, images, video and audio, agentic multi‑step reasoning, and full enterprise‑grade deployment—highlighting their increasing accuracy, speed nuances, implementation complexity, and operational requirements.
- Graph RAG and Hybrid Search - The speaker explains how graph‑based Retrieval‑Augmented Generation preserves entity relationships, combines keyword and semantic searches through rank‑voting hybrid methods, and highlights careful handling of multimodal data such as images and tables.
- RAG Memory vs Context Windows - The speaker explains that retrieval‑augmented generation serves as an advanced memory manager to retain conversation details, making OpenAI’s seemingly larger context window a result of clever memory tricks, whereas Claude suffers from a hard, shorter memory limit.
- Scaling Secure Enterprise Vector Search - A guide to building large‑scale, cost‑optimized vector‑search systems—starting with production‑like testing, open‑source options, early pipeline and embedding versioning, then addressing sharding, caching, model cascades, and comprehensive security and compliance reviews.
- Future of Retrieval‑Augmented Generation - The speaker outlines how RAG will become more agentic, integrate larger context windows and memory advances, remain essential for precise data retrieval, drive market growth, and see democratized fine‑tuning by 2026.
Full Transcript
# RAG: Best Practices & Pitfalls **Source:** [https://www.youtube.com/watch?v=z8-0INxN_Hg](https://www.youtube.com/watch?v=z8-0INxN_Hg) **Duration:** 00:23:24 ## Summary - Retrieval‑augmented generation (RAG) promises to turn LLMs into real‑time, data‑driven assistants, unlocking a market projected to grow from ~ $2 B today to over $40 B by 2035. - RAG tackles core LLM flaws—knowledge cut‑offs, hallucinations, and lack of access to proprietary data—by retrieving relevant documents, augmenting the query with those facts, and then generating answers grounded in reality. - Adoption is already widespread: roughly 80 % of enterprises use RAG (preferring it to fine‑tuning), and 73 % of AI‑focused firms cite the need for up‑to‑date data access as essential. - Real‑world wins include LinkedIn’s dramatic cut in support‑ticket resolution time, demonstrating how RAG acts like an “open‑book exam” for an LLM, while other companies have over‑invested and later regretted poorly scoped implementations. - Successful RAG systems rely on high‑dimensional embeddings to match semantic meaning, careful scaling from prototype to millions of queries, and strict avoidance of common pitfalls that have derailed many projects. ## Sections - [00:00:00](https://www.youtube.com/watch?v=z8-0INxN_Hg&t=0s) **RAG's $40B Promise & Pitfalls** - The video outlines retrieval‑augmented generation as a solution to LLM limits—offering real‑time, company‑specific knowledge—while detailing implementation steps, success stories, and common mistakes that can cause costly failures. - [00:03:21](https://www.youtube.com/watch?v=z8-0INxN_Hg&t=201s) **Effective Text Chunking Strategies** - The passage explains that raw text must be divided into overlapping chunks—using fixed-size, sentence‑based, semantic, or recursive methods aligned with business goals—to preserve meaning and improve LLM retrieval, emphasizing that vector searches rely on cosine similarity of semantics rather than simple keyword matching. - [00:06:36](https://www.youtube.com/watch?v=z8-0INxN_Hg&t=396s) **RAG Maturity Levels Overview** - The speaker walks through four escalating RAG approaches—hybrid keyword‑semantic search, multimodal search across text, images, video and audio, agentic multi‑step reasoning, and full enterprise‑grade deployment—highlighting their increasing accuracy, speed nuances, implementation complexity, and operational requirements. - [00:09:48](https://www.youtube.com/watch?v=z8-0INxN_Hg&t=588s) **Graph RAG and Hybrid Search** - The speaker explains how graph‑based Retrieval‑Augmented Generation preserves entity relationships, combines keyword and semantic searches through rank‑voting hybrid methods, and highlights careful handling of multimodal data such as images and tables. - [00:13:07](https://www.youtube.com/watch?v=z8-0INxN_Hg&t=787s) **RAG Memory vs Context Windows** - The speaker explains that retrieval‑augmented generation serves as an advanced memory manager to retain conversation details, making OpenAI’s seemingly larger context window a result of clever memory tricks, whereas Claude suffers from a hard, shorter memory limit. - [00:16:28](https://www.youtube.com/watch?v=z8-0INxN_Hg&t=988s) **Scaling Secure Enterprise Vector Search** - A guide to building large‑scale, cost‑optimized vector‑search systems—starting with production‑like testing, open‑source options, early pipeline and embedding versioning, then addressing sharding, caching, model cascades, and comprehensive security and compliance reviews. - [00:20:21](https://www.youtube.com/watch?v=z8-0INxN_Hg&t=1221s) **Future of Retrieval‑Augmented Generation** - The speaker outlines how RAG will become more agentic, integrate larger context windows and memory advances, remain essential for precise data retrieval, drive market growth, and see democratized fine‑tuning by 2026. ## Full Transcript
What if Chad GPT had perfect memory and
never hallucinated? That is the $40
billion promise that Rag is making to
the industry. Rag is retrieval augmented
generation. In this video, you're going
to get a one-stop shop that unpacks all
the current debates, all the current
best practices on Rag, how companies are
implementing RAG, a few success stories,
and not to be left out, a few places
where you should not use Rag. because
yes, there are companies that have
absolutely overinvested in rag and
profoundly regret it. So, let's dive in.
The problem is fundamentally that LLM
are brilliant but jagged. I've talked
about this before. They have fatal
flaws. They have knowledge cutoff dates,
so their knowledge is frozen in time.
They have hallucinations or confident
lies and they obviously can't access
your company's data, which in most cases
companies do not mind. So the solution
preview is basically rag plus an LLM or
large language model will give your AI a
realtime research assistant flavor. So
what you're going to learn is how to
build your rag system, how to scale from
a prototype up to true scale like
millions of queries and how to avoid the
pitfalls that kill so many rag projects.
This is all based on actual deep dives
I've done on rag. Lots of research I've
done. It's very comprehensive. So,
bookmark this one and watch it in
chunks. First, why Rag changes
everything. We're going to actually get
into the stakes and then we're going to
get into how it actually works. Rag is
currently a roughly $2 billion market,
although that's exploding so fast it's
hard to measure. It is on track for $40
billion plus by 2035. So many
enterprises use rag. The loose running
number is around 80%. And they use it
over fine-tuning because they perceive
it as easier. So, fine-tuning a model is
perceived as more difficult, at least
right now. And those that are engaging
with AI, 73% of them say that they need
real- time data access. By the way, if
you're wondering where am I making up
these statistics from, I have a list of
links in my Substack that you can go and
follow up to find all of the actual
stories that underly this. So, as an
example of a success story, LinkedIn had
a significant reduction in support
ticket resolution time because of rag.
because RAG enabled them to know their
business. So, it's like an LLM having an
openbook exam instead of a closed book
exam. And yes, that story is public. So,
how does it work? What's the magic?
Retrieval is searching the knowledge
base for relevant info. Augmentation is
combining the query with retrieved
facts. And generation is an LLM creating
an answer grounded in real data. So, how
does rag really work? Number one,
embeddings. So text is embedded as
numbers in dimensional space,
highdimensional space. For example, the
phrase refund policy might be embedded
as a series of numbers or vectors. And
the key insight is that similar meanings
will cluster together mathematically. If
you've watched previous videos of mine
talking about how large language models
work, it's the same darn thing. You're
taking the words and you're encoding
them as numbers in highdimensional
vector space. And if you want to know
how many dimensions, one of the best
practices right now is 1,536
dimensions. That's a lot of dimensions.
So if you have the dimensions, you might
wonder, is that enough? Can we just feed
the text raw? The answer is no. You want
to chunk. And by chunking, we mean that
you want to break the large blocks of
text that you're giving to the system
into pieces in ways that help the LLM
understand relationships and semantic
meaning. Bad chunking ruins. so many rag
projects. So pay attention. You have
four different strategies here. You can
have fixed size chunks that can be
dangerous. It can cut off mid-sentence.
You can have sentencebased chunks that
will respect boundaries. And you have
semantic chunks that group by topic. And
you have recursive chunks that group by
hierarchical structure. The key is
making sure you understand what you want
to get from a business perspective and
driving your chunking strategy off of
that. You should plan to have overlap
between chunks. You don't want to have
just a 0 0 cut off because if you do
then you don't have the chance for the
LLM to find something in a similar chunk
that it might have run across in the
original chunk. And that basically if
you give it a little overlap, you
maximize the odds of the AI finding what
it needs in a really complicated
haststack. So when we're looking at
things in vector space, we are not
keyword matching. That's often a
misunderstanding. People will say, "Oh,
the LLM is looking for a keyword match."
No, it's not. It's looking to match
meaning. And so, it's actually looking
for what's called cosign similarity and
finding the nearest neighbors in vector
space. As an example, how do I get my
money back? Might be a query that a
customer types in. That would find, say,
refund processing 0.95 similarity,
return policy 93 similarity, and
shipping info 38 similarity not
retrieved. That one's not a fit. Now, if
you want to like think about how you
handle retrieval, you can actually
rerank based on how you get actual
queries back and you can boost accuracy
for business purposes significantly if
you do re-ranking. It's an advanced
technique, but in this situation, if you
want the system to retrieve shipping
info with how do I get my money back
because maybe they need to ship their
item back, then you can rerank and you
can get to that in what we'll call postp
production, for lack of a better term.
Okay, so how do you build a rag? Very
simply, I would recommend like go to
Llama index. You're going to load up
your documents, which Llama gives you a
way to do from the command line. Create
an index and query. And it's easy to get
a stack for this. This is not expensive.
This is not hard as simple. Rag is
quick. Uh you can use lang chain. It's a
Swiss Army knife. It's going to do
everything. You can use llama index.
It's optimized specifically for rag.
other vector DBs, Pine Cone, Chroma,
Cudrant, they all work. And if you want
something as simple as what's the
warranty period on a manual or handbook,
you're going to be able to get 2 years
for EU purchases or a similar answer
that's correct very, very quickly. Like
this is something where in 2025 it's not
hard to build a simple rag. The
challenge is most people don't just want
a simple rag. So if you want a level one
basic Q&A, you can get that one done in
like a week or so, right? even at a
company. Simple vector search, single
source, couple of seconds of latency,
internal FAQs only, super fast. It's
basically a slightly fancier custom GPT.
Level two, hybrid search, where you're
combining both a keyword match and
semantic meaning match. That's a little
bit more complicated. You definitely get
better accuracy. It can be faster in
some cases because you're handling
keywords directly. Uh, and it can be
helpful for handling edge cases. It's
much more complicated to implement
though. But it gets more complicated
from there. All right, level three.
Let's say you want to search text and
images and video and audio. Modal rag.
It can be quite accurate. You can get it
to be quick. Uh but you are going to
have to put a ton of work in on the data
side and the chunking side. You think
chunking is complex complex with text?
Wait till you're trying to come up with
a chunking strategy for text and images
and video and audio. One example would
be uh Vimeo's video search with
timestamps. That's an interesting one,
right? So, level four, agentic rag.
That's where you actually have the agent
go in and do multi-step reasoning and
self-improve on what it finds. It's
going to be a longer wait. You can get a
more accurate response. You not only
have to build a full rag, you have to
build an agent over the top. And then
finally, if you want enterprise
production, there's a lot of security.
There's a lot of compliance. There's a
lot of monitoring. You have performance
expectations around how fast this thing
needs to respond and how it handles load
when there's multiple queries and all of
that additional software engineering
that goes into putting something on a
million boxes. That does not go away.
That is still complicated. No AI will
magically put software that lives on a
million or 10 million or 100 million
boxes easily. Okay. So, we've talked a
fair bit about data. Let's do a little
bit more of a deep dive there because
data is the key to a good rag system. So
when you are looking at documents for a
rag, there's a few things that you want
to keep in mind as like trusty tool tips
that will help you to go farther. Number
one, PDFs often have terrible header and
footer pollution. Like they have stuff
where like, have you ever copied and
pasted a PDF? Like that's how that's how
the system sees it and it will read
those little footers and get confused.
It'll read the weird header and get
confused. OCR for scan documents. Are
you sure the optical character
recognition is correct? This is why
Mistl released a special OCR tool just
for scanning documents. It's difficult
to get a good rag if you don't have good
clean text that's digital. Tables are
going to need special handling because
you have to encode spatial
relationships. You need to get to clean
boilerplate in documents before thinking
about chunking. Do not try to chunk a
PDF. Get to clean boilerplate first. Get
to clean markdown first. Okay. metadata
can be a dramatically impactful choice
as far as how you handle accuracy. So if
you add source, section, and date to
each chunk, retrieval is going to be
vastly improved. For example, policy
updated March 20, now the system knows
that it's a 2024 update. And if it finds
a 2025 update, it's going to probably
choose the 2025 update if it understands
you're looking for a recency based
retrieval. So what does this look like?
10 steps. Convert to text with the
appropriate parser. You got to split it
into sections. You have to remove the
boiler plate that's like crappy headers
and footers. You have to normalize all
the white space. You have to extract the
section titles. You have to add the
metadata. You have to chunk with the
overlap. You have to embed the chunks.
You have to verify samples. And then you
have to iterate. That's how much work it
is. And that is for frankly a fairly
simple exercise. And this is why I say
like rag can get complicated. But we're
going to get even more advanced because
this is one of those videos. Let's talk
about graph rag. So traditional rag is
just isolated text chunks. Graph rag
preserves entity relationships as it
encodes in a rag. And so LinkedIn saw
significantly better retrieval with
knowledge graphs from graph rag. Another
hybrid approach that's interesting is
search deep dive. And so can you catch
exact matches or error codes with a
hybrid search that not only looks at
like the vector space but also looks at
for example error codes. So the best
document often ranks in different
searches at different positions and it's
sort of like rank choice voting where
it's looking for like what is the
retrieval answer that is highest across
these different search methods in our
hybrid search approach. And then maybe
that's the number one, right? So maybe
the error code that we're looking for
ranks really highly in our keyword
search and it ranks not as highly in the
semantic meaning search, but it's still
there because we've used correct
metadata when we chunked it. And so it
all sort of comes out in the wash and it
comes out as number one combined and
that improves the accuracy. Multimodal,
you want to be thoughtful about how you
handle especially the relationship
between image, table, and text. Invoices
are a good example of this. They will
often have tables. They'll definitely
have text. They may have images as well.
You want to use something like uh a tool
like clip for image embeddings. You want
to unify an index across all your
modalities. So unified index for text
and images and tables. And you should
be, if you do it right, able to send a
query that says something like, "Show me
the revenue table from Q3." And it
should retrieve both the image and the
data because the index is common across
both modalities. Okay, this is where I
mention MCP. MCP is helpful because it
ends up being like the USB port for AI.
It's a universal protocol to enable AI
data connectivity. and it is super super
helpful to enable systems to plug into
and access data that they would not
otherwise be able to get. And so a good
system that has rag for internal company
data can also extend that search
relatively easily using MCP to other
data sources as well. Let's get to
memory management. So if you think about
memory, part of the whole reason we got
here is memory and why memory is a
problem for AI. We have to get the
memory right. Context windows are
working memory. It's what every AI ships
with. Often it's 100,000, 200,000,
400,000, maybe even a million tokens.
Vector stores long-term memory. We've
talked about embeddings. Long-term
memory, effectively unlimited that we
have get compressed and summarized. And
so, if you think about how all of this
relates together, you can be in a
position where you can compress old
terms of conversation and summarize them
in memory. You can retrieve previous
conversation with a rag on the
conversation itself. You can have
multiple abstraction levels. And one
good example is making sure that you can
encode enough of a previous longrunning
conversation to not forget key facts. So
as an example, let's say you're ordering
French fries, right? And you're talking
with an AI bot about ordering French
fries. It is 2025. That could happen.
Maybe you're on Door Dash. I don't know
what happens if you mention that the
order you're working on doesn't have
enough fries in it and you want extra
fries and that's on the second chat that
you send and then 20 or 30 chats later
cuz you're having a great conversation
with Door Dash uh as we all do um it
forgets it forgets that you have French
fries. That kind of visceral moment
where it forgets the previous
conversation is something almost
everyone has experienced with AI and you
don't have to experience it with a rag
system because the rag system can
effectively be used as an advanced
memory manager to re reduce that sense
that the me memory is just going to
disappear as the context window moves
along. A lot of the fancy work that
companies do to keep context windows
open a long time basically revolves down
to this fancy memory management. This is
one of the reasons why OpenAI feels like
it has a larger context window even
though it doesn't. They don't exactly
reveal what they do, but basically they
do some fancy work with memory
management to keep the conversation
flowing longer. Whereas Claude has a
pretty hard memory cap and they aren't
keeping the conversation longer with a
technique like this right now at least.
And so you'll run into the you've run
out of memory on Claude really fast. And
what's fascinating is people think that
means that Claude has shorter context
windows and shorter memory. But that's
not true. OpenAI is fooling you with
fancier memory management. Okay, let's
get to evals in testing. Four things
that I want to call out. Relevance, are
we retrieving the right chunks?
Faithfulness, is the answer based on
actual sources? Quality, would a human
rate it as correct? And latency, is this
fast enough? And you'll have to set that
bar, but oftentimes it's like sub a
couple of seconds. And so you need to
start this by building a eval set, a
question set for this rag that you will
consider gold standard. Include edge
cases, include things that are tricky.
Don't make it easy. You want to measure
both retrieval and generation. So can it
get it and can it write it well? And you
want to AB test improvements in your rag
system. If you're going to move to
hybrid search, take it seriously. One
example, uh, Notion worked on AB testing
their rag system when they moved and
they could prove the improved value of
search over time and so they were able
to analyze their failures, fix their
data and problems. That's another
publicly available story. Okay, I've
given you some examples of how to do
rag. Let's talk about how rag goes
wrong. We've talked about chunking going
wrong, breaking context midsentence.
We've talked about LLM's missing info in
big chunks and how things get lost in
memory. Well, it's possible if you set
up a bad rag that you can actually get
lost in the middle because the LM can't
retrieve the info. So, that's a way that
rag can actually make the memory problem
worse if you implement it badly. You
have hallucination horror stories where
the rag will make up facts despite the
context being available. That can happen
with poorly labeled context. It can
happen for a variety of reasons. Number
four, you can have frankly an incorrect
vector DB setup. It can be very
expensive. Number five, you can have
stale data or bad data inside it. So
there's no update pipeline and the data
gets out of date and then it's useless.
Number six, security leaks, PII
exposure, compliance failures. It's just
not fun. And number seven, mismatching
on embeddings. Different models for
index versus query can lead to complete
garbledegook. So as you would expect,
the prevention strategies make a lot of
sense here. Always overlap your chunks.
Test with production-like data. Let it
be okay to have I don't know responses.
That really helps with hallucinations.
Start with open- source or cheap
options. That prevents you from having
the wrong vector database. You can just
before you pour the concrete, just check
it, right? Build update pipelines on day
one. Don't build them later. Have a
security review before you start on the
architecture. Track embedding versions
so you don't have different embedding
versions that screw you over between
index and query. Okay. Now, let's get
into some of the challenges that occur
especially when you have very large
systems like enterprise and scale
systems scaling to 10 million queries,
right? You have to start to shard your
vector DB and replicate it. You have to
cache popular queries. You have to
figure out how to cascade models. You
may want to expand a prompt and then
have a different model handle the prompt
from there. Cost optimization will save
you millions of dollars because it's so
expensive to run these things. And so
this is where like you'll be shaving
models and figuring out what is the
absolute smallest model you can use and
how do you trade off different models in
a system depending on the query. You're
going to have a security deep deep dive
like no other. access control filtering,
PII scrubbing, audit trails, compliance,
HIPPA, GDPR, SOCK 2, you name it. Add an
acronym, right? There's going to be a
lot. Plan for it to take months, but
it's worth it, right? Like another
example is RBC banking. They built a rag
for support agents, another publicly
available story. It indexes policies, it
indexes past tickets, faster resolution,
better consistency, and they rolled it
out internally at first and then to
customers. It is possible to do ragged
scale. Let's do just a little bit of a
look at rag versus agentic search as we
come toward the end of this long
introduction. Thank you for staying with
me to rag. So rag versus agentic search
is a huge question. Fundamentally rag is
a single retrieval answer modality
whereas agents are thinking and planning
in multi-step and they can be more
accuracy but they're much slower and
more expensive. So you want to use rag
for simple Q&A for documentation. You
want to use rag and agents for complex
reasoning, multissource, etc. And I
would be remiss even in a video that's
all about rag if I did not say when not
to use rag. I know of companies that
have they have regretted their rag
implementations. Because what they used
rag for was not company data, not
something that the LLM could not get any
other way. What they used it for was
essentially a way to make the LLM
temporarily smarter. And what they found
after a very expensive halfmillion
dollar million dollar implementation is,
oh no, we implemented a rag and the next
general purpose model was smart enough
it didn't matter. It had a big enough
context window it didn't matter. We
still need rag. It just needs to be
intelligent. It needs to be smart. It
needs to follow some of the best
practices I've outlined here around how
you handle data, how you chunk data, why
you use it, how you set it up. So here's
some examples that are kind of when not
to use rag. Things you can do to avoid
making those kinds of mistakes. Number
one, check if the base model knows it or
almost knows it. I mentioned that
already. Number two, this is more for a
personal rag system, but if it's for
stories or poems or creative writing,
rag just generally doesn't work well
because the semantic meaning doesn't
work the same way. Number three, if you
need it to be super super fast, like
gaming system fast, don't bother with a
rag. It's not going to work ever. uh
because you just have to go and get the
data and it takes time. If you have
highly volatile data like stock market
tickers, don't use a rag. It's not going
to ever work. If you have a high
maintenance cost and you don't have a
really clear benefit, like if it's a
small data set, don't use a rag. If you
have relatively simple transformations,
like basic calculations, basic
formatting, there's not a lot to do,
don't use a rag. It's not worth it. If
it's privacy critical, you have to
ensure you can't store user data. And if
you do, you're in trouble. Now if we
look to the future and what's going to
happen, I think there's some clear
writing on the wall. One, the models are
going to get more agentic and smarter.
That means rag is going to become more
and more agentic search rag, more and
more agentic search plus MCP rag, and
they are going to make active progress
on the memory side, which leads you to
ask me, well, heck, if they're going to
get the memory figured out, why are we
using rag? And my answer to you is rag
is a way of talking with data that has a
little bit of stability, a widespread
good topic diffusion, and that you can
actually query against that data in a
way that enriches current conversations.
You actually would not want to populate
a magical 10 million token working
memory with your entire wiki of your
company anyway because it would just
make your answers dirty. What you want
is retrieval augmented generation
sometime because it gives you a precise
picture of a larger data set that is
relevant to your query. So rag will have
its place even as memory improves but
only if you use it smartly. So expect
more context windows million plus
context windows are going to be typical.
Expect a rapid spread of model context
protocol and MCP. expect huge market
growth as companies start to use rag as
a way of bridging the world of their
data with AI models and expect a much
more sophisticated relationship between
model fine-tuning and rag. I would
expect that fine-tuning becomes much
more democratized in 2026 just as rag is
really common now in 2025. Okay, so
we've walked through a lot of this.
We've walked through how to set up a
rag. We've walked through some of the
pitfalls with rags, how rag works, how
data works, how chunking works. I want
to leave you with this. Rag is a way to
solve some of AI's biggest problems.
Hallucination, stale knowledge, lack of
memory. It can be started as simple as a
few lines of code. And it does scale up
to the enterprise scale, although not
with 15 lines of code. This is why so
many enterprises and businesses are
thinking about Ragen moving toward
Brack. It solves problems that are real
if implemented well. The tools do exist
today. I've mentioned some of them on
this video. You have no excuses not to
start if you have a problem that fits in
the fairly wide rag problem space. Most
of us have run across hallucination
stale a stale knowledge and memory
issues. But if you're going to do it,
pick a small use case. Build just a
prototype to start. Don't pour the
concrete. Measure the impact and eval
it. Evaluate it and then learn and
iterate. The companies that win are not
going to be the companies that just have
the magical biggest models. The size
doesn't matter, right? the smartness of
the model is not going to be the magic
thing. It's going to be their ability to
take AI integrate it into their company
data and knowledge maybe with rag and in
ultimately enable AI to drive their
workflows forward. That's what I would
suggest you think about for rag in your
situation. What problems are ragshaped
for you? And critically, what problems
do you want to avoid using rag on?
Because if you've watched this video
this far, you know I'm not trying to
tell you rag is the solution for
everything. I just want you to
understand what it is so you're not
surprised the next time someone talks to
you about it. Cheers. I hope you've
enjoyed this introduction to Rack.