Custom AI Accelerators Drive Innovation
Key Points
- AI is moving from a single‑purpose technology to a diverse ecosystem, much like the evolution of automobiles from uniform wagons to specialized vehicles such as ambulances, race cars, and refrigerated trucks.
- Hardware AI accelerators—purpose‑built silicon optimized for matrix and tensor calculations—provide faster, more power‑efficient inferencing than general‑purpose processors.
- These accelerators sit at the base of the AI stack, handling high‑compute workloads while integrating with memory systems and parallel‑processing architectures to reduce footprint and latency.
- Effective AI deployment also depends on software layers for task management, governance, and security, ensuring models operate ethically, without bias, and with protected data.
Sections
- AI Accelerators Transform Modern Landscape - The speaker compares today’s proliferation of specialized AI hardware accelerators to the early automobile era, stressing that these purpose‑built chips now allow enterprises to right‑size and efficiently deploy diverse AI applications across industries.
- Placement and Memory of AI Accelerators - The speaker explains how AI accelerators integrate with system architecture, detailing their various possible locations (on‑die, on‑chip, or external) and how they manage compute and memory either independently or shared with other processors.
- Choosing the Right‑Sized AI Model - The speaker contrasts large generative models with smaller predictive ones, emphasizing the need to match model size to task requirements to balance accuracy, cost, sustainability, and system performance.
- Balancing Accuracy, Performance, and Risk - The speaker explains that selecting an AI model involves a three‑dimensional trade‑off among response time, accuracy, and the specific risk profile of the use case, and that knowing the model size combined with appropriate hardware accelerators allows optimal deployment for each task.
- Hybrid On‑Chip/Off‑Chip Fraud Detection - The speaker explains using a tiny, cache‑resident AI model in the processor to instantly approve most transactions, while routing the ambiguous 20% to a larger, separate AI engine for deeper analysis.
Full Transcript
# Custom AI Accelerators Drive Innovation **Source:** [https://www.youtube.com/watch?v=KX0qBM-ByAg](https://www.youtube.com/watch?v=KX0qBM-ByAg) **Duration:** 00:15:27 ## Summary - AI is moving from a single‑purpose technology to a diverse ecosystem, much like the evolution of automobiles from uniform wagons to specialized vehicles such as ambulances, race cars, and refrigerated trucks. - Hardware AI accelerators—purpose‑built silicon optimized for matrix and tensor calculations—provide faster, more power‑efficient inferencing than general‑purpose processors. - These accelerators sit at the base of the AI stack, handling high‑compute workloads while integrating with memory systems and parallel‑processing architectures to reduce footprint and latency. - Effective AI deployment also depends on software layers for task management, governance, and security, ensuring models operate ethically, without bias, and with protected data. ## Sections - [00:00:00](https://www.youtube.com/watch?v=KX0qBM-ByAg&t=0s) **AI Accelerators Transform Modern Landscape** - The speaker compares today’s proliferation of specialized AI hardware accelerators to the early automobile era, stressing that these purpose‑built chips now allow enterprises to right‑size and efficiently deploy diverse AI applications across industries. - [00:03:24](https://www.youtube.com/watch?v=KX0qBM-ByAg&t=204s) **Placement and Memory of AI Accelerators** - The speaker explains how AI accelerators integrate with system architecture, detailing their various possible locations (on‑die, on‑chip, or external) and how they manage compute and memory either independently or shared with other processors. - [00:06:34](https://www.youtube.com/watch?v=KX0qBM-ByAg&t=394s) **Choosing the Right‑Sized AI Model** - The speaker contrasts large generative models with smaller predictive ones, emphasizing the need to match model size to task requirements to balance accuracy, cost, sustainability, and system performance. - [00:09:45](https://www.youtube.com/watch?v=KX0qBM-ByAg&t=585s) **Balancing Accuracy, Performance, and Risk** - The speaker explains that selecting an AI model involves a three‑dimensional trade‑off among response time, accuracy, and the specific risk profile of the use case, and that knowing the model size combined with appropriate hardware accelerators allows optimal deployment for each task. - [00:12:52](https://www.youtube.com/watch?v=KX0qBM-ByAg&t=772s) **Hybrid On‑Chip/Off‑Chip Fraud Detection** - The speaker explains using a tiny, cache‑resident AI model in the processor to instantly approve most transactions, while routing the ambiguous 20% to a larger, separate AI engine for deeper analysis. ## Full Transcript
AI has captured the world's imagination.
The landscape is evolving rapidly
as enterprises begin to leverage the technology's potential for real-world business applications.
Ideas in the space for creating and improving applications are expanding in every direction,
and at the heart of these ideas are AI accelerators, hardware accelerators,
which is specific hardware designed for the inferencing and AI workloads.
These enable the transformation because they enable faster and more efficient processing.
I like to think of where we are right now in the AI landscape as where the world was when the automobile was invented.
So at first, all automobiles basically looked the same,
more or less shaped like the wagons and carriages they were meant to replace,
but very quickly...
the world and industry realized, wow,
we could make ambulances and passenger cars and race cars and refrigerated milk trucks,
and all of these things have vastly different customizations in order to achieve their purpose.
That's really where we are with AI in 2025.
One size no longer fits all when it comes to AI.
Hardware accelerators are really great at
helping to right size solutions for AI
and to really explain the intersection between hardware acceleration
and AI, let's take a admittedly simplified look at the AI stack.
So at its fundamental level, at the bottom, we have infrastructure that's hard to spell.
It's hardware.
On top of that, of course, we have our models.
You've all heard of these.
And then at the top left, but most definitely not least, is software, management of the tasks, but also governance,
and governance is making sure that the AI continues to behave correctly,
without bias, ethically, as it evolves and develops and goes forward.
In addition, the software manages security, making sure that
the AI model data and the data that's fed in and out of those models remains private and secure.
Where do the accelerators fit in this stack?
Well, you might have guessed already, they are part of the hardware.
So the accelerators are here.
What specifically is an AI accelerator?
Again, this is purpose-built hardware, silicon, that's built and optimized
for the high-compute matrix mathematics that is necessary to do AI transactions.
So the linear algebra, the tensor calculations, this hardware is laid out specifically,
the silicon is designed to optimize and do that faster so that the inferencing answers can happen faster,
with reduced power consumption and a smaller physical footprint
compared to a general purpose piece of hardware that's designed to do everything and therefore has to cover all bases.
Accelerators do use all of the computer architecture, modern techniques
like parallel processing and optimize memory bandwidths and they interact with memory,
they have a large memory component.
It's a specific hardware engine that does the compute,
and that either has its own memory or it shares memory with other processors in the system.
Let's draw a picture of a system to really get a sense of where these accelerators might exist in an actual piece of hardware.
So this is a computer system.
It could be a server or a mainframe,
and like most today, let's make it an SMP so it has a variety of processor chips on it.
These AI accelerators, they could be located anywhere within the system.
For example, you might wanna have an on-die accelerator close to your processing compute.
You could have that on one or more of your processor compute engines
You could have a AI engine somewhere else in the system, not necessarily on a processor chip, but still within the box.
Additionally, since servers interact with the external world through industry standard IO protocols,
you could actually attach an AI accelerator
externally to the box,
and those accelerators could come with their own compute as well as their own memory.
The other thing to note about these hardware accelerators,
and this is what really provides that flexibility and scale that we really need in computing today
to enable all of these use cases and these ideas that people are coming up with in the AI industry, is the...
concept that each one of these AI engines doesn't have to be the exact same design.
Your on-chip accelerator, for example, could be optimized in a different way for a different type of workloads
than the one that's inside the system, which may or may not be different from the ones that you attach to the box.
So there's definitely a heterogeneous system of hardware and within hardware,
and heterogenous system of accelerators that are becoming available so that we can meet these growing AI needs.
To really see how these accelerators now interact with the rest of the stack,
let's take a couple of minutes to just think about models for a second,
and I think, you know, we all I think are familiar now with our traditional AI models, right?
Machine learning, you know, deep learning.
These are relatively fast.
They're relatively small in terms of model size,
and they're pretty good at coming up with, I'll say, suggestions.
They can't give me life advice necessarily, but they can offer suggestions.
Gen AI coming out in recent years, more complex, right?
Another level, right, they're generative, right, versus predictive, right.
These are things like our LLMs and simulators and things, right these are much larger models in general,
but they have, you know, they take more energy to run, they are more costly,
but they're also more powerful in terms of delivering accuracy and doing more complex things.
They can...
dare I say it, maybe not give life advice, but in some cases give, you know, advice.
Do some advisement.
Because we have this new, more complicated one,
it doesn't mean that we should just abandon the traditional AI because it does have a lot of value.
Most notably, it's best and it's small and it is less expensive, right?
So if we can get a good answer out of that, it is worthwhile to use it.
Really, right-sizing your model is where you want to be when you are doing AI.
If you have a AI task that's this size and your model is little,
you run the risk of, you might get an answer fast, but that answer might be wrong,
or it might come back with more risk than your application can tolerate.
So you might say, okay, I'll err on the side of caution and I'll just always use a huge model,
but you're not being cost effective and you're being sustainable when you do this.
In addition, you run the risk,
if you happen to be attempting to do something in the same compute space as your running system,
you run the risk of kneecapping other simultaneously running workloads on your box if you use too large of a model.
So ideally, what we wanna go for and try to do is to right size the model to the task that we're doing.
Ok, how do hardware accelerators help with that?
To understand that, it helps to think about the model the way hardware thinks about the model,
which is that the model really, at the end of the day, is data,
and when you think about hardware and how you optimize hardware, what we think about generally is performance,
a common metric of performance in the hardware world,
familiar to me, I'm a hardware person,
If you couldn't tell already, is latency or we call it response time, right?
How fast do you respond?
And when you're measuring response time as a function of size,
right, the size of your working set, the size of your data, the of your model,
basically, generally, the bigger that model gets, the worse your response time gets, right.
It takes longer and longer to get the answer out there, right?
If we try to use a similar graph to measure performance of AI, it's not quite as linear,
because what we're trying to do really when we're measuring how well AI performs is we really say, well, how effective is it?
And if you try to draw a graph like this, it would end up looking a little bit more confusing,
and you kind of have to think, well what's going on there?
What's going on there is response time,
hardware performance is not the only factor
when you're talking about measuring the success or the effectiveness of the AI.
It's really accuracy as well as performance.
And the variable that's hidden here is a third dimension.
Bear with me as I try to draw a third-dimension.
Which we touched on before I hinted at it.
It's real your use case,
and that green line is really coming out in the third dimension here.
So the questions you want to be asking when you're trying to decide what model to use are,
how much risk do I need?
If I'm a credit card fraud detection system, you know,
I have a different risk profile, I have different inputs, I have levels of
tolerance than I do if I'm batch document summarization process, right?
So those factors drive the need for models that align with specific use cases.
So at the end of the day, what's happening is your use case is dictating your model size,
and if you have knowledge of your model size, and you have access to these hardware accelerators,
then you can leverage these hardware accelerator and run
the model that you need on the optimized hardware accelerator for the task at hand.
Let me illustrate all this with an example.
It's a common example, but it's one of my favorites because it's very personal and it resonates,
I think with everybody, it's credit card fraud detect, right?
I wanna use my credit card online all the time,
and I want to hit pay now and I wanted to go through as long as it's me.
If it's not me, I want it to stop.
So that task is fairly complex.
It also has a high dependence on response time because
I'm gonna navigate away if it takes too long to use this site, I'll go someplace else, right.
So let's illustrate how we might use AI models and hardware accelerators to optimize for that use case.
So we have a task here,
nd really, that task is, is it fraud?
All right.I want to get the answer as fast as possible.
So maybe I'll start by using a.
ML DL model.
This will be fast, it'll have a low response time, and it'll have a good forecast of whether or not this is fraud or not.
Maybe it'll be 80% I'm making these numbers up,
successful.
That model, if I'm doing something like this,
I will probably, if have it, want to use a hardware accelerator
that's on the same processor chip as the transaction workload that's running.
The reason is that transaction workload already has the data
that needs to go into that model, and the model's a relatively small size,
so you can pull that data into the cache right there in the processor chip
and enable in real time actually getting, you know, your suggestion of whether or not this is fraud,
yery quickly.
Then I'd say 80% of the time, that answer comes back.
They're like, yeah, this is you.
This is a valid purchase.
I'm gonna approve it.
And then you get your toy that you bought and nobody loses any money and everything is good,
but maybe 20% of time, the model says, hmm, I'm not sure.
Some things look legit, but some things look suspicious.
We need to take a deeper look at this.
At that point, then, you could decide that now you know it's worth the investment
of going into a more expensive gen AI model,
and that model requires a lot more memory because it's a much bigger model.
So you wouldn't be using the AI engine in this case that you had on your die.
You might want to use an AI engine that's on a different die that's not running a
performance critical workload, or maybe one that's in a different place on the system,
or maybe one attached and sharded through the IO peripherals that has its own memory.
Based off of that, if you have that flexibility, then you can decide which to use and the hardware can use that,
and then at the end of that transaction, it can come out and say, no, it was good.
It's okay, it turns out that that was the right one.
So the idea here is using more than one model, this concept of multi-model AI, or ensemble AI,
can really be supported and enabled and optimized
by having these hardware accelerators that are optimized for different model types,
and then, when you think about when you're doing AI, the last thing to think about
here is, you know, we're not doing one of these a day, right?
We're doing millions of these day, many of them at the same time, right.
So you want to have all of these processing AI engines running all at the time, doing different things,
and then the next second or millisecond, It's doing a completely different model with a completely difference task.
So hardware accelerators give you the flexibility to do that at scale,
and because of that, they intersect well with the AI stack as we know it today
and are really starting to really be used to
ensure that we're continuing to expand and meet business need by deploying scalable,
efficient and secure AI to meet today's business problems and tomorrow's.