Learning Library

← Back to Library

Custom AI Accelerators Drive Innovation

15m • Unknown Channel • ai-ml • tutorial • intermediate • Watch on YouTube ↗

Key Points

AI is moving from a single‑purpose technology to a diverse ecosystem, much like the evolution of automobiles from uniform wagons to specialized vehicles such as ambulances, race cars, and refrigerated trucks.
Hardware AI accelerators—purpose‑built silicon optimized for matrix and tensor calculations—provide faster, more power‑efficient inferencing than general‑purpose processors.
These accelerators sit at the base of the AI stack, handling high‑compute workloads while integrating with memory systems and parallel‑processing architectures to reduce footprint and latency.
Effective AI deployment also depends on software layers for task management, governance, and security, ensuring models operate ethically, without bias, and with protected data.

Sections

Full Transcript

# Custom AI Accelerators Drive Innovation **Source:** [https://www.youtube.com/watch?v=KX0qBM-ByAg](https://www.youtube.com/watch?v=KX0qBM-ByAg) **Duration:** 00:15:27 ## Summary - AI is moving from a single‑purpose technology to a diverse ecosystem, much like the evolution of automobiles from uniform wagons to specialized vehicles such as ambulances, race cars, and refrigerated trucks. - Hardware AI accelerators—purpose‑built silicon optimized for matrix and tensor calculations—provide faster, more power‑efficient inferencing than general‑purpose processors. - These accelerators sit at the base of the AI stack, handling high‑compute workloads while integrating with memory systems and parallel‑processing architectures to reduce footprint and latency. - Effective AI deployment also depends on software layers for task management, governance, and security, ensuring models operate ethically, without bias, and with protected data. ## Sections - [00:00:00](https://www.youtube.com/watch?v=KX0qBM-ByAg&t=0s) **AI Accelerators Transform Modern Landscape** - The speaker compares today’s proliferation of specialized AI hardware accelerators to the early automobile era, stressing that these purpose‑built chips now allow enterprises to right‑size and efficiently deploy diverse AI applications across industries. - [00:03:24](https://www.youtube.com/watch?v=KX0qBM-ByAg&t=204s) **Placement and Memory of AI Accelerators** - The speaker explains how AI accelerators integrate with system architecture, detailing their various possible locations (on‑die, on‑chip, or external) and how they manage compute and memory either independently or shared with other processors. - [00:06:34](https://www.youtube.com/watch?v=KX0qBM-ByAg&t=394s) **Choosing the Right‑Sized AI Model** - The speaker contrasts large generative models with smaller predictive ones, emphasizing the need to match model size to task requirements to balance accuracy, cost, sustainability, and system performance. - [00:09:45](https://www.youtube.com/watch?v=KX0qBM-ByAg&t=585s) **Balancing Accuracy, Performance, and Risk** - The speaker explains that selecting an AI model involves a three‑dimensional trade‑off among response time, accuracy, and the specific risk profile of the use case, and that knowing the model size combined with appropriate hardware accelerators allows optimal deployment for each task. - [00:12:52](https://www.youtube.com/watch?v=KX0qBM-ByAg&t=772s) **Hybrid On‑Chip/Off‑Chip Fraud Detection** - The speaker explains using a tiny, cache‑resident AI model in the processor to instantly approve most transactions, while routing the ambiguous 20% to a larger, separate AI engine for deeper analysis. ## Full Transcript

0:00AI has captured the world's imagination. 0:03The landscape is evolving rapidly 0:05as enterprises begin to leverage the technology's potential for real-world business applications. 0:11Ideas in the space for creating and improving applications are expanding in every direction, 0:22and at the heart of these ideas are AI accelerators, hardware accelerators, 0:29which is specific hardware designed for the inferencing and AI workloads. 0:34These enable the transformation because they enable faster and more efficient processing. 0:40I like to think of where we are right now in the AI landscape as where the world was when the automobile was invented. 0:48So at first, all automobiles basically looked the same, 0:53more or less shaped like the wagons and carriages they were meant to replace, 0:57but very quickly... 0:59the world and industry realized, wow, 1:02we could make ambulances and passenger cars and race cars and refrigerated milk trucks, 1:08and all of these things have vastly different customizations in order to achieve their purpose. 1:15That's really where we are with AI in 2025. 1:19One size no longer fits all when it comes to AI. 1:24Hardware accelerators are really great at 1:27helping to right size solutions for AI 1:32and to really explain the intersection between hardware acceleration 1:37and AI, let's take a admittedly simplified look at the AI stack. 1:44So at its fundamental level, at the bottom, we have infrastructure that's hard to spell. 1:52It's hardware. 1:54On top of that, of course, we have our models. 2:01You've all heard of these. 2:02And then at the top left, but most definitely not least, is software, management of the tasks, but also governance, 2:11and governance is making sure that the AI continues to behave correctly, 2:16without bias, ethically, as it evolves and develops and goes forward. 2:22In addition, the software manages security, making sure that 2:27the AI model data and the data that's fed in and out of those models remains private and secure. 2:34Where do the accelerators fit in this stack? 2:37Well, you might have guessed already, they are part of the hardware. 2:41So the accelerators are here. 2:47What specifically is an AI accelerator? 2:50Again, this is purpose-built hardware, silicon, that's built and optimized 2:55for the high-compute matrix mathematics that is necessary to do AI transactions. 3:03So the linear algebra, the tensor calculations, this hardware is laid out specifically, 3:08the silicon is designed to optimize and do that faster so that the inferencing answers can happen faster, 3:16with reduced power consumption and a smaller physical footprint 3:19compared to a general purpose piece of hardware that's designed to do everything and therefore has to cover all bases. 3:26Accelerators do use all of the computer architecture, modern techniques 3:31like parallel processing and optimize memory bandwidths and they interact with memory, 3:38they have a large memory component. 3:39It's a specific hardware engine that does the compute, 3:42and that either has its own memory or it shares memory with other processors in the system. 3:50Let's draw a picture of a system to really get a sense of where these accelerators might exist in an actual piece of hardware. 3:59So this is a computer system. 4:01It could be a server or a mainframe, 4:04and like most today, let's make it an SMP so it has a variety of processor chips on it. 4:13These AI accelerators, they could be located anywhere within the system. 4:19For example, you might wanna have an on-die accelerator close to your processing compute. 4:24You could have that on one or more of your processor compute engines 4:30You could have a AI engine somewhere else in the system, not necessarily on a processor chip, but still within the box. 4:38Additionally, since servers interact with the external world through industry standard IO protocols, 4:45you could actually attach an AI accelerator 4:50externally to the box, 4:51and those accelerators could come with their own compute as well as their own memory. 4:57The other thing to note about these hardware accelerators, 4:59and this is what really provides that flexibility and scale that we really need in computing today 5:04to enable all of these use cases and these ideas that people are coming up with in the AI industry, is the... 5:13concept that each one of these AI engines doesn't have to be the exact same design. 5:19Your on-chip accelerator, for example, could be optimized in a different way for a different type of workloads 5:24than the one that's inside the system, which may or may not be different from the ones that you attach to the box. 5:30So there's definitely a heterogeneous system of hardware and within hardware, 5:36and heterogenous system of accelerators that are becoming available so that we can meet these growing AI needs. 5:45To really see how these accelerators now interact with the rest of the stack, 5:50let's take a couple of minutes to just think about models for a second, 5:55and I think, you know, we all I think are familiar now with our traditional AI models, right? 6:01Machine learning, you know, deep learning. 6:05These are relatively fast. 6:09They're relatively small in terms of model size, 6:13and they're pretty good at coming up with, I'll say, suggestions. 6:23They can't give me life advice necessarily, but they can offer suggestions. 6:30Gen AI coming out in recent years, more complex, right? 6:34Another level, right, they're generative, right, versus predictive, right. 6:39These are things like our LLMs and simulators and things, right these are much larger models in general, 6:46but they have, you know, they take more energy to run, they are more costly, 6:50but they're also more powerful in terms of delivering accuracy and doing more complex things. 6:56They can... 6:58dare I say it, maybe not give life advice, but in some cases give, you know, advice. 7:03Do some advisement. 7:07Because we have this new, more complicated one, 7:11it doesn't mean that we should just abandon the traditional AI because it does have a lot of value. 7:17Most notably, it's best and it's small and it is less expensive, right? 7:21So if we can get a good answer out of that, it is worthwhile to use it. 7:28Really, right-sizing your model is where you want to be when you are doing AI. 7:37If you have a AI task that's this size and your model is little, 7:45you run the risk of, you might get an answer fast, but that answer might be wrong, 7:50or it might come back with more risk than your application can tolerate. 7:54So you might say, okay, I'll err on the side of caution and I'll just always use a huge model, 7:59but you're not being cost effective and you're being sustainable when you do this. 8:04In addition, you run the risk, 8:07if you happen to be attempting to do something in the same compute space as your running system, 8:14you run the risk of kneecapping other simultaneously running workloads on your box if you use too large of a model. 8:21So ideally, what we wanna go for and try to do is to right size the model to the task that we're doing. 8:31Ok, how do hardware accelerators help with that? 8:38To understand that, it helps to think about the model the way hardware thinks about the model, 8:43which is that the model really, at the end of the day, is data, 8:48and when you think about hardware and how you optimize hardware, what we think about generally is performance, 8:54a common metric of performance in the hardware world, 8:58familiar to me, I'm a hardware person, 9:00If you couldn't tell already, is latency or we call it response time, right? 9:06How fast do you respond? 9:08And when you're measuring response time as a function of size, 9:13right, the size of your working set, the size of your data, the of your model, 9:17basically, generally, the bigger that model gets, the worse your response time gets, right. 9:23It takes longer and longer to get the answer out there, right? 9:28If we try to use a similar graph to measure performance of AI, it's not quite as linear, 9:38because what we're trying to do really when we're measuring how well AI performs is we really say, well, how effective is it? 9:46And if you try to draw a graph like this, it would end up looking a little bit more confusing, 9:54and you kind of have to think, well what's going on there? 9:56What's going on there is response time, 9:59hardware performance is not the only factor 10:03when you're talking about measuring the success or the effectiveness of the AI. 10:08It's really accuracy as well as performance. 10:12And the variable that's hidden here is a third dimension. 10:16Bear with me as I try to draw a third-dimension. 10:20Which we touched on before I hinted at it. 10:22It's real your use case, 10:26and that green line is really coming out in the third dimension here. 10:29So the questions you want to be asking when you're trying to decide what model to use are, 10:37how much risk do I need? 10:39If I'm a credit card fraud detection system, you know, 10:42I have a different risk profile, I have different inputs, I have levels of 10:48tolerance than I do if I'm batch document summarization process, right? 10:53So those factors drive the need for models that align with specific use cases. 10:58So at the end of the day, what's happening is your use case is dictating your model size, 11:04and if you have knowledge of your model size, and you have access to these hardware accelerators, 11:11then you can leverage these hardware accelerator and run 11:15the model that you need on the optimized hardware accelerator for the task at hand. 11:22Let me illustrate all this with an example. 11:25It's a common example, but it's one of my favorites because it's very personal and it resonates, 11:30I think with everybody, it's credit card fraud detect, right? 11:33I wanna use my credit card online all the time, 11:36and I want to hit pay now and I wanted to go through as long as it's me. 11:39If it's not me, I want it to stop. 11:41So that task is fairly complex. 11:46It also has a high dependence on response time because 11:52I'm gonna navigate away if it takes too long to use this site, I'll go someplace else, right. 11:58So let's illustrate how we might use AI models and hardware accelerators to optimize for that use case. 12:08So we have a task here, 12:12nd really, that task is, is it fraud? 12:16All right.I want to get the answer as fast as possible. 12:19So maybe I'll start by using a. 12:24ML DL model. 12:25This will be fast, it'll have a low response time, and it'll have a good forecast of whether or not this is fraud or not. 12:34Maybe it'll be 80% I'm making these numbers up, 12:38successful. 12:41That model, if I'm doing something like this, 12:43I will probably, if have it, want to use a hardware accelerator 12:47that's on the same processor chip as the transaction workload that's running. 12:52The reason is that transaction workload already has the data 12:55that needs to go into that model, and the model's a relatively small size, 12:59so you can pull that data into the cache right there in the processor chip 13:02and enable in real time actually getting, you know, your suggestion of whether or not this is fraud, 13:09yery quickly. 13:11Then I'd say 80% of the time, that answer comes back. 13:16They're like, yeah, this is you. 13:18This is a valid purchase. 13:19I'm gonna approve it. 13:20And then you get your toy that you bought and nobody loses any money and everything is good, 13:27but maybe 20% of time, the model says, hmm, I'm not sure. 13:33Some things look legit, but some things look suspicious. 13:36We need to take a deeper look at this. 13:38At that point, then, you could decide that now you know it's worth the investment 13:43of going into a more expensive gen AI model, 13:47and that model requires a lot more memory because it's a much bigger model. 13:52So you wouldn't be using the AI engine in this case that you had on your die. 13:57You might want to use an AI engine that's on a different die that's not running a 14:01performance critical workload, or maybe one that's in a different place on the system, 14:05or maybe one attached and sharded through the IO peripherals that has its own memory. 14:11Based off of that, if you have that flexibility, then you can decide which to use and the hardware can use that, 14:18and then at the end of that transaction, it can come out and say, no, it was good. 14:23It's okay, it turns out that that was the right one. 14:25So the idea here is using more than one model, this concept of multi-model AI, or ensemble AI, 14:34can really be supported and enabled and optimized 14:38by having these hardware accelerators that are optimized for different model types, 14:44and then, when you think about when you're doing AI, the last thing to think about 14:47here is, you know, we're not doing one of these a day, right? 14:50We're doing millions of these day, many of them at the same time, right. 14:55So you want to have all of these processing AI engines running all at the time, doing different things, 15:00and then the next second or millisecond, It's doing a completely different model with a completely difference task. 15:05So hardware accelerators give you the flexibility to do that at scale, 15:10and because of that, they intersect well with the AI stack as we know it today 15:15and are really starting to really be used to 15:18ensure that we're continuing to expand and meet business need by deploying scalable, 15:23efficient and secure AI to meet today's business problems and tomorrow's.