Learning Library

← Back to Library

Training vs Inference Hardware Landscape

38m • Unknown Channel • ai-ml • deep-dive • advanced • Watch on YouTube ↗

Key Points

The episode focuses on how the training and inference hardware stacks are increasingly diverging, raising challenges for designing datacenter‑grade chips that remain viable for 5‑6 years as model architectures evolve.
Apple’s hybrid approach—running simple tasks on‑device and off‑loading more complex reasoning to the cloud—is highlighted as a potential industry‑wide pattern for improving composability of chips and models.
Model‑optimization techniques are becoming more accessible, with growing tooling and documentation that help developers squeeze maximal performance out of existing hardware.
NVIDIA’s recent dominance in AI hardware investment is noted, but speakers emphasize that differing consumer needs and use‑case requirements are driving a split between training‑focused and inference‑focused ecosystems.
Guests include Volkmar Uhlig (VP of AI Infrastructure), Chris Hay (CTO of Customer Transformation), and Kaoutar El Maghraoui (Principal Research Scientist for AI Hardware), who discuss these trends and their implications.

Sections

Full Transcript

# Training vs Inference Hardware Landscape **Source:** [https://www.youtube.com/watch?v=rn1sjMFRzTQ](https://www.youtube.com/watch?v=rn1sjMFRzTQ) **Duration:** 00:38:18 ## Summary - The episode focuses on how the training and inference hardware stacks are increasingly diverging, raising challenges for designing datacenter‑grade chips that remain viable for 5‑6 years as model architectures evolve. - Apple’s hybrid approach—running simple tasks on‑device and off‑loading more complex reasoning to the cloud—is highlighted as a potential industry‑wide pattern for improving composability of chips and models. - Model‑optimization techniques are becoming more accessible, with growing tooling and documentation that help developers squeeze maximal performance out of existing hardware. - NVIDIA’s recent dominance in AI hardware investment is noted, but speakers emphasize that differing consumer needs and use‑case requirements are driving a split between training‑focused and inference‑focused ecosystems. - Guests include Volkmar Uhlig (VP of AI Infrastructure), Chris Hay (CTO of Customer Transformation), and Kaoutar El Maghraoui (Principal Research Scientist for AI Hardware), who discuss these trends and their implications. ## Sections - [00:00:00](https://www.youtube.com/watch?v=rn1sjMFRzTQ&t=0s) **Hardware Trends & Apple Hybrid** - Brian Casey explores the split between training and inference hardware, the longevity challenges of data‑center chips, Apple’s on‑device‑plus‑cloud model, and emerging model‑optimization techniques. - [00:03:02](https://www.youtube.com/watch?v=rn1sjMFRzTQ&t=182s) **Scaling Inference and Training Across GPUs** - The speaker outlines how inference is limited by latency, throughput, and memory—necessitating multi‑GPU (A100/H100) configurations to keep token generation human‑readable—while training emphasizes connecting thousands of GPUs and maximizing inter‑GPU communication speed to achieve cost‑effective performance. - [00:06:07](https://www.youtube.com/watch?v=rn1sjMFRzTQ&t=367s) **Clients Prefer SaaS Over On-Prem GPUs** - Clients avoid managing their own GPUs for inference due to cost inefficiency, opting for token‑based SaaS services, while only regulated sectors require on‑prem GPUs, making cost reductions vital for startups and SMEs. - [00:09:13](https://www.youtube.com/watch?v=rn1sjMFRzTQ&t=553s) **Debating LLM Dominance and Alternatives** - The speaker reflects on Jan Leike’s warning against a singular focus on large language models, asks whether hardware players are similarly fixated, and notes ongoing research into alternative model architectures. - [00:12:17](https://www.youtube.com/watch?v=rn1sjMFRzTQ&t=737s) **From Paper Hype to Industry Impact** - The speaker reflects on the viral excitement surrounding a new research paper, the gap between initial hype and practical reality, and outlines the milestones required to transform the novel approach into commercially relevant hardware. - [00:15:20](https://www.youtube.com/watch?v=rn1sjMFRzTQ&t=920s) **AI Model Lifecycle & Hardware Shift** - The speaker explains that large language models typically have a 5‑6‑year relevance window, with successive 10× performance gains enabling ever more complex architectures, while hardware manufacturers are rapidly re‑optimizing GPUs for the memory‑bandwidth‑intensive transformer stack, shifting from many‑GPU setups to fewer, more specialized cards. - [00:18:25](https://www.youtube.com/watch?v=rn1sjMFRzTQ&t=1105s) **Consumer Hardware Constraints vs Model Innovation** - The speaker explains how consumer‑grade hardware limits model deployment, weighing the choice between sticking with established CNN hardware and adopting large, fast new architectures, and the resulting challenges for hardware vendors. - [00:21:38](https://www.youtube.com/watch?v=rn1sjMFRzTQ&t=1298s) **Stakeholder-Driven Factors in AI Deployment** - The speaker outlines how consumers, model‑training companies, and cloud inference providers each prioritize distinct concerns—cost, hardware compatibility, training speed, and token‑cost efficiency—shaping model design and the growing importance of composable model architectures. - [00:24:54](https://www.youtube.com/watch?v=rn1sjMFRzTQ&t=1494s) **Secure Edge-Cloud Architecture Discussion** - A speaker outlines a client‑focused architecture that extends trust from a handheld device into a data centre, emphasizing secure boot, non‑introspectable admin access, digitally signed binaries, and physical isolation. - [00:27:57](https://www.youtube.com/watch?v=rn1sjMFRzTQ&t=1677s) **Apple M3 Enables Local LLM Fine‑Tuning** - The speaker explains that the M3’s 128 GB unified memory allows them to run and fine‑tune massive models such as Llama 3 70B directly on a MacBook, eliminating the need for cloud resources and positioning Apple as a developer‑focused AI workstation. - [00:31:15](https://www.youtube.com/watch?v=rn1sjMFRzTQ&t=1875s) **LLM Optimization: Techniques & Tools** - The speaker highlights how combining fine‑tuned small models with compression methods like quantization, pruning, and knowledge distillation—supported by libraries such as Hugging Face Optimum—delivers lower inference costs while maintaining performance, reflecting the current market excitement. - [00:34:21](https://www.youtube.com/watch?v=rn1sjMFRzTQ&t=2061s) **Pre‑tokenization and Feature‑Based Freezing** - The speaker explains how pre‑tokenizing data and batching by equal lengths dramatically speeds up fine‑tuning, and predicts that future optimizations will move from freezing whole layers to selectively freezing individual learned features. - [00:37:25](https://www.youtube.com/watch?v=rn1sjMFRzTQ&t=2245s) **Rapid Gains in Mixture‑of‑Experts** - The speakers wrap up a discussion on continuous batching, noting its low latency and unchanged quality, while celebrating the excitement of sudden 10× performance improvements in the rapidly evolving Mixture‑of‑Experts field. ## Full Transcript

0:00Welcome everyone to this week's Mixture of Experts. 0:02I am your host, Brian Casey. 0:04Um, today we're doing all hardware all the time. 0:07Uh, we are going to start with a deep dive really of the industry 0:12and space right now, specifically about, um, how the training and 0:16inference stacks are diverging. 0:18Model architectures still. 0:19evolve over time. 0:21So how do you build hardware that, you know, you put in the data center and you, 0:24you know, depreciate over like five, six years so that you can still run the model, 0:29which is active in six years with today's hardware, you know, you're putting down. 0:34From there, we're going to move to Apple and talk a little bit about how The 0:38architecture patterns that we see there that are combining on device and cloud 0:42could be a pattern for the industry. 0:44The composability of these chips, the different models 0:47is going to become important. 0:49And I liked the way Apple did this, right? 0:51Here's some of the stuff that you run on device. 0:53But when I want to do something a little bit more complicated than 0:55reasoning, I'm going to come off device and I'm going to push that 0:58on the cloud to perform that action. 1:00Finally today, we're going to talk about model optimization and the things 1:04that people are doing with models. 1:05to better take advantage of the hardware that's available to them. 1:08But I see there's a lot of energy there and a lot of work and documentation 1:13and, uh, which make it very easy, you know, for the end users and the 1:17developers to use these techniques. 1:25Joining us on the show today, we have Volkmar Uhlig, who is 1:28the VP of AI Infrastructure. 1:30We have Chris Hay, who is the CTO of Customer Transformation and we have 1:35Kaoutar El Maghraoui, who is the Principal Research Scientist for AI Hardware. 1:44We're going to start like a little bit big picture. 1:46Um, and Volkmar, I'm going to throw this first one over to you. 1:50Um, but obviously like NVIDIA has been, has absorbed a lot of like the oxygen 1:55and perhaps money in the room, uh, in the hardware space over the last 18 1:59months in a sort of like classic, Um, when there's a gold rush picks and 2:03shovels, um, toward a sort of investment. 2:06Uh, but I think one of the thing that's becoming like more discussed and kind 2:09of more clear is that the ways that like the training and inference stacks might 2:14diverge, um, over time, the extent to which that's kind of already happening. 2:18Um, because I think some of the players that are involved in each 2:21one of those things, and some of the like, the interests of the consumers 2:24are, um, you know, a bit different. 2:26Yeah. 2:26Even at some of the use cases obviously are very different, but I wonder if you 2:29might just like start us off and just like maybe give, like, I don't want to say an 2:33overview of the whole landscape, but maybe start with that kind of piece of talking 2:37about the, like the two stacks of like training and inference, uh, kind of maybe 2:41some of what you see going on in each of those and just like how you see them 2:44kind of diverging both now and over time. 2:47Yeah. 2:48Okay, cool. 2:49So I think the, uh, uh, the main difference between. 2:53inference and training stack is to scale, uh, these GPUs get connected. 2:58And so we like, there's a bunch of accelerators happening, but 3:01let's just stay in the GPU market. 3:03On inference. 3:04You're primarily limited by, you know, the throughput you want to 3:07achieve the token latency you want to achieve, um, and then there's a 3:12certain amount of memory you need. 3:14And so the bigger the models get, then usually, you know, you start 3:17spanning across multiple GPUs just to maintain, like you have enough 3:20memory and maintain the throughput, um, and then achieve latencies, which 3:25are reasonable for human beings. 3:27So if you look at the big GPUs, which is often what... 3:30what people are using, A100s, H100s, and, you know, what's coming out, um, 3:36you connect, you know, two to four of them to just get into latency bands, um, 3:41where, uh, you know, the human can not read as fast as the tokens get produced. 3:47Um, and then you have the, the other area, um, we're just training and training is 3:53really, uh, how many GPUs can you connect into a single system, and it's much 3:59more like the HPC space, you know, very like, you know, for 20 years, we tried 4:04to figure out how to build networks to connect thousands or tens of thousands of 4:09computers, and now we're doing the same thing again, it's just called the GPU. 4:13And here it's really how fast can you exchange information between these GPUs? 4:19So you have these all gather operations or weight redistribution operations, um, 4:25but you may have 10, 000 GPUs behind it. 4:27So the real value here is how fast can you get data out of these GPUs 4:32to the other GPUs and how fast can you distribute them to all. 4:35And so from a cost perspective, um, in, in basic inferencing is really like, you 4:41know, you cram something in a box and you want to have a power efficient, um, and 4:45then you want to achieve those latencies. 4:48And in the, in the case of training, uh, at least for like the very large 4:51training sets, uh, it's like how many computers can you fit into a 4:55room and connect them with a super latency, uh, super low latency network? 5:00And many cases like You know, the hardware cost in the network is 5:0330 percent of your overall system cost just to interconnect them. 5:07And Chris, maybe a question to you is, um, on the inferencing side, like in 5:13particular, have, have you seen like costs on like an absolute basis already become 5:19like a real issue for clients, or is it the type of thing where, you know, they're 5:24concerned about like the unit economics and as it scales, it will obviously 5:28be an issue, like how real is the cost problem today um, on just like the 5:33production workloads people are working with when it comes to work with LLMs? 5:36Is that like a theoretical problem or real problem? 5:39Um, a very obviously in the future problem, uh, real problem, but how 5:43would you, how would you describe that? 5:45I honestly think from a client perspective, and this is going to sound 5:49really harsh, they don't really care. 5:51Because most of the time they're getting inference from the cloud, right? 5:55And they're playing cost per token, so they're not really provisioning GPUs. 6:00I don't know many clients who are provisioning GPUs? 6:03Maybe some are doing that for fine tuning, but they're certainly 6:06not doing that for inference. 6:07In fact, clients are actively avoiding anywhere where they have to have 6:13a GPU for inference themselves. 6:15And the reason they're avoiding that is they're not going to shove enough 6:18workload through that for inference to be able to justify the cost. 6:22So they really want to be paying more in a kind of SaaS model 6:26where it's kind of paying for the tokens as opposed to GPU time. 6:29And if you think about it sensibly, GPUs would just sit around doing nothing most 6:34of the time in a client's estate, so I think this is a big technology company 6:39problem, not necessarily a client problem. 6:43Now, don't get me wrong. 6:44There are some clients who have to run on premise workloads. 6:48So think of financial institutions, government institutions where they don't 6:51want that data going to the cloud, and they have that problem where they have 6:55to think about provision their own GPUs. 6:57But, but regular clients, I don't think they think about that too much. 7:01But I think I, you know, this is very true, Chris, but I think especially 7:05for startups and like small and medium businesses, this becomes a big challenge, 7:11especially being able to maybe either train or fine tune models for their 7:17whatever business that they care about. 7:20This becomes a real issue. 7:22So cost reductions for them will be significant. 7:25I think like what Chris said is, is right. 7:28To a certain extent, the customer doesn't really care. 7:31Um, you know, they don't want to know that there's a GPU behind it, ultimately. 7:35Um, they only care that they get the service. 7:37And they get the service at a certain cost and at a certain latency. 7:40Because in the end, you're modeling something where, let's say, I have an 7:44interactive workload with a customer and the customer doesn't want to wait like 7:4720 seconds until they get an answer back. 7:50Um, we want to hide that in the data center in the inferencing service. 7:54Now, from a hardware perspective from, you know, companies which are 7:58actually invested in those hardware systems, that's the optimization, right? 8:02And that's, I think, where we will see much more custom hardware going 8:07into the market, um, which is highly optimized for these specific workloads. 8:13And it's interesting to see like, you know, we have a trend of like, okay, 8:16there's, you know, more common models and model architectures, but model 8:20architectures still evolve over time. 8:23So how do you build hardware that, you know, you put in the data center and you, 8:26you know, depreciate over like five, six years so that you can still run the model, 8:31which is active in six years with today's hardware, you know, you're putting down. 8:35So I think there is a, there's an optimization game probably played by 8:40the big players, you know, the AWSs and you know, the Azures and you know, 8:45the IBMs who are putting specialized hardware, highly optimized influencing 8:51engines with a lot of variety for these different models into the data center. 8:56And I think, and for companies, it'll be hard because, like, 8:59how many GPUs do you wanna host? 9:01Like you want to homogenize, but now you may overpay. 9:04And so the cloud may be your answer, uh, to get cost reductions in place. 9:09To that point about maybe optimizing for five or six year time horizons. 9:13Like one of the themes I've seen come up, um, more recently, 9:17I've heard Jan Leike.... 9:19Um, make versions of the statement, a few others, um, that like LLMs, 9:25even within the research and academia community have like absorbed all of 9:28the oxygen, um, in, in the room and that there's other, there's other 9:32areas and like ways that people could potentially even imagine getting to AGI. 9:37But like right now, it's just everybody's on this one path and so I 9:41think Jan even like famously was like telling New Research like don't work 9:46with LLMs like go do something else with your life, uh, basically which 9:49like the reaction in the market was like sort of predictable to that but 9:53it was like very Jan on some level. 9:55Um, but I'm curious like do you see like the hardware players like almost following 10:00that same sort of dynamic where like everybody sort of like is looking what's 10:03happening with LLMs right now and they're just sort of all in like the transformer 10:07has been pretty resistant up to up to this point with some like tweaks to it so... 10:12are people kind of looking at the model architectures that are around today and 10:15just sort of like assuming that those are going to hold for at least at least 10:20the medium term or, you know, do you see places where people are like, you 10:24know, Exploring, you know, what other, um, um, you know, maybe what other 10:31alternative architectures might play a role at some point in the future? 10:34I think that's a great question, and there's a lot of research 10:37actually exploring alternative architectures as transformers. 10:40Transformers, of course, it's a pretty solid architecture. 10:44Uh, the example, for example, the MatMal3 LLMs. 10:48Which is… which is I think it’s a WAG paper that has demonstrated the viability of MatMal3 architectures. 10:55And the research opens up a new frontier in LLM design. 10:59Also encouraging the exploration of alternative operations and computational paradigms. 11:04So this is actually kind of opening the frontier here for a new wave 11:09of innovations with researchers developing novel model architectures. 11:14So this particular paper, Uh, really kind of looked at using these trinary 11:20operations, replacing the, uh, the math behind the matrix multiplication with 11:27much simpler operations, and that's actually the biggest bottleneck you 11:31see in these LLMs computationally. 11:33And there's also translated also into reduction, huge reductions into the memory 11:37and the computational, uh, requirements needed for these LLMs, also, they really 11:42showed some very interesting benchmarks and result that shows you could get almost 11:47same accuracy or the same performance as you would get with transformers. 11:52So and I think this is just the beginning. 11:54So there is a huge need to look at alternative architectures, uh, 11:58either using, you know, these novel architectures or using things like, 12:01uh, neural architecture search where you're trying also to You using AI 12:07to generate efficient architectures for for hardware for to target, you 12:13know, special special platforms. 12:15And that's are much more energy efficient. 12:18The reaction to that paper when it came out in the market 12:20was like pretty substantial. 12:22Um, I think the original tweet that got shared as well, like announcing the 12:27paper, either just like summarizing it even, um, I think had a couple million 12:30views associated with it and people were very excited and, um, in the way 12:35that Twitter gets, um, around things like, oh, this is, you know, a big 12:39change, it's like, we're not going to do matrix math on GPUs, uh, anymore. 12:43And obviously, like there's the initial hype and then, you know, there's 12:46reality that sort of comes after that. 12:48Um, but in, in a space like that where you're doing that sort of 12:51fundamental research and a lot of the indicators were looking like, you 12:55know, this actually could be a thing... 12:57um, what do you, what are like the next steps for an approach like that going 13:02from something that's just like a novel piece of research to something that 13:06is like approaching a place where it might have some commercial relevance? 13:10Like, how does it go, and like, you know, does it end up having any 13:13impact on the hardware ecosystem? 13:15Maybe before we get to that hardware point, you know, maybe just would love 13:19to hear from the group of, you know, what are the things that you would want to 13:22see that would give you some indication that like a different approach like this, 13:26um, might have some, some relevance and some staying power, um, in the industry? 13:30So I think that we go through functional plateaus with these different models. 13:37So if you go back, we had CNNs, and suddenly computers could see. 13:41And then we got transformers, and suddenly they can reason. 13:45And we are going through that, through those phases, I think there will be new 13:50model architectures coming, you know. 13:53Like this is almost like a scientific discovery. 13:55Suddenly, The model exhibits a certain behavior, which we 13:58haven't seen before, right? 14:00And we had the similar effects when CNNs came out and suddenly, you know, all the 14:07image processing algorithms were out the window and now all the, the NLP systems 14:12are out of the window because we don't do entity extraction anymore because 14:16we can directly reason in a network. 14:18And so I think there will be new architectures coming with new 14:21capabilities, and I think this will be a lot of the effort, like the 14:24people will start experimenting with these alternative model architectures. 14:28At the same time, we are seeing that, you know, people are trying to rewrite 14:32the existing capabilities because now you have that new plateau where you are, 14:36you know, we know how Chat GPT needs to behave and we, we can benchmark it. 14:42Before we could benchmark the vision algorithms and we could say, you 14:45know, what's the detection rate. 14:47And so I believe that we will constantly go through these loops. 14:50And if these fundamental changes happen, suddenly you got, you know, a 10 X 14:54performance improvement, but, and then the, the hardware will catch up, but 14:59you will not take something which is in production and necessarily say, Oh, 15:02let's just throw it out because these models have been tuned for, for ages. 15:07So if you look at the introduction from discovery until it goes into the 15:11industry, you know, if you are like at. 15:13Three years now, like, you know, it's like Chat GPT already I mean, it's already 15:18aged, like the transformer model is aged. 15:20And so what's the lifetime? 15:22It's probably not like twenty years, but it's probably like five, six 15:25years until the new model shows up and things, you know, get tweaked, 15:29and each of these 10 X performance improvements now allow you to make the 15:33model 10 X more complex, because now you suddenly get all the computation free. 15:38And so I think those things will happen over and over again. 15:41We are just now in this era of discovery, right before a true architecture or 15:45a lasting architecture was found. 15:48So, and I think we will have this for a couple more years. 15:52Do you think that basically every, every hardware player in the space right now 15:56is up is like building out what they're doing, optimizing almost entirely for like 16:01the existing transformer and LLM stack? 16:03That's the safe bet, right? 16:05So if you look from a, from a manufacturer, if you build a card, 16:08like you go like all the cards, if you go back five years, cards 16:12don't have enough memory bandwidth. 16:14Then suddenly training of these LLMs comes on online and it's like everybody's 16:18like oh I need two terabytes of memory bandwidth per second and now I need four 16:23and you know if you're growing these cards if you would have optimized the card five 16:27years ago you would have done that for encoder models which are totally compute 16:30bound and not that, not memory bound. 16:33And so we are seeing the shift towards these models as they come online. 16:37And then you go from, oh, I need 32 GPUs to I need one. 16:42But that's the architectural shift because the hardware gets like gets modified. 16:47And you are like, Okay, I'm willing to pay 10 times as much for memory bandwidth. 16:52And so but overall, I'm still getting a better deal. 16:55And so I think those things will happen over and over again, you know, like 16:59effectively what NVIDIA did is like it took the computation of the CPU 17:03and said, okay, I give you 100x for this different architecture, right? 17:06And now we are going through a cycle of, okay, we need more memory. 17:10And so then there will be specialized operations. 17:12We already see it with CUDA cores, right? 17:13So do matrix multiplications in hardware and, you know, you just 17:17do this over and over again. 17:19I, I kind of disagree. 17:21Yeah. Sorry. I kind of disagree. Good. 17:23I think, I think last best model wins. 17:25That's it. 17:26So how many people here is still using LLAMA 1? 17:30How many people here is still using LLAMA 2? 17:33Nobody. 17:33Why? 17:33Because we're using LLAMA 3. 17:35How many people is using Mistral version 1? 17:37Zero. 17:38Cause we've all moved on to Mistral 0.3. Last best model wins. 17:41So therefore, if somebody comes up with a good new architecture that beats the 17:47old architecture, Everybody's going to move to that super, super quickly. 17:51Consumer doesn't care because consumers are not training models, so they 17:54will take the first model that works, they'll run it, as long as it works on 17:58their consumer hardware, they're good. 17:59Slightly different for data centers, but data centers are not going 18:02to want to be left behind, so they're going to run towards it. 18:05So hardware wise, if I'm honest, as long as it runs on my Mac, which 18:10runs MLX, I don't care, right? 18:13So I'm not going to go out and buy a GPU, it's going to run on my MLX. 18:17Now that's different for, uh, folks who run data centers because they're going 18:20to want the most efficient inference that they can and they're going to invest 18:24in the hardware to be able to do that. 18:26But me as a consumer, last best model wins and, you know, it's got 18:30to run on my consumer grade hardware. 18:32Good, but now you just defined what the constraint is and it needs 18:35to run on my consumer hardware. 18:37So there is effectively a, there are two sides of this. 18:40There's like, to what hardware do I build, which you are going to buy 18:44with your next laptop in three years? 18:46And what's the model architecture? 18:48And so you have a deployment, deployment issue, right? 18:51So CNNs, for example, haven't fundamentally changed. 18:55So CNNs, the hardware is the same, right? 18:58It's an encoder model. 18:59It's not a decoder model. 19:01And so you, you, Still run the same hardware, but now we said, Oh, and by the 19:05way, we have this new workloads, but fund fundamentally CNN's all kind of stuck 19:09where they were a couple of years ago 19:11To a point, because if there's a brand new model with a completely 19:15different architecture, and it could be. 19:17It could be a 5 trillion parameter model for all I care, right? 19:20If the latency is fast, then again, I'm putting more constraints on it, 19:23and it's SaaS and I can't run it on my machine, but it smokes whatever 19:28I can run on my machine by 100x, and the price is so cheap, I don't care. 19:32I don't need to run it on my machine at that point, because That model 19:36there is the new best model, and it wins, and therefore me as a consumer 19:41is going to run right towards it. 19:42I think on the hardware space it is, it has been constantly a challenge 19:46for these hardware vendors to figure out what do we optimize for. 19:50There's always this trade offs between building this general purpose accelerator 19:54that can run all these different architectures Or this really super 19:58optimized things for like the LLMs or for the CNN, so for, you know, the LSTMs 20:04or all of these different architectures. 20:06So that I think that's going to continue to evolve and because it takes a long 20:11time to design hardware, so it's not a software that's, you know, every day you 20:15have new architectures, new, new releases and things like that for hardware. 20:20It is a much longer timeframe, so. 20:23so this really, you know, becomes a challenge for the hardware vendors. 20:26What do we optimize for? 20:27And I think maybe that's going to even change the way these hardware 20:31vendors and the architecture and so forth think about hardware moving 20:35into these composable systems. 20:37Maybe I don't need to build these monolithic chips that are built to design 20:42everything, you know, to optimize for one specific model or architectures. 20:47Can we compose things and then plug plug them or compose 20:50decompose in a dynamic fashion? 20:52So I think this is maybe something a chiplet, for example, it's something 20:56trying to do something similar where we we don't want to have, you know, 21:00these one big chip, but having these, uh, uh, chiplets where that you can 21:05compose, uh, very easily and then scale, especially for different use cases. 21:11I think the, the, this race for the best model is going to continue the, 21:16the, also the race on the AI hardware... 21:19how do we optimize, what do we optimize for? 21:21How do we build the next roadmap for the next five years? 21:24That's also going to be a challenge, but it's going to force, I think, 21:27a much closer hardware software co design kind of cycle that we need 21:32to shorten so we can reduce cost efficiencies and win in the market share. 21:39And I think you're right. 21:40There is, I think there's two different driving factors, right? 21:43So if you're a consumer, there are three different driving factors. 21:47If you're a consumer, you care about what is cheap for you, what 21:50can run on your machine, etc. 21:51That's kind of one factor. 21:53If you are a company who is training these models, then the 21:59biggest factor in your case is going to be, can I train the model? 22:02Can I pump the data in, how quickly can I get my model out? 22:05And how can I prove it and keep up with the market? 22:08If you're a cloud company serving up inference or a data center serving up 22:11inference, then you're trying to maximize the cost per token or minimize the 22:18cost per token for your architecture. 22:20And that becomes really important because you want to be 22:22competitive with everybody else. 22:24So if you're charging, you know 0. 05 cents per token or something per million token, but somebody 22:32else is doing it for 0.01, you need to cut your inference costs get it as bare as possible so 22:38that you can be as cheap as possible. 22:40But that's not true when training, Right? 22:42When training you are trying to get the best possible model that you can and 22:46get it done as as quickly as possible. 22:47So I think these different driving forces, uh, really affect it. 22:51And I totally agree with you, Kaoutar, that, that the composability of 22:55these shapes, the different models is going to become important. 22:58And I like the way Apple did this, right? 23:00Here's some of the stuff that you run on device. 23:03But when I want to do something a little bit more complicated than 23:05reasoning, I'm going to come off device and I'm going to push that 23:08onto cloud to perform that action. 23:10And I think that sort of. 23:11Uh, thing is going to be key in the future. 23:14There's a really good statement across which, um, what I like about the Apple 23:17approach is that they effectively figured out an upgrade path, right? 23:21So you can have your five generations of old phones and you effectively 23:25say, whatever I can do on device, I do on device and whatever I can't, 23:28or if I don't have the hardware acceleration, I now have an overflow 23:32bucket and the overflow bucket can be. 23:34The model is too complex or like I want a complex answer to something or 23:39the other one is I have old hardware. 23:40So they effectively created for themselves now an architecture where 23:44they can start innovating despite that the device has a longer lifetime. 23:48So they decoupled the lifetime of the device from the lifetime of 23:53or from from the model evolution. 23:55And so now they can innovate rapidly because they can 23:58always update the data center. 24:01But the device which the customer held in their hand is, it's, 24:03you know, it's kind of fixed. 24:05And so I think it's, it's a really interesting move 24:07to, to do that separation. 24:09I thought that story, just like everything that they announced at WWDC from, from 24:14both that path of smaller, like many smaller models on device to their own 24:20models in the cloud, to only a third party API when they absolutely have to, 24:25and then having like the user opt in on a per interaction basis combined with 24:30their own sort of Silicon was just like... 24:34the best example yet I've seen in the market around how to do, you know, how 24:39to optimize for, for cost for speed, how to do kind of like local, um, you 24:45know, local inference, um, on, on a workload, um, you know, I'm just curious, 24:50like, do the rest of y'all, Was that kind of your reaction to it as well? 24:54And I'm curious to just the conversations that you've had with clients like did 24:57that did that stick with them at all about a way to, you know, think about both 25:02like the combination of models they're working with and just like at all in 25:05terms of the way that they're thinking about like infrastructure and compute. 25:08Yeah, my reaction was the, I think it's the first architecture, which 25:12takes security and confidentiality between a device I have in my pocket 25:19and, you know, computation, which is on the other side of a wire, seriously. 25:25And I think they really try to figure out a way, um, to keep like the, you know, 25:32like they're selling a device and like everything happens on the device and they 25:35figured out, well, we cannot do that. 25:37And so they try to come up with an architecture where you, you, you 25:41extend your trust domain into a data center and they, they went, I 25:45think, Overboard with like, okay, we, you know, we have secure boot. 25:50Okay. That's what everybody expects. 25:53And then you have, um, non introspect ability of any, any data, which goes 25:58in by the like by an administrator, so you don't have privilege elevation 26:03of administrative accounts that I cannot extract data at all. 26:06So I think there is. 26:07They really, oh, and then we publish our binaries and the binaries are digitally 26:12signed and they are only buildable through our internal build processes. 26:15And so now anybody can inspect it. 26:18So they really try to say, look what we guarantee you with physical isolation. 26:23We are doing also in the cloud with like. 26:25digital isolation, you know, through that, all these things. 26:29I think the second really smart move was to say we are taking the same chip, 26:34we are not buying, you know, an NVIDIA chip or, you know, an AMD chip, but we 26:39are using our own infrastructure to run those things on our, on the same device, 26:44like on the same operating system, right? 26:45So they're M3 or whatever they have internally. 26:49Um, and so, And now you can run the code on both ends. 26:52So, like, from a development perspective, it really brings the cost down. 26:56But if you look at those scales, those Apple scales, with, you know, 26:59hundreds of millions of chips, like, it's really cheap for them, right? 27:03So, a chip costs you, like, 60 bucks if you make it 27:05yourself, put a bunch of memory. 27:07The whole system is probably in the ballpark of, like, $200. 27:11And they can stamp them out. 27:12They know how to stamp out at scale, right? 27:14And so, if you would do that alternatively with an NVIDIA card, Card and x86 servers 27:20to cost would be like super ballooning. 27:22And so I think there's a lot of really smart, but design decisions in there, 27:28but they really looked at an end to end. 27:30I think they figured out and they put the bar there, how consumer AI trustworthy 27:36consumer AI will now look like. 27:39And everybody will be like, okay, if you don't do it like Apple, 27:41I will not use your source. 27:42And I think now what's unclear is what the enterprise answer to that is, because 27:46there are other questions which are asked. 27:49But I don't think that's just in the consumer and inference. 27:51So, uh, I run a Apple M3, right? 27:56MacBook Pro. 27:57It's a 128 gig unified memory. 27:59Um, it's, this machine is a beast, right? 28:03And basically, I can run Llama 3 70 B on my local machine... 28:10because of the unified memory, I can say right now, there is no consumer machine 28:16that I can afford, right, other than the Apple M3 that would be able to run 28:20Llama 3 on on my machine, not only that, I can fine tune models just as fast as 28:28I can on Google Colab on my MLX machine. 28:33Without quantization, I can just take a base model and I can fine tune it. 28:36So I fine tuned, I think it was 1, 000 rows, completely unquantized, so maybe 28:42it was about 10, 000 tokens, maybe more. 28:44Um, and I did that in less than 10 minutes on my, on my M3. 28:48I didn't need to go to cloud, just did it on my machine. 28:51Now, I haven't even measured that with quantization or Laura at that point, but 28:56that is going to be a future for fine tuning as well, even in the enterprise. 29:00Why would I go to a cloud to go and fine tune my data if I'm fine tuning maybe 29:06100, 000 rows worth of data, right? 29:08Maybe I'm doing a couple million tokens. 29:11I can run that on MacBook, and half of my enterprise is sitting with MacBooks on 29:16their machines, so I think it's, I think Apple's making a really smart play, and 29:20I don't think it's just in the inference space, I think it's in the fine tuning 29:24space as well, I think Apple has set themselves up as the real developer. 29:30Uh, workstation for AI 29:32I think combining this with something like what IBM is doing and Red Hat with 29:37Instruct Lab would be really powerful because it also brings that kind of, 29:43uh, creativity in terms of creating your own models, fine tuning them all in your 29:49local space without having to deal with the complexity or the cost of the cloud. 29:54I think that's going to be really powerful. 29:56It's funny. I think one of the outcomes of just that whole thing was that from the 30:01very start when people started to get educated about models, people were 30:06like, fine tune as a last resort. 30:09Like you should, you should do rag first. 30:12You should do all these things before you ever think about fine tuning. 30:16And I kind of feel like Apple made fine tuning cool because that was like such 30:21an important part of what they're doing. 30:23And it just, it does strike me as like, as it gets easier and cheaper 30:26to do this stuff that we'll see, we'll see more and more fine tunes. 30:29I certainly like hugging faces just like littered with fine tunes at 30:33this point, but I think, um, it does feel to me like there's going to 30:36be sort of more and more of that. 30:38And I think that's actually sort of a great segue for the last segment 30:42where we were going to just talk a little bit about, Model optimization. 30:50There's a huge amount of activity in the space. 30:53I think, um, you know, the count of and it even seems like the emphasis on 30:58smaller models, even from the big players is getting to be more and more and more 31:03like even some of the messaging I've seen from them recently is like we want 31:06to deliver models that are actually like usable to the community as opposed to 31:10just like this, just ultra AGI path, which is only useful in like some dimensions. 31:15Um, you know, Chris, you just talked a little bit about Laura and quantization. 31:19Um, I'm just, you know, beyond like, what are you all seeing in terms 31:23of the things people are doing? 31:25And that's having like the most impact in the market right now in terms of 31:30Taking a combination of like small models and fine tuning all these techniques 31:35and actually compressing them down to get their inference costs lower, 31:39but still getting kind of the results that they're that they're looking for, 31:42there's like a lot in this space, and there's a lot of energy in the space. 31:45And I'm just curious in terms of like the things you've seen, the things you've 31:48done, the clients you've worked with, um, you know, just where you're seeing 31:51kind of the most bang for the buck and where people are kind of most excited, 31:54um, right now and, you know, happy for, um, anyone to kind of take that one. 31:59I think there's a lot of, you know, excitement around these model 32:02optimizations, you know, reducing the complexity of the models, quantizing 32:06the models, pruning the models, applying things like knowledge distillation. 32:09So there is a suite of techniques right now and tons of great libraries out there. 32:14Hugging Face is doing a, I think they're doing a fabulous job, uh, with their 32:18Optimum, uh, uh, library, where they have multiple hardware vendors, and 32:24they basically focusing on these, uh, hardware level optimizations, but kind 32:29of abstracting them away where you can have these LLM extensions, for example, 32:33for better transformers, for things like quantization with GPTGTQ and or for 32:39bits and bytes, so these are libraries that make it very easy with some common 32:44APIs to benefit from quantization, pruning, uh, better transformers or 32:50optimization and transformer like this. 32:52page attention. 32:53Uh, you know, some of these quantization like you, Laura, et cetera. 32:57So those are very important to kind of democratize and make it easy for 33:02people to consume without really having to understand the depth and the 33:07details of different hardware vendors. 33:09So that's a good example of kind of democratizing this, all of 33:14these optimizations and making them accessible to the developers. 33:18Um, Of course, that requires a lot of work from different vendors to 33:22be able to implement all of these optimizations and then provide 33:26these common API for the end users. 33:29But I see there's a lot of energy there and a lot of work and documentation 33:33and, uh, which make it very easy, you know, for the end users and the 33:37developers to use these techniques. 33:40Yeah, and just to sort of add into that, I mean, there's obvious ones as 33:43well, like kind of caching, for example. 33:46One of the ones that I find quite interesting at the 33:49moment is probably batching. 33:50So there was a project somebody did with MLX earlier this week, where, you 33:56know, if you use MLX out of the box, on my machine certainly, with something 34:00like the Gemma, Model to be model. 34:03You're going to get something like 100 tokens per second. 34:05That's unquantized. 34:07But they got up to just by batching the inference, they got it up to 1300 34:10tokens per second on their machine. 34:12And I've seen similar things with fine tuning. 34:16So one of the things that I've done in my data pipeline is the MLX 34:20pipeline wasn't fast enough for me. 34:22So I rewrote all the Apple data pipeline so that it was just kind of 34:25loading in and The biggest thing that I found is actually I pre process 34:31all the tokenization for fine tuning. 34:33So rather than letting it load, um, you know, as it's trying to fine 34:36tune, I just pre tokenize everything. 34:39I set up all the input and target tensors, get it all done in advance. 34:43But the last thing that I do is then I bucket. 34:46all of the batches together on the same sizes, which is kind 34:50of almost like padding free. 34:51But really what I'm doing in this case is bucketing based on the same 34:54pad padding size to reduce padding. 34:56And that is massively increased, uh, my speed of training. 35:01So I think these techniques like caching training, you know, uh, cause I mentioned 35:07things like you, Laura, et cetera. 35:10Again, even, even, even if you think of the way LoRa works, 35:13um, you're really looking at things like, um, freezing layers. 35:17I think at some point, if I think of the Golden Gate stuff, the Anthropic we're 35:22doing, I, I think you're going to get to the point, rather than doing a hard 35:25freeze on layers, I think you're going to get, start to freeze on features, 35:29because if you can understand Feature A and Feature B are being influenced, 35:33uh, or is, are on a certain topic, why not freeze those features right in the 35:38future and then train areas only in the feature area that you want to train. 35:43So I can see that happening in the future as well. 35:46So I think there's going to be a lot of optimized patients 35:48coming in on those areas. 35:50I know we're just about out of time. 35:52So Volkmar, maybe I'll, as the first time guest, um, on the show, maybe I'll, I'll 35:57give it over to you to maybe close us out and talk, you know, of all the sort of 36:00like techniques that are in the market right now that are like really focused 36:04on like optimization, which ones, which ones are you seeing, like, are Yeah. 36:09Which ones are you most excited about and that you see the market 36:11kind of most excited about? 36:13So I think the similar to what Chris and Kaoutar already said, you know, I think 36:18we get a bag of these optimizations. 36:21What's interesting, uh, what I find interesting is the, uh, Uh, like 36:26anything which is giving you like a 2, 3x speed up in whatever dimension. 36:31Like we are not yet, I mean, when you compound them you may get to a 10x. 36:35One of the things that really excites me, um, is speculative decoding. 36:40So we are running a sim, like a simpler model, um, on top of your complex model. 36:45And the simpler model is kind of trying to predict a bunch of tokens. 36:48Um. 36:49Ahead, and then you just verify it with the big model because all the 36:52quantization methods, right, there's a quality impact and the quality impact, 36:58you know, it's really hard to quantify. 37:00And so we have these models which we're testing and then we go through 37:03quantization and they kind of look good, but you don't know what you don't know. 37:07And with speculation, you effectively verify against the full model and you 37:13get just a 3x performance improvement. 37:15And I think that's probably the one thing where, um, uh, you know, I, 37:20I think that's, that's the biggest impact because it's also like batching 37:24makes things better, et cetera. 37:25And then you have like a continuous batching. 37:28So, you know, requests come in and go out, but there's always, um, like 37:32there's a latency impact with that one. 37:35It's really like, it's just, it's better and you don't have 37:38any, any reduction of quality. 37:41I think that's a great place to end on. 37:42And it's also just like thematically, I think one of the most fun parts about the 37:46space, which is, it just feels like you can wake up one day and it's like, Oh, I, 37:50someone struck together a couple two Xs. 37:52And now we got a 10 X improvement. 37:53Um, and something, you know, being at that stage of the journey is a lot 37:59more exciting and interesting than, Some of the later ones, potentially. 38:03So, um, it's a fun, fun time right now. 38:05And, um, I think a great, a great place to end on. 38:08So thank you all for, for joining us today for our listeners, for our guests. 38:12And, uh, we will see you back next time on mixture of experts. 38:17Thank you.