Learning Library

← Back to Library

DeepSeek R1: Hype, Costs, Impact

Key Points

  • The panel gave wildly different importance scores for DeepSeek R1 (5, 9, and 7.5), underscoring how contentious its impact currently is.
  • DeepSeek R1, a new open‑source model from a Chinese lab, is being hailed as competitive with leading proprietary systems from Anthropic, OpenAI, etc., and has generated massive buzz—even reaching the hosts’ families.
  • The show’s hosts set out to debunk common myths, noting that mainstream coverage often misstates key facts about the model and its significance.
  • One widely cited claim is that state‑of‑the‑art models can now be trained for about $5.5 million; Kate confirms the figure appears in the paper but warns it is heavily caveated and not a simple “price tag.”
  • Overall, the episode frames DeepSeek R1 as a hot topic that warrants careful analysis beyond hype, with open‑source ambitions and cost narratives that need nuanced interpretation.

Sections

Full Transcript

# DeepSeek R1: Hype, Costs, Impact **Source:** [https://www.youtube.com/watch?v=jC0MGFDawWg](https://www.youtube.com/watch?v=jC0MGFDawWg) **Duration:** 00:39:10 ## Summary - The panel gave wildly different importance scores for DeepSeek R1 (5, 9, and 7.5), underscoring how contentious its impact currently is. - DeepSeek R1, a new open‑source model from a Chinese lab, is being hailed as competitive with leading proprietary systems from Anthropic, OpenAI, etc., and has generated massive buzz—even reaching the hosts’ families. - The show’s hosts set out to debunk common myths, noting that mainstream coverage often misstates key facts about the model and its significance. - One widely cited claim is that state‑of‑the‑art models can now be trained for about $5.5 million; Kate confirms the figure appears in the paper but warns it is heavily caveated and not a simple “price tag.” - Overall, the episode frames DeepSeek R1 as a hot topic that warrants careful analysis beyond hype, with open‑source ambitions and cost narratives that need nuanced interpretation. ## Sections - [00:00:00](https://www.youtube.com/watch?v=jC0MGFDawWg&t=0s) **DeepSeek R1 Rating Debate** - Panelists humorously rate the significance of DeepSeek R1, giving scores from 5 to 9, as they launch an episode of the Mixture of Experts show focused on the AI model. - [00:03:07](https://www.youtube.com/watch?v=jC0MGFDawWg&t=187s) **Training Cost vs Inference Efficiency** - The speaker uses a marathon analogy to show that an impressive efficiency metric reflects only the inference runtime of a model, not the extensive training effort required, warning against misinterpreting it as a low barrier for building new models. - [00:06:12](https://www.youtube.com/watch?v=jC0MGFDawWg&t=372s) **Chain‑of‑Thought RL on a Laptop** - The speaker explains that a brief fine‑tuning phase with high‑quality chain‑of‑thought data followed by reinforcement learning can dramatically boost model performance using only modest hardware, challenging the need for massive pre‑training and fueling discussion of reduced compute demand and its impact on NVIDIA. - [00:09:21](https://www.youtube.com/watch?v=jC0MGFDawWg&t=561s) **Rethinking Model Scaling & Pre‑training Costs** - The speaker argues that although scaling laws lessen GPU demands, the ever‑growing size of foundational models like DeepSeek‑V3 still entails massive pre‑training expenses, emphasizing the need to separate the modest cost of a single training step from the overall billions‑dollar investment. - [00:12:23](https://www.youtube.com/watch?v=jC0MGFDawWg&t=743s) **Hype vs Experimentation in Model Development** - The speaker argues that the AI hype culture prizes flashy results over the extensive experimental work required to train models, before transitioning to topics like distillation and the resurgence of reinforcement learning. - [00:15:29](https://www.youtube.com/watch?v=jC0MGFDawWg&t=929s) **Hybrid Distillation and RL for Small Models** - The speaker discusses how pure reinforcement‑learning fine‑tuning of a 32‑billion‑parameter model fails to yield strong reasoning, requiring a hybrid pipeline of synthetic data distillation, chain‑of‑thought fine‑tuning, and RL to improve small‑model performance. - [00:18:33](https://www.youtube.com/watch?v=jC0MGFDawWg&t=1113s) **Math Compiler Enhances Chain‑of‑Thought** - The speaker explains how converting automatically generated equations into an abstract syntax tree, walking the tree to produce step‑by‑step explanations, and fine‑tuning a language model with these outputs yields exceptionally precise mathematical reasoning. - [00:21:35](https://www.youtube.com/watch?v=jC0MGFDawWg&t=1295s) **Model Distillation and Architecture Translation** - The speaker explains how knowledge from a large teacher model—potentially a Mixture‑of‑Experts—can be transferred to a smaller student model through distillation, highlighting its role in model compression and the shift from sparse to dense neural architectures. - [00:24:40](https://www.youtube.com/watch?v=jC0MGFDawWg&t=1480s) **Distillation: The Open-Source Secret** - The speakers explain how model distillation, a long‑standing technique now accessible through DeepSeek’s publicly licensed model, makes it effectively impossible to stop others from replicating large AI models despite the high associated costs. - [00:27:46](https://www.youtube.com/watch?v=jC0MGFDawWg&t=1666s) **DeepSeek's Efficient AI and Model Distillation** - The speaker observes that a reasoning‑boosted model may lose general performance, but praises DeepSeek’s push for cost‑effective, task‑specific distilled models using open‑source teacher models, while noting the challenge of preventing model copying, akin to biometric security. - [00:31:12](https://www.youtube.com/watch?v=jC0MGFDawWg&t=1872s) **OpenAI's Steady Response to DeepSeek** - The speaker reviews DeepSeek's launch, cites Sam Altman's tweet affirming OpenAI’s unchanged research roadmap, and debates whether the news will alter OpenAI’s strategic approach. - [00:34:38](https://www.youtube.com/watch?v=jC0MGFDawWg&t=2078s) **Choosing and Trusting LLMs Amid Competing Strategies** - The speakers compare OpenAI’s market dominance with DeepSeek’s hardware‑constrained, efficiency‑driven approach, debating how these differing strategies affect model selection, trust, and economics for enterprises. - [00:37:39](https://www.youtube.com/watch?v=jC0MGFDawWg&t=2259s) **Constraints Drive Open‑Source Innovation** - The speakers argue that limited hardware resources force researchers to devise creative techniques, and sharing these open‑source solutions accelerates advancement across the entire AI community. ## Full Transcript
0:00On a scale from 0 to 10, how big of a deal is DeepSeek R1? 0:04Kate Soule is Director of Technical Product Management for Granite. 0:07Kate, welcome to the show. 0:08What do you think? 0:08I'm going to take maybe a little bit of a controversial position. 0:11I'm going to say 5. 0:12Chris Hay, Distinguished Engineer and CTO of Customer Transformation. 0:15Chris, welcome back as always. 0:160 to 10. 0:17What do you think? 0:189. 11 or 9. 0:209, but I'm not sure which is the bigger number. 0:23Wow, that is a niche reference. 0:24And finally, Aaron Baughman is IBM Fellow and Master Inventor, zero to 10. 0:28Aaron, what do you think? 0:30Yeah, that's a great question, and I think we're gonna be right 0:32in between the other two at a 7.5. 0:35All that and more on today's Mixture of Experts. 0:42I'm Tim Hwang, and welcome to Mixture of Experts. 0:44This is episode 40 of the Mixture of Experts show. 0:49I'm really excited to meet this milestone with an all-star cast. 0:53Each week, MoE is the place to tune into to hear the news and analysis 0:56on the biggest headlines and trends in artificial intelligence. 0:59And today we're all going to talk about DeepSeek-R1. 1:02It's basically anything that anyone is talking about right now is DeepSeek-R1. 1:07It's the talk of the AI chatter class, it's rocking markets, and 1:09even my dad is texting me about it. 1:11Um, so what I want to do is start the first segment with a little 1:15bit of DeepSeek-R1 myth busting. 1:17Um, if you've been anywhere around our, uh, AI, uh, in the last 1:21week, you know the basic story. 1:23Um, there is a, a Chinese lab, uh, DeepSeek that has released a new 1:27model called R1, uh, that is both open source and competitive with the 1:31state of the art models coming out of Anthropic, OpenAI and all the names 1:35that we're really familiar with. 1:37And there has been so much hype about this story that as I said, 1:41even my dad's texting me about it. 1:43And a lot of the mainstream coverage has actually been 1:45getting a lot of the facts wrong. 1:47So kind of what I want to start with is just to knock down a bunch of myths, so 1:51we can kind of calibrate as we really kind of peel back and talk about this story. 1:55And, Kate, I want to start with you, because I know you were, uh, angry 1:59about this in the show Slack, and so I wanted to give you a chance to 2:01kind of like, you know, let loose. 2:03Um, I think the first meme that we've heard in a lot of this mainstream 2:06news coverage is we can now train state of the art models for 5.5 million dollars. 2:12And that's so crazy expensive relative to the kind of numbers 2:16that we've heard before, right? 2:17I think the Stargate price was a hundred billion dollars 2:20or something crazy like that. 2:21Um, so Kate, how true is that number? 2:24Can we, can we really train models for 5.5 million dollars now? 2:27So, first, the number is true, it's published in the, uh, presumably, it's 2:32published in the paper, DeepSeek isn't necessarily hiding anything about this 2:35number, it's heavily caveated if you, if you look at it, but the takeaway that 2:40people are driving from this number is a little bit crazy, so, yes, training one 2:47iteration of a base model, DeepSeek-V3, by the way, this all came out in December. 2:52This isn't late breaking news as of last week. 2:55Back in December, they trained this model. 2:57They said one iteration of training it would cost about 5.6 million dollars. 3:02But that's like saying if a startup could go and train a model for the same cost. 3:07That's like saying if I'm going to go run a marathon that the only 3:10distance I'll ever run is 26 miles. 3:12The reality is you're going to train for months. 3:15Practicing, training, running hundreds and thousands of miles potentially leading 3:20up to that one race that then takes 26.2 miles. 3:25And if you look at what that number is and take that metaphor even a step 3:28further, it's like saying, okay, what if I'm running a race, but I take a 3:33break every mile, I stop, I, you know, take a drink of water, I take a nap, I 3:37come back the next day, I keep running. 3:39And you only add up your time from the actual miles that you're running 3:42in the race, not all your breaks. 3:44That's like the equivalent of what this number represents. 3:47It's a really valuable number to understand and impressive. 3:50The parts that they're measuring, they did bring down a lot in efficiency. 3:54But that number does not represent the cost to go and now train a model. 3:57It's not like we're going to have startups now flooding, you know, the ecosystem 4:01with their own version of 600 billion parameter mixture of expert models. 4:04That's super helpful. 4:05Yeah, and I think it's a great calibration. 4:07I want to kind of pick up on that last thing that you said, which 4:09is there is a lot that's new here from an efficiency standpoint. 4:13And maybe Chris, I'll toss it to you for sort of the next meme that we're hearing 4:17kind of flow around in this space, which is DeepSeek-R1 is a huge breakthrough. 4:22Models are running way more efficiently than they used to. 4:25You know, dot, dot, dot, DeepSeek is so far ahead. 4:28I know you said that you, uh, felt like this model was a big deal, like 4:329.11, um, a bunch of a big deal. 4:35Uh, but can you tell us a little bit about, like, has DeepSeek really unlocked 4:39some novel things, and if so, how big of a deal are these novel things that they're 4:44really uncovering with this new model? 4:46I said 9.11 or 9.9, so clearly, Tim, you think 9.11 is the 4:53bigger number out of those two. 4:54Sorry, there's some uncertainty bars there. 4:57I actually think it is a big deal, so I think there is a few things, which 5:02is, we're sort of joining everything together there, so we're actually 5:07saying, okay, here's the base model, and then there's the RL training 5:11for the R1 part of that, right? 5:13And actually, if we separate out the kind of the DeepSeek-V3 version from 5:18the RL training there for a second, I think there is a big deal there. 5:22Because the reality is... Nevermind the 5.5 million bucks, right? 5:26You are going to be able to take an existing base model that has 5:31been pre-trained and then you are going to be able to do RL 5:34training over the top of that. 5:35You're going to be able to take your cold start fine tune data. 5:39So you can take a relatively small amount of data set and then put that on 5:42top and train it to do amazing tasks. 5:45And I know that myself, right? 5:46Because I took like a tiny model myself, one and a half billion 5:50parameters, absolutely tiny Qwen model. 5:53And then I put maybe a thousand lines of SFT data, right? 5:58And I got that thing to be able to tune math, right? 6:01Basic arithmetic at the kind of same level as GPT 4o, right? 6:06Just myself and I, I'm telling you right now, I love IBM. 6:10They do not pay me five and a half million dollars. 6:12That was on my laptop. 6:13So, so this is a big deal and it's a big deal because the thing that they're 6:18showing there is long chain of thought has a huge impact and accurate data because 6:25actually even to the point of RL, they they started with pure RL training, right? 6:31So they actually just said, here's your rewards for not doing anything else. 6:35And, and we're going to train the model that way. 6:37And then they went to say, actually, if we, if we do one round of fine tuning 6:41with a really good chain of thought, a set of data, maybe a few thousand rows worth 6:46of data, and then do RL training after that, then we get much better results. 6:51So actually what they're showing is that we can maybe stop obsessing about 6:56pre-training so much and we can get into this kind of post training world 7:01and Inference time compute world and for that you don't even need five and 7:05a half million dollars. Just your laptop and a little bit of tenacity and a 7:10little bit of GPU is gonna do that job. 7:12That's awesome yeah, we're gonna talk a little bit more about RL and chain 7:14of thought in just a second, but Aaron I think before we move to that 7:17one other question I want to ask you the kind of other third big meme is 7:21that everybody suddenly discovered Jevons paradox, uh, this week. 7:26Um, and I think one of the narratives that popped up is NVIDIA is doomed. 7:30You need a lot less compute, uh, compute for these models now. 7:33You know, NVIDIA's stock price took a tumble. 7:36I bought the dip for what it's worth. 7:38Um, and I want to say to, uh, I guess for Aaron, if you wanted to kind of 7:41respond to this, this question, or whether or not you think it's a myth 7:43at all, is, um, are we going to need a lot less compute in the future? 7:48Is NVIDIA doomed? 7:49Like how should we read this? 7:50And if you want to explain Jevons paradox, you can go ahead too. 7:53Yeah, I mean, I mean, so, so I think, you know, fundamentally, you know, 7:56that that's an interesting notion. 7:57Um, you know, but I tend to follow, um, the dynamics of AI, you know, which 8:01comes in to me three different areas. 8:03One is the scaling law, right? 8:05Is that, you know, it tends to say that, um, as you scale up the train 8:09of AI systems, you get better results, which means bigger models are better. 8:12Generally, right? 8:13And then the shifting curve where new ideas are making 8:17training more efficient, right? 8:18And so this, you know, affects that scaling law. 8:21So the more new ideas you get, you know, the smaller models 8:25become more powerful, right? 8:27But then there's a third, which is a shifting paradigm of these big 8:30revolutionary ideas and can an order of magnitude change the scale of which 8:35you actually need to train these models in order to get performance, right? 8:39And so I think by having those points one, two, and three laid out, uh, which 8:43is, you know, backed by a lot of research over time, you know, you know, I think 8:47that, yes, there, there's always going to be a demand for GPUs, but I do think 8:51that, um, there's going to be different chip architectures that are coming out, 8:55but also, um, if you look at some of the efficiency gains that- that V2 had, um, 9:00such as a multi-head attention, were they able to cache a lot- a lot of the weights? 9:03The, the token throughput is incredible, uh, that they were able to achieve, 9:07but I think that was one of the bigger innovations that they had. 9:10Uh, and then the second one was there, what they call the DeepSeek, um, MoE, uh, 9:15where they're able to sort of partition out and share uh, knowledge amongst these 9:19different agents that they can have. 9:21And that also helps, but those two things where some of those, um, uh, 9:25pieces that gave us the shifting curve on that scale law, which said, 9:28okay, I don't need as many GPUs now. 9:30But if you look at the foundational model, um, so if you go to, 9:34let's say, DeepSeek-V2, right? 9:37It's big. 9:37It's a very big model and v3 is even bigger with a- what is it? 9:40671 billion parameters, right? 9:43That's a very big model. 9:44Yeah, it's, it's chunky. 9:45Yeah. 9:46So, I mean, I mean, that that's, it's very fun, right? 9:49To watch that curve. 9:50And I, and I think that we'll see agglomeration of models together, we 9:54can do reverse distillation, right? 9:57To create and combine smaller models together. 10:00Uh, you can do model distillation to create smaller models, you know, um, 10:04but, but it's, it's going to be fun. 10:06I want to maybe pick out a point that Chris actually mentioned that 10:10I think is really important, which is the like, can we just stop 10:14worrying about pre-training now? 10:15Because I think everyone is talking about this 5.5, 5.6 million dollar number. 10:21And they're tagging it to all of these amazing performance improvements 10:25that we're seeing in the R1 model and that, and the distilled models, and 10:29they're kind of equating the two and saying, all right, now we can just 10:32go and we are getting like this crazy performance at a pretty minimal cost. 10:36And I think it's really important to disambiguate these two things, right? 10:46A step in this process costs about 5.6 million, the true cost of building this pre-training 10:51model is likely orders of magnitude higher, but Regardless, it almost 10:56doesn't matter, like this 5.6 million number doesn't even matter because you 11:00can take this big model that's now open source and distill it basically for free 11:05on top of other open source, smaller models that are out there to get crazy 11:10performance improvements and build. 11:12So it's not that startups are going to go and build and pre-train their own 11:15600 billion parameter model because the cost is only 5.6 million dollars. 11:20That's the wrong takeaway. 11:21It's that we now have the ability to distill and thanks to more and more 11:25competitive models being put into the open source that distillation is 11:28becoming even more powerful and use reinforcement learning as a technique 11:32that DeepSeek used really effectively to go and build our own smaller versions of 11:37these models that are really powerful. 11:38And that is where there's actually very low barrier to entry now, as Chris is 11:42saying, you know, doing it on your laptop. 11:44Yeah, yeah, yeah. 11:44I find that, you know, very, very nice because it's like a house of cards, right? 11:48I mean, they're only quoting the top card. 11:50They're not quoting any of the cards at the bottom. 11:52And if you move one of those cards at the bottom, the whole house collapse, right? 11:54And so that 5.5 million is only the cost associated with maybe you know, 12:00you know, one, one epoch right of this type of training and that's it. 12:04But if you look at the hardware, even that they use, what are the H800's? 12:08Just procuring those alone or using those as a service is expensive, right? 12:13Um, and so, so they're excluding lots of costs associated with 12:15prior research ablation studies and lots of different things, right? 12:19Which um that that number is very, very much misleading. 12:22Uh, Chris, I see you nodding. 12:23Do you want to jump in? 12:24I don't know if you have a comment 12:26Uh, no, I was just nodding in agreement because I'm a very 12:30kind and collaborative person. 12:31Uh, no, I, I, I, no, I, I, I absolutely agree. 12:38I, I think that, um, you're gonna go for the big hit numbers, right? 12:42You're gonna say, we did this super cheap. 12:44And you are really going to miss out all the steps that took you to 12:49get there in the first place, right? 12:50And, um, and as Kate probably knows better than anyone, right, that the 12:55amount of experimentation that it takes for these models, right, to 12:58get to the final version is a lot. 13:01So the, the actual final epoch, as Aaron was saying, that final training 13:05run, that's just, that's just the kind of the end of the road there. 13:08So, um, but. 13:10You know what? 13:11No one wants to hear about the big journey going up there. 13:13They want to hear the big number. 13:14We're in a hype industry, baby. 13:16So we'll, yeah, five and a half million. 13:18Here we go, right? 13:19Kate, I guess maybe one last myth I've seen kind of popping up that might 13:22be good to address before we, we do a segment on distillation, because it's 13:25already come up a couple times, and I think it is worthwhile to explain 13:28why, what it is and why it matters. 13:30But maybe one last thing to cover before we get to that, Kate, is, um, on the 13:34point about RL, um, it feels like, the DeepSeek narrative has also been a little 13:39bit about like the revenge of RL, like reinforcement learning's back, baby. 13:43Um, and I know some people have gone so far to be like, everything is RL now. 13:46You know, fine tuning is dead. 13:48Um, do you want to talk a little bit about that? 13:50Like even, you know, with everything that we've said, like how much does R1 indicate 13:54to us that really RL will be kind of like the more dominant method for these types 13:58of fine tuning efforts going forward? 14:00Yeah, and I'm really curious to get Chris's take on this because I know he's 14:03just run these experiments right locally on his own laptop, but so DeepSeek in 14:08their paper, they trained two models, uh, in addition to all the smaller distilled 14:12models that they, that they worked on. 14:15One model was trained with just reinforcement learning only. 14:19So there's no additional data that's added. 14:21You've got your pre-trained model, which costs, you know, 14:245.6 million plus, you know, all the arguable buffer on top. 14:28And they just use reinforcement learning using some rules based systems more or 14:32less to be able to verify the results, um, and, and score the responses. 14:37And then, and so they called that R1-Zero, I believe. 14:40Then there's R1, which they also created because in their paper they 14:44mentioned that there were some, you know, rough edges, so to speak, on the, 14:48the reinforcement learning-only model. 14:50And in that model, they start the model first with some fine tuning, 14:55basically using some structured data in order to better prime it for 15:00this reinforcement learning task. 15:01And that is the model that everyone's now playing with on the DeepSeek app and 15:05that everyone's really excited about. 15:07So I think it's a really interesting look at, you know, the takeaway 15:11shouldn't be that, oh, we, you know, can't do RL only we had to, you know, 15:16resort to this cold start and fine tuning before the model was released. 15:19The takeaway that I think people should have is it's amazing how 15:22far they were able to push just RL. 15:25And yeah, there's still always going to be a need for some 15:27structured data potentially. 15:29And there's, you know, maybe a hybrid approach is best, but it is kind of 15:33crazy how far they were able to push it. 15:35Now, what they also published in their paper, getting to the distilled models, 15:39and you asked about distillation, distillation has been around forever. 15:43It's where, you know, back to, you know, early days of the first Llama model, 15:47you know, a group of students distilled that into Vicuna, um, and it's basically 15:52where you generate a bunch of synthetic structure data from a big model and 15:55use that to fine tune a small model. 15:58So DeepSeek used that same kind of thought process doing just RL only on a small 16:03model, so no big models involved, just RL and try to see how far could they get, you 16:08know, they published numbers on Qwen32B. 16:11So how far can we push Qwen32B's reasoning just on RL? 16:15And they weren't able to, in the paper, they claimed they weren't able to 16:18push it nearly as far, get any real reasoning capabilities out of the model, 16:23they had to resort to distillation, take their big R1 model, generate a 16:27bunch of synthetic data, and tune it. 16:29So, you know, I'm curious from your perspective, like, what your take 16:32is on that, Chris, based off of some of the RL experiments you've 16:35been doing with small models. 16:36You said you also, I think, did some fine tuning first to start it 16:40off and then with chain of thought reasoning and then RL on top. 16:43Now, for me, the critical thing is the long chain of thought reasoning. 16:47That is actually an accurate long chain of thought reasoning. 16:50That is the thing that really enabled everything. 16:53So, again, if you look at the paper, when they did RL, um, they said they got there. 16:58But, you know, if you think about, especially math problems, LLMs 17:03are not really good at that. 17:04So you're going to say, what's 25 plus 8? 17:06What's this? 17:07Whatever. 17:07And you're going to ask an LLM, you know, to go and generate me this sum. 17:11And it may or may not get it right. 17:13It may or may not get the sums and the length of chain of thoughts that you want. 17:17It may not get its explanations right. 17:19So it's, it's really a, a bit of a crap shoot and getting 17:23an accurate chain of thought. 17:24And then at the end of it, they're using this thing called a verifier. 17:27And what the verifier does is, is, you know, take the answer that you've got 17:32and go you know, run a bit of rules to run the equation and say, yeah, 17:37that was correct or that was wrong. 17:38And then you get a, you get a bit of a reward, you know, 17:41it's like, here's a cookie. 17:42Well done model. 17:43Good job. 17:44But, but if you think about how long that's going to take it, 17:48it, you really are monkeys and typewriters at that point, right? 17:51It's going to take time for the models to, to come back with the right answers. 17:55Now, if, if you run a fine tuning step before that. 17:59So if you can produce long, accurate chain of thoughts for those math equations, 18:05for example, and I'm picking math because that was in the paper, then the model 18:09is gonna look at that and say, okay, I'm doing this equation here, and you 18:13explain every single step, step one, step two, step three, and then I'm 18:16reflecting back, this was right, this was wrong, and then finally I'm gonna 18:20check the answers back, then you're just going to need less steps for the model 18:24to be able to learn what it has to do. 18:26And then you can use RL afterwards to go, oh, this particular 18:31song- sum you went wrong. 18:33So I'm going to give you a cookie here. 18:35Um, because you've now, you know, we've now given you a different way, you 18:39know, here's the right way of doing it. 18:40And you don't get a cookie if you got it wrong. 18:42So I think, I think that combination between the two is the key thing. 18:46But I actually think the real, uh, take away from that paper 18:51is the long chain of thought. 18:52So when I did my experiment, uh, on my YouTube channel, the thing that 18:56I did is I took a slightly different approach from what DeepSeek did. 19:00And, and I have a thing called a math compiler. 19:02So what I did is I, I automatically generate the, uh, the math equations, 19:08and then I put it into my compiler, and I generate an abstract syntax tree, and 19:13then I walk the tree, and then it, I don't need the LLM to do the math, I'm 19:17just going step zero, step one, step two, and I'm just walking the tree, and I'm 19:20outputting the explanations, and then what I do is I use the LLM to transform that 19:25into something that the model actually unders- you know, is actually human 19:29language, and the explanations behind it, and then that's how I got these 19:32really accurate, uh, chains of thought. 19:35And then when I put that in just as an, uh, you know, fine tuning step, I think 19:39I used maybe a hundred different examples and and honestly the math and I did it 19:44on a one and a half billion parameter model the math was incredible, right? 19:48It was like a couple of decimal precisions of uh, you know accuracy out which which 19:57the larger models of six months ago would be nowhere near so so I think the this 20:02The real innovation is the long chain of thought and the accurate chain of thought. 20:06It's not to say RL won't get you there, it will get you there, but 20:09it's just going to take a long time. 20:11So if you can, if you can short set that a little bit, and then have RL 20:15sort of, you know, do the kind of, uh, the smooth and out of the edges, 20:19then you're really going to win. 20:20That's kind of my view on this. 20:21RL is really valuable for tasks like math and things where it's 20:25easy to check the accuracy, right? 20:28Um, as well as, relatively easy to generate that chain of thought, but when 20:34we look like in the paper, for example, they talked about still needing some 20:38instruction tuning for tasks like tool calling, you know, instruction following, 20:43like there's still going to be a need for having like these reasoning models 20:47aren't designed to do every single task. 20:48They're specific for reasoning, and you're still going to potentially 20:52need instruction tuning in order to handle some of those more specific 20:55instruction following tasks. 21:02So I'm going to move us on to our next segment. 21:03Uh, so this is super helpful, I think, in terms of setting the scene, knocking down 21:06some of the myths that have popped up. 21:07We've already talked a bunch about distillation, um, and I think on 21:11the last episode, Skyler actually gave like a short, brief explanation 21:14of it, but for those who weren't listening on the previous episode, 21:17maybe Aaron, I'll toss it to you. 21:19I think it's worth it for our listeners to, just get a sense of like, what 21:23is distillation in the first place? 21:25Um, and then I think if you want to give that explanation, there's some interesting 21:28things I think that are worth getting into about like, well, what does this 21:30mean for where the industry is going? 21:31But maybe I'll toss it to you to give the quick capsule explainer first. 21:35Yeah. 21:35So, I mean, I mean, model distillation is very powerful technique. 21:38You know, it's about having a teacher model, you know, that could be a bigger 21:42model where it's encoded, you know, much more information, uh, through 21:46weights and through embeddings. 21:47And what you want to do is transfer that knowledge to a student model. 21:51And then usually that student model could be smaller and it requires, then in turn, 21:54less resources to train, right, and to also use for inference, right, and some 21:59people and groups think of this as model compression, right, where you're making 22:02a model smaller, and so on, um, and then, and then there's, there's different things 22:07that you can distill, right, you can distill like response based knowledge, 22:11You can dispel feature-based knowledge or even like the relations between all 22:16the different connections within all of the neurons that you have, right? 22:21And one interesting thing that I wanted to bring up that I saw within 22:25the R1 paper is that the distillation process, it wasn't just about, to me, 22:30about just doing this, you know, model compression or getting knowledge out. 22:35But it was almost like this model translation. 22:38Uh, because what I saw is that you were actually distilling 22:41information, uh, from an MOE, right? 22:44And then you were going directly to the student model, which was a 22:48densely connected, you know, feed forward, uh, neural network and 22:51many, uh, different cases, right? 22:53And so, and so just changing that model architecture, um, looked to be, 22:57um, a different way of doing this type of model distillation that, that I 23:02thought also looked You know, gave R1, I think, some advantages, especially 23:06whenever you were looking at using like Qwen2.5 and like the Llama 3 23:11series, right, as the base foundational model to pull information out. 23:17Yeah, and I think one of the most interesting elements of distillation is 23:19sort of the idea that um, you know, you, you can, you can take any large model 23:24and bring that sort of knowledge into whatever it is that you're building. 23:28Um, you know, I think really just literally, I think in the last 24, 48 23:31hours, there was a little bit of kind of a controversy over did DeepSeek use 23:35effectively kind of OpenAI's chains of thoughts or other inputs, outputs 23:38to kind of do the distillation here. 23:41Um, I guess, Kate, kind of question is like, this makes it very hard for 23:45any model company to kind of protect its models in some ways, right? 23:48Because everything is distillable. 23:50Is that the right way of thinking about it? 23:52Yeah, I mean, I think by releasing a very capable, the most capable model 23:57to date in the open source with a permissible MIT license, DeepSeek is 24:03essentially allowing and eroding kind of that competitive moat that all the big 24:08model providers have had to date, keeping their biggest models behind closed doors. 24:13And regardless of whether or not DeepSeek also benefited from distillation from 24:17those bigger models, we're now able to go and take that really big model in the 24:22open and use it indiscriminately, where before people, I mean, this distillation 24:27from GPT has been going on for, ages. 24:29Anyone can go to Hugging Face and find tons of data sets that were generated 24:32from GPT models, uh, that are, you know, formatted and designed for training and 24:38likely taken without the rights to do so. 24:41Um, so this is like a secret that's not secret that's been going on forever. 24:46So yes, it most likely worked its way in some degree or fashion to the 24:50DeepSeek model, but it almost doesn't matter anymore because DeepSeek now is 24:53out there and that model can be used to run very similar style distillation. 24:57With great effect on as many small models as you like, and anyone now has 25:01the rights if they, uh, use DeepSeek's model to do so, according to the 25:05license that it's published under. 25:07That's right. 25:07Yeah, I think one of the funniest parts about the kind of news cycle has been 25:10like, they used a secret, you know, sinister technique called distillation. 25:14Yeah, it's like, actually everybody's been, everybody's 25:16been distilling all the time. 25:17It's just like happening 25:18It's been around forever. 25:19And it costs 5.5 million dollars. 25:21That's right. 25:22Yeah, exactly. 25:22What strikes me, I mean, Chris, even to the example that you gave earlier, 25:26right, like you don't, it turns out you don't need a whole lot of data to 25:30make these models much, much better. 25:32And it kind of seems like there's this sort of like fundamental thing in the 25:35market where it's like, unless you want to control and really down to the 25:38nth level, prevent people from getting outputs from a model, there's basically 25:43no way to stop distillation, right? 25:45I don't know if you think there's a realistic way to prevent that at all. 25:48No, I don't think so. 25:49I mean, the reality is, as Kate said, there's open weight models out there, 25:54um, and people are gonna do that. 25:55And I, I think that, and, and I love this by the way, and the reason I love 26:01it is that I'm, I'm all for chaos. 26:04I'm all for open source. 26:06I'm all for sharing and collaborations. 26:08So, you know what, people are going to go off now, they're going 26:11to create their own data sets. 26:12They're going to distill from different models. 26:14They're going to share that out in the community. 26:16And you know what, we're going to all end up with better stuff. 26:19Right? 26:19So I'm, I'm not a big fan of the closed models, personally, my opinion. 26:24Um, I'm a big fan of sharing and learning from each other. 26:27So I, that's what gets me excited about kind of the DeepSeek stuff. 26:31And, and it, and again, it's not just the fact that, um, they put the model out 26:35there that you can distill from is they talked about the techniques that they use. 26:39So, so yeah, it's, it's cool. 26:41We can all start doing interesting things. 26:43And, and you know what? 26:45I don't think everybody's going to, I don't think we're suddenly all 26:47going to be going out competing with OpenAI, Anthropic, blah, blah, blah. 26:51I don't think the, you know, all of these people sitting in their 26:53bedroom are going to do that. 26:54But you know what they might be able to do is take one of these out of the box 26:58pre-trained model and then solve one of their own particular tasks that the 27:02general model can't do that's specific to their use case and make it easier. 27:07Um, but again, don't, don't undersell this. 27:10I mean, Kate, you know this better than anyone, right? 27:12Fine tuning models is really hard, right? 27:15Because of all of the biases, you might, you might think, hey, my model is now 27:19great at doing this one particular task. 27:21But then you've just ruined that model from doing any other tasks there 27:25because you didn't have the right biases and mixes within that data set. 27:29Yeah. 27:29I mean, just take a look at the Hugging Face Open LLM Leaderboard. 27:33All those distilled versions of Llama and Qwen are on there, and they all rank 27:36significantly lower than the original model that they were distilled from. 27:41Uh, on those Open LLM Leaderboard tasks, which are not predominantly 27:45reasoning-based tasks. 27:46So the model was boosted in reasoning, but other general 27:50performance characteristics drop. 27:52But I, I think it's still incredibly powerful. 27:55And as we talk about, you know, DeepSeek's introducing this new 27:59era of efficient open source AI. 28:02It's true. 28:03It's just not true because they trained this really cost-effective 28:06model during the pre-training. 28:07It's true because we now have the methods to create these armies of distilled fit 28:13for purpose models that are specific for the tasks that you care about because we 28:16have better tooling, like powerful teacher models, out in the open source ecosystem. 28:21Yeah, yeah, yeah. 28:22I think that there's a lot of secret agents, you know, that 28:24are hidden amongst our labs. 28:26And, you know, in the next couple days, weeks, you'll see them to 28:29become super agents that are going to be a release that we can all use. 28:32So, um, I, I really think this might have been one of the impetuses, you know, 28:37to sort of grandstand out, you know, what's happening within the field of AI. 28:41Um, DeepSeek just happened to be right time, right place, you know, to do it, to 28:45put all, to connect all the dots together. 28:47Um, but, yeah. 28:48Um, I do think that lots of these technologies and new innovations that 28:53are coming out, inventions, um, you know, you ask the question, can you 28:56prevent someone from distilling a model? 28:59Um, that sort of brings me back to biometrics. 29:01You know, it used to be, can you prevent someone from 29:03stealing a picture of your face? 29:05You know, and we, and we came up with this cancelable biometric 29:08invention so that if someone took your picture, you could revoke your 29:11biometric and create a new one, right? 29:14So I mean, I mean, there, I think there might be some cancelable 29:17technologies, um, and patents, right? 29:19That we could work on together, right? 29:20To achieve some of this. 29:22Final question here, I think for Kate, um, particularly given your work on 29:25Granite, you know, I think there's maybe one point of view, which is, well, 29:28you know, the only reason you know, investors have put money in towards 29:32building these giant, giant models is kind of the idea that if you build 29:37these giant models, you'll be able to capture all the value from that model. 29:40And it sort of seems to me that like, if distillation gets good and you know, 29:43granted, distillation is hard in some respects, but you think you've got 29:46enough eyeballs, someone will eventually figure out ways of cracking it. 29:49You know, is there an argument here that it kind of erodes the incentive 29:53for people to invest in building the big model in the first place? 29:56Uh, like, and there's kind of a really interesting question, which is, you 29:59know, it's almost kind of an accident that like we've ended up with these 30:01giant models and it's partially based on the idea that like, well, you could 30:04have some exclusive control over this, but it feels like this is rapidly kind 30:08of escaping the ability for anyone to be able to kind of exclusively control. 30:11Yeah, look, I don't think there's any incentive to really build big 30:16models to run at inference time. 30:17The incentive is to big, really big models to help you build really small models. 30:22And all it takes, like we've, it started with Llama releasing, you 30:26know, 400 billion plus parameter model, NVIDIA released a 400 billion 30:29plus parameter model as a teacher. 30:31And now DeepSeek releasing their 600 billion plus parameters, 30:35you know, size isn't everything. 30:36They also have to have high quality post-training, which is 30:38why the reinforcement learning part of DeepSeek is so important. 30:41But we're seeing more and more large models that can be used openly 30:45to train these smaller models. 30:46And I think it's just going to continue to make this more of a teacher model 30:50based commodity like why pay for those big models if we've got similar capabilities 30:55out in the open that you can customize further and I think we are going to 30:59converge on to a point where we've got powerful enough tools to craft the smaller 31:06models that we need that are going to run, you know 80 percent to 90 percent of our 31:09workflows for generative AI in the future. 31:12Yeah, it's kind of a funny world where it's kind of like you ever, you 31:14never talk to the giant model that's just inside company headquarters and 31:17then just like lots of tiny models that are coming out around it. 31:24Well, in the last few minutes, I want to zoom out a little bit. 31:27We've been talking a lot about DeepSeek and what's going on underneath the hood. 31:30Um, and I want to just take a moment to talk a little bit about what 31:33all the other companies are doing. 31:35Relative to this development in the AI space, um, Sam Altman, of course, the 31:40head of OpenAI put out a little tweet thread kind of responding to this news. 31:45Um, and I'll just quote a little bit of it. 31:46He said: "we are excited to continue to execute on our research roadmap and 31:50believe more compute is more important now than ever to succeed at our mission." 31:54Um, which is really like a statement by a guy that says, steady as she goes, we're 31:59continuing on the research path as you know, we had planned, um, and nothing 32:03has changed by the DeepSeek, uh, release. 32:06Um, I guess, Chris, maybe I'll kick it to you. 32:09Do you buy that? 32:09Like, is OpenAI pretty much gonna just keep doing its strategy? 32:12Or does this really kind of fundamentally change what they're gonna need to do? 32:15Nah, he's gonna release his model sooner. 32:17He's been holding on to these models for too long, and he needs to get on with it. 32:20And good on you, DeepSeek, right? 32:22Where's my o3? 32:23You showed me it at the end of Christmas. 32:24Do I have it in my hands? 32:26No, so thank you DeepSeek, maybe we'll get his model out a bit quicker and 32:29then we'll get o4 and o5 and then maybe we'll get some of these models 32:33in Europe because guess what, they're releasing vision models and video 32:38models and I don't have any of them so I'm gonna get them as well, so woohoo! 32:42Uh, so I guess ultimately what you're saying is it just 32:45accelerates his roadmap, right? 32:46To just get him off the fence. 32:49There is no way he's just gonna sit there and go, uh uh uh, I'm not 32:53giving you my model while DeepSeek is getting all of this press. 32:56He's gonna respond and we're gonna get new models. 32:58But I think, I mean, Aaron, maybe to turn it to you, you don't think this 33:01changes, like, their approach to, I guess, ironically, being kind of 33:04a closed source model here, right? 33:06Like, this is not the kind of situation where you believe that OpenAI or 33:09Anthropic, any of the big kind of providers, would say, hey, now we 33:12need to start switching to open source is the way we play this game. 33:16Um, I don't think so. 33:17I mean, I mean, I think so. 33:19I mean, this, this could go in several directions, you know, but I 33:22think, you know, open versus closed source, you know, I think that there's 33:25advantages and disadvantages to both, but I think ultimately it helps the 33:29academic community, which then in turn fuels, you know, economies of 33:32scale for the average consumer, right? 33:34Because, uh, because if you think about it, you have two groups, right? 33:37You have the open source group, closed group, they compete, you know, um, to 33:41make sure that one is better than the other, which then spurns innovation. 33:45Okay. Great. 33:45And then within each one of those groups, you have companies and 33:49organizations that then in turn competes. 33:51You have like this 20 and of competition that further accelerates, 33:56you know, this, uh, innovation. 33:57And so I, and so, so Sam Altman, you know, you know, I think he's going to release a 34:01secret agents, right, uh, sooner, right. 34:04Make them available, right. 34:06And, and lots of the techniques that, you know, DeepSeek has shown, you know, 34:09like that caching layer of the key value and queries that they've, you know, come 34:14up with some of their MoE innovations and then some of the parallelization 34:18whenever they can share context and information amongst their grid, lots of 34:24that is going to be included, I think, in Sam Altman's but pushed even further, 34:28you know, with their own innovations and it's going to splinter out a bit, but 34:32the fundamental like model distillation and so on and so forth, you know, I 34:35think that's going to be, uh, very key. 34:38And then it brings a value proposition down to frameworks, you 34:41know, um, how can I better train the models for my own fit purpose? 34:45You know, whether I'm an enterprise or a customer and then 34:48also how can I trust it, right? 34:49Because there's going to be a zoo of models now that are out there. 34:52It's just very confusing to pick which one do I use? 34:54Yeah, Kate, uh, so we've been talking about OpenAI. 34:56Obviously they take up a bunch of sort of airtime. 34:59Um, but I guess one thing to kind of, as we think about zooming out to tell this 35:03DeepSeek story is whether or not we think OpenAI is kind of similarly situated. 35:08Like, you know, everything we've been hearing is, okay, OpenAI is 35:11going to continue its strategy, it's just going to move faster. 35:14Do you think it changes the economics at all or the kind of decision making 35:17at all for say a Google or a Meta or, um, you know, even like an Anthropic? 35:22I don't think it changes the decision making or strategy, uh, overall. 35:28I think a lot of DeepSeek strategy, you know, necessity 35:31is the mother of invention. 35:32They only had access to H800 chips. 35:34So they optimized the hell of it. 35:36They invested in efficient architecture like MoEs and DeepSeek was born, right? 35:42So I think the U.S. based labs are operating with very different constraints 35:47and DeepSeek's innovation doesn't necessarily, constraints and incentives. 35:52And I don't think DeepSeek necessarily changed that calculus. 35:56I also think a lot, again, what we've talked about today 35:59with DeepSeek is distillation. 36:01And for the labs pursuing AGI, distillation is not 36:05necessarily as relevant, right? 36:07They need to keep training as big a model as possible and have incentives to 36:11try and keep that behind closed doors. 36:13Whereas the business value, again, my take is the business value is all 36:16around these distilled smaller models that are actually what people are going 36:19to deploy in a commercial setting. 36:21And I don't think they're at least at the highest strategy level 36:25and what they're working on with right in terms of their investment 36:27profiles, that longer term AGI game. 36:30And for that you still need a crap ton of big model of big GPUs. 36:34And they're not going to want to release any of that out in the open. 36:36Right? 36:36Yeah. 36:36It's not like they're going to use Stargate to do like 36:38small distilled models. 36:39That would be the funniest thing. 36:41It's actually an inference cluster. 36:43Surprise. 36:45Um, that's yeah, that's I think really fascinating. 36:48I guess, Chris, maybe I'll turn to you for the sort of last 36:50word and last question here. 36:52You know, Kate just talked a little bit about the idea that 36:55Chinese researchers are operating under very different constraints. 36:58So they kind of develop different types of methodology, different types of 37:01models, different types of proficiencies. 37:03Um, and do you think there's something to the idea that like, almost we're 37:07like, we have an embarrassment of compute 37:09among the U.S. labs. 37:11And 37:11So it actually kind of like, limits, like the degree to which we would 37:14ever invest in the kind of thing that DeepSeek would be working on. 37:17Um, I'm really sort of interested in the idea that like, these constraints really 37:21mean that AI will start to look pretty different in different parts of the world. 37:24As researchers operate under very different constraints of what 37:26they need to do to deploy systems. 37:28I, I think that's exactly the case, right? 37:29And you can see a little bit of reinforcement learning happening 37:32there and reward modeling, right? 37:34Which you, you were saying here. 37:36You're going to have less compute available to you. 37:38And guess what? 37:39They have different incentives at that point, and they've been 37:42rewarded by being more efficient. 37:44So if you've got an abundance of compute, you're not really going 37:47to be optimizing for efficiency. 37:48You're going to be trying to be getting your models out first. 37:50And I think that's also, you know, speaking from my own 37:53experience, I, you know, I don't 37:55have any H100's kicking 37:56around. 37:57What have I got? 37:57I've got my MacBook Pro, right? Where I've got -- You're more like the deep sea researcher, basically. 38:02Exactly. 38:03So you're trying to come up with innovative techniques to work 38:07within the hardware constraints that you run within today. 38:10So I, I, and I think honestly, if they didn't have the chip constraints in China, 38:18I'm not sure that DeepSeek would have probably came up with those techniques, 38:22because they, they would have been just trying to focus and catch up with 38:25everybody else, as opposed to trying to take things from a different angle. 38:28And, and therefore, again, one of the reasons I believe in open 38:31source very much, and everybody's sharing their papers, everybody's 38:34running under different constraints. 38:35And they're going to find new innovations, and if we share that, 38:38we're all going to learn from each other and be able to contribute. 38:40And that's not just the big labs, but the, the people in the community, 38:44just with their laptops, trying to discover and experiment with new things. 38:48Ah, so I love this panel. 38:49Kate, Aaron, Chris, thank you for joining us on the show as always 38:53and walking us through DeepSeek. 38:55Uh, a lot more to talk about and we will be tracking the story. 38:58Um, and thanks for you listeners for joining us. 39:01Uh, if you enjoyed what you heard, you can get us on Apple Podcasts, Spotify, 39:04and podcast platforms everywhere. 39:06And we will see you next week on a jam packed episode 39:08again of Mixture of Experts.