Learning Library

← Back to Library

GPT‑5.2 Rumors Spark OpenAI‑Google Rivalry

41m • Unknown Channel • ai-ml • news • beginner • Watch on YouTube ↗

Key Points

OpenAI is rumored to be accelerating a “code‑red” release of GPT‑5.2 to counter Google’s new Gemini model, suggesting the company may be feeling pressure to keep its lead in the AI race.
The episode’s news roundup highlighted Jeff Bezos and Elon Musk racing to build space‑based data centers, IBM’s $11 billion acquisition of Confluent, OpenAI’s work on models that admit when they hallucinate, and a whimsical “Santa agent” for holiday interaction.
Panelists noted a sharp shift in narrative from earlier in the year, when OpenAI was viewed as the dominant leader, to now where Google’s advancements are forcing OpenAI into a defensive stance.
Despite the hype around GPT‑5.2, the experts expressed uncertainty about whether the upcoming model will materially improve the consumer experience over current releases.

Sections

Full Transcript

# GPT‑5.2 Rumors Spark OpenAI‑Google Rivalry **Source:** [https://www.youtube.com/watch?v=nvAScf8YhzE](https://www.youtube.com/watch?v=nvAScf8YhzE) **Duration:** 00:41:32 ## Summary - OpenAI is rumored to be accelerating a “code‑red” release of GPT‑5.2 to counter Google’s new Gemini model, suggesting the company may be feeling pressure to keep its lead in the AI race. - The episode’s news roundup highlighted Jeff Bezos and Elon Musk racing to build space‑based data centers, IBM’s $11 billion acquisition of Confluent, OpenAI’s work on models that admit when they hallucinate, and a whimsical “Santa agent” for holiday interaction. - Panelists noted a sharp shift in narrative from earlier in the year, when OpenAI was viewed as the dominant leader, to now where Google’s advancements are forcing OpenAI into a defensive stance. - Despite the hype around GPT‑5.2, the experts expressed uncertainty about whether the upcoming model will materially improve the consumer experience over current releases. ## Sections - [00:00:00](https://www.youtube.com/watch?v=nvAScf8YhzE&t=0s) **AI Rumors, Transparency, and Acquisitions** - The episode covers speculation about OpenAI’s upcoming GPT‑5.2 and its rivalry with Gemini, highlights a new Stanford transparency report and Amazon’s Nova models, and recaps recent AI‑related corporate moves such as space data‑center projects and IBM’s acquisition of Confluent. - [00:03:13](https://www.youtube.com/watch?v=nvAScf8YhzE&t=193s) **Speculation Over New Model Releases** - Participants debate whether the constant stream of AI model launches—like the rumored 5.2, ChestNet, and HazelNet—actually improve consumer productivity or merely fuel competitive hype. - [00:06:35](https://www.youtube.com/watch?v=nvAScf8YhzE&t=395s) **Benchmark Race vs Real Impact** - The speakers critique how AI benchmarks dictate corporate competition and incremental model releases, arguing that these metrics are misaligned with genuine, transformative utility. - [00:10:04](https://www.youtube.com/watch?v=nvAScf8YhzE&t=604s) **Balancing Model Updates and Enterprise Maturity** - The speaker explains that while major step‑function improvements in AI models will eventually necessitate upgrades, frequent switches are impractical for most firms unless they are very small and agile or possess highly automated evaluation pipelines to safely test and deploy new versions. - [00:13:21](https://www.youtube.com/watch?v=nvAScf8YhzE&t=801s) **Declining Transparency in AI Models** - The speaker outlines Stanford’s yearly AI model transparency survey—covering upstream data curation and training details through downstream safety benchmarks—and notes that the 2025 report shows most labs are sharing fewer details, whereas IBM is pursuing a markedly different, more open approach. - [00:16:50](https://www.youtube.com/watch?v=nvAScf8YhzE&t=1010s) **Model Lineage and Transparency Trends** - The speakers describe their investment in data‑curation architecture to maintain clear model provenance and meet regulatory demands, and observe a broader industry move toward less transparency despite the goals of the Transparency Index. - [00:20:25](https://www.youtube.com/watch?v=nvAScf8YhzE&t=1225s) **Enterprise vs Consumer AI Transparency** - The speakers argue that consumer and enterprise perspectives on AI transparency intersect, influencing initiatives like transparency indexes, while cautioning that an over‑emphasis on IP and benchmark metrics may lead the market to ask the wrong questions. - [00:24:28](https://www.youtube.com/watch?v=nvAScf8YhzE&t=1468s) **Transparency as Emerging Market Driver** - The participants argue that transparency will evolve from a peripheral issue to a major market force for new technologies, paralleling the privacy shift that occurred as social media matured. - [00:28:33](https://www.youtube.com/watch?v=nvAScf8YhzE&t=1713s) **Nova’s Ongoing Releases & Enterprise Potential** - The speaker explains that Nova is not brand‑new—having launched speech‑to‑speech models and other updates last year—and introduces the upcoming Nova Forge, which aims to democratize custom model creation for enterprises, while questioning how many of these new capabilities will address mainstream business use cases. - [00:32:22](https://www.youtube.com/watch?v=nvAScf8YhzE&t=1942s) **Agents Over Fine‑Tuning for Enterprises** - The speaker explains that, because most LLMs are trained on outdated public data, enterprises should combine them with retrieval‑augmented methods and ready‑made agent platforms—like Amazon’s one‑stop solution—rather than invest in costly fine‑tuning, which only a handful of data‑science‑intensive companies can effectively execute. - [00:35:58](https://www.youtube.com/watch?v=nvAScf8YhzE&t=2158s) **Future of Long‑Running AI Agents** - The discussion examines Amazon’s Frontier agents’ claim of multi‑day operation, explores the prospect of AI assistants that work for weeks or longer before delivering results, and highlights ongoing improvements in tool use and alignment. - [00:39:44](https://www.youtube.com/watch?v=nvAScf8YhzE&t=2384s) **Balancing Runtime and Accuracy** - They discuss how model evaluation must consider both execution time and reliability, noting a shift toward longer yet more accurate runs and the role of self‑evaluation loops. ## Full Transcript

0:01I think OpenAI is going to try and 0:03capture attention back away from the 0:05success of Gemini. They've got to do 0:07that to, you know, save face with their 0:09broader investors and everything else 0:10they're pursuing. But I don't know that 0:12I would agree that at the end of the 0:14day, the consumer is going to be a lot 0:15better off the day after 5.2 is released 0:18than today. All that and more on today's 0:21Mixture of Experts. 0:27I'm Tim Hang and welcome to Mixture of 0:29Experts. Each week, Moe brings together 0:31a panel of the smartest minds in 0:32technology to distill down what's 0:34important in artificial intelligence. 0:36Joining us today are three incredible 0:37panelists. We've got Mihi Crevetti, 0:38distinguished engineer, Aentic AI, Kate 0:41Soul, director of technical product 0:42management, Granite, and Ammy Gatisan, 0:45partner AI and analytics. Uh, welcome to 0:47you all. We're really ending the year 0:49with a bang. There's a lot to talk about 0:51today. We're going to talk a little bit 0:52about rumors of GPT 5.2. 2, a new 0:55transparency report out of Stanford and 0:57Amazon's newest generation of their Nova 0:59models. But first, we've got Eiley with 1:01the news. 1:06Hi everyone, I'm Eiley McConn, a tech 1:08news writer for IBM Think. Here are a 1:11few AI headlines you might have missed 1:12this week. Both Jeff Bezos and Elon Musk 1:15are now racing to develop data centers 1:18in space. IBM has acquired data 1:21streaming platform Confluent for 11 1:23billion to help ramp up agent use in 1:26enterprises. 1:27Open AAI has started training models to 1:30confess when they've made stuff up or 1:32taken shortcuts. 1:34Ho ho ho. A new Santa agent lets users 1:38interact with Santa via text, phone, or 1:40video chat to share what they want for 1:42Christmas and to find out if they're on 1:45the naughty or the nice list. For more, 1:48subscribe to the Think newsletter linked 1:50in our show notes. And now, let's see 1:52what our experts think of chat GPT 5.2. 1:58This is kind of an interesting story. 2:00Rumors are swirling and by the time you 2:02listen to this, this model actually may 2:04be out that uh effectively OpenAI has 2:07called a co a code red uh to get its GPT 2:105.2 2 model out to go compete largely 2:13with uh the new Google model Gemini 2:15which indeed as we've talked about 2:16before in previous episodes um is very 2:19very impressive. Um and I mean maybe 2:22I'll start with you. This is sort of a 2:24really interesting kind of reversal in 2:26some ways. Had we talked about it in 2:27January 2025 it would have been like 2:29OpenAI is crushing everybody. They've 2:31got the state-of-the-art models. They're 2:33ahead of everyone else. No one's 2:35catching up. But this is kind of weirdly 2:37now in a situation where it's like 2:38Google, which we would have said at the 2:40beginning of the year is like the most 2:41behind, is now the one that's kind of 2:43causing Open AI to react. And I don't 2:45know, is this just gossip? Like are we 2:47reading too much into this or is it 2:48really kind of a signal that Open AI is 2:51in some ways falling behind in this 2:52race? 2:53>> Yeah, look, I think we can speculate all 2:55we want. Uh history always suggests that 2:58there's always going to be this up and 3:00down roller coaster, right? I feel like 3:02if you made this entire saga movie, it's 3:05going to be full of plot twists and 3:06turns. So much of you're going to be 3:08>> play, you know, who plays Sam, right? 3:11>> Yeah. So, yeah, it's anybody's guess. 3:13So, yeah, of course, rumors are swelling 3:15and I think the the latest I read that 3:17was uh, you know, 5.2 is already on 3:19cursor. There are indications that may 3:21release soon, right? Um, it's not just 3:235.2. There's chestnet and hazelnet as 3:26well accompanying it. um code names for 3:28a couple of the image in models to you 3:30know compete with Nana Banana Pro. So 3:34yeah of course yeah I think it's 3:35anybody's game at this point in time. Um 3:39we can speculate all we want but hey at 3:41the end of the day you know consumers a 3:44winner here right welcome all the 3:45competition all the the good competition 3:48between the bottle 3:48>> makers. You're happy for the soap opera 3:50basically. 3:51>> Yeah. Exactly. Kate, would love to get 3:53your reaction on this because I feel 3:54like uh at the end of this year I'm I'm 3:56tired, you know? It's like every week 3:59there's a new model out or it's like 4:00what's the difference between this model 4:02and that model, but do like model 4:04launches matter anymore? Like should we 4:06care about them? Yes. 4:07>> Or is like the game really somewhere 4:08else now? 4:09>> Yeah. And I actually wonder I I don't 4:11know if I quite agree with your 4:13statement the consumer wins at the end 4:14of this. Like are we really in this race 4:17where the consumer is actually 4:18benefiting? Am I going to have this like 4:21huge uptick in my productivity and daily 4:26life with 5.2? Um, I don't think so. Not 4:29for the potential cost that will come 4:31along with it. But, you know, we can we 4:33can always see. I think there definitely 4:34is a little bit of exhaustion uh that's 4:37coming in just broadly around model 4:39releases. Uh, and so, you know, I I 4:42think uh OpenAI is going to try and 4:45capture attention back away from the 4:47success of Gemini. they've got to do 4:48that to, you know, save face with their 4:51broader investors and everything else 4:52they're pursuing. But I don't know that 4:54I would agree that at the end of the 4:55day, the consumer is going to be a lot 4:57better off the day after 5.2 is released 5:00than today. 5:01>> Yeah. So, I get where Kada is coming 5:03from, right? But the way I look at it is 5:05at the end of the day, advances are 5:08going to keep coming. They're going to 5:09keep coming, right? And what I mean by 5:11the consumer is going to win is that 5:13keep those advances coming. You don't 5:14want things to stagnate, right? That's 5:16the only way. So have that competition 5:18flowing, have that healthy competition 5:20flowing. So you keep advancing the um 5:22the boundaries, you keep pushing the 5:24boundaries and so at the end of the day 5:26as consumers of those models, right? 5:28Yeah, there may not be dramatic changes, 5:30but every win counts, right? So you keep 5:32pushing the boundary. You keep pushing 5:33the boundary and that's how the field 5:35advances. So at the end of the day, that 5:37healthy competition is great, right? You 5:39got to have that. 5:40>> Mia, do you want to get in on this? Do 5:41you feel uh do you have any opinions on 5:43a model that is not yet out? Is this 5:45going to be the model that crushes 5:46everything for the year or? 5:48>> Um, I'm about as excited about this 5:50model as I am for the latest Windows or 5:53Mac OS hot fix. You know, you'll see it 5:55in the Windows hot fix. 5:56>> Wow, that's harsh. 5:58>> I was joking recently. I was like, they 6:00just dropped a new version of Zoom 6:01recently. Who's excited about the new 6:02Zoom version? 6:04>> Yeah, I guess. Um, my take is this. Many 6:08of these models are going to see minor 6:10updates that try to resolve issues with 6:13performance, with speed, with cost, with 6:15specialized use cases uh with usage in 6:18for example IDs like cursor or in codecs 6:21or the equivalent of clot code. Um 6:24they're going to try to optimize for 6:26specific benchmarks or for specific 6:28situations. But I don't expect these 6:30updates to be necessarily revolutionary. 6:32they just put maybe open AAI for the 6:35next two days, 2 hours, 2 minutes, two 6:38months if they're lucky uh ahead of 6:40Gemini in some of these specific 6:42benchmarks. Uh is it going to be world 6:44changing? Likely not. Um it's nice. It's 6:48maintenance. It's going to help with 6:50some of these specialized use cases, but 6:53I don't think it's going to be re 6:54revolutionary. Otherwise, they would 6:56have called it GPT6. 6:58>> Yeah. Versus point 2, you know. 7:00>> Yeah. Yeah. And I think that's kind of 7:03one of the really interesting sort of 7:04ironies of it feels like the situation 7:06we're sitting in at the end of 2025 is 7:08like everybody kind of agrees there's 7:10something like kind of rotten in the 7:12world of benchmarks, right? Like they 7:14don't really provide us with a whole lot 7:16of traction on what we actually want to 7:17use these tools for. As yet clearly they 7:20are motivating a lot of big corporate 7:22activity, right? like OpenAI wants to be 7:24number one on all those benchmarks and 7:26it doesn't want to be left behind for 7:28any length of time when you know Gemini 7:31comes out and like says hey we're great 7:33against all these benchmarks. Um but 7:35it's almost like we're almost optimizing 7:37for the same thing and it feels like you 7:38end up kind of in this discussion that 7:40Ambi and Kate were just having which is 7:42well there's these maybe downstream 7:44effects where everybody sort of benefits 7:45from us constantly pushing the frontier. 7:48The other one is also kind of like is 7:49the industry focusing on the right the 7:51right thing, right? Because I think you 7:53know Kate, I guess you're nodding. I 7:54don't if you want to respond to that 7:55idea. 7:56>> Well, I think what's really interesting 7:58um a week or two ago Stamford's Hazy Lab 8:01put out a report looking at the 8:03intelligence per watt basically and how 8:06much performance we're being able to 8:08drive per kind of watt of electricity uh 8:11powering the compute. And what they 8:13found is that you know a lot of the 8:16adoption and market share is with these 8:17big hosted models like you know the 8:20latest GPT models but that if you 8:22actually look at what you can achieve if 8:24you move some of those workloads locally 8:26you can get the same amount of 8:27performance at a lot lower energy 8:29consumption a lot lower cost and so you 8:33know I think they argue that there's a 8:35huge opportunity for disruption here uh 8:37that you know the model providers might 8:41not be focused on the right metric And I 8:43would tend to agree with that. I think 8:44that right now we're uh you know chasing 8:47a lot of investment dollars and 8:48prioritizing fancy benchmarks but a lot 8:52of the future development is going to be 8:54incentivized more by performance per 8:56cost and you don't see that quite in the 8:58conversation today with these model 9:00releases that are coming out. One angle 9:01I wanted to bring to this before we move 9:03on to the next topic is you work with a 9:06lot of like customers and enterprise 9:07right um and you know I think all of 9:10this comes on the backdrop of like these 9:11companies are obviously ultimately 9:13competing for enterprise dollars and so 9:16I guess I'm I'm curious just like so I 9:17genuinely don't know like one of these 9:19new bottles drops like 5.2 to our 9:21customers like oh man I got to this 9:23one's launching on all the benchmarks I 9:24got to move my entire stack over to the 9:26new model like what's the influence of 9:28these types of kind of competitions even 9:30very incrementally on who chooses to 9:32adopt what like is the market influenced 9:35I guess is what I'm saying by these 9:36kinds of launches 9:37>> yeah there there are two lenses in which 9:39you look at it right enterprises are not 9:42going to immediately switch to the 9:45latest model at the drop of a hat right 9:48so you you pick a You pick a stable 9:50workhorse, you build your applications 9:52on top of that. You got to have some 9:54stability. There's all the, you know, 9:56you put it into production um and then 9:59you you start realizing value, right? 10:01It's it'll be very very tricky. It would 10:04be very very um problematic to go and 10:08keep changing problems at the drop of a 10:10hat. So, you know, it's not going to 10:12happen immediately, right? But does it 10:16happen? Of course, it'll happen, right? 10:17Because let's say you track it over the 10:19period of time right let's say you're 10:21tracking things over the period of n you 10:23know 6 months a year and over the course 10:26of time the pace at which these advances 10:28are happening there is a fundamental 10:30step function change in the performance 10:32of the models right bunch of new 10:34capabilities have gotten accumulated 10:36which means okay now you're exploring 10:38and you're looking at okay uh you know 10:40from a maintenance perspective from an 10:42application maintenance perspective I do 10:44want to have a road map into okay you 10:46there is a certain time window at which 10:48I say okay I've got a step function 10:50change and I'm going to go and make a 10:52switch to the latest model right so yes 10:55those model changes will happen and do 10:58happen but it's not going to happen for 10:59every single release that happens 11:01>> I would say the following if you're able 11:03to switch models at the drop of a hat 11:05either your enterprise maturity is very 11:07low where you're an independent 11:09developer or a small shop and are able 11:11to just quick quick quickly switch 11:13models or your maturity is very high 11:15where you have all of your eval fully 11:17automated and you're able to switch the 11:19model. You push a button, all your eval 11:21get done. You can test your request on 11:23the new model and then you're able to 11:24see, oh yeah, this one performs 17.3% 11:27better for my use case. It's more cost 11:30effective. I see the data my 11:31observability platform in my dashboard. 11:34I make the switch overnight. So, uh 11:36you're if you're in the middle, it's 11:37going to be tough. 11:38>> Well, we'll just have to see. Uh I guess 11:39this announcement 5.2, I'm sure we'll be 11:41talking about it potentially next week 11:43when it actually launches. uh and we'll 11:45see how all these predictions play out. 11:46But I think that's really interesting 11:47and I think Miha it's I think very 11:49helpful to have this discussion on just 11:51like so much of this is like we see the 11:53competition but it's also on the 11:54backdrop of the customers and just 11:56seeing kind of what they do or how they 11:58react to this stuff. Yeah. Mh. Sorry. 11:59>> I'm just hoping OpenAI is going to do 12:01the 12 days of Christmas thing again 12:03like uh what was it last year? 12:04>> You like that? That was a good gimmick 12:06from last year. 12:06>> 5.2 5.3 5.4 one model release every day. 12:10>> Yeah. Exactly. until we get to 5.12 and 12:12then they'll roll those six. 12:14>> You just tweak the prompt every day and 12:16you call it a five. 12:18>> Exactly. 12:25>> I'm going to move us on to our next 12:26topic. Uh so we've talked about this 12:28report before, but a number of 12:30researchers at Stanford have come out 12:32with the latest edition of their 12:34transparency index. And so if you're not 12:36familiar with this discussion from last 12:37year, um the idea is that they're taking 12:39a bunch of uh available models and 12:42trying to rank and assess basically how 12:44well these models do um from the point 12:46of view of transparency, right? What 12:48kinds of documentation do they provide? 12:50What kinds of um you know data 12:52disclosures do they have? Um and I've 12:54always thought about this as like a very 12:56interesting project. Um because when we 12:59say transparency, it's a little bit like 13:01open source, right? Where it's like what 13:02do we exactly mean by that? And these 13:04are attempts, I think, to kind of get a 13:06lot more granular about, you know, what 13:08we mean when we say transparency. Um, 13:11and Kate, it was good to have you on the 13:12show because I understand Granite was a 13:13part of this transparency report. And do 13:16you want to just talk a little bit about 13:17how you all kind of approached it and 13:18how it all turned out? 13:19>> Yeah, so this is a report as you 13:21mentioned that Stanford does annually. 13:23We've participated in the past and it 13:25really tries to break down model 13:27development into three components. So 13:30kind of down upstream the model training 13:33itself and downstream of the model. And 13:35what they do is they send a a survey out 13:37to model developers like IBM uh training 13:41our granite models both closed and open 13:43model developers and they invite people 13:45to participate and share information 13:48about everything you know upstream of 13:50model development like around data 13:52curation. What models are you using to 13:55generate data to train on your models 13:58downstream to the actual training 14:00process? You know, do you release your 14:02training code? Do you release different 14:03repositories? Do you release um 14:06different details about the architecture 14:08of the model? And then uh downstream 14:11with of model use. So things around like 14:14do you release benchmarks on safety? Do 14:16you release details on uh gaps and 14:19performance? do you release prompts that 14:21were uh successfully used to attack the 14:24model? So that type of thing and you 14:26know what they've done and found is that 14:29over the years you know transparency has 14:31actually greatly diminished. If you look 14:34between 2024 in this report that just 14:37came out last week uh in 2025, there's a 14:41most labs have reduced the degree that 14:44they are quote unquote transparent to 14:46the degree that they share details about 14:48these different facets of model 14:49development. Um IBM's taking a very 14:52different approach which I I'm really 14:53proud of really focusing on transparency 14:56and trust and being as open as possible. 14:59I think it speaks to the rigor of which 15:01we put together our strategy and 15:03policies around how we train and develop 15:05our models which is reflected in our ISO 15:074201 certification that we also received 15:10this year. Uh and it allows us to just 15:13be very forthcoming with what we're 15:15working on, how we're building it, and 15:17how we're contributing it to the open 15:18source ecosystem. So we're really proud 15:20that Granite got the top score 95 out of 15:24100 I believe. uh and where other labs 15:27were kind of going down in transparency 15:29over time, you know, IBM demonstrated 15:31that we are actually doubling down in 15:33increasing the degree of which we're 15:35transparent in model development. 15:37>> Yeah. And that's 95 out of 100 like 15:38different criteria basically. 15:40>> Yes. Exactly. Different indicators uh 15:42different questions do we answer and 15:44provide detail. So it's not actually 15:46looking at you know what was the result 15:48on this safety benchmark. It's how 15:50transparent are you on your safety 15:51benchmarks? Do you share the benchmarks? 15:53do you share this type of data? Uh, 15:55which is a a really cool approach. 15:57>> And I think one of the things I want to 15:58if you want to speak if you could speak 16:00a little bit more about it is um you 16:02know especially because across a hundred 16:04of these metrics um you know you have to 16:07almost pick and choose right like the 16:08team can't afford to try to do 16:10everything or move everything forwards 16:12on a year-to-year basis. Uh or or maybe 16:15that is how the team is thinking about 16:16it. I guess I'm interested in whether or 16:17not there's like particular aspects of 16:19transparency that the team said, "Okay, 16:21this is here is what we're really going 16:22to prioritize." 16:23>> Yeah. So, I think over the past year and 16:26a half, if you look at from where we 16:27were in 2024 to 2025, we've done a lot 16:31of work on automating uh and 16:34standardizing our training and 16:36development process so that there are 16:38automated records of everything. that 16:41makes it much easier to be transparent 16:43and share because there's so many minute 16:45details that go into these models. 16:48Everything from, you know, when was a 16:50data set acquired and what was the 16:51license it was acquired on and what was 16:53the source it was acquired and what was 16:55the review process for it. And so we 16:57actually invested heavily in the 16:59architecture around all of that uh data 17:02curation and training so that we can 17:05have a very streamlined you know uh 17:07lineage of our models that makes it 17:09really easy to just be transparent and 17:12open and have that information at our 17:13fingertips. That also helps us with our 17:16own regulatory compliance requirements 17:18where we want to be obviously 17:20best-in-class and able to respond to 17:23changing regulations uh as they evolve. 17:26and that made it uh possible for us to 17:29be just a lot more open uh when it came 17:31to the transparency index this year. 17:33>> Um yeah, if I can bring you in, I mean I 17:35think um Kate's already pointing out I 17:38think one of the interesting trends, 17:39right, which is obviously Granite 17:41doubled down on this, but the general 17:43trend is less transparency that we're 17:45seeing. And you know, this actually goes 17:46back to what we were talking about a 17:47little bit earlier about like what the 17:48market incentivizes. You know how I read 17:51the transparency index is it's sort of a 17:52dream of saying look people will be able 17:54to look at the index and say I want the 17:56more transparent model here's how I find 17:58that right and the market will reward 17:59people who are more transparent but if 18:01anything it feels like there's there's 18:02actually been a pullback on transparency 18:04do you think that that means that the 18:05market doesn't really value transparency 18:07all that much 18:08>> I think it depends on the type of 18:09business they serve so I've noticed in 18:11the report that B2B for example 18:13companies tend to be more transparent 18:15than B2C because regular consumers may 18:17not care if they're running a 100 18:19billion 200 billion, 500 billion 18:21parameter model, how many GPUs it uses, 18:24how much water or whatever other metrics 18:27CO2 emissions are used in the model. Uh 18:30they may not necessarily care about the 18:31cost to run the model itself. They care 18:34about the cost to the end user. Uh while 18:36B2B companies do need to care if they 18:38make these models available to other 18:39companies who are consuming it, who may 18:40be running it on their own 18:41infrastructure. Um the second 18:44interesting trend I've seen is is like 18:46you've pointed out uh it went from 74% 18:48of the companies responding last year to 18:51only 30% responding this year. So models 18:55it's kind of curious if you look you 18:56know X.AI models or uh models from 19:00entropic models from open AI you don't 19:01even know how many billion parameters 19:03they have and you might not care. Um I 19:07would see it from one perspective this 19:09kind of information can be used against 19:10them. Oh, look how much you know CO2 or 19:13emissions this model is generating or 19:14how inefficient it is. It can be used in 19:16calculating 19:18uh how viable their business is long 19:20term. So for example, are they actually 19:22um subsidizing a lot of their end users? 19:25So a lot of this information I see um is 19:29likely to become more transparent 19:32in B2B companies. So, you know, AWS with 19:35their Nova models and IBM with their 19:37granite models and Nvidia and so on and 19:39so forth are going to become likely more 19:41transparent over time. Uh, while models 19:44that are focused more on the consumer 19:46market don't necessarily need to publish 19:49those details. They probably will not 19:51publish them anymore. Am I it almost 19:52feels like there's going to be like um 19:54on the consumer side almost like the 19:56appleification of the world. Kind of 19:58what I mean by that is like you know uh 20:01uh you know if you go back 20 years 20:04right it'd be like okay well we have 20:06these open computing platforms and 20:07you've got Apple and it's a battle 20:08between open and closed and then over 20:10time it kind of feels like everybody has 20:12been like yeah actually like for the 20:14consumer the general preference is 20:16they're happy to pay more for a pretty 20:18closed system that's pretty opaque. you 20:20know, you have to go to a store and find 20:21a genius to fix these computers for you. 20:23Um, you know, that's kind of like the 20:25state of play in consumer land. And then 20:27on enterprise, of course, open source 20:29has this like long and robust legacy and 20:31is a huge huge huge business. Do you 20:33sort of see that happening in the world 20:34of kind of AI applications as well where 20:36it's like it turns out that from a 20:38consumer standpoint? Transparency is not 20:40so important that it really is forcing 20:43say forcing is a little strong but 20:44encouraging companies like Anthropic and 20:46OpenAI to say hey we're going to 20:48participate in this index and we're 20:49going to try to get a good score on this 20:50index. 20:51>> Well partially right see I always say 20:54that at the end of the day we are all we 20:56all sit in enterprises but then we're 20:58also consumers right? So at the end of 20:59the day, we we all wear those two hats 21:02at the same time. So you know, you know, 21:06it's not like, you know, we just 21:08immediately switch on and off between a 21:10consumer hat and an enterprise hat, 21:12right? Like even when we're sitting in 21:13an enterprise, we think with a consumer 21:14lens and vice versa. So I think some of 21:17those um the ways we think bleed into 21:19each other's domains, right? Um and I 21:22this is what I've noticed, right? I feel 21:24like the market in general um is maybe 21:29asking the wrong questions, right? So 21:31yes there is the prioritization on IP 21:34which is why you see you know in these 21:37uh benchmarks you know most of the labs 21:41if you look at the the downward trend on 21:43the the metrics right there was a huge 21:46hit on the upstream component right um 21:50but 21:52I don't think there isn't necessarily a 21:54you know a reward or um you know whether 21:57there is a reward for labs to do it or 21:59not I feel like the the the the right 22:01thesis should be whether the marketers 22:03is asking the right question. What I 22:05mean by that is I'll give you an 22:07example. Um just earlier this week I was 22:09with a client and they were talking 22:12about deepseek and asking oh you know we 22:16we want to see if we should be using 22:18open-source models and oh what do you 22:21think about deepseek and should we be 22:23using that right now and this is within 22:26an enterprise setting and we talked 22:28about some of this in one of the earlier 22:30episodes. What deepseek did was it 22:32opened the mind share for open source 22:35right so everyone started thinking about 22:37open source models open weight models 22:38and started talking about it but I think 22:40there's a conflation of transparency 22:42with open source and open weights which 22:45is not necessarily true right and so I 22:49think what most consumers and most 22:52enterprises inherently are asking for 22:54are transparent models but they are 22:56terming it and they're asking for hey 22:58can I get open source and openweight 22:59models which isn't you know those two 23:01aren't necessarily the same. So yeah I 23:04don't fully buy the agree you know 23:06argument that hey the market isn't 23:08asking for or they are favoring for it. 23:09Yes, of course there's the inherent 23:11tension between, hey, I'm going to 23:12optimize for my IP from the labs 23:14perspective and the market saying, hey, 23:16you know, I need some transparency. Um, 23:19but you know, there is definitely a 23:21demand for that transparency, I would 23:23say. Right. It's just that they're 23:25asking the wrong questions, which means 23:27that the signals aren't really coming up 23:29into the into these reports 23:31appropriately. Well, I I will say what's 23:33interesting about the parallel you 23:35brought up, Tim, comparing to Apple is 23:37Apple at the same time, they you know, 23:39they've taken away a lot of the 23:41configurability and and user visibility 23:44into the the hardware, but they also 23:46have a one of the best reputations for 23:48privacy uh when it comes to devices and 23:52kind of responsible uh use of data and 23:55information. And deservedly or not, 23:58they've kind of built a a strong 23:59reputation there. And I think it is 24:01paying off with consumers. And I don't 24:04see that quite yet in model development, 24:07but I think it's going to become more 24:09and more of a priority. Transparency is 24:11one way you can indicate it. It's not 24:13the only way. Like Enthropic didn't 24:15score as well on transparency, but they 24:17have the ISO 420001 certification, and I 24:20think they're also very well known for 24:22their kind of principles in ethical AI. 24:25And so I think transparency is just one 24:28tool to kind of address some of the 24:30broader societal and ethical questions 24:33that are going to maybe not be the d the 24:36singular driving market factor but will 24:38um be an important market factor in the 24:40future. 24:41>> No just to add on to that I do I do 24:42agree with Kate and I do think that will 24:44become a trend. Just look back at social 24:46media as a parallel, right? So when it 24:49started with MySpace and the early days 24:51of social media, privacy wasn't 24:53probably, you know, at the the center of 24:55everyone's thoughts, right? It was about 24:56the cool thing and okay, the ability to 24:58network. So the capabilities were at the 25:00forefront. But then when these 25:01capabilities sort of matured and 25:03saturated, privacy went up front, right? 25:06So you had all the, you know, the 25:08shenanigans with Cambridge Analytica and 25:10things of that nature, you know, the the 25:12Congress hearings popping up. So you 25:14started to see that pivotal shift 25:15happen. I feel like you're going to see 25:17some of those happen with any new 25:18technology the capabilities come up 25:20front and then you know once those sort 25:22of become mainstream you're going to 25:24start seeing some of these privacy 25:25concerns and transparency aspects come 25:28to the forefront really soon. 25:29>> K maybe to wrap this section up um you 25:31know you're already scoring 95 out of 25:33100. Where do you where do you go next 25:35year? Do you work on that last remaining 25:36five? like are are we already kind of 25:39saturating I guess in some sense the the 25:41benchmark for transparency. 25:44>> I think there are certainly always going 25:46to be new ways to think about 25:47transparency. We're moving from models 25:51being kind of just a bag of weights that 25:53get released in the open source in the 25:55case of granite at least open weight 25:57models to having more systems of models 26:00and software built together. And so 26:02that's going to introduce new aspects of 26:04being transparent. Uh so being 26:06transparent not just on the weights 26:07itself and how the weights were created 26:08but uh looking particularly around 26:11deployment and the systems and software 26:13that are executing the deployment and 26:14the details can have huge impacts on 26:16performance and um I'd love to see the 26:19transparency index evolve to encompass 26:22those aspects. Um I know it's certainly 26:25things that IBM's thinking about. We're 26:27also working on, you know, one project 26:29we're working on is thinking through how 26:32do you create a standardized AI bill of 26:34materials and have that more of a 26:36standard artifact that can be released 26:38with models. So, uh, I don't want to, 26:40you know, give away too much, but expect 26:41some some work from IBM on that in 2026 26:44to come out. Um, you know, I think 26:46there's going to be a lot more look at 26:49standardization, a lot more look at, uh, 26:52deployment of these models. Uh so still 26:54lots to do uh that we're we're eager to 26:57to work on with the community. 26:58>> Not done yet for sure. 26:59>> U I'd love to see more transparency over 27:01the infrastructure as well. Right. The 27:03APIs they put in front of the models. 27:05>> Absolutely. 27:05>> Like even the system prompt is kind of 27:07invisible and if you're comparing the 27:09open AI model to chat GPT as a end user 27:13application. There's a lot of other 27:15stuff going on in there which is 27:17unknown. 27:22I'm going to push us on to our final 27:24topic. Um so uh the big Amazon AWS 27:27reinvent conference was uh just the 27:29other week. Um number of really 27:30interesting announcements coming out of 27:32that that we didn't get a chance to 27:33cover in the previous episodes. Um I 27:36actually it occurs to me that I'm 27:37actually like running against myself 27:39now. I started the episode by being like 27:41I'm bored of all these new model 27:42releases and we're going to end with and 27:44Amazon released some new models. So uh 27:46I'm a hypocrite I suppose. Um the news 27:49is of course the big news coming out of 27:51the conference is that Amazon uh 27:53announced its latest generation of Nova 27:56Frontier models. Um and uh you know I 28:00think Amazon has always been really 28:01interesting in thee discussion just 28:04because they've they've always been kind 28:05of looming in the background right they 28:07have huge infrastructure they have 28:10incredible data with all the e-commerce 28:12stuff. Um and so it seems like very 28:14natural that at some point they would be 28:15really starting to make some very big 28:17swings. um in the AI space and in the 28:20model space and um Ammy I guess the 28:22question for you is like is this the big 28:23swing like Nova really feels like it's 28:25like they're really touting this as like 28:27we're now in the game. Uh are they in 28:29the game? 28:30>> Well, I mean Nova's no there were some 28:33of the releases on Nova even last year, 28:35right? So Nova isn't like completely 28:36new. Um so first of all, right? Um so 28:39technically they're saying, "Hey, we 28:40were already in the game last year um 28:42with the Nova releases, right? 28:45Um some of those I think um advances are 28:49part for the course right so they're 28:50releasing uh speechtospech models which 28:53others are uh releasing as well so some 28:56of those I think are part for the course 28:58uh a couple of new advances came out 29:01which is you know nova forge which uh 29:03they're touting as there are we're going 29:06to democratize multiple different 29:07mechanisms for you to go and 29:11you build your own models so it's not 29:12just fine-tuning mechanisms But almost 29:15and you know it's still murky on exactly 29:17how they do this but it's almost like 29:19hey we'll give you checkpoints and then 29:21you come and blend in with your data and 29:23then build your own custom pre-trained 29:25models from scratch right um and we're 29:29going to democratize it and enterprise 29:30just can go and do it you don't have to 29:32have a complete research lab to do it so 29:35um so there are some of those that's 29:38really um exciting um the question again 29:42you know if I put an enterprise lens on 29:44it. Um, great. But you know, how many of 29:48those 29:50um capabilities are going to be used for 29:52how many enterprise use cases, right? A 29:54large mainstream set of use cases will 29:58and can be largely driven with, you 30:02know, your your models out of the box 30:04with um with appropriate integrations um 30:08and to be able to drive off of that. you 30:11may not need, you know, custom 30:14fine-tuned models or even custom 30:15pre-trained models for a good chunk of 30:18the use cases, right? So, great 30:20capabilities. Um, it's a great push on 30:22the engineering side of things, right? 30:24Um, so it's fantastic 30:26looking at it as an engineer. Um, we're 30:29also trying to think about okay, you 30:31know, what's the enterprise value and 30:33how that slots in, right? There's 30:35another one on Nova Act which is the 30:37enterprise equivalent of um your um you 30:41know the openi browser use or the the 30:44the Gemini browser use. So being able to 30:46do that the the differentiation that 30:48they talk about is hey now you know we 30:51have trained it on enterprise screens. 30:54So it's not doing it on Instacart 30:57shopping but you're training it on CRM 31:00screens and we think you know we are way 31:03more equipped to handle those sorts of 31:05enterprise screens right so still early 31:08days I think that piece is actually 31:10exciting because let's all be honest 31:12right there's always going to be a data 31:14and API 31:16um issue and there's always going to be 31:19issues of hey am I having the most clean 31:22and hygienic data element in an 31:24enterprise there is always going to be 31:26those cases. So you're you know we're 31:29looking at and we're all thinking hey 31:31you know that the browser use cases the 31:33browser act applications and 31:35capabilities can be fairly promising 31:37where you don't have ready access to 31:39data right you you just sort of mimic 31:41the the human actions to do it um so 31:44it's a promising capability but then 31:46there are obviously a lot of open 31:48questions on the security of how that 31:50will work right good promising still to 31:52be seen 31:52>> um I'm not a fan of training or 31:54fine-tuning models for most enterprise 31:56use cases 31:58um mostly because whenever you talk to 32:00an enterprise they one assume they have 32:02data second they assume they have the 32:03GPUs and third they assume they have the 32:06investment necessary to continuously 32:08fine-tune or to train a model every 32:10single time their data evolves or 32:12changes. The reality is that large 32:14language models on their own are 32:15insufficient for vast for the vast 32:17majority of enterprise use cases. Why? 32:19They've been trained on last year's data 32:22and they've been trained on public data. 32:24So you want to blend that data with your 32:26enterprise data. But we've seen 32:28techniques like rag, graph rag, a gentic 32:31rag as well as tool use. So using MCP 32:34servers or leveraging all sorts of 32:36techniques uh provides sufficiently good 32:40access to real-time data and real-time 32:42information without the need for 32:44expensive tuning, training or 32:45fine-tuning. 32:47Um, I think the proposition is for the 32:49very very few companies that employ, you 32:51know, hundreds of data scientists who 32:55really make it their passion to train 32:56and fine-tune models, even if you're 32:58doing doing it on quote unquote somebody 33:00else's infrastructure, even if you're 33:02not starting from scratch and you're 33:04starting from a checkpoint. Um, 33:07you shouldn't underestimate the effort 33:10it takes to properly train or even 33:13fine-tune a model to a specific domain. 33:15And you shouldn't underestimate the vast 33:18amount of data that is required or the 33:20quality of data that is required. So I 33:23would say most folks should stick to 33:26agents. That's why I like the fact that 33:27Amazon provides a one-stop shop for 33:29everything. You're biased or anything. 33:30>> So it's like no, no, but look, they have 33:32the other option, right? They have that 33:34they have the their agent core. They 33:36have agents. You don't like this, we 33:38have that. So I would say don't 33:41fine-tune or train a model unless you 33:43really have to and you know what you're 33:45doing. Uh it's very unlikely that the 33:47resulting model is going to outperform a 33:49frontier model plus tool use. And even 33:52if it does now you have to do that every 33:54single week, month or however whatever 33:57the refresh rate of your data is. still 34:00exciting if you're in that space. So, if 34:02there you're in that 1% of companies 34:04that do need that service, uh, and you 34:06can't buy the GPUs that are required for 34:08it and you need to run that service, 34:11it's awesome. 34:12>> This is actually kind of fun because I 34:13feel like it actually kind of flips the 34:14narrative from what we were talking 34:16about earlier, right? I think earlier I 34:18was like, you know, consumers, they 34:19don't want complexity, they don't want 34:21transparency, but enterprise, they do 34:23want complexity and transparency. And I 34:25think yeah, you're coming back and 34:26basically saying actually for most 34:27enterprises they they don't want that 34:29either, you know. So Kate, do you have 34:32any thoughts on this? 34:33>> Well, I I agree with everything that's 34:35said. The only other comment I have to 34:37add is where I do think there could be 34:39something interesting is really around 34:41the research and academia uh community 34:44when it comes to these new types of 34:47reinforcement learning as a service and 34:49tuning uh as a service capabilities in 34:51Nova Forge. So, I thought it was really 34:54cool that they're offering um kind of 34:56early checkpoints. So, uh partially 34:59trained versions of the Novalite model 35:02that can then be further customized. So, 35:04I'm one benefit that could come out of 35:06that. While I agree, I'm skeptical in 35:08the like direct enterprise value. I 35:10think it's going to be a lot harder than 35:12um people anticipate to kind of get a 35:15specialized model using uh SFT or RL. Um 35:19I do think that we could potentially by 35:23offering more of these components uh 35:25enable more engagement with academia 35:27engagement from the research community 35:30that's otherwise kind of hampered 35:31because they don't have access to uh 35:34early checkpoint. You know they even 35:35have a part of their service where you 35:37can mix your own data with the training 35:39data for continued training. So, you 35:42know, those are all really interesting 35:43things that hopefully, you know, could 35:46spur some more innovation that, you 35:48know, the field could benefit from and 35:50engage with a new user group that's kind 35:51of been left along the sidelines um and 35:54not able to participate fully. 35:57>> Yeah, definitely a constituency we don't 35:58talk about enough uh on this show, but 36:00we should definitely talk more about it. 36:02Um, Mihi, maybe I'll give you the final 36:04uh word of this episode. Um, maybe a 36:07little bit of a peak into the future. 36:09One of the fun tidbits that uh Amazon 36:13announced when releasing the Nova models 36:15was, you know, they've been playing 36:16around with kind of making the claim 36:18that their Frontier agents can operate 36:20for hours or even days on end. Um, which 36:23is, I think, very intriguing. And 36:25regardless of how credible you think 36:26that claim is, I think we are kind of 36:28headed towards this really fun world 36:29where you're like, "Okay, computer, I 36:30need you to help me out with something." 36:32And it comes back like 3 weeks later and 36:34is like, "Here's what I did." Um, are we 36:37headed for that world? Certainly the 36:39technology will be able to do something 36:40in those three weeks, but I'm kind of 36:42curious if you feel like we're finally 36:43getting these agents aligned enough to 36:45get there. Yeah, 36:46>> my agents can operate for weeks and not 36:48on end, too. It doesn't mean that I'm 36:50getting good results out of them. I 36:51mean, 36:53>> actually, you can have them run for 36:54years if you want. Uh, brother, 36:57>> it's not it's not an issue. I actually 36:58have a time timeout I can tweak. I can 37:00keep it going and going and going and 37:02never return a final answer. Just tell 37:04me how many tokens you want me to 37:05consume. 37:06>> Yeah, that's right. Um so so look I 37:09think what is improving is tool use 37:11right what we're seeing is improvement 37:13in tool use in terms of the number of 37:14tools that can be called the number of 37:16tools that can be called in parallel the 37:18number of sequential tools that can be 37:20called and techniques like map reduce or 37:23being able to do vector search or search 37:26or tool search to call the right tool 37:28allows this kind of continuous use 37:30cases. So let's say you're building a 37:31document literally building or let's 37:33take a PowerPoint document because it's 37:35even easier to uh visualize and you're 37:37building slide one, slide two, slide 37:39three, slide four. Each of those can be 37:41an independent tool call and it can keep 37:43going and going and going if you're 37:44managing your context right. So if you 37:47think about what's preventing us from 37:48doing continuously running agents today 37:51is just how difficult it is to properly 37:53manage that context. You're working with 37:56a limited context of the LLM for tool 37:58orchestration. Everything needs to fit 38:00in the context within an execution and 38:02then you need to use techniques to 38:04manage the context. How you compact it. 38:06So if you've used cloud code or codeex, 38:09you see at some point in time it starts 38:10to compact it. It's literally 38:12summarizing what you have in your 38:14context to a state that is good enough 38:16for it to continue from that state. So 38:18all of these techniques are coming 38:20together and we're seeing longer and 38:22longer and longer running agents. 38:24Microsoft has researcher. Um, Chad GPT 38:28and Gemini have their deep research 38:30functionality. Amazon has similar 38:32techniques. We have similar techniques 38:33and we've built our own deep 38:35researchers. Uh, I think at the end of 38:37the day, this is something we're going 38:38to see more and more 38:41cuz if you want to get good results in 38:44enterprise use cases from AI, you wanted 38:47to touch all of your data and that means 38:49hundreds potentially thousands of tool 38:51calls. Rag is not enough. With rag, what 38:54you're doing is you're selecting 10 38:56paragraphs, 38:58give or take, from whatever you're 39:00searching, and then you're giving it to 39:01the model and you're hoping for the 39:03best. What I would like to do is to give 39:05it all of the data, summarize this and 39:07this and this and this and this and keep 39:08going and going and going. It's 39:10expensive. But in some cases, if you're 39:12putting together a complex deliverable 39:15like an RFI response document, an RFP 39:17response document, a, you know, go write 39:19me a book and come back with 300 pages 39:22on this topic. You need that depth. So I 39:26do see a natural evolution of all agents 39:29within the enterprise space adopting 39:31this kind of deep researcher 39:33functionality with agents that can run 39:34for 10 minutes, an hour, perhaps even 39:39overnight to come back with a very 39:40complex response. 39:42>> Tim, I want to add a nuance to what Mihi 39:44said and Mihi is absolutely right. 39:46Right. So you you have to contextualize 39:48all of this. Um but that's not to 39:51discount that the advances that the 39:53field is seeing. Right. So you know it's 39:55not just you you have to look at this in 39:57two dimensions. It's not just about the 39:59amount of time that an agent or a model 40:01or system is taking. It's also when it's 40:06running for that much amount of time how 40:08reliably or how accurate is the outcome 40:10of the task that you are accomplishing. 40:12Right? Um that curve has definitely 40:15shifted towards the right. Right? So 40:18couple years back we would have said you 40:19know high accuracy would have been of 40:21the order of a few seconds then it 40:23became a few minutes and now we're 40:24definitely in the realm of a few hours 40:26right so the curve is definitely 40:28shifting but you know it's important to 40:30recognize it's not just how long it's 40:32running it's how long it's running and 40:34it's doing it reliably with high 40:36accuracy. 40:37>> Yeah and in the loop val also help with 40:39this. So if you have agents that can 40:42self- evvaluate 40:43and intermediate checkpoints and retry 40:47and take different direction then this 40:49is going to help improve them over um 40:51over a longer running execution cycle. 40:53>> Yeah, I think that's right. I mean I 40:55think part of it is just going to be 40:57these like tradeoffs but I do think the 40:59frontier is going to be increasing 41:00continuously. Um but something to pay 41:03attention to particularly because I 41:04think this will be the new frontier of 41:06claims being made about agents, right? 41:07You can run them for weeks, you can run 41:09them for two weeks. And so I think the 41:10question now will be how do we measure 41:12that? How do we quantify that? So it'll 41:14be very interesting to see. Well, that's 41:15all the time we have for today. So Kate, 41:17Ambi, Mihi, thanks for joining us as 41:19always. And happy holidays. Um, and 41:21thanks to all you listeners. Uh, if you 41:23liked what you heard, you can get us on 41:24Apple Podcast, Spotify, and podcast 41:26platforms everywhere. And we'll see you 41:28next week on Mixture of Experts.