GPT‑5.2 Rumors Spark OpenAI‑Google Rivalry
Key Points
- OpenAI is rumored to be accelerating a “code‑red” release of GPT‑5.2 to counter Google’s new Gemini model, suggesting the company may be feeling pressure to keep its lead in the AI race.
- The episode’s news roundup highlighted Jeff Bezos and Elon Musk racing to build space‑based data centers, IBM’s $11 billion acquisition of Confluent, OpenAI’s work on models that admit when they hallucinate, and a whimsical “Santa agent” for holiday interaction.
- Panelists noted a sharp shift in narrative from earlier in the year, when OpenAI was viewed as the dominant leader, to now where Google’s advancements are forcing OpenAI into a defensive stance.
- Despite the hype around GPT‑5.2, the experts expressed uncertainty about whether the upcoming model will materially improve the consumer experience over current releases.
Sections
- AI Rumors, Transparency, and Acquisitions - The episode covers speculation about OpenAI’s upcoming GPT‑5.2 and its rivalry with Gemini, highlights a new Stanford transparency report and Amazon’s Nova models, and recaps recent AI‑related corporate moves such as space data‑center projects and IBM’s acquisition of Confluent.
- Speculation Over New Model Releases - Participants debate whether the constant stream of AI model launches—like the rumored 5.2, ChestNet, and HazelNet—actually improve consumer productivity or merely fuel competitive hype.
- Benchmark Race vs Real Impact - The speakers critique how AI benchmarks dictate corporate competition and incremental model releases, arguing that these metrics are misaligned with genuine, transformative utility.
- Balancing Model Updates and Enterprise Maturity - The speaker explains that while major step‑function improvements in AI models will eventually necessitate upgrades, frequent switches are impractical for most firms unless they are very small and agile or possess highly automated evaluation pipelines to safely test and deploy new versions.
- Declining Transparency in AI Models - The speaker outlines Stanford’s yearly AI model transparency survey—covering upstream data curation and training details through downstream safety benchmarks—and notes that the 2025 report shows most labs are sharing fewer details, whereas IBM is pursuing a markedly different, more open approach.
- Model Lineage and Transparency Trends - The speakers describe their investment in data‑curation architecture to maintain clear model provenance and meet regulatory demands, and observe a broader industry move toward less transparency despite the goals of the Transparency Index.
- Enterprise vs Consumer AI Transparency - The speakers argue that consumer and enterprise perspectives on AI transparency intersect, influencing initiatives like transparency indexes, while cautioning that an over‑emphasis on IP and benchmark metrics may lead the market to ask the wrong questions.
- Transparency as Emerging Market Driver - The participants argue that transparency will evolve from a peripheral issue to a major market force for new technologies, paralleling the privacy shift that occurred as social media matured.
- Nova’s Ongoing Releases & Enterprise Potential - The speaker explains that Nova is not brand‑new—having launched speech‑to‑speech models and other updates last year—and introduces the upcoming Nova Forge, which aims to democratize custom model creation for enterprises, while questioning how many of these new capabilities will address mainstream business use cases.
- Agents Over Fine‑Tuning for Enterprises - The speaker explains that, because most LLMs are trained on outdated public data, enterprises should combine them with retrieval‑augmented methods and ready‑made agent platforms—like Amazon’s one‑stop solution—rather than invest in costly fine‑tuning, which only a handful of data‑science‑intensive companies can effectively execute.
- Future of Long‑Running AI Agents - The discussion examines Amazon’s Frontier agents’ claim of multi‑day operation, explores the prospect of AI assistants that work for weeks or longer before delivering results, and highlights ongoing improvements in tool use and alignment.
- Balancing Runtime and Accuracy - They discuss how model evaluation must consider both execution time and reliability, noting a shift toward longer yet more accurate runs and the role of self‑evaluation loops.
Full Transcript
# GPT‑5.2 Rumors Spark OpenAI‑Google Rivalry **Source:** [https://www.youtube.com/watch?v=nvAScf8YhzE](https://www.youtube.com/watch?v=nvAScf8YhzE) **Duration:** 00:41:32 ## Summary - OpenAI is rumored to be accelerating a “code‑red” release of GPT‑5.2 to counter Google’s new Gemini model, suggesting the company may be feeling pressure to keep its lead in the AI race. - The episode’s news roundup highlighted Jeff Bezos and Elon Musk racing to build space‑based data centers, IBM’s $11 billion acquisition of Confluent, OpenAI’s work on models that admit when they hallucinate, and a whimsical “Santa agent” for holiday interaction. - Panelists noted a sharp shift in narrative from earlier in the year, when OpenAI was viewed as the dominant leader, to now where Google’s advancements are forcing OpenAI into a defensive stance. - Despite the hype around GPT‑5.2, the experts expressed uncertainty about whether the upcoming model will materially improve the consumer experience over current releases. ## Sections - [00:00:00](https://www.youtube.com/watch?v=nvAScf8YhzE&t=0s) **AI Rumors, Transparency, and Acquisitions** - The episode covers speculation about OpenAI’s upcoming GPT‑5.2 and its rivalry with Gemini, highlights a new Stanford transparency report and Amazon’s Nova models, and recaps recent AI‑related corporate moves such as space data‑center projects and IBM’s acquisition of Confluent. - [00:03:13](https://www.youtube.com/watch?v=nvAScf8YhzE&t=193s) **Speculation Over New Model Releases** - Participants debate whether the constant stream of AI model launches—like the rumored 5.2, ChestNet, and HazelNet—actually improve consumer productivity or merely fuel competitive hype. - [00:06:35](https://www.youtube.com/watch?v=nvAScf8YhzE&t=395s) **Benchmark Race vs Real Impact** - The speakers critique how AI benchmarks dictate corporate competition and incremental model releases, arguing that these metrics are misaligned with genuine, transformative utility. - [00:10:04](https://www.youtube.com/watch?v=nvAScf8YhzE&t=604s) **Balancing Model Updates and Enterprise Maturity** - The speaker explains that while major step‑function improvements in AI models will eventually necessitate upgrades, frequent switches are impractical for most firms unless they are very small and agile or possess highly automated evaluation pipelines to safely test and deploy new versions. - [00:13:21](https://www.youtube.com/watch?v=nvAScf8YhzE&t=801s) **Declining Transparency in AI Models** - The speaker outlines Stanford’s yearly AI model transparency survey—covering upstream data curation and training details through downstream safety benchmarks—and notes that the 2025 report shows most labs are sharing fewer details, whereas IBM is pursuing a markedly different, more open approach. - [00:16:50](https://www.youtube.com/watch?v=nvAScf8YhzE&t=1010s) **Model Lineage and Transparency Trends** - The speakers describe their investment in data‑curation architecture to maintain clear model provenance and meet regulatory demands, and observe a broader industry move toward less transparency despite the goals of the Transparency Index. - [00:20:25](https://www.youtube.com/watch?v=nvAScf8YhzE&t=1225s) **Enterprise vs Consumer AI Transparency** - The speakers argue that consumer and enterprise perspectives on AI transparency intersect, influencing initiatives like transparency indexes, while cautioning that an over‑emphasis on IP and benchmark metrics may lead the market to ask the wrong questions. - [00:24:28](https://www.youtube.com/watch?v=nvAScf8YhzE&t=1468s) **Transparency as Emerging Market Driver** - The participants argue that transparency will evolve from a peripheral issue to a major market force for new technologies, paralleling the privacy shift that occurred as social media matured. - [00:28:33](https://www.youtube.com/watch?v=nvAScf8YhzE&t=1713s) **Nova’s Ongoing Releases & Enterprise Potential** - The speaker explains that Nova is not brand‑new—having launched speech‑to‑speech models and other updates last year—and introduces the upcoming Nova Forge, which aims to democratize custom model creation for enterprises, while questioning how many of these new capabilities will address mainstream business use cases. - [00:32:22](https://www.youtube.com/watch?v=nvAScf8YhzE&t=1942s) **Agents Over Fine‑Tuning for Enterprises** - The speaker explains that, because most LLMs are trained on outdated public data, enterprises should combine them with retrieval‑augmented methods and ready‑made agent platforms—like Amazon’s one‑stop solution—rather than invest in costly fine‑tuning, which only a handful of data‑science‑intensive companies can effectively execute. - [00:35:58](https://www.youtube.com/watch?v=nvAScf8YhzE&t=2158s) **Future of Long‑Running AI Agents** - The discussion examines Amazon’s Frontier agents’ claim of multi‑day operation, explores the prospect of AI assistants that work for weeks or longer before delivering results, and highlights ongoing improvements in tool use and alignment. - [00:39:44](https://www.youtube.com/watch?v=nvAScf8YhzE&t=2384s) **Balancing Runtime and Accuracy** - They discuss how model evaluation must consider both execution time and reliability, noting a shift toward longer yet more accurate runs and the role of self‑evaluation loops. ## Full Transcript
I think OpenAI is going to try and
capture attention back away from the
success of Gemini. They've got to do
that to, you know, save face with their
broader investors and everything else
they're pursuing. But I don't know that
I would agree that at the end of the
day, the consumer is going to be a lot
better off the day after 5.2 is released
than today. All that and more on today's
Mixture of Experts.
I'm Tim Hang and welcome to Mixture of
Experts. Each week, Moe brings together
a panel of the smartest minds in
technology to distill down what's
important in artificial intelligence.
Joining us today are three incredible
panelists. We've got Mihi Crevetti,
distinguished engineer, Aentic AI, Kate
Soul, director of technical product
management, Granite, and Ammy Gatisan,
partner AI and analytics. Uh, welcome to
you all. We're really ending the year
with a bang. There's a lot to talk about
today. We're going to talk a little bit
about rumors of GPT 5.2. 2, a new
transparency report out of Stanford and
Amazon's newest generation of their Nova
models. But first, we've got Eiley with
the news.
Hi everyone, I'm Eiley McConn, a tech
news writer for IBM Think. Here are a
few AI headlines you might have missed
this week. Both Jeff Bezos and Elon Musk
are now racing to develop data centers
in space. IBM has acquired data
streaming platform Confluent for 11
billion to help ramp up agent use in
enterprises.
Open AAI has started training models to
confess when they've made stuff up or
taken shortcuts.
Ho ho ho. A new Santa agent lets users
interact with Santa via text, phone, or
video chat to share what they want for
Christmas and to find out if they're on
the naughty or the nice list. For more,
subscribe to the Think newsletter linked
in our show notes. And now, let's see
what our experts think of chat GPT 5.2.
This is kind of an interesting story.
Rumors are swirling and by the time you
listen to this, this model actually may
be out that uh effectively OpenAI has
called a co a code red uh to get its GPT
5.2 2 model out to go compete largely
with uh the new Google model Gemini
which indeed as we've talked about
before in previous episodes um is very
very impressive. Um and I mean maybe
I'll start with you. This is sort of a
really interesting kind of reversal in
some ways. Had we talked about it in
January 2025 it would have been like
OpenAI is crushing everybody. They've
got the state-of-the-art models. They're
ahead of everyone else. No one's
catching up. But this is kind of weirdly
now in a situation where it's like
Google, which we would have said at the
beginning of the year is like the most
behind, is now the one that's kind of
causing Open AI to react. And I don't
know, is this just gossip? Like are we
reading too much into this or is it
really kind of a signal that Open AI is
in some ways falling behind in this
race?
>> Yeah, look, I think we can speculate all
we want. Uh history always suggests that
there's always going to be this up and
down roller coaster, right? I feel like
if you made this entire saga movie, it's
going to be full of plot twists and
turns. So much of you're going to be
>> play, you know, who plays Sam, right?
>> Yeah. So, yeah, it's anybody's guess.
So, yeah, of course, rumors are swelling
and I think the the latest I read that
was uh, you know, 5.2 is already on
cursor. There are indications that may
release soon, right? Um, it's not just
5.2. There's chestnet and hazelnet as
well accompanying it. um code names for
a couple of the image in models to you
know compete with Nana Banana Pro. So
yeah of course yeah I think it's
anybody's game at this point in time. Um
we can speculate all we want but hey at
the end of the day you know consumers a
winner here right welcome all the
competition all the the good competition
between the bottle
>> makers. You're happy for the soap opera
basically.
>> Yeah. Exactly. Kate, would love to get
your reaction on this because I feel
like uh at the end of this year I'm I'm
tired, you know? It's like every week
there's a new model out or it's like
what's the difference between this model
and that model, but do like model
launches matter anymore? Like should we
care about them? Yes.
>> Or is like the game really somewhere
else now?
>> Yeah. And I actually wonder I I don't
know if I quite agree with your
statement the consumer wins at the end
of this. Like are we really in this race
where the consumer is actually
benefiting? Am I going to have this like
huge uptick in my productivity and daily
life with 5.2? Um, I don't think so. Not
for the potential cost that will come
along with it. But, you know, we can we
can always see. I think there definitely
is a little bit of exhaustion uh that's
coming in just broadly around model
releases. Uh, and so, you know, I I
think uh OpenAI is going to try and
capture attention back away from the
success of Gemini. they've got to do
that to, you know, save face with their
broader investors and everything else
they're pursuing. But I don't know that
I would agree that at the end of the
day, the consumer is going to be a lot
better off the day after 5.2 is released
than today.
>> Yeah. So, I get where Kada is coming
from, right? But the way I look at it is
at the end of the day, advances are
going to keep coming. They're going to
keep coming, right? And what I mean by
the consumer is going to win is that
keep those advances coming. You don't
want things to stagnate, right? That's
the only way. So have that competition
flowing, have that healthy competition
flowing. So you keep advancing the um
the boundaries, you keep pushing the
boundaries and so at the end of the day
as consumers of those models, right?
Yeah, there may not be dramatic changes,
but every win counts, right? So you keep
pushing the boundary. You keep pushing
the boundary and that's how the field
advances. So at the end of the day, that
healthy competition is great, right? You
got to have that.
>> Mia, do you want to get in on this? Do
you feel uh do you have any opinions on
a model that is not yet out? Is this
going to be the model that crushes
everything for the year or?
>> Um, I'm about as excited about this
model as I am for the latest Windows or
Mac OS hot fix. You know, you'll see it
in the Windows hot fix.
>> Wow, that's harsh.
>> I was joking recently. I was like, they
just dropped a new version of Zoom
recently. Who's excited about the new
Zoom version?
>> Yeah, I guess. Um, my take is this. Many
of these models are going to see minor
updates that try to resolve issues with
performance, with speed, with cost, with
specialized use cases uh with usage in
for example IDs like cursor or in codecs
or the equivalent of clot code. Um
they're going to try to optimize for
specific benchmarks or for specific
situations. But I don't expect these
updates to be necessarily revolutionary.
they just put maybe open AAI for the
next two days, 2 hours, 2 minutes, two
months if they're lucky uh ahead of
Gemini in some of these specific
benchmarks. Uh is it going to be world
changing? Likely not. Um it's nice. It's
maintenance. It's going to help with
some of these specialized use cases, but
I don't think it's going to be re
revolutionary. Otherwise, they would
have called it GPT6.
>> Yeah. Versus point 2, you know.
>> Yeah. Yeah. And I think that's kind of
one of the really interesting sort of
ironies of it feels like the situation
we're sitting in at the end of 2025 is
like everybody kind of agrees there's
something like kind of rotten in the
world of benchmarks, right? Like they
don't really provide us with a whole lot
of traction on what we actually want to
use these tools for. As yet clearly they
are motivating a lot of big corporate
activity, right? like OpenAI wants to be
number one on all those benchmarks and
it doesn't want to be left behind for
any length of time when you know Gemini
comes out and like says hey we're great
against all these benchmarks. Um but
it's almost like we're almost optimizing
for the same thing and it feels like you
end up kind of in this discussion that
Ambi and Kate were just having which is
well there's these maybe downstream
effects where everybody sort of benefits
from us constantly pushing the frontier.
The other one is also kind of like is
the industry focusing on the right the
right thing, right? Because I think you
know Kate, I guess you're nodding. I
don't if you want to respond to that
idea.
>> Well, I think what's really interesting
um a week or two ago Stamford's Hazy Lab
put out a report looking at the
intelligence per watt basically and how
much performance we're being able to
drive per kind of watt of electricity uh
powering the compute. And what they
found is that you know a lot of the
adoption and market share is with these
big hosted models like you know the
latest GPT models but that if you
actually look at what you can achieve if
you move some of those workloads locally
you can get the same amount of
performance at a lot lower energy
consumption a lot lower cost and so you
know I think they argue that there's a
huge opportunity for disruption here uh
that you know the model providers might
not be focused on the right metric And I
would tend to agree with that. I think
that right now we're uh you know chasing
a lot of investment dollars and
prioritizing fancy benchmarks but a lot
of the future development is going to be
incentivized more by performance per
cost and you don't see that quite in the
conversation today with these model
releases that are coming out. One angle
I wanted to bring to this before we move
on to the next topic is you work with a
lot of like customers and enterprise
right um and you know I think all of
this comes on the backdrop of like these
companies are obviously ultimately
competing for enterprise dollars and so
I guess I'm I'm curious just like so I
genuinely don't know like one of these
new bottles drops like 5.2 to our
customers like oh man I got to this
one's launching on all the benchmarks I
got to move my entire stack over to the
new model like what's the influence of
these types of kind of competitions even
very incrementally on who chooses to
adopt what like is the market influenced
I guess is what I'm saying by these
kinds of launches
>> yeah there there are two lenses in which
you look at it right enterprises are not
going to immediately switch to the
latest model at the drop of a hat right
so you you pick a You pick a stable
workhorse, you build your applications
on top of that. You got to have some
stability. There's all the, you know,
you put it into production um and then
you you start realizing value, right?
It's it'll be very very tricky. It would
be very very um problematic to go and
keep changing problems at the drop of a
hat. So, you know, it's not going to
happen immediately, right? But does it
happen? Of course, it'll happen, right?
Because let's say you track it over the
period of time right let's say you're
tracking things over the period of n you
know 6 months a year and over the course
of time the pace at which these advances
are happening there is a fundamental
step function change in the performance
of the models right bunch of new
capabilities have gotten accumulated
which means okay now you're exploring
and you're looking at okay uh you know
from a maintenance perspective from an
application maintenance perspective I do
want to have a road map into okay you
there is a certain time window at which
I say okay I've got a step function
change and I'm going to go and make a
switch to the latest model right so yes
those model changes will happen and do
happen but it's not going to happen for
every single release that happens
>> I would say the following if you're able
to switch models at the drop of a hat
either your enterprise maturity is very
low where you're an independent
developer or a small shop and are able
to just quick quick quickly switch
models or your maturity is very high
where you have all of your eval fully
automated and you're able to switch the
model. You push a button, all your eval
get done. You can test your request on
the new model and then you're able to
see, oh yeah, this one performs 17.3%
better for my use case. It's more cost
effective. I see the data my
observability platform in my dashboard.
I make the switch overnight. So, uh
you're if you're in the middle, it's
going to be tough.
>> Well, we'll just have to see. Uh I guess
this announcement 5.2, I'm sure we'll be
talking about it potentially next week
when it actually launches. uh and we'll
see how all these predictions play out.
But I think that's really interesting
and I think Miha it's I think very
helpful to have this discussion on just
like so much of this is like we see the
competition but it's also on the
backdrop of the customers and just
seeing kind of what they do or how they
react to this stuff. Yeah. Mh. Sorry.
>> I'm just hoping OpenAI is going to do
the 12 days of Christmas thing again
like uh what was it last year?
>> You like that? That was a good gimmick
from last year.
>> 5.2 5.3 5.4 one model release every day.
>> Yeah. Exactly. until we get to 5.12 and
then they'll roll those six.
>> You just tweak the prompt every day and
you call it a five.
>> Exactly.
>> I'm going to move us on to our next
topic. Uh so we've talked about this
report before, but a number of
researchers at Stanford have come out
with the latest edition of their
transparency index. And so if you're not
familiar with this discussion from last
year, um the idea is that they're taking
a bunch of uh available models and
trying to rank and assess basically how
well these models do um from the point
of view of transparency, right? What
kinds of documentation do they provide?
What kinds of um you know data
disclosures do they have? Um and I've
always thought about this as like a very
interesting project. Um because when we
say transparency, it's a little bit like
open source, right? Where it's like what
do we exactly mean by that? And these
are attempts, I think, to kind of get a
lot more granular about, you know, what
we mean when we say transparency. Um,
and Kate, it was good to have you on the
show because I understand Granite was a
part of this transparency report. And do
you want to just talk a little bit about
how you all kind of approached it and
how it all turned out?
>> Yeah, so this is a report as you
mentioned that Stanford does annually.
We've participated in the past and it
really tries to break down model
development into three components. So
kind of down upstream the model training
itself and downstream of the model. And
what they do is they send a a survey out
to model developers like IBM uh training
our granite models both closed and open
model developers and they invite people
to participate and share information
about everything you know upstream of
model development like around data
curation. What models are you using to
generate data to train on your models
downstream to the actual training
process? You know, do you release your
training code? Do you release different
repositories? Do you release um
different details about the architecture
of the model? And then uh downstream
with of model use. So things around like
do you release benchmarks on safety? Do
you release details on uh gaps and
performance? do you release prompts that
were uh successfully used to attack the
model? So that type of thing and you
know what they've done and found is that
over the years you know transparency has
actually greatly diminished. If you look
between 2024 in this report that just
came out last week uh in 2025, there's a
most labs have reduced the degree that
they are quote unquote transparent to
the degree that they share details about
these different facets of model
development. Um IBM's taking a very
different approach which I I'm really
proud of really focusing on transparency
and trust and being as open as possible.
I think it speaks to the rigor of which
we put together our strategy and
policies around how we train and develop
our models which is reflected in our ISO
4201 certification that we also received
this year. Uh and it allows us to just
be very forthcoming with what we're
working on, how we're building it, and
how we're contributing it to the open
source ecosystem. So we're really proud
that Granite got the top score 95 out of
100 I believe. uh and where other labs
were kind of going down in transparency
over time, you know, IBM demonstrated
that we are actually doubling down in
increasing the degree of which we're
transparent in model development.
>> Yeah. And that's 95 out of 100 like
different criteria basically.
>> Yes. Exactly. Different indicators uh
different questions do we answer and
provide detail. So it's not actually
looking at you know what was the result
on this safety benchmark. It's how
transparent are you on your safety
benchmarks? Do you share the benchmarks?
do you share this type of data? Uh,
which is a a really cool approach.
>> And I think one of the things I want to
if you want to speak if you could speak
a little bit more about it is um you
know especially because across a hundred
of these metrics um you know you have to
almost pick and choose right like the
team can't afford to try to do
everything or move everything forwards
on a year-to-year basis. Uh or or maybe
that is how the team is thinking about
it. I guess I'm interested in whether or
not there's like particular aspects of
transparency that the team said, "Okay,
this is here is what we're really going
to prioritize."
>> Yeah. So, I think over the past year and
a half, if you look at from where we
were in 2024 to 2025, we've done a lot
of work on automating uh and
standardizing our training and
development process so that there are
automated records of everything. that
makes it much easier to be transparent
and share because there's so many minute
details that go into these models.
Everything from, you know, when was a
data set acquired and what was the
license it was acquired on and what was
the source it was acquired and what was
the review process for it. And so we
actually invested heavily in the
architecture around all of that uh data
curation and training so that we can
have a very streamlined you know uh
lineage of our models that makes it
really easy to just be transparent and
open and have that information at our
fingertips. That also helps us with our
own regulatory compliance requirements
where we want to be obviously
best-in-class and able to respond to
changing regulations uh as they evolve.
and that made it uh possible for us to
be just a lot more open uh when it came
to the transparency index this year.
>> Um yeah, if I can bring you in, I mean I
think um Kate's already pointing out I
think one of the interesting trends,
right, which is obviously Granite
doubled down on this, but the general
trend is less transparency that we're
seeing. And you know, this actually goes
back to what we were talking about a
little bit earlier about like what the
market incentivizes. You know how I read
the transparency index is it's sort of a
dream of saying look people will be able
to look at the index and say I want the
more transparent model here's how I find
that right and the market will reward
people who are more transparent but if
anything it feels like there's there's
actually been a pullback on transparency
do you think that that means that the
market doesn't really value transparency
all that much
>> I think it depends on the type of
business they serve so I've noticed in
the report that B2B for example
companies tend to be more transparent
than B2C because regular consumers may
not care if they're running a 100
billion 200 billion, 500 billion
parameter model, how many GPUs it uses,
how much water or whatever other metrics
CO2 emissions are used in the model. Uh
they may not necessarily care about the
cost to run the model itself. They care
about the cost to the end user. Uh while
B2B companies do need to care if they
make these models available to other
companies who are consuming it, who may
be running it on their own
infrastructure. Um the second
interesting trend I've seen is is like
you've pointed out uh it went from 74%
of the companies responding last year to
only 30% responding this year. So models
it's kind of curious if you look you
know X.AI models or uh models from
entropic models from open AI you don't
even know how many billion parameters
they have and you might not care. Um I
would see it from one perspective this
kind of information can be used against
them. Oh, look how much you know CO2 or
emissions this model is generating or
how inefficient it is. It can be used in
calculating
uh how viable their business is long
term. So for example, are they actually
um subsidizing a lot of their end users?
So a lot of this information I see um is
likely to become more transparent
in B2B companies. So, you know, AWS with
their Nova models and IBM with their
granite models and Nvidia and so on and
so forth are going to become likely more
transparent over time. Uh, while models
that are focused more on the consumer
market don't necessarily need to publish
those details. They probably will not
publish them anymore. Am I it almost
feels like there's going to be like um
on the consumer side almost like the
appleification of the world. Kind of
what I mean by that is like you know uh
uh you know if you go back 20 years
right it'd be like okay well we have
these open computing platforms and
you've got Apple and it's a battle
between open and closed and then over
time it kind of feels like everybody has
been like yeah actually like for the
consumer the general preference is
they're happy to pay more for a pretty
closed system that's pretty opaque. you
know, you have to go to a store and find
a genius to fix these computers for you.
Um, you know, that's kind of like the
state of play in consumer land. And then
on enterprise, of course, open source
has this like long and robust legacy and
is a huge huge huge business. Do you
sort of see that happening in the world
of kind of AI applications as well where
it's like it turns out that from a
consumer standpoint? Transparency is not
so important that it really is forcing
say forcing is a little strong but
encouraging companies like Anthropic and
OpenAI to say hey we're going to
participate in this index and we're
going to try to get a good score on this
index.
>> Well partially right see I always say
that at the end of the day we are all we
all sit in enterprises but then we're
also consumers right? So at the end of
the day, we we all wear those two hats
at the same time. So you know, you know,
it's not like, you know, we just
immediately switch on and off between a
consumer hat and an enterprise hat,
right? Like even when we're sitting in
an enterprise, we think with a consumer
lens and vice versa. So I think some of
those um the ways we think bleed into
each other's domains, right? Um and I
this is what I've noticed, right? I feel
like the market in general um is maybe
asking the wrong questions, right? So
yes there is the prioritization on IP
which is why you see you know in these
uh benchmarks you know most of the labs
if you look at the the downward trend on
the the metrics right there was a huge
hit on the upstream component right um
but
I don't think there isn't necessarily a
you know a reward or um you know whether
there is a reward for labs to do it or
not I feel like the the the the right
thesis should be whether the marketers
is asking the right question. What I
mean by that is I'll give you an
example. Um just earlier this week I was
with a client and they were talking
about deepseek and asking oh you know we
we want to see if we should be using
open-source models and oh what do you
think about deepseek and should we be
using that right now and this is within
an enterprise setting and we talked
about some of this in one of the earlier
episodes. What deepseek did was it
opened the mind share for open source
right so everyone started thinking about
open source models open weight models
and started talking about it but I think
there's a conflation of transparency
with open source and open weights which
is not necessarily true right and so I
think what most consumers and most
enterprises inherently are asking for
are transparent models but they are
terming it and they're asking for hey
can I get open source and openweight
models which isn't you know those two
aren't necessarily the same. So yeah I
don't fully buy the agree you know
argument that hey the market isn't
asking for or they are favoring for it.
Yes, of course there's the inherent
tension between, hey, I'm going to
optimize for my IP from the labs
perspective and the market saying, hey,
you know, I need some transparency. Um,
but you know, there is definitely a
demand for that transparency, I would
say. Right. It's just that they're
asking the wrong questions, which means
that the signals aren't really coming up
into the into these reports
appropriately. Well, I I will say what's
interesting about the parallel you
brought up, Tim, comparing to Apple is
Apple at the same time, they you know,
they've taken away a lot of the
configurability and and user visibility
into the the hardware, but they also
have a one of the best reputations for
privacy uh when it comes to devices and
kind of responsible uh use of data and
information. And deservedly or not,
they've kind of built a a strong
reputation there. And I think it is
paying off with consumers. And I don't
see that quite yet in model development,
but I think it's going to become more
and more of a priority. Transparency is
one way you can indicate it. It's not
the only way. Like Enthropic didn't
score as well on transparency, but they
have the ISO 420001 certification, and I
think they're also very well known for
their kind of principles in ethical AI.
And so I think transparency is just one
tool to kind of address some of the
broader societal and ethical questions
that are going to maybe not be the d the
singular driving market factor but will
um be an important market factor in the
future.
>> No just to add on to that I do I do
agree with Kate and I do think that will
become a trend. Just look back at social
media as a parallel, right? So when it
started with MySpace and the early days
of social media, privacy wasn't
probably, you know, at the the center of
everyone's thoughts, right? It was about
the cool thing and okay, the ability to
network. So the capabilities were at the
forefront. But then when these
capabilities sort of matured and
saturated, privacy went up front, right?
So you had all the, you know, the
shenanigans with Cambridge Analytica and
things of that nature, you know, the the
Congress hearings popping up. So you
started to see that pivotal shift
happen. I feel like you're going to see
some of those happen with any new
technology the capabilities come up
front and then you know once those sort
of become mainstream you're going to
start seeing some of these privacy
concerns and transparency aspects come
to the forefront really soon.
>> K maybe to wrap this section up um you
know you're already scoring 95 out of
100. Where do you where do you go next
year? Do you work on that last remaining
five? like are are we already kind of
saturating I guess in some sense the the
benchmark for transparency.
>> I think there are certainly always going
to be new ways to think about
transparency. We're moving from models
being kind of just a bag of weights that
get released in the open source in the
case of granite at least open weight
models to having more systems of models
and software built together. And so
that's going to introduce new aspects of
being transparent. Uh so being
transparent not just on the weights
itself and how the weights were created
but uh looking particularly around
deployment and the systems and software
that are executing the deployment and
the details can have huge impacts on
performance and um I'd love to see the
transparency index evolve to encompass
those aspects. Um I know it's certainly
things that IBM's thinking about. We're
also working on, you know, one project
we're working on is thinking through how
do you create a standardized AI bill of
materials and have that more of a
standard artifact that can be released
with models. So, uh, I don't want to,
you know, give away too much, but expect
some some work from IBM on that in 2026
to come out. Um, you know, I think
there's going to be a lot more look at
standardization, a lot more look at, uh,
deployment of these models. Uh so still
lots to do uh that we're we're eager to
to work on with the community.
>> Not done yet for sure.
>> U I'd love to see more transparency over
the infrastructure as well. Right. The
APIs they put in front of the models.
>> Absolutely.
>> Like even the system prompt is kind of
invisible and if you're comparing the
open AI model to chat GPT as a end user
application. There's a lot of other
stuff going on in there which is
unknown.
I'm going to push us on to our final
topic. Um so uh the big Amazon AWS
reinvent conference was uh just the
other week. Um number of really
interesting announcements coming out of
that that we didn't get a chance to
cover in the previous episodes. Um I
actually it occurs to me that I'm
actually like running against myself
now. I started the episode by being like
I'm bored of all these new model
releases and we're going to end with and
Amazon released some new models. So uh
I'm a hypocrite I suppose. Um the news
is of course the big news coming out of
the conference is that Amazon uh
announced its latest generation of Nova
Frontier models. Um and uh you know I
think Amazon has always been really
interesting in thee discussion just
because they've they've always been kind
of looming in the background right they
have huge infrastructure they have
incredible data with all the e-commerce
stuff. Um and so it seems like very
natural that at some point they would be
really starting to make some very big
swings. um in the AI space and in the
model space and um Ammy I guess the
question for you is like is this the big
swing like Nova really feels like it's
like they're really touting this as like
we're now in the game. Uh are they in
the game?
>> Well, I mean Nova's no there were some
of the releases on Nova even last year,
right? So Nova isn't like completely
new. Um so first of all, right? Um so
technically they're saying, "Hey, we
were already in the game last year um
with the Nova releases, right?
Um some of those I think um advances are
part for the course right so they're
releasing uh speechtospech models which
others are uh releasing as well so some
of those I think are part for the course
uh a couple of new advances came out
which is you know nova forge which uh
they're touting as there are we're going
to democratize multiple different
mechanisms for you to go and
you build your own models so it's not
just fine-tuning mechanisms But almost
and you know it's still murky on exactly
how they do this but it's almost like
hey we'll give you checkpoints and then
you come and blend in with your data and
then build your own custom pre-trained
models from scratch right um and we're
going to democratize it and enterprise
just can go and do it you don't have to
have a complete research lab to do it so
um so there are some of those that's
really um exciting um the question again
you know if I put an enterprise lens on
it. Um, great. But you know, how many of
those
um capabilities are going to be used for
how many enterprise use cases, right? A
large mainstream set of use cases will
and can be largely driven with, you
know, your your models out of the box
with um with appropriate integrations um
and to be able to drive off of that. you
may not need, you know, custom
fine-tuned models or even custom
pre-trained models for a good chunk of
the use cases, right? So, great
capabilities. Um, it's a great push on
the engineering side of things, right?
Um, so it's fantastic
looking at it as an engineer. Um, we're
also trying to think about okay, you
know, what's the enterprise value and
how that slots in, right? There's
another one on Nova Act which is the
enterprise equivalent of um your um you
know the openi browser use or the the
the Gemini browser use. So being able to
do that the the differentiation that
they talk about is hey now you know we
have trained it on enterprise screens.
So it's not doing it on Instacart
shopping but you're training it on CRM
screens and we think you know we are way
more equipped to handle those sorts of
enterprise screens right so still early
days I think that piece is actually
exciting because let's all be honest
right there's always going to be a data
and API
um issue and there's always going to be
issues of hey am I having the most clean
and hygienic data element in an
enterprise there is always going to be
those cases. So you're you know we're
looking at and we're all thinking hey
you know that the browser use cases the
browser act applications and
capabilities can be fairly promising
where you don't have ready access to
data right you you just sort of mimic
the the human actions to do it um so
it's a promising capability but then
there are obviously a lot of open
questions on the security of how that
will work right good promising still to
be seen
>> um I'm not a fan of training or
fine-tuning models for most enterprise
use cases
um mostly because whenever you talk to
an enterprise they one assume they have
data second they assume they have the
GPUs and third they assume they have the
investment necessary to continuously
fine-tune or to train a model every
single time their data evolves or
changes. The reality is that large
language models on their own are
insufficient for vast for the vast
majority of enterprise use cases. Why?
They've been trained on last year's data
and they've been trained on public data.
So you want to blend that data with your
enterprise data. But we've seen
techniques like rag, graph rag, a gentic
rag as well as tool use. So using MCP
servers or leveraging all sorts of
techniques uh provides sufficiently good
access to real-time data and real-time
information without the need for
expensive tuning, training or
fine-tuning.
Um, I think the proposition is for the
very very few companies that employ, you
know, hundreds of data scientists who
really make it their passion to train
and fine-tune models, even if you're
doing doing it on quote unquote somebody
else's infrastructure, even if you're
not starting from scratch and you're
starting from a checkpoint. Um,
you shouldn't underestimate the effort
it takes to properly train or even
fine-tune a model to a specific domain.
And you shouldn't underestimate the vast
amount of data that is required or the
quality of data that is required. So I
would say most folks should stick to
agents. That's why I like the fact that
Amazon provides a one-stop shop for
everything. You're biased or anything.
>> So it's like no, no, but look, they have
the other option, right? They have that
they have the their agent core. They
have agents. You don't like this, we
have that. So I would say don't
fine-tune or train a model unless you
really have to and you know what you're
doing. Uh it's very unlikely that the
resulting model is going to outperform a
frontier model plus tool use. And even
if it does now you have to do that every
single week, month or however whatever
the refresh rate of your data is. still
exciting if you're in that space. So, if
there you're in that 1% of companies
that do need that service, uh, and you
can't buy the GPUs that are required for
it and you need to run that service,
it's awesome.
>> This is actually kind of fun because I
feel like it actually kind of flips the
narrative from what we were talking
about earlier, right? I think earlier I
was like, you know, consumers, they
don't want complexity, they don't want
transparency, but enterprise, they do
want complexity and transparency. And I
think yeah, you're coming back and
basically saying actually for most
enterprises they they don't want that
either, you know. So Kate, do you have
any thoughts on this?
>> Well, I I agree with everything that's
said. The only other comment I have to
add is where I do think there could be
something interesting is really around
the research and academia uh community
when it comes to these new types of
reinforcement learning as a service and
tuning uh as a service capabilities in
Nova Forge. So, I thought it was really
cool that they're offering um kind of
early checkpoints. So, uh partially
trained versions of the Novalite model
that can then be further customized. So,
I'm one benefit that could come out of
that. While I agree, I'm skeptical in
the like direct enterprise value. I
think it's going to be a lot harder than
um people anticipate to kind of get a
specialized model using uh SFT or RL. Um
I do think that we could potentially by
offering more of these components uh
enable more engagement with academia
engagement from the research community
that's otherwise kind of hampered
because they don't have access to uh
early checkpoint. You know they even
have a part of their service where you
can mix your own data with the training
data for continued training. So, you
know, those are all really interesting
things that hopefully, you know, could
spur some more innovation that, you
know, the field could benefit from and
engage with a new user group that's kind
of been left along the sidelines um and
not able to participate fully.
>> Yeah, definitely a constituency we don't
talk about enough uh on this show, but
we should definitely talk more about it.
Um, Mihi, maybe I'll give you the final
uh word of this episode. Um, maybe a
little bit of a peak into the future.
One of the fun tidbits that uh Amazon
announced when releasing the Nova models
was, you know, they've been playing
around with kind of making the claim
that their Frontier agents can operate
for hours or even days on end. Um, which
is, I think, very intriguing. And
regardless of how credible you think
that claim is, I think we are kind of
headed towards this really fun world
where you're like, "Okay, computer, I
need you to help me out with something."
And it comes back like 3 weeks later and
is like, "Here's what I did." Um, are we
headed for that world? Certainly the
technology will be able to do something
in those three weeks, but I'm kind of
curious if you feel like we're finally
getting these agents aligned enough to
get there. Yeah,
>> my agents can operate for weeks and not
on end, too. It doesn't mean that I'm
getting good results out of them. I
mean,
>> actually, you can have them run for
years if you want. Uh, brother,
>> it's not it's not an issue. I actually
have a time timeout I can tweak. I can
keep it going and going and going and
never return a final answer. Just tell
me how many tokens you want me to
consume.
>> Yeah, that's right. Um so so look I
think what is improving is tool use
right what we're seeing is improvement
in tool use in terms of the number of
tools that can be called the number of
tools that can be called in parallel the
number of sequential tools that can be
called and techniques like map reduce or
being able to do vector search or search
or tool search to call the right tool
allows this kind of continuous use
cases. So let's say you're building a
document literally building or let's
take a PowerPoint document because it's
even easier to uh visualize and you're
building slide one, slide two, slide
three, slide four. Each of those can be
an independent tool call and it can keep
going and going and going if you're
managing your context right. So if you
think about what's preventing us from
doing continuously running agents today
is just how difficult it is to properly
manage that context. You're working with
a limited context of the LLM for tool
orchestration. Everything needs to fit
in the context within an execution and
then you need to use techniques to
manage the context. How you compact it.
So if you've used cloud code or codeex,
you see at some point in time it starts
to compact it. It's literally
summarizing what you have in your
context to a state that is good enough
for it to continue from that state. So
all of these techniques are coming
together and we're seeing longer and
longer and longer running agents.
Microsoft has researcher. Um, Chad GPT
and Gemini have their deep research
functionality. Amazon has similar
techniques. We have similar techniques
and we've built our own deep
researchers. Uh, I think at the end of
the day, this is something we're going
to see more and more
cuz if you want to get good results in
enterprise use cases from AI, you wanted
to touch all of your data and that means
hundreds potentially thousands of tool
calls. Rag is not enough. With rag, what
you're doing is you're selecting 10
paragraphs,
give or take, from whatever you're
searching, and then you're giving it to
the model and you're hoping for the
best. What I would like to do is to give
it all of the data, summarize this and
this and this and this and this and keep
going and going and going. It's
expensive. But in some cases, if you're
putting together a complex deliverable
like an RFI response document, an RFP
response document, a, you know, go write
me a book and come back with 300 pages
on this topic. You need that depth. So I
do see a natural evolution of all agents
within the enterprise space adopting
this kind of deep researcher
functionality with agents that can run
for 10 minutes, an hour, perhaps even
overnight to come back with a very
complex response.
>> Tim, I want to add a nuance to what Mihi
said and Mihi is absolutely right.
Right. So you you have to contextualize
all of this. Um but that's not to
discount that the advances that the
field is seeing. Right. So you know it's
not just you you have to look at this in
two dimensions. It's not just about the
amount of time that an agent or a model
or system is taking. It's also when it's
running for that much amount of time how
reliably or how accurate is the outcome
of the task that you are accomplishing.
Right? Um that curve has definitely
shifted towards the right. Right? So
couple years back we would have said you
know high accuracy would have been of
the order of a few seconds then it
became a few minutes and now we're
definitely in the realm of a few hours
right so the curve is definitely
shifting but you know it's important to
recognize it's not just how long it's
running it's how long it's running and
it's doing it reliably with high
accuracy.
>> Yeah and in the loop val also help with
this. So if you have agents that can
self- evvaluate
and intermediate checkpoints and retry
and take different direction then this
is going to help improve them over um
over a longer running execution cycle.
>> Yeah, I think that's right. I mean I
think part of it is just going to be
these like tradeoffs but I do think the
frontier is going to be increasing
continuously. Um but something to pay
attention to particularly because I
think this will be the new frontier of
claims being made about agents, right?
You can run them for weeks, you can run
them for two weeks. And so I think the
question now will be how do we measure
that? How do we quantify that? So it'll
be very interesting to see. Well, that's
all the time we have for today. So Kate,
Ambi, Mihi, thanks for joining us as
always. And happy holidays. Um, and
thanks to all you listeners. Uh, if you
liked what you heard, you can get us on
Apple Podcast, Spotify, and podcast
platforms everywhere. And we'll see you
next week on Mixture of Experts.