Europe's Mistral Medium 3: AI Contender
Key Points
- Europe’s AI landscape may not lead in building the largest models, but it can “define the rules of the road,” offering a strategic advantage despite trailing the U.S. and China.
- Mistral’s new Medium 3 model claims 8× lower operating costs and on‑premises deployment capability, positioning “medium is the new large” for enterprises seeking more affordable, locally‑hosted AI.
- Critics note that Mistral’s focus on larger‑scale models (up to 70 B parameters) leaves a gap in the open‑source ecosystem, which still heavily relies on smaller models (e.g., 3–8 B) for broad developer adoption.
- The episode also touches on broader industry moves, including a major chip shipment to Saudi Arabia, fresh benchmarking releases, and an experimental initiative involving AI‑generated advertisements.
Sections
- Untitled Section
- Mistral's Quiet Open-Source Resurgence - The discussion weighs Mistral's medium‑sized, open‑weight models, performance focus, and low inference costs against its lower public profile, questioning if the company remains competitive in the rapidly evolving open‑source AI landscape.
- Local AI Trends and Model Shrinking - The speaker warns that launching AI firms in Europe is risky due to heavy regulation, predicts a shift toward region‑specific, language‑optimized models and rapid reductions in model size as performance improves, and cites Mistral’s strategy as an example of adapting to these economic and technological pressures.
- Debating AI Geopolitics and Model Performance - A speaker challenges US‑centric narratives by highlighting European AI breakthroughs such as Mistral’s strong performance, questioning memory biases, and criticizing limited access to advanced AI resources.
- Saudi AI Factories Power Scale - Panelists examine NVIDIA's partnership with Saudi investors to deploy hundreds of thousands of GPUs and a 500 MW data‑center capacity, comparing it to conventional ≈20 kW racks to illustrate the challenges of scaling AI infrastructure.
- Saudi Arabia's AI Power Play - The speaker notes Saudi Arabia’s aggressive acquisition of AI chips, compute capacity, and language models to become a global AI contender, while warning that success will also require building an open ecosystem, skilled talent, and responsible development frameworks.
- Sovereign Wealth Funds Powering Digital Infrastructure - The speaker explains how nations like Singapore deploy sovereign wealth fund capital to build fiber‑optic networks and data‑center hubs, converting oil money into hardware assets such as GPUs to attract tech businesses and establish regional dominance.
- Emerging Health Benchmarks and Fragmentation - The speakers discuss OpenAI’s new Health Bench and IBM’s IT Bench, then debate how the growing number of specialized benchmarks may fragment evaluation standards and complicate model selection for health applications.
- Rethinking State‑of‑the‑Art AI - The speakers critique the relevance of universal “state‑of‑the‑art” claims and benchmark charts, arguing that the diversity of AI use cases demands tailored, self‑created evaluations instead of generic marketing hype.
- Fine‑Tuned Models vs Benchmark Marketing - The speaker argues that specialized, fine‑tuned models will outpace general ones, questions how future releases will prove superiority beyond standard benchmark charts, and highlights the diminishing marketing impact of benchmark bragging in favor of concrete performance demonstrations.
- Shifting Benchmarks Toward Specialized Agents - The speaker argues that evaluation should move from broad, generic models to narrow, task‑specific agents—allowing deeper assessment—and then transitions to a forthcoming anecdote about an Amazon Prime Video AI hook.
- AI-Driven Hyper-Personalized Advertising Concerns - The speakers discuss worries about intrusive, AI-generated contextual ads and speculate on Amazon’s new native advertising feature that lets AI create seamless, hyper‑personalized placements.
Full Transcript
# Europe's Mistral Medium 3: AI Contender **Source:** [https://www.youtube.com/watch?v=kjRHpuXGmH0](https://www.youtube.com/watch?v=kjRHpuXGmH0) **Duration:** 00:36:35 ## Summary - Europe’s AI landscape may not lead in building the largest models, but it can “define the rules of the road,” offering a strategic advantage despite trailing the U.S. and China. - Mistral’s new Medium 3 model claims 8× lower operating costs and on‑premises deployment capability, positioning “medium is the new large” for enterprises seeking more affordable, locally‑hosted AI. - Critics note that Mistral’s focus on larger‑scale models (up to 70 B parameters) leaves a gap in the open‑source ecosystem, which still heavily relies on smaller models (e.g., 3–8 B) for broad developer adoption. - The episode also touches on broader industry moves, including a major chip shipment to Saudi Arabia, fresh benchmarking releases, and an experimental initiative involving AI‑generated advertisements. ## Sections - [00:00:00](https://www.youtube.com/watch?v=kjRHpuXGmH0&t=0s) **Untitled Section** - - [00:03:02](https://www.youtube.com/watch?v=kjRHpuXGmH0&t=182s) **Mistral's Quiet Open-Source Resurgence** - The discussion weighs Mistral's medium‑sized, open‑weight models, performance focus, and low inference costs against its lower public profile, questioning if the company remains competitive in the rapidly evolving open‑source AI landscape. - [00:06:09](https://www.youtube.com/watch?v=kjRHpuXGmH0&t=369s) **Local AI Trends and Model Shrinking** - The speaker warns that launching AI firms in Europe is risky due to heavy regulation, predicts a shift toward region‑specific, language‑optimized models and rapid reductions in model size as performance improves, and cites Mistral’s strategy as an example of adapting to these economic and technological pressures. - [00:09:15](https://www.youtube.com/watch?v=kjRHpuXGmH0&t=555s) **Debating AI Geopolitics and Model Performance** - A speaker challenges US‑centric narratives by highlighting European AI breakthroughs such as Mistral’s strong performance, questioning memory biases, and criticizing limited access to advanced AI resources. - [00:12:19](https://www.youtube.com/watch?v=kjRHpuXGmH0&t=739s) **Saudi AI Factories Power Scale** - Panelists examine NVIDIA's partnership with Saudi investors to deploy hundreds of thousands of GPUs and a 500 MW data‑center capacity, comparing it to conventional ≈20 kW racks to illustrate the challenges of scaling AI infrastructure. - [00:15:24](https://www.youtube.com/watch?v=kjRHpuXGmH0&t=924s) **Saudi Arabia's AI Power Play** - The speaker notes Saudi Arabia’s aggressive acquisition of AI chips, compute capacity, and language models to become a global AI contender, while warning that success will also require building an open ecosystem, skilled talent, and responsible development frameworks. - [00:18:30](https://www.youtube.com/watch?v=kjRHpuXGmH0&t=1110s) **Sovereign Wealth Funds Powering Digital Infrastructure** - The speaker explains how nations like Singapore deploy sovereign wealth fund capital to build fiber‑optic networks and data‑center hubs, converting oil money into hardware assets such as GPUs to attract tech businesses and establish regional dominance. - [00:21:35](https://www.youtube.com/watch?v=kjRHpuXGmH0&t=1295s) **Emerging Health Benchmarks and Fragmentation** - The speakers discuss OpenAI’s new Health Bench and IBM’s IT Bench, then debate how the growing number of specialized benchmarks may fragment evaluation standards and complicate model selection for health applications. - [00:24:39](https://www.youtube.com/watch?v=kjRHpuXGmH0&t=1479s) **Rethinking State‑of‑the‑Art AI** - The speakers critique the relevance of universal “state‑of‑the‑art” claims and benchmark charts, arguing that the diversity of AI use cases demands tailored, self‑created evaluations instead of generic marketing hype. - [00:27:44](https://www.youtube.com/watch?v=kjRHpuXGmH0&t=1664s) **Fine‑Tuned Models vs Benchmark Marketing** - The speaker argues that specialized, fine‑tuned models will outpace general ones, questions how future releases will prove superiority beyond standard benchmark charts, and highlights the diminishing marketing impact of benchmark bragging in favor of concrete performance demonstrations. - [00:30:53](https://www.youtube.com/watch?v=kjRHpuXGmH0&t=1853s) **Shifting Benchmarks Toward Specialized Agents** - The speaker argues that evaluation should move from broad, generic models to narrow, task‑specific agents—allowing deeper assessment—and then transitions to a forthcoming anecdote about an Amazon Prime Video AI hook. - [00:34:00](https://www.youtube.com/watch?v=kjRHpuXGmH0&t=2040s) **AI-Driven Hyper-Personalized Advertising Concerns** - The speakers discuss worries about intrusive, AI-generated contextual ads and speculate on Amazon’s new native advertising feature that lets AI create seamless, hyper‑personalized placements. ## Full Transcript
Mistral is France's National champion in AI will it make Europe a global contender
for the technology in the years to come?
Chris Hay is a Distinguished Engineer and CTO of Customer Transformation.
Chris, welcome back to the show.
What do you think?
The US is just falling in Europe's footsteps.
All right, great.
Volkmar Uhlig is Vice president AI Infrastructure Portfolio Lead.
Uh, Volkmar, what do you think?
I think the judgment is still out, but I hope the best follow.
Alright, great.
And Kaoutar El Maghraoui is a Principal Research Scientist and Manager for
Hybrid Cloud Platform, uh, Kaoutar.
Welcome back.
What's your take?
I think Europe may not win the race to build the biggest model, like what's
happening in the US and China, but it's, it has a, a big, uh, opportunity
to still define the rules of the road.
All right?
All that and more on today's Mixture of Experts.
I am Tim Hwang and welcome to Mixture of Experts.
Each week, MoE brings together a world-class team of researchers,
engineers, and product leaders to discuss and debate the biggest
news in artificial intelligence.
As always, we have a ton to talk about.
We're gonna talk about a big shipment of chips to Saudi Arabia.
Some new releases in the world of benchmarking and a new
experiment around AI generated ads.
But first I really wanted to talk a little bit about Mistral
Medium 3, which was a launch that happened, uh, just a few weeks back.
Uh, it's part of a, a class of models that they've been working on for some time.
Um, they tout 8x lower costs and really kind of the opportunity to
do on-premises deployment with this.
Mistrial Medium series of models and their tagline is Medium is the new large.
Um, and so I guess Chris, maybe I'll start with you.
You know, we haven't talked about Mistral on the show for quite a while, and I think
one of the reasons I wanted to bring them back up again was obviously there's this
new release, but it kind of offers the question of like, is Mistral still kind
of a contender in this open source space?
Um, and curious about how you size that up.
I think so I love Mistral first of all, and to come back to my earlier point,
right, the, the Mistral team were the original folks that came up with the Llama
models in the first place, Llama one.
So they are great innovators.
I love the Mistral models.
I especially love the Mistral 7B from a few years ago.
Um, and I think the new Mistral Medium
is great, but maybe the criticism I would give is that have we
had that sort of 7, 8B model or even a 3B type model from them?
They've been focusing on like what the, the Mistral Medium three is probably
what, a 70 billion parameter model.
Even they're small model.
They really
recently was a 24 billion parameter model.
So when we think about the world that most open source developers
live in, it is the world of a Llama.
It's the world of Hugging Face, and therefore you need the smaller models.
So by not having those smaller models out there, that maybe just sort of
fell out of our thinking a little bit, but what they're doing in the lab.
Mistral Medium 3 is a fabulous model.
It's super fast, it's, it really is state of the art.
But again.
Another thing I would criticize is they haven't put a reasoning
model with that as well.
So it is as good as it is.
We've also moved on to reasoning models and I think they just
need to kind of push that.
But I am, I'm confident they're gonna do some great stuff and, uh, and, and
we're gonna see a resurgence of them.
Yeah, for sure.
Uh, Kaoutar I'm curious if you agree, I mean, I think Chris is making a sort
of interesting point, which is like,
Medium is maybe still too large for a lot of what's happening in, uh, in enterprise.
Um, you know, it kind of sounds like at least the story that Chris is telling is
almost that they, they kind of like are sort of missing the boat on where the kind
of current competition is and where the current heat is in the open source space.
Do you think that becomes a problem for them going forwards?
I, I think so, actually.
Um, you know, I think.
Mis what, you know, their strategy has been, you know, these open weights.
They also focus on high performance.
Also the focus on the low inference costs.
They, they all have been great and while, you know, they haven't been
making many headlines as OpenAI or Anthropic right now, I think they're
quite consistency in releasing these.
Strong open models is worth noting.
So the Medium 3, for example, ranks competitively on standard leaderboards.
And I think their commitment to, um, the open source community is a very
rare stance in today's increasing closed, you know, ecosystem.
So the
the question is, you know, has really Mistral faded, uh, from the race?
Or are they quietly building the foundations for long-term impact
in the open weight AI ecosystem?
Which I think, you know, I think they're heading that, that that way.
Of course, like, uh, Chris mentioned, you know, they, they need to fix, you
know, the reasoning, uh, aspect of their models, but I think they'll get there.
Yeah, for sure.
Volkmar, I think one of the interesting parts about Mistral for
me is that it feels like, you know.
Countries increasingly have like their big AI champion, right?
So there's like DeepSeek in China and the US arguably OpenAI.
But a number of companies, um, and I think for a long time as trial really
was kind of like the big hope of Europe and certainly of France on the idea that
they would have a national champion that be able to kind of like lift many boats
and help build sort of an AI sort of.
Kind of industry and sort of leadership for kind of the, the continent.
I'm curious if you kind of buy that as a thesis.
I know you said in your opening remarks that you said that you wish them all the
best, which I guess the more critical way of saying that is you don't know if
they much have like a very strong hope.
Um, but curious if you wanna talk a little bit more about that.
Okay.
That's a loaded question.
Um, if you look where GPUs get deployed, there's a very strong
concentration in the states and in like.
A couple of countries which are trying, like we are talking about
this with the Saudis, um, there's, you know, there's investment going
on, but if you look at the majority of GPU deployment, it still happens
in very, very few places in the world.
Like, you know, the 50,000, a hundred thousand GPU clusters are
pretty much only in the States.
And so I think there is a, there's a concentration of
capital and as a concentration of skills, uh, which is clearly
not in Europe's advantage.
Uh, Europe is really good at writing regulations right now,
but they're regulating something which they cannot build themselves.
And I think this is a real big danger.
And so I think companies like starting an AI company today in Europe, um, I
it's kind of insane, whereas like, you know, it's an an under not understood
market and already overregulated and so I think Mistral is kind of
like, you know, walking the line.
I think in general what we are going to see is that, um, companies
are going to focus on their local markets.
Um, if you look just from a language perspective, all these models are very
English focused and then all the other languages are almost translations.
And so I, and you know, the Chinese said, okay, we wanna have a Chinese first model.
Um, and so I think there will be kind of local champions.
Uh, I think also that, um
we see a trend, you know what you're saying?
Like medium is the new large, I would say, and smallest the new medium.
And every six months that's the case, right?
So if you look at the capabilities of the models, what we could only
do in huge models is now moving into something you can run on your laptop.
But it took like two or three iterations.
And so I think Mistral It's just adopting to that general trend, you
know, that the technology is now at a point that I don't need a 70 B model
anymore to get that type of performance.
Um, and it's just, you know, the nature of, of the beast.
I think also the smaller models are just much more economic.
And so there is just economic pressure, right?
The moment you deploy this at scale and you don't have money to burn
anymore, then you need to actually look at what the cost footprint is.
And I think that's probably a reaction of Mistral as well.
Like in some ways, like you have to be close to where the clusters are
to build the talent and expertise.
Yeah.
I, I think the, the capital follows the talent.
The talent follows the capital, and so you, you are in, in a world where
Europe just doesn't have deployments.
So the people who are really good, they come here.
I mean, you need to go somewhere where someone is willing to pay for 50,000 GPUs.
That's not happening in Europe.
Yeah.
I think I, I second Volkmar's point, you know, Europe, actually
today where it's lagging is, you know, the compute infrastructure.
So it's really, uh, the.
continent,
what we see is it's really short of, you know, on the sovereign AI compute,
there is no European equivalent of the AI 100 h 100 scale clusters that
you find in the US and even in China.
Uh, there's also, I think, a lack in the VC funding and the, so the
startups in Europe today, they're struggling to access the scale of
funding that fuels also Silicon Valley and to the Chinese tech ecosystems.
That's also like the, these are like very, uh, you know, uh.
Things that are kind of slowing their progress and the
foundation model development.
So, you know, we, it doesn't have like a direct equivalent of
to open AI or Google Keep Mind.
So Mistral, you know, even if it's, you know, they, they're champion
but like, you know, I think Volkmar pointed, you know, access
to compute is, is very important.
So we need to have both.
Chris, do you buy this?
I mean, I think my note of skepticism is like.
We live in the 21st century, right?
Like there's the internet, like the idea that you have to be kind of
proximate to all the compute in order to build a, a strong AI industry.
That I guess, I know it works a little bit against my intuitions,
but Chris, do you wanna jump in?
I don't buy any of that in the slightest.
Have we all got short term memories or something?
What were we saying about China?
What, six months ago we're like, oh, US is the greatest.
Well, they don't, China doesn't have access to any H one hundreds, and then
Deepsea comes along and we all go.
Uh, well, okay.
Uh, whoops.
And I think that's the exact same case.
I mean, let's look at what Mistral has done there.
Right?
We're sitting arguing about medium, but it's one of the strongest models
that is out there that's non reasoning.
So they've shown that in the benchmarks.
It is a great model if you go and use it.
It is a fantastic model, right?
And, and again, I'm guessing at the 70 billion parameter
model, but that's phenomenal.
What they've done is outperforming the Llama form Maverick model, for
example, and we're saying Europe.
You're useless.
Sorry.
You just created one of the best models, and again, let's go back
to being Europe for a second.
Who is leading Google's Gemini model?
Demis Hassabis, right?
Where does he come from?
Not America, right?
If we look at OpenAI, who started that from an architectural point of view?
Ilya Sutskever.
Where did he come from?
Not America, right?
So there is lots of European innovation coming through.
European companies are building stuff and they don't have access to AI classes.
Just write the first part.
Maybe not the second
part. Most part.
Do you wanna get into? Yeah, you just called out individuals and if you
look at the, uh, you know, the top companies founded in the, in the Bay
area, that is most, like 50% or so is not from Americans like American born.
That's a non argument because what you are saying is like my.
First citizenship matters.
It's like, no, it matters where you actually build your company.
Okay, but, but let's take the Laa models who were built by France, right?
That that was that team that were doing that.
So innovation is coming from Europe, right?
Hang on.
Two things are the businesses, is that innovation?
Europe has the brain power.
Europe has no money to actually fund it, and that's why it's
all funded in the United States.
If you look at the amount of.
Capital, which is spent in the US on new technology and the
amount of capital spent in Europe.
There's nothing spent in Europe.
So, and therefore we, I'm very happy we got a full brain drain.
Let's take all these people and make it happen in the United States.
But we literally had this argument six months ago about DeepSeek right?
So I'm telling you that's a
different story.
Completely different
story. I don't think
it is.
I think so.
I do not think it is,
yes, because.
In one case you have a wall and people cannot get out,
and the other case you don't.
No.
The DeepSeek, they have the GPUs, but they probably didn't have the latest,
you know, or top scale GPUs, but they were very clever in how do we, I. You
know, take those limits and even act at the PTX level of NVIDIA you know,
of the cuda so they can basically overcome the limitations that they have.
So I think, you know, I think there are two things here we're talking about.
There is the brain power and there there is, you know, the
infrastructure and the deployments.
So the resource, the R&D, yeah, it could come anywhere.
But then when you wanna deploy these things at scale and see the
business value, that's I think where, where we see the lax.
So this is a very nice segment or a nice segue into our next segment, actually.
So, uh, it's another story I wanted to kind of bring up and have
the panel react to, but I think it'll actually be a continuation
of this discussion in some ways.
Um, super interesting news coming out of Saudi Arabia.
Uh, this week, uh, NVIDIA announced that it will be collaborating
with the Saudi investment funds to build what they call AI factories.
Um, and they are projecting sort of the deployment of several hundred thousand,
uh, advanced processors in Saudi Arabia over, you know, the next amount of time.
Um, and they're promising sort of a data center capacity of
as much as quote 500 megawatts.
Um, the first step of this is gonna be a shipment of 18,000,
uh, GB 300 Grace Blackwell chips.
Um, and I think maybe Volkmar, just to maybe turn it to you first,
because I know this isn't your world.
Can you gimme a sense of like.
What is 500 megawatts compared to where we are today?
There are two, there are two viewpoints on this.
That's a lot and it's nothing.
Okay.
Um, if you take, um, um, if you take the latest announced NVIDIA rack,
that's, um, so first a, a typical rack in a data center is about 20 kilowatts.
If you're really trying to push the envelope, it's 30, 35, but
that's kind of on the edge.
So 20 kilowatts.
If you take 500 megawatts, you do the math, it's.
Thousands of racks.
Now, if you look on the flip side of what NVIDIA announced, uh, with
the next generation rack, which is, you know, a peta flop in a
rack, it's 600 kilowatts per rack.
Okay?
So if you take your 500 megawatts, then you can fit, you know,
800 of these racks in in.
Uh, in a data center.
Now, 800 racks is still a very large installation, but like if you look in the,
uh, like it's not 20,000 racks, right?
So we are in a world where it's sufficient, a sufficiently large
deployment to actually make a dent.
Um, uh.
For a whole country that's probably not enough.
Kaoutar, one of the things we were talking about a little bit earlier, I
guess was kind of Volk Mar's thesis that like the talent follows the capital.
And so I guess maybe you could draw a comparison where you say,
well, Europe's got the talent, but it doesn't have the capital.
This seems to be a case where, you know, the Gulf states they, they
have the capital and now the question is whether they're not gonna be
able to bring sort of the talent to really build out a much broader AI
kind of industry.
I'm curious about how you think the prospects are of, you know, these kinds
of deployments being the thing that sort of allows you to sort of trigger sort
of global leadership in the technology.
Yeah, that, that's a very interesting point here.
It's like what we're we're seeing in Europe, you know, talent is there,
the capital is lagging.
I think what, what's happening in Saudi Arabia?
So the deal that's happening, you know, with the NVIDIA and AMD and the US
you know, it, it's marking a, a major ma milestone in the rise of what we
call the sovereign AI infrastructure.
So Saudi Arabia here is not just buying chips there.
They're taking the kind of a stake and they're trying to claim,
you know, uh, basically their place in the future of compute.
So, uh, and they're, you know, with, you know, the Grace Blackwell deployments,
they're trying to also to commit, you know, these Arab language LLMs and,
uh, nearly like, I think about two gigabytes of AI power on the horizon.
So it's not just a regional experiment, I think it's a global
power play that they're trying to play here.
And um, I think from the US side, this is also reflecting a clear
shift in the AI export strategy, which China restricted here.
Saudi Arabia and the Gulf nations are immersions are their pre, you
know, the US preferred strategic partners for high-end ai.
But the
the issue here is, you know, these sovereign AI efforts, they need not
just more than the just the silicon, not more than just the compute.
So they'll need also the open ecosystem, the talents, the responsible development
frameworks to really, truly compete.
So I think we'll, we'll have to see, you know, what Saudi Arabia is doing.
You know, they're getting this huge infrastructure, they're
building this huge infrastructure.
But I think the ecosystem and then the talent, you know, they
still need to work on that.
Well, Chris, maybe I'll bring you in.
I mean, 'cause I think you were a Europe booster, um, and I kind of curious about
like if you think you are kind of a, a Gulf States booster now, you know,
given seeing these investments like.
You know, will they be able to bring the talent like in the future?
Will a European researcher say, well, I could go work for a company in the US or
I could go work for a company in the Gulf.
Like do you think those dynamics will start to happen as we see these
deployments get bigger and bigger and bigger in say, Saudi Arabia?
I think they already have the talent.
So I've spent quite a bit of time in Saudi Arabia with, on a bunch of these companies
and they're heavily investing in AI.
And again, even if we go back a couple of years, um.
Just after the Llama models came out, the first ones, what was the
most popular model At that point?
It was the Falcon models.
And where did they come from?
Saudi Arabia.
So they already have the, um, they already have the talent within the
region, um, between Saudi and UAE who have been sort of developing,
uh, some of these models in region.
And I think therefore.
We're gonna see talent flock towards that infrastructure, you
know, to Volkmar's point earlier.
Um, so I, I think you're gonna start to see stuff coming out of them.
And, and back to my earlier point, I think even with that amount of
infrastructure, when you place constraints, people get more creative.
So I think they're gonna do interesting things and, and then.
Again, maybe taking sort of language first models as well.
You're gonna start to get different flavors as opposed
to an English first model.
Again, it was brought up earlier, so I think we're gonna start to see
different flavors of models that may have different and newer capabilities,
which will become a plus to the overall world ecosystem of models.
So I, I am, I'm a positive on that and I think it's a, I think it's a good thing.
If you look 20 years back, we had a similar
uh, kind of investment cycle.
And this was, uh, affected the massive build out of internet infrastructure.
So this was fiber optics in the ground data centers, you
know, computers connected.
And so I think there's a certain repeat like those nations because
they, you know, have a, have a different investment philosophy.
Uh, like look at Singapore.
Singapore made the decision.
We wanna be the data center hub and the fiber optic hub.
Of, of that region and they put poured billions of dollars in it,
and that attracted a lot of business.
I think those nations are because of, you know, central command structure of very,
very advantage, uh, advant, um, advantaged in, uh, pushing those types of like
large scale infrastructure investments out of, let's say a sovereign wealth fund and
saying, okay, we put this capital to work.
Um, they usually are more challenged if it's a, you know, pure IP play.
Like, okay, we need to build some software, uh, because, you know, then
they don't have a competitive advantage.
So I think there, they're all playing to their strength here
saying, okay, we take, you know.
Oil money and convert it into GPUs, uh, and thereby creating that,
that suction sound of getting, you know, people into the region.
So I think it's a, it's a very natural play for those types of, um, uh, like
regional players of trying to establish, you know, dominance in a new emerging
field, which has this, oh, and by the way, we need to put 20, 30, $40 billion down.
Um, you know, which, you know, don't do like out of normal, private.
you know, funding, you need someone like the US.
Right. It's like government scale.
Exactly.
It's government scale and that's exactly where they can shine and
that's where you're seeing it.
Yeah, for sure.
And maybe a final question on this before we move to the next topic.
I mean, Volkmar is someone who's, you know, very much
in the infrastructure game.
I. My friend was making the argument to me too recently that, you know, in
some ways actually it's possible that, you know, Saudi Arabia might actually
be advantaged even against the US on data center build out, um, because
of, for example, like the ability to access and move energy assets around is
just something that they, they're not gonna face the same kind of, you know,
permitting issues and construction issues.
That you have in the US do you buy that as kind of advantage that they have for No,
I don't. Okay.
Because we have 50 states and you have, I mean, I'm, I'm moved out of California
to Texas and it's totally different.
Permitting is so much easier.
So I think we will see a similar thing like, you know, they're
playing it on a, on a country level.
We will play this on a regional level.
And if you look at the US, there is pretty much just a network ring, which goes.
All through, pretty much all states.
And so you can put your data centers in Arizona, on Oregon, on Texas,
so you go where the power is cheap for these types of build outs.
Yeah. I like that take.
It's like we have Saudi, it's, it's called Texas.
Exactly.
We also have the oil.
Yeah. Right.
Yeah.
So exactly.
Well, great.
I'm gonna move us on to our next topic.
Uh, moving us a little bit away from the world of, um, you know, chips and
national competition to something a little bit more kind of close to home.
Two sort of interesting releases that happened fairly recently.
One of them was from OpenAI.
Um, this, uh, benchmark they released called Health Bench, uh,
which is specifically a curated set of about 5,000 conversations.
So the interactions between AI models and users or clinicians, and the idea
is to kind of create standard benchmarks for AI's use in the the health domain.
Um, there's also a really interesting benchmark that came out of IBM
called IT Bench, which is looking at sort of benchmarking on agents.
Um, and I guess Kaoutar maybe I'll turn it to you.
I mean, I think the funny thing that I have when a new model releases now,
whether it be Mistral Medium or what have you, is that they always go like.
They say this like we're very good against all these benchmarks, and
then there's a list of like 50 benchmarks and it's like very difficult
to tell what it actually means.
There's like lots and lots of benchmarks, but it's like very, very
difficult to say, okay, like I'm gonna deploy this in the health space.
You know, is this actually a good model for me?
Um, and I guess I'm kind of curious.
I wanted to kind of put to you the idea that in the future, like these benchmarks
and might end up becoming a lot more fragmented than they are right now.
Right?
Where you might imagine, you know, we say, oh, we're gonna release a
model, but it's gonna be specifically for health applications, and then
here are the benchmarks for it.
Do you think that's where we're headed, or we're gonna kind of keep developing out
this ever more comprehensive benchmark?
Suite, I suppose, for every single model that comes out.
Yeah.
I think this, you know, race, you know, towards, you know, building these AI
models and benchmarking against them.
It's gonna continue.
But now that we're entering the age of agent AI is, you know, the
benchmarking, it still seems like this, you know, it's in the chatbot era.
So if you want, you know, to deploy, like trustworthy, for example, AI agents, I
think we need new evaluations, frameworks that combine general reasoning metrics
with domain tasks, uh, completion.
So.
Think of it like, you know, this general benchmark tests, you know,
the I, the IQ, but you know, sector specific ones, test the job performance.
So I think ultimately the future of benchmarking lies in
these hybrid evaluation stacks.
So general foundations, models, you know, that are tested, uh, you know,
across standard reasoning tasks, but also we need to stress tests.
In realistic operational settings, like, you know, the examples that you mentioned
from OpenAI, the health bench or the IT bench, those are very specific domains
specific, and we need more of that, those, so we need, you know, kind of the
general evaluation stacks and frameworks.
But we also need, you know, the stress tested.
For these realistic, uh, and operational settings.
And those will become, I think, industry standards.
Not because they're broad, but because they're really real.
So, uh, so I think the, the new wave that we're seeing with the, uh, open AI
and, you know, the IBM, uh, you know, the Open AI health bench and the IBM's it.
You know, a uh, IT agent benchmark, you know, these really
are raising critical questions.
Are we really measuring the right things?
And, you know, as I see, you know, these models shifting from static chat
box to dynamic agents, you know, the traditional benchmarks need to change.
Yeah.
What I like about that, and Chris, I'm curious about your comment on this is.
You know, like does the idea of state-of-the-art even make sense anymore?
Like it kind of feels like, well, I don't know.
It just, there's so many use cases now for AI that it'll be very
difficult to imagine that one model is state-of-the-art across all applications.
And so are we kind of maturing our thinking here?
Like does it, I don't know, Chris, do you buy that like state of the art?
It actually doesn't make any sense because of the number of applications.
I don't think state of the art makes any sense.
I think benchmarks and every, it's all marketing hype, isn't it?
We are the greatest at this.
And look, we are the topping this chart here.
Yeah. You need the chart
that shows that you're ahead on all the benchmarks.
That's what you do.
Exactly.
E every, every single, even Mr Out did that.
Right?
Every single model provider releases their chart.
And they beat everybody else, right?
And, and they, and they select the models that they don't want to pit against.
So it's like I'm selecting this, this, this, and this.
Look, I lead on all of this and I win at this benchmark on that one, et cetera.
Because it's gotta look like you've got the greatest model ever.
And, and I understand that, but I truly think, don't let
somebody else tell you, uh.
What model is good for your use case, right?
So if you need a benchmark, go create a benchmark for yourself.
I'm in this domain, these are the sort of things I'm gonna do.
I'm gonna test for it, and I'll create my own evals, and then I'll make sure
that I'm using that model for the the purpose and tasks that I want.
Now, don't get me wrong, benchmarks are kind of useful in some regards
because that allows you to basically go.
Yeah, this model's probably around the right level and therefore I can
go and try it on this different stuff.
So it gives you an idea of whether it's something you're gonna be useful for.
But the reality is most of us know what model is good just from using it.
So I'm, I'm vi I'm a Viber all the way, so, yeah.
Well, but Chris, when you say you create your own benchmark,
isn't that like a biased approach?
I mean, I can craft it in a way that's gonna show.
Uh, you know, my model as the best in whatever I'm doing, so
there might be some bias there.
As opposed to having an external party create maybe a set of benchmarks.
When I say create my own benchmark, I mean for my specific task, right?
So, so if I'm, so if let's say I'm creating a chat bot for tourism, that
answers FAQ questions on flights, right?
You may have the best coding model in the world, but if it
is giving me flights for another company, then it's not really good.
Or if it's not reading the FEQ, then it's not really any good, right?
So.
Like any normal software application, I'm gonna want to create test
cases to say, is this thing coming back with the answers that I'm
expecting for my business use case?
That's the kind of eval.
So it's not about testing for generality, it's actually is the model any good
at the task that I want it to perform.
And, and rather than relying on somebody else to tell you whether it's any good
go, you know, for your application you're building, go, go create some
evals and, and test it out for yourself.
And, and most of the time.
Back to the point, a smaller model that is fine tuned or is uh, you know,
designed for a specific application will do better than a general model.
Anyway,
Wilmar, I'm wondering if you have any predictions on where the kind
of meta of this moves over time?
'cause I, I guess the world that Chris is describing, which I really agree
with, is everybody sees these charts of state-of-the-art performance and
yeah, I think the main thing that I get from them now is, okay, you.
You did your homework, you're at least as good as everyone else.
Right.
It doesn't like, doesn't necessarily make me stand up and be like, wow,
this is the most incredible model ever.
But it just says like you're doing as good as everyone else.
If that's the case, then like we're living in a world where like these
kind of marketing statements don't really have that much value anymore.
And I'm kind of curious how in the future.
An open AI or a Mara or whoever is gonna kind of demonstrate that, like, oh, this
is the model you really should be using in the future, if, if not these benchmarks.
So I, I'm very much aligned with Chris, what he said.
Um, I think there is, there's the outside marketing.
I wanna bring my product to market.
I need to communicate what it does.
And I think what we will see is, and this is what you get with the medical
benchmark, you know, first be that.
How good are you on math tests and physics tests and, you know, logic
tests and now what's happening?
As, as we are, you know, widening the use cases, every industry will come and
say, Hey, you know, I'm the medical guy and I'm, you know, the rocketship guy.
I wanna know how good that model works in the end when you are
putting things, things in production.
You build your own regression tests, because what happens is
that you have something that fails.
It goes into regression test, and then what I'm seeing with projects,
we're running internally here.
Every six months you upgrade your model.
So if you don't build your regression test, you actually don't know what's going
to fail when you're switching from the old model to a new model, and then you have
a problem with the customers behind it.
And so over time, you are building effectively a benchmark.
Call it benchmark, call it ration test, call it unit test.
It doesn't matter, but you're building out your way of validating that a, your fine
tuned model or your, you know, the next version of the, of the open source model
you are using actually works for use case.
And if it doesn't work, you cannot upgrade.
It's if's a quality assurance thing.
And that quality assurance is extremely domain specific.
That's right.
Yeah.
It almost kind of suggests also a world where, yeah, there'll be a lot more
eval work that needs to happen in-house.
And then maybe finally, people have been talking about this for years,
but it maybe actually finally creates a market for like, specialized
businesses that just do evals.
Um, because you live in a world where basically, like your big foundation
model, company's not gonna run every single eval in the whole world, but
if you have a specific use case, you need someone with eval expertise.
You know, it almost kind of feels like that suddenly starts to become viable in
this world as the, the market matures.
But I, I wanna bring it back to Ka Tara's point, and I think it's
really, really relevant, which is we focus on the model world.
But as Kaar was saying, right, it's like in a world of agents, it's gonna be
less about whether this model is capable to this, uh, performing this task.
It's gonna be more of is the agent capable of doing this task
and how good is it performing?
So, so I, I sort of agree with that point, that, that the, the benchmarks
kind of need to move on and sort of really look at it from a kind of.
Perspective as opposed to just always being looking at the models,
which I think will also give a better quality because agents
usually have a much more narrow band.
This is the problem with the, with the generic model, right?
So I have open ai, it can do anything.
And so how do you test that?
It does, what do you want to do specifically?
But once you go to agent, you really narrow the, the
applicability of the model.
The application of the model, uh, because now you're saying you agent,
you do flight booking and nothing else.
And now I can go deep.
Right?
And not just broad.
Right Now we just go so broad.
It's like a little bit of math and a little bit of medical, right?
But now if you have an agent, you can actually say, do,
do your task, or you don't.
I'm gonna move us on to our final story, uh, in the last
few minutes of the episode.
This is kind of a fun one.
You know, on MOE we cover, you know, infrastructure and chips and
health and, you know, model evals.
We don't really talk about showbiz all that much.
Um, but there was this kind of interesting story, um, Amazon was
doing its kind of upfront event where kind of talks to advertisers.
For Prime video, it's kind of upcoming season and there was a bunch of
announcements, new shows they're doing and all the usual kind of show business stuff.
But there was one interesting AI hook that was mentioned in this new story
where Amazon announced that they were gonna start using generative AI to create
contextual advertising on Prime video.
So they didn't really provide too many details, but the idea would be that
you'd be watching a show on Prime video.
An ad would come on, but it would be sort of an ad generated on the fly using ai.
Um, and in a way that's presumably contextually related to both
what you're watching and what it knows about you, um, which is.
A really kind of weird, interesting change.
Uh, we've had targeted advertising for, for many years, of course.
Um, but this seems to be a little bit of a qualitative shift where
like the actual ad itself will be kind of generated, uh, on the fly.
Um, Chris, are you a fan of this?
Are you excited about this?
I am not excited about this
whatsoever.
I mean, come on.
It's like, it's not that loaded question, but yeah.
Chris, go ahead.
I, no, seriously.
I mean, how many times on Amazon, like I look at.
I don't know, maybe I buy a comb off of Amazon and then for the next three weeks
I am seeing combs appear everywhere.
Every website I click on, I buy a comb, everything.
I look on Amazon.
Do you want a comb?
Here's a comb, here's a comb.
I'm like, I just.
Bought a comb.
Don't, don't stop trying to sell me combs.
The last thing I want to do is stick on the tv.
There I am about to watch some New York giants be terrible.
And then guess what?
More combs arrive.
You're like, oh, oh great.
Oh no.
So no, I, I, once you can stop the combs following me around the internet, then
I will be excited about Gen AI adverts.
Okay.
Quick hit Qar.
Are you excited by this?
Yeah. I'm also worried about this.
You know, I think, you know, this contextualized advertising
might be a little too much.
Um, this also hyper personalized advertising.
Uh, I'm not sure, you know, I think we'll have to see how the viewers react to this.
You know, I think we've seen clearly Chris's reaction.
Um, so, and also it's a machine generated content inside the show.
So, yeah, I'm also a bit worried about this, you know, of this
experience and how it's gonna play.
So I feel it's just like they're more getting into our minds and what we see
and you know, so it's a bit creepy.
And last but not least, to close out the episode, Volkmar, your take
on this incredible new feature that Amazon's about to launch into our lives.
So,
um, I. Two startups before I had an ed tech company, so I, I
probably know too much about it.
So there's a. Big challenge of creating really good creatives.
And I think that is already now at a point where AI is really helping, um, you
know, not an internal clicking together or something, but, uh, actually creating,
you know, really good advertising.
Um, it's going to be really interesting to see.
I. Like it's a fact.
They can now do native formats.
So native formats in print is like, when you have, it kind of
looks, it's part of the article.
And so you are reading it despite that it's an advertising.
Uh, and so what you now could do is you could do go full native, so you have the
movie and you could even plug something.
You could change the plot of the movie if you take it to an extreme right.
Um, so it's going to be interesting to see.
I wanna see all the, all the issues they have there.
The model accidentally renders something you shouldn't render.
So how do you do quality control?
But I think this is the path where we are going on.
So, you know, image rendering is obvious, but n video rendering is the next step.
Um, I think it's, uh, hopefully a long, a long path until this becomes reality.
Hopefully.
Yeah, we'll see.
Yeah, I just envisioned that you're like watching Star Wars
in 2030 and like loose Skywalks, like you should really buy a comb.
Exactly.
It was like not a good outcome for sure.
Comb scene.
Yeah.
Everybody remembers that one.
Yeah.
So, um, well that's all the time that we have for today.
Uh, Kaoutar, Volkmar, Chris Pleasures always to have you on the show.
This is one of my favorite.
Kind of panels on MoE.
And uh, thanks to all you listeners for joining us.
Uh, if you enjoyed what you heard, you can get us on Apple Podcasts, Spotify,
and podcast platforms everywhere.
And we'll see you next week on Mixture of Experts.