Gemini 3 Dominates AI Benchmarks
Key Points
- Gemini 3 is now recognized as the world’s leading LLM, outperforming every benchmark and anecdotal user reports compared to rivals like GPT 5.1 and Sonnet.
- It dominates a range of tests—including Humanity’s Last Exam, ARC AGI2, Math Arena Apex, MMU Pro, OCR, and especially Screenspot Pro where it scores roughly double the competition—showcasing superior abstract visual, mathematical, and multimodal understanding.
- Many older benchmarks are saturated, yet Gemini 3 still achieves massive leaps on newer, unsaturated tasks, proving that progress isn’t plateauing.
- The data contradicts claims of an “AI wall” or bubble; both pre‑training and fine‑tuning continue to yield substantial, not incremental, improvements.
- While casual conversations may not reveal the gap, the underlying advancements in reasoning, perception, and tool‑free problem solving are dramatically clearer in these benchmark results.
Sections
- Gemini 3: Unambiguous Number One - The speaker explains how Gemini 3 surpasses all competitors on benchmarks and user reports, establishing it as the clear, world‑leading AI model.
- Rethinking AI Leadership with Gemini 3 - The speaker previews a second video on integrating Gemini 3 into workflows, emphasizing that claiming the top spot in the AI race remains achievable and urging continual reassessment of AI’s capabilities and practical applications.
- Visual Reasoning Breakthrough in Gemini 3 - The speaker highlights how Gemini 3’s enhanced visual acuity and reasoning eliminate previous multimodal weaknesses, enabling truly integrated image‑text‑code understanding.
Full Transcript
# Gemini 3 Dominates AI Benchmarks **Source:** [https://www.youtube.com/watch?v=nktAnCHK94I](https://www.youtube.com/watch?v=nktAnCHK94I) **Duration:** 00:09:06 ## Summary - Gemini 3 is now recognized as the world’s leading LLM, outperforming every benchmark and anecdotal user reports compared to rivals like GPT 5.1 and Sonnet. - It dominates a range of tests—including Humanity’s Last Exam, ARC AGI2, Math Arena Apex, MMU Pro, OCR, and especially Screenspot Pro where it scores roughly double the competition—showcasing superior abstract visual, mathematical, and multimodal understanding. - Many older benchmarks are saturated, yet Gemini 3 still achieves massive leaps on newer, unsaturated tasks, proving that progress isn’t plateauing. - The data contradicts claims of an “AI wall” or bubble; both pre‑training and fine‑tuning continue to yield substantial, not incremental, improvements. - While casual conversations may not reveal the gap, the underlying advancements in reasoning, perception, and tool‑free problem solving are dramatically clearer in these benchmark results. ## Sections - [00:00:00](https://www.youtube.com/watch?v=nktAnCHK94I&t=0s) **Gemini 3: Unambiguous Number One** - The speaker explains how Gemini 3 surpasses all competitors on benchmarks and user reports, establishing it as the clear, world‑leading AI model. - [00:03:18](https://www.youtube.com/watch?v=nktAnCHK94I&t=198s) **Rethinking AI Leadership with Gemini 3** - The speaker previews a second video on integrating Gemini 3 into workflows, emphasizing that claiming the top spot in the AI race remains achievable and urging continual reassessment of AI’s capabilities and practical applications. - [00:07:26](https://www.youtube.com/watch?v=nktAnCHK94I&t=446s) **Visual Reasoning Breakthrough in Gemini 3** - The speaker highlights how Gemini 3’s enhanced visual acuity and reasoning eliminate previous multimodal weaknesses, enabling truly integrated image‑text‑code understanding. ## Full Transcript
Gemini 3 is the number one model in the
world and it's not close. This first
video I'm doing is going to talk about
what it means to be the number one
model, why we should care, and then the
next video I'll do tomorrow is going to
talk about my takeaways. So, what does
it mean to be the number one model in
the world? I actually want to get into
that because we haven't had an
unambiguous number one model that
everybody agrees on in a while now.
Gemini 3 is that model. It wins on every
benchmark I can find. And it also wins
the anecdotal wars. The conversations on
Reddit, the conversations on X about
what this model can do. Users are
reporting it's a very very strong model
just like the benchmarks do. So what do
these benchmarks say and why do they
matter? Humanity's last exam, it has the
highest published score. What I noticed
is it didn't use tools to get to that
published score. That is the model's
brain doing it. ARC AGI2 again a clear
lead. And the thing I noticed is it does
well on abstract visual puzzles and that
visual puzzle theme will come back. It
does well on the math and the science.
These feel somewhat exhausted as scores.
So like getting 95% without code on AIM,
it may edge GPT 5.1 by a point
technically, but the point is that
entire benchmark is saturated. Math
Arena Apex is a different mathematical
benchmark. It's not saturated. 1 to 2%
has been the average score from LLMs in
that benchmark. And you know what Gemini
3 scored? Percent. Is it perfect? No. Is
it a whole lot better than 1 or 2%? Yes.
Modal understanding. It's ahead of GPT
5.1 and Sonnet in MMU Pro. It has the
best reported benchmark for video MMU.
It's also got the best OCR recognition
rates. And then this is the one that
blows me away, Screenshot. Screenspot
Pro, which is a measure of the ability
of the model to read real screens. It
dunks on the competition. It scores
72.7%
versus half of that, about 36% for
Sonnet 4.5 versus just 3.5% for GPT 5.1.
And by the way, as you see these scores,
you may start to think, "Wow, the model
I have in front of me, GPT 5.1 or Sonnet
4.5, is is terrible." No, the models you
have in front of you did not get
magically worse. We are just seeing that
there is no wall. And that is the thing
I want you to remember. Anyone who tells
you that there is a wall and we are
seeing an AI bubble because labs cannot
make progress is wrong. They're wrong.
And we have all of these benchmarks to
show it. I am not making it up. I am
just reporting to you what Google has
already shipped. We do not see a wall on
pre-training. We do not see a wall on
post training. These models continue to
get better and they don't get better by
a little bit. Progress is not slowing
down. So, it's only going a little sort
of bits. This is a massive leap forward
on the state-of-the-art. And the thing
is, you may not see it if you were just
chatting with Gemini about casual
subjects. Like, if you're asking about
planning the soccer game, you're not
going to notice this. If you're asking
about writing even a one-pager, you may
not notice a huge difference. This model
is very very good, but it's good in ways
that are more suitable to complex work.
And that's what my second video is going
to focus on. I'm going to focus on a
larger perspective and the takeaways
that we can have as to where Gemini 3
sort of slots into our workflows. I want
to test. I want to figure all that out.
In the meantime, number one means three
things for you to keep in mind. First,
it is possible to be number one. And I
that sounds circular, but I promise you
it matters. Number one means that it is
possible to take a formidable lead in
the AI race that everyone agrees on. I
think we had lost that belief a little
bit. We thought that we were in a tight
horse race and it was just going to be a
tight horse race. This is not a tight
horse race. It's like we were all neck
andneck and all of a sudden a new model
launches several lengths ahead. It is
possible to have those big jumps still.
So don't get lulled into a false sense
of security. That's my first takeaway
for you. My second takeaway for you is
that we continue to need to rethink what
AI is capable of doing every couple of
months. People wonder why I can still
find material, still be able to keep
talking about AI. Guys, today is why we
still have so much to learn about how to
use these models in practical workflows,
how to integrate these models, when do
we call multiple models, do you need the
smartest model, when do you use Gemini
3? And every time we advance the
stateofthe-art, which is what Gemini 3
did today, we expand the surface area of
possible workflows that we can cover
with AI. And today that expanded by a
really meaningful margin. And so when
you are thinking about what AI can do, I
want to encourage you to think about it
in terms of a trajectory. you are
understanding AI today as a particular
area that AI can cover of your work or
of workflows that you're trying to
build. Assume it will get better. I keep
saying this and I know you can't predict
it to the day, but it will get better.
AI will continue to cover more
workflows. And none of that contradicts
the idea that there are still places
where AI really struggles. The areas
where we see Gemini 3 getting better are
fantastic. In some ways, they do reflect
the claim that this is a PhD level
researcher, right? It is able to think
and be blunt and push and ask questions
and all of that stuff. But I don't see
indications either from anecdotes or
from tests that this is a situation
where a model is suddenly going to be
able to do everybody's work for all
time. the areas of ambiguity that humans
thrive in, the tough calls we have to
make, the stakeholders we have to
manage, the questions we ask, the
creativity we bring, we all still need
to do those things. Gemini 3 is very
good as a language model, but it's not
good at the things that a language
model's not good at. And so, as much as
I'm excited about number one, and I'm
going to do another video on my larger
set of takeaways and where to use it,
right? like where the value is. Today's
takeaway is really believe that it's
number one. Keep an eye out for those
advanced workflows that you can now
unlock, but don't overindex and believe
it takes over the world or believe that
it takes your job tomorrow because that
is not at all the indication that it
gives. And I think it should be
encouraging to you that a model can get
better in important ways like math or
science or coding. But there are aspects
to the work that we do that are not just
those very narrowly defined tasks. So
get excited. You live in a world where
you get a colleague that can help you in
your work who is not going to really be
able to take your job well but who can
help you do a whole lot more a whole lot
faster. And that colleague keeps getting
smarter all the time. That is what today
is about. It is about a colleague who
keeps getting smarter all the time. And
Gemini 3 reminds us with that
unambiguous number one that that is why
we get excited about AI. That is why
these days matter. The last takeaway I
have and I'll probably explore this more
tomorrow is that some of the areas where
we see the biggest leaps are in areas
that we have hoped we would see progress
but that have proven really difficult.
And I think there's something to be
explored there that I'm still
formulating. So to be specific, we see
giant jumps in visual acuity, visual
reasoning, ability to navigate visual
interfaces, visual understanding at the
same time as leaps in reasoning, leaps
in coding ability. What this reinforces
is a set of use cases where the model
needs to both see and think, see and
reason. That's really exciting because
the promise of these models has always
been that they are multimodal. That they
can take in image data, sound data, text
data, and put out a variety of things,
maybe code, maybe something else. Well,
that's becoming more and more true. is
you start to get a model that treats
some of these other input modes, not
just text, as native, as something that
they can reason across at a high level.
You get a truly multimodal experience
where it feels smart all the way around
and it doesn't feel like it has a weak
spot. There's probably a lot to explore
in a model that doesn't have a visual
weak spot, which we've known is the case
for some of these other models. It's
like if you ask Chad GPT to draw up a
web page or SA to draw up a web page,
they're they're good. They get you
somewhere, but they're not as good as
Gemini 3. And so having a model that is
starting to get smart at visuals unlocks
a lot that I'm still thinking about. And
I'd be curious for your take as we start
to think about what it means to have a
multimodal AI out there. But for now,
Gemini 3 is the number one model in the
world. Everyone just about agrees on
that. And I'd be curious to hear what
you're building with it. I will come up
with a more detailed set of takeaways
once I complete my testing today.