Learning Library

← Back to Library

Gemini 3 Dominates AI Benchmarks

Key Points

  • Gemini 3 is now recognized as the world’s leading LLM, outperforming every benchmark and anecdotal user reports compared to rivals like GPT 5.1 and Sonnet.
  • It dominates a range of tests—including Humanity’s Last Exam, ARC AGI2, Math Arena Apex, MMU Pro, OCR, and especially Screenspot Pro where it scores roughly double the competition—showcasing superior abstract visual, mathematical, and multimodal understanding.
  • Many older benchmarks are saturated, yet Gemini 3 still achieves massive leaps on newer, unsaturated tasks, proving that progress isn’t plateauing.
  • The data contradicts claims of an “AI wall” or bubble; both pre‑training and fine‑tuning continue to yield substantial, not incremental, improvements.
  • While casual conversations may not reveal the gap, the underlying advancements in reasoning, perception, and tool‑free problem solving are dramatically clearer in these benchmark results.

Full Transcript

# Gemini 3 Dominates AI Benchmarks **Source:** [https://www.youtube.com/watch?v=nktAnCHK94I](https://www.youtube.com/watch?v=nktAnCHK94I) **Duration:** 00:09:06 ## Summary - Gemini 3 is now recognized as the world’s leading LLM, outperforming every benchmark and anecdotal user reports compared to rivals like GPT 5.1 and Sonnet. - It dominates a range of tests—including Humanity’s Last Exam, ARC AGI2, Math Arena Apex, MMU Pro, OCR, and especially Screenspot Pro where it scores roughly double the competition—showcasing superior abstract visual, mathematical, and multimodal understanding. - Many older benchmarks are saturated, yet Gemini 3 still achieves massive leaps on newer, unsaturated tasks, proving that progress isn’t plateauing. - The data contradicts claims of an “AI wall” or bubble; both pre‑training and fine‑tuning continue to yield substantial, not incremental, improvements. - While casual conversations may not reveal the gap, the underlying advancements in reasoning, perception, and tool‑free problem solving are dramatically clearer in these benchmark results. ## Sections - [00:00:00](https://www.youtube.com/watch?v=nktAnCHK94I&t=0s) **Gemini 3: Unambiguous Number One** - The speaker explains how Gemini 3 surpasses all competitors on benchmarks and user reports, establishing it as the clear, world‑leading AI model. - [00:03:18](https://www.youtube.com/watch?v=nktAnCHK94I&t=198s) **Rethinking AI Leadership with Gemini 3** - The speaker previews a second video on integrating Gemini 3 into workflows, emphasizing that claiming the top spot in the AI race remains achievable and urging continual reassessment of AI’s capabilities and practical applications. - [00:07:26](https://www.youtube.com/watch?v=nktAnCHK94I&t=446s) **Visual Reasoning Breakthrough in Gemini 3** - The speaker highlights how Gemini 3’s enhanced visual acuity and reasoning eliminate previous multimodal weaknesses, enabling truly integrated image‑text‑code understanding. ## Full Transcript
0:00Gemini 3 is the number one model in the 0:02world and it's not close. This first 0:04video I'm doing is going to talk about 0:06what it means to be the number one 0:07model, why we should care, and then the 0:10next video I'll do tomorrow is going to 0:11talk about my takeaways. So, what does 0:13it mean to be the number one model in 0:15the world? I actually want to get into 0:17that because we haven't had an 0:18unambiguous number one model that 0:21everybody agrees on in a while now. 0:23Gemini 3 is that model. It wins on every 0:25benchmark I can find. And it also wins 0:28the anecdotal wars. The conversations on 0:31Reddit, the conversations on X about 0:33what this model can do. Users are 0:35reporting it's a very very strong model 0:37just like the benchmarks do. So what do 0:39these benchmarks say and why do they 0:41matter? Humanity's last exam, it has the 0:43highest published score. What I noticed 0:45is it didn't use tools to get to that 0:47published score. That is the model's 0:48brain doing it. ARC AGI2 again a clear 0:52lead. And the thing I noticed is it does 0:54well on abstract visual puzzles and that 0:56visual puzzle theme will come back. It 0:58does well on the math and the science. 1:00These feel somewhat exhausted as scores. 1:02So like getting 95% without code on AIM, 1:06it may edge GPT 5.1 by a point 1:09technically, but the point is that 1:11entire benchmark is saturated. Math 1:14Arena Apex is a different mathematical 1:17benchmark. It's not saturated. 1 to 2% 1:19has been the average score from LLMs in 1:22that benchmark. And you know what Gemini 1:243 scored? Percent. Is it perfect? No. Is 1:27it a whole lot better than 1 or 2%? Yes. 1:30Modal understanding. It's ahead of GPT 1:325.1 and Sonnet in MMU Pro. It has the 1:35best reported benchmark for video MMU. 1:38It's also got the best OCR recognition 1:42rates. And then this is the one that 1:44blows me away, Screenshot. Screenspot 1:47Pro, which is a measure of the ability 1:50of the model to read real screens. It 1:53dunks on the competition. It scores 1:5572.7% 1:58versus half of that, about 36% for 2:02Sonnet 4.5 versus just 3.5% for GPT 5.1. 2:07And by the way, as you see these scores, 2:09you may start to think, "Wow, the model 2:10I have in front of me, GPT 5.1 or Sonnet 2:134.5, is is terrible." No, the models you 2:16have in front of you did not get 2:17magically worse. We are just seeing that 2:19there is no wall. And that is the thing 2:21I want you to remember. Anyone who tells 2:24you that there is a wall and we are 2:27seeing an AI bubble because labs cannot 2:30make progress is wrong. They're wrong. 2:33And we have all of these benchmarks to 2:35show it. I am not making it up. I am 2:37just reporting to you what Google has 2:40already shipped. We do not see a wall on 2:43pre-training. We do not see a wall on 2:44post training. These models continue to 2:46get better and they don't get better by 2:48a little bit. Progress is not slowing 2:50down. So, it's only going a little sort 2:52of bits. This is a massive leap forward 2:55on the state-of-the-art. And the thing 2:57is, you may not see it if you were just 2:59chatting with Gemini about casual 3:02subjects. Like, if you're asking about 3:03planning the soccer game, you're not 3:05going to notice this. If you're asking 3:07about writing even a one-pager, you may 3:09not notice a huge difference. This model 3:11is very very good, but it's good in ways 3:14that are more suitable to complex work. 3:16And that's what my second video is going 3:18to focus on. I'm going to focus on a 3:22larger perspective and the takeaways 3:24that we can have as to where Gemini 3 3:27sort of slots into our workflows. I want 3:29to test. I want to figure all that out. 3:30In the meantime, number one means three 3:33things for you to keep in mind. First, 3:35it is possible to be number one. And I 3:37that sounds circular, but I promise you 3:39it matters. Number one means that it is 3:42possible to take a formidable lead in 3:45the AI race that everyone agrees on. I 3:47think we had lost that belief a little 3:49bit. We thought that we were in a tight 3:51horse race and it was just going to be a 3:53tight horse race. This is not a tight 3:55horse race. It's like we were all neck 3:57andneck and all of a sudden a new model 3:59launches several lengths ahead. It is 4:01possible to have those big jumps still. 4:03So don't get lulled into a false sense 4:05of security. That's my first takeaway 4:06for you. My second takeaway for you is 4:09that we continue to need to rethink what 4:13AI is capable of doing every couple of 4:15months. People wonder why I can still 4:17find material, still be able to keep 4:19talking about AI. Guys, today is why we 4:22still have so much to learn about how to 4:25use these models in practical workflows, 4:28how to integrate these models, when do 4:30we call multiple models, do you need the 4:32smartest model, when do you use Gemini 4:343? And every time we advance the 4:36stateofthe-art, which is what Gemini 3 4:38did today, we expand the surface area of 4:42possible workflows that we can cover 4:44with AI. And today that expanded by a 4:46really meaningful margin. And so when 4:49you are thinking about what AI can do, I 4:51want to encourage you to think about it 4:53in terms of a trajectory. you are 4:55understanding AI today as a particular 5:00area that AI can cover of your work or 5:02of workflows that you're trying to 5:04build. Assume it will get better. I keep 5:07saying this and I know you can't predict 5:09it to the day, but it will get better. 5:11AI will continue to cover more 5:13workflows. And none of that contradicts 5:16the idea that there are still places 5:18where AI really struggles. The areas 5:21where we see Gemini 3 getting better are 5:24fantastic. In some ways, they do reflect 5:27the claim that this is a PhD level 5:29researcher, right? It is able to think 5:32and be blunt and push and ask questions 5:34and all of that stuff. But I don't see 5:36indications either from anecdotes or 5:39from tests that this is a situation 5:41where a model is suddenly going to be 5:44able to do everybody's work for all 5:46time. the areas of ambiguity that humans 5:49thrive in, the tough calls we have to 5:51make, the stakeholders we have to 5:53manage, the questions we ask, the 5:55creativity we bring, we all still need 5:58to do those things. Gemini 3 is very 6:00good as a language model, but it's not 6:02good at the things that a language 6:04model's not good at. And so, as much as 6:06I'm excited about number one, and I'm 6:08going to do another video on my larger 6:09set of takeaways and where to use it, 6:11right? like where the value is. Today's 6:13takeaway is really believe that it's 6:16number one. Keep an eye out for those 6:18advanced workflows that you can now 6:21unlock, but don't overindex and believe 6:24it takes over the world or believe that 6:26it takes your job tomorrow because that 6:27is not at all the indication that it 6:30gives. And I think it should be 6:31encouraging to you that a model can get 6:33better in important ways like math or 6:35science or coding. But there are aspects 6:37to the work that we do that are not just 6:40those very narrowly defined tasks. So 6:43get excited. You live in a world where 6:45you get a colleague that can help you in 6:47your work who is not going to really be 6:50able to take your job well but who can 6:53help you do a whole lot more a whole lot 6:55faster. And that colleague keeps getting 6:57smarter all the time. That is what today 6:59is about. It is about a colleague who 7:01keeps getting smarter all the time. And 7:03Gemini 3 reminds us with that 7:05unambiguous number one that that is why 7:09we get excited about AI. That is why 7:11these days matter. The last takeaway I 7:13have and I'll probably explore this more 7:15tomorrow is that some of the areas where 7:17we see the biggest leaps are in areas 7:20that we have hoped we would see progress 7:22but that have proven really difficult. 7:24And I think there's something to be 7:26explored there that I'm still 7:27formulating. So to be specific, we see 7:30giant jumps in visual acuity, visual 7:33reasoning, ability to navigate visual 7:35interfaces, visual understanding at the 7:37same time as leaps in reasoning, leaps 7:39in coding ability. What this reinforces 7:42is a set of use cases where the model 7:44needs to both see and think, see and 7:47reason. That's really exciting because 7:50the promise of these models has always 7:51been that they are multimodal. That they 7:54can take in image data, sound data, text 7:56data, and put out a variety of things, 7:59maybe code, maybe something else. Well, 8:02that's becoming more and more true. is 8:04you start to get a model that treats 8:06some of these other input modes, not 8:08just text, as native, as something that 8:10they can reason across at a high level. 8:12You get a truly multimodal experience 8:14where it feels smart all the way around 8:16and it doesn't feel like it has a weak 8:17spot. There's probably a lot to explore 8:21in a model that doesn't have a visual 8:25weak spot, which we've known is the case 8:27for some of these other models. It's 8:28like if you ask Chad GPT to draw up a 8:31web page or SA to draw up a web page, 8:33they're they're good. They get you 8:35somewhere, but they're not as good as 8:37Gemini 3. And so having a model that is 8:40starting to get smart at visuals unlocks 8:43a lot that I'm still thinking about. And 8:45I'd be curious for your take as we start 8:48to think about what it means to have a 8:49multimodal AI out there. But for now, 8:52Gemini 3 is the number one model in the 8:55world. Everyone just about agrees on 8:57that. And I'd be curious to hear what 8:59you're building with it. I will come up 9:01with a more detailed set of takeaways 9:03once I complete my testing today.