Learning Library

← Back to Library

Gemini 3 Launch and AI Hallucinations

Key Points

  • Gemini 3 was unveiled with dramatically higher benchmark scores—especially on tough humanities exams and ARC‑AGI tests—signaling a major performance leap for Google’s model.
  • Early user feedback notes that Gemini 3 still tends to “hallucinate” and prefers to give an answer rather than admit uncertainty, though it appears less aggressive about making false claims than earlier versions.
  • This week’s AI roundup highlighted big moves: Microsoft and Nvidia teaming with Anthropic on a $15 billion infrastructure pact, CMU researchers finding AI agents fail ~70 % of real‑world corporate tasks, IBM launching a live‑alert AI platform for UFC events, and OpenAI releasing a ChatGPT variant for K‑12 teachers.
  • The expert panel—Marina Danielewski, Gabe Goodhart, and newcomer Marve Univar—discussed the implications of Gemini 3’s capabilities and its lingering hallucination issue, reflecting a mix of excitement and caution about the model’s real‑world reliability.

Sections

Full Transcript

# Gemini 3 Launch and AI Hallucinations **Source:** [https://www.youtube.com/watch?v=7T_TjH6P8CE](https://www.youtube.com/watch?v=7T_TjH6P8CE) **Duration:** 00:46:32 ## Summary - Gemini 3 was unveiled with dramatically higher benchmark scores—especially on tough humanities exams and ARC‑AGI tests—signaling a major performance leap for Google’s model. - Early user feedback notes that Gemini 3 still tends to “hallucinate” and prefers to give an answer rather than admit uncertainty, though it appears less aggressive about making false claims than earlier versions. - This week’s AI roundup highlighted big moves: Microsoft and Nvidia teaming with Anthropic on a $15 billion infrastructure pact, CMU researchers finding AI agents fail ~70 % of real‑world corporate tasks, IBM launching a live‑alert AI platform for UFC events, and OpenAI releasing a ChatGPT variant for K‑12 teachers. - The expert panel—Marina Danielewski, Gabe Goodhart, and newcomer Marve Univar—discussed the implications of Gemini 3’s capabilities and its lingering hallucination issue, reflecting a mix of excitement and caution about the model’s real‑world reliability. ## Sections - [00:00:00](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=0s) **AI News Roundup & Expert Panel** - The Mixture of Experts podcast introduces its guest panel and reviews the week’s AI headlines, covering Gemini 3, a Claude attack, the Microsoft‑Nvidia‑Anthropic $15 billion deal, CMU’s finding that AI agents fail 70% of corporate tasks, and IBM’s AI‑driven UFC live‑alert platform. - [00:03:14](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=194s) **Assessing Gemini 3’s Ecosystem Edge** - The speaker sees Gemini 3 as Google’s move to strengthen its AI moat by offering novel ecosystem tools—such as an anti‑gravity editing platform and a management‑of‑agents framework—rather than just a superior model, and remains uncertain about recommending it. - [00:06:33](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=393s) **AI Creates Custom Workout Dashboard** - The speaker describes using Gemini on the Antigravity platform to quickly generate a personalized workout plan and an interactive Streamlit dashboard, highlighting the model’s multimodal code, UI, and artifact generation capabilities. - [00:09:40](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=580s) **Balancing Generalist and Specialist AI** - The speaker discusses the trade‑offs between a single all‑purpose model and specialized agents, noting precision‑recall tension, the value of multimodality, and how task‑specific automation may reshape AI deployment. - [00:12:50](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=770s) **IBM's Kuga Enterprise Agent Development** - The speakers discuss IBM's newly announced Kuga generalist agent, outlining its progression from simple domain-specific bots to a multi‑agent, task‑decomposing architecture designed for enterprise readiness. - [00:16:34](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=994s) **Agents Becoming the New API** - The speaker predicts that by 2025, building AI agents will be as routine and standardized as creating REST API services, with open, configurable tools and built‑in management handling deployment, security, and scalability. - [00:20:07](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=1207s) **Spokes and Hubs Metaphor** - The speaker argues that even with accelerated technology, effective problem‑solving still relies on a hub‑and‑spoke structure of specialists and managers, and outlines future AI agent initiatives (Kuga, Altk) aimed at benchmarks like WebArena. - [00:23:39](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=1419s) **AI Benchmarking and Economic Impact** - The speaker directs listeners to Kuga’s resources, then explains that discussions about AI’s economic impact have moved from alarmist job‑loss predictions to systematic evaluation, highlighting OpenAI’s GDP Valley benchmark that tests AI against human experts on real‑world professional tasks. - [00:27:12](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=1632s) **Human vs AI in Benchmarking** - The speaker critiques AI benchmarks that rely on approximations, human graders, and proxy models, highlighting the paradox of using humans to evaluate AI while aiming to eliminate human labeling. - [00:32:04](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=1924s) **Evaluating AI Benchmarks and Real-World Tasks** - The speaker critiques headline‑driven metrics, urges readers to examine paper appendices and prompt‑based evaluations, discusses the complexity of multi‑model assessments, and highlights the need for deeper analysis of AI’s impact on jobs. - [00:35:18](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=2118s) **Measuring AI Impact in Clinical Conversations** - The speaker highlights the difficulty of quantifying benefits from extracting insights in doctor‑patient dialogues—such as best‑practice identification, supply‑chain optimization, and personalized follow‑ups—and stresses the need for realistic benchmarks, human evaluation, and KPI‑driven ROI tracking. - [00:39:19](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=2359s) **Observability as Security for LLM Agents** - The speaker argues that, since perfect alignment is unlikely, embedding robust telemetry and monitoring into LLM systems can provide transparency, enable rollback, and build trust—especially in controlled enterprise settings. - [00:43:59](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=2639s) **AI-Driven Exploit Scaling** - The speakers explain that AI can rapidly automate attacks on known, unpatched vulnerabilities, underscoring the urgent need for tighter patch timelines and built‑in safeguards in defensive AI systems. ## Full Transcript
0:01It was interesting to me that a couple of people 0:04reported that it's still hallucinating and it still really likes 0:07to give answers rather than say that it doesn't know 0:10the answers, although it's not so psychopathic about it. But 0:13it still really likes to give answers. So it's an 0:15interesting combo. All that and more on today's Mixture of 0:19Experts. I'm Tim Hwang and welcome to Mixture of Experts. 0:27Each week MOE brings together a panel of the finest 0:30minds of technology to distill down what's in important in 0:32the latest news in artificial intelligence. Joining us today are 0:36three incredible panelists. We've got Marina Danielewski, Senior Research scientist, 0:39Gabe Goodhart, Chief Architect, AI Open Innovation. And joining us 0:43for the very first time is Marve Univar, Director, Agentic 0:46Middleware and Applications Research, AI. All right, lots of interesting 0:51topics for today's episode. We're going to talk a little 0:53bit about, of course, the drop of Gemini 3, some 1:03attack using Claude. But first we've got Eily with the 1:06news. Hi, I'm Illy McConnell, a tech news writer for 1:14IBM Think. Here are this week's AI headlines. Microsoft and 1:18chipmaker Nvidia will partner with AI startup Anthropic in a 1:22$15 billion AI infrastructure deal. Carnegie Mellon researchers discovered that 1:27AI agents fail a shocking 70 to of the time 1:32on corporate real world tasks. IBM and the Ultimate Fighting 1:36Championship have launched an AI driven live alert platform that 1:40delivers real time records and milestones during UFC events to 1:44viewers. OpenAI announced ChatGPT for teachers, a version of its 1:49popular chatbot for K12 educators. For more subscribe to the 1:54Think newsletter linked in the show notes. And now let's 1:56see what our experts think of Google's Gemini 3.0. First, 2:04I want to start with the big news of the 2:06week which is the launch of Gemini 3. So long 2:11rumored, long teased, but finally out. And it's a remarkable 2:16model. I mean from some of the benchmarks Google is 2:19reporting explosively good performance on. I think what has been 2:22considered some of the most difficult evals and benchmarks that 2:25are out there. So huge leaps on humanities last exam, 2:28really big jumps on arc AGI. But I guess maybe 2:32let's just kind of start with like the vibe check 2:35I guess. Marina, have you had a chance to play 2:36with the model yet? I'm curious about what you think 2:38about it and if it feels like substantively very different 2:41from what came before I. Haven'T played with it. I've 2:43looked at a few digests about it. It does seem 2:46like there's a lot of interest in making the more 2:49complicated benchmarks be something that's handleable. It was interesting to 2:53me that a couple of people reported that it's still 2:56hallucinating and it still really likes to give answ rather 2:59than say that it doesn't know the answers. Although it's 3:02not so psychophantic about it. But it still really likes 3:05to give answers. Right. It's still making mistakes. Yeah. It 3:09doesn't like to admit that it doesn't know something and 3:11maybe that's saying something about this new set of models. 3:14Yeah, for sure. And Gabe, quick question. Would you recommend 3:18Gemini 3? Have you played around with it yet? Yeah, 3:21I've played just a little bit with it this morning 3:23and I don't know. I think my take on this 3:26is we're really starting to see sort of the ecosystem 3:30moats evolving. Like I think this is a necessary step 3:34for Google in their AI ecosystem to have a model 3:38that is at par or better than all of their 3:40competitors so that they can truly claim to be running 3:44ahead in the front of offerings here. But what really 3:47struck me about the announcement was that they actually took 3:50a swing at a piece of differentiation because frankly a 3:53really great model is not that differentiated anymore. It's like 3:56that line but. But for what? I want to use 3:59a model for what most people want to use a 4:01model for. We don't need something better than what we've 4:03already got. I thought the thing about the Anti Gravity 4:08editing platform was really interesting because it actually looked like 4:12something novel that they're adding to their ecosystem that you 4:17can't get anywhere else. And in particular, you know, the 4:20idea of an agentic IDE is not at all new. 4:23There are well known startups out there doing that. There 4:26are open source ways to pull that together on your 4:29own. The part that I think was novel here was 4:32the intentional transition to framing it as a management of 4:36agents problem and tying this all back to would I 4:40recommend the model itself? I haven't played with this capability, 4:44but a colleague of mine has already. And the idea 4:47of being able to launch a fleet of delegate worker 4:51agents that can all work on separate tasks in parallel 4:55and you can manage them is something that I think 4:58has some real compelling chops. So if the model can 5:02actually hold up to that level of independence and you 5:05know, parallel analysis, it could be a real breakthrough there 5:09on a net new capability that you couldn't get with 5:12any other ecosystem. So I'm really excited to try that 5:15out and to see where it goes. But from a 5:17pure model perspective, I'll play with it. Yeah, for sure. 5:22Yeah, I think that's kind of the one. The really 5:23interesting things coming out of all this is like, you 5:26know, I think we used to marvel even earlier this 5:28year. We're like oh my God, new model, incredible benchmarks, 5:31like look at all this progress. But here we are, 5:33you know, sitting in November of 2025 and we're like 5:36eh, it's the benchmarks, whatever, awesome. The science is amazing 5:41and I can accomplish the same set of tasks that 5:45I had before with a much, much smaller model running 5:49on my laptop. So Marva, do you want to talk 5:51a little bit about antigravity? I did think that this 5:53was sort of like the big interesting differentiator on the 5:56announcement was to say, hey, you know, we acquired Windsurf, 5:59we're going to do kind of our ide and it's 6:02going to be an agentic ide. What's your take? I 6:05mean, where is this all going? And I guess if 6:07you want to give our listeners an intuition of like 6:08why Google is even investing in that kind of differentiation, 6:12I think Gabe alluded. A little bit to it, right. 6:15The ecosystem play. But I think this, I haven't played 6:18with antigravity. I play with the model and I'll tell 6:21or share my experience. But I think they're aiming for 6:25the advanced tool use so the whole agentic applications and 6:29making tool calls more robust and also increase the modalities. 6:33Right. Like they're claiming you can do editor, terminal, web 6:37browser and like many different execution modes. And this means 6:41you can plan code, execute, verify different tasks more autonomously. 6:46And I think agents in anti gravity will also generate 6:49artifacts. That's what they claim. So this can task lists, 6:53plans, screenshots, I think browser recordings and they can also 7:03when you want to take agents to next stage beyond 7:06benchmarks, from academic benchmarks to put it out there to 7:09reality. So it's quite promising. But I did play with 7:13the model, I didn't play with the antigravity platform yet. 7:16So is Marina said it's not hallucinating fully, but just 7:20like other big models, it's really, really good at with 7:23the initial prompt, like the way you describe your first 7:25set of things. And they have a build section. Right. 7:28Like you can build artifacts, you can build UI elements. 7:31So I asked Gemini to create me a workout plan 7:33and an interactive dashboard to track my workout sessions. And 7:37I told my weight, height, age to customize for myself. 7:40And the very first UI that was really nice and 7:44literally like I was multitasking in a meeting and it 7:46like locally I was able to build in streamlit in 7:49less than two minutes. Then I asked it to add 7:51some more, you know, personalized pictures, like motivation pictures, customize 7:55with my name. And then I realized it added a 7:57reminder section saying that I should eat high nutrition food 8:01after the workout to grow. To grow. I'm a mother 8:06with two children. If this was for my kids, I 8:08think it would make more sense. But for me, I'm 8:11not growing. It totally messed that part up. It already 8:14knew my age, so it was way past, I'm way 8:16past growing age. So again, and the overall performance I 8:19think on benchmarks are quite impressive and the claims they 8:23are making. And I think this excited me because I 8:26think this is the largest capability jump we've seen in 8:28a few months. Right. Like, it's nice, but it has 8:31some flows which I personally experienced in my first UI 8:35dashboard that I built with. It when Marina, hopefully you 8:37can help me kind of square this circle here because 8:40I think it's really interesting that your first reaction to 8:43this model is still hallucinating. And we kind of have 8:47this very funny sort of split screen experience of these 8:50models where they are just kind of performing better and 8:53better against these benchmarks. But yet our kind of everyday 9:03that? Is it true that the more powerful models just 9:07simply hallucinate more or is this kind of just like. 9:09I'm just kind of curious about. They're seemingly really strong 9:12in some things, but remain kind of like amazingly weak 9:15in other domains. Right. So I'm going to agree with 9:17what Gabe said, which is the whole point of, well, 9:19plenty of tasks. I don't need this really large model 9:21and I can do a better thing with a smaller 9:23model because hopefully we're trying to finally getting beyond this 9:27idea that we're going to have one model to rule 9:30them all. This never should have been a goal. It's 9:32never going to work for an LLM with the architecture 9:34that we have. What you want is a suite. Either 9:37you give them different instructions or different preferences or whatever. 9:41If you wanted to do better on the these very 9:43complicated benchmarks which really want the model to think through 9:47a lot of Things generate a lot of thoughts and 9:49attempts and whatever. Then you don't want it to be 9:51reticent and sit there saying, I don't know anything. I'm 9:54going to twiddle my thumbs because I don't have a 9:55citation to offer you. These are different tasks. This is 9:59the same thing as if you were to have the 10:01statistical tension between precision and recall. You're going to do 10:04better in one, you're going to do worse than the 10:06other. This is going to be a consistent thing. Use 10:08them for different things. And the fact that the more 10:12interesting thing now is automation, the more interesting thing now 10:14is really the multimodality. Yeah. Lean into that because that 10:18really is more interesting. Having one model is always going 10:20to do the best job at giving you the right 10:21information. Why? Review your Karl Marx, review your division of 10:25labor. Let's reinvent normal civilization of people working together. That's 10:32going to be more effective. It always will be. Yeah, 10:35yeah. And I think it's kind of funny where this 10:37is all resolving too, because I agree, one of the 10:40things that was like, oh man, this is going to 10:42change everything was one model to rule them all. But 10:45if we end up in a world where, I don't 10:47know, we have very specialized agents for very specialized kinds 10:50of tasks, are we back to Appland again? Are we 10:53back to software again? In some ways we're kind of 10:56reconstructing applications like specialized software again, which I guess is 11:02like everything old is new again in some sense. I 11:05think it's a healthy tension. I mean, I think frankly, 11:07one of the reasons we're in the AI moment. We 11:09are, is that the pendulum really swung with the introduction 11:13of Transformers and suddenly you didn't need a complicated suite 11:16of Software to get 80% of the solution. And that 11:20was a real game changer. Right. I think what we're 11:22seeing here is there's. I mean, the general purpose populace 11:26is still going to use one chat window. Right. They're 11:29going to jump to a chat window and they're going 11:31to enter some things. Now if that chat window becomes 11:35an increasingly complicated software machine behind the hood, the user 11:39doesn't need to know. So the interface change of one 11:42model to rule them all I think is sticky. I 11:44don't think that's going anywhere but the actual implementation behind 11:47the scenes. I think we've already seen that with the 11:51GPT 5 series. I think we will almost certainly see 11:55it with other Frontier model offerings, let's call them that, 12:01and I think we'll see the open equivalent of a 12:05software Stack emerging that allows you to ensemble models for 12:09specific parts of your workflow and specific elements of how 12:12you want this all to work together, exposing that nice 12:15single entry point that users want to interact with. So 12:19I think it's a healthy tension. The nice thing here 12:22as a software architect you get an abstract interface which 12:25is your chat box and then you get to implement 12:27it however you want. So we'll iterate on that implementation 12:31because we're software folks and we'd like to do that. 12:33But I think we'll swing back and forth a little 12:38bit on the complexity behind the scenes. Yeah, for sure. 12:40I think it'll be so funny if we end up 12:41in a few years people being like, well rather than 12:43a chat window, what if we had like a desktop 12:45with like icons you can click on and it's just 12:47like we'll be back to where we were. So. Yep. 12:54Two very interesting announcements out of IBM recently about agents 13:02life really the last few months and I think particularly 13:06on Gabe's last comment. I understand one of the announcements 13:09is around a project called Kuga C U G A 13:12which is billed as an enterprise ready generalist agent. So 13:17do you want to talk a little bit about kind 13:18of like what the team was working on and thinking 13:20about for this, for this launch? Sure. Happy to. As 13:22you said, this has been my life since the launch 13:25and it's been quite. I think we got good feedback 13:27as well. We're trying to become enterprise ready general estate 13:31and it's not an easy, easy task to take on. 13:34But where we started is like everybody else who starts 13:38to build enterprise ready agents, you start from some simple 13:42traditional ways, build a domain specific agent. So maybe React 13:46Kodak the simple pattern that you take and then you 13:49start evolving it to oh, my task is too complex 13:52and my single agent cannot handle this. So let me 13:55go and build a task decomposer on top. Oh, this 13:58is now becoming this multi agent architecture where you have 14:01the layer up top that, that picks the right sub 14:04agent to do. It's a, I mean classical engineering design 14:07principles because it's easier to distribute to the sub agents. 14:10We believe it's going to work faster. And then what 14:12we realized is we're not the only one that does 14:14this. Like my peer groups in IBM Research, when they 14:17build sophisticated agents they go through this experience as well. 14:20Let's start simple and then it all of a sudden 14:22becomes this very complex. Exactly. And then we stepped back 14:27and we thought like maybe we can create a generalized 14:30version of this where people can jumpstart with using KUG 14:35architecture rather than building all these things by themselves. Right. 14:38So we can give Kuga, which is this multi agent 14:41supervisory layer already embedded and a multi agent architecture and 14:46people can configure it for their own domain and users. 14:50So Radiodon is a traditional way like build domain specific 14:53agent, evolve it, do some custom benchmarks and a long 15:04bring your own domain onboard your own tools and configure 15:07your own domain to do your own benchmarks and then 15:10deploy. So that's our vision. It's open, it's outside in 15:14the open now for people to try and give us 15:17feedback and see if it works for their domain. So 15:19we're very excited that we launched in the open so 15:21we can capture if what we experimented in research actually 15:26can be mimicked in the real world application uses. Yeah, 15:30that's really exciting. And I think one of the things 15:32that we've been watching really closely here at MOE is 15:35what I love about the kind of agent competition world 15:37is right now we're very much in the world of 15:40norm setting. We're doing it this way. We hope you 15:43do it this way as well. And I think there's 15:46various projects that are more or less successful at attempting 15:50to build those standards. I think what's really intriguing, and 15:52Gabe, I'm curious if you want to talk about this 15:54in the context of Kuga, is it seems here what's 15:57really intriguing, Marvi, I'm hearing you right, is that everybody 16:00starts by building an agent and they all discover exactly 16:04the same problems over and over again. And everybody's going 16:07through that process of rediscovery right now. I guess. Gabe, 16:09that's pretty promising from the point of view of. Okay, 16:12let me just shortcut this. Here's a standard framework. Yeah, 16:16I've been also thinking a lot about this, having a 16:19lot of conversations with different teams building different components. And 16:23one of the things that really seems to be true 16:26is that there are emerging slots for an abstract architecture 16:30for agents in open source and presumably in closed source. 16:34But we don't know how those tools are implemented necessarily. 16:37The generalist agent is absolutely one of those slots. And 16:40having an open offering that's configurable and permissively licensed is 16:45a really awesome place for people to start collaborating and 16:47building. On top of this tool management is another big 16:51piece of this and it just seems to be coming 16:55up over and over again that this sort of emergent 16:57architecture is there and to the point about refining and 17:01iterating on the actual agentic architecture itself. The analogy that 17:05I keep coming to when I talk about this stuff 17:08is if I asked anybody out there working at a, 17:12well, really any company, please go build me a REST 17:15API server for X, Y and Z. I wouldn't have 17:19to tell them the architecture of that thing. I wouldn't 17:22have to tell them what programming language to use. I 17:23wouldn't have to tell them like it's just kind of 17:26a well established pattern that everybody knows how to do 17:29if they've ever touched cloud software. Agents aren't there yet, 17:32but as many people have said, you know, 2020, the 17:35year of the agent. I think by the end of 17:372025 we're going to be close to actually hitting that 17:40point where we can just say, hey, build an agent 17:42for this. And everybody just knows what you mean. And 17:44as you exactly described it, Merve, I think the decomposition 17:50is exactly that step from I got it running in 17:53flask on my local machine with HTTP to now I've 17:57got a server that has middleware for authentication and serves 18:01TLS and can actually be horizontally replicated. Those are the 18:06steps you take when you're building, building a microservice after 18:08you get your demo app running. The same thing you're 18:10describing with Kuga is exactly what people are hitting after 18:14they get their first react agent off the ground. You 18:16mentioned like, oh, here is the agent, go use it. 18:18We're really trying not to put people to like, okay, 18:23this is Kuga is this and you have to use 18:25it. It's also flexible. The configuration piece makes it, I 18:28think easier for people to okay, I need this, but 18:32I can configure it this way. And also what we 18:34did is, which is I think maybe it's a good 18:37time to introduce the altk, which is the Agent Lifecycle 18:40development toolkit that we also released in the open. We 18:43componentize Kuga and we build different components to support Kuga's 18:47different capabilities like memory guardrails and other things that makes 18:52Kuga function in the real world. But some people may 18:55not want to start from Kuga. They may still have 18:57their own sophisticated agent that they built and they don't 19:00want to move it to Kuga. So they can reuse 19:02these components under this Agent Lifecycle toolkit and if they 19:06want the memory piece and they can take it and 19:08apply to their agent and this is Again, like democratizing 19:12and not really pushing people to use. This is what 19:14I have. This is it works and use it. No, 19:17there's flexible design. You can take the different components and 19:20apply to your current agentic implementation if you want to 19:23improve certain aspects of your agent. Marina, I want to 19:25go back to kind of the comment that you made 19:27a little bit earlier when we were talking about Gemini 19:303 and kind of this movement to sort of like 19:32more specialized agents. Over time it kind of strikes me 19:35that we will almost reproduce human org structures in agents 19:39because it's generally this agent is kind of like it's 19:41the middle manager. Right. Its role is to manage other 19:45agents. Do you think that's kind of where we're headed 19:47ultimately is like we're moving away from one agent to 19:50rule them all. But there will still be these kind 19:52of generalist agents and really their role will be sort 19:55of that middle manager, I guess, in the org chart 20:04biology to software, you have this combination of hubs and. 20:08And spokes. There's a real reason that you end up 20:10settling. And maybe it's going to be a different number 20:12of spokes, more hubs, fewer hubs. But sort of like 20:15as you figure out the way to solve a particular 20:16problem, that's still the place that you solve. You need 20:19some specialists and you need somebody doing the managing and 20:22the planning. So yeah, what's interesting about this era is 20:28that we are able to go faster, further than we 20:32thought. But if you take that 10,000 foot view, it's 20:34still a. All right, I've got a task I maybe 20:37can do. I'll think the spokes part a lot faster. 20:40But you still somewhere in there need a hub where 20:42you say, okay, this is what you do next. This 20:44is how you know that you're done. This is what 20:45you try. It's very natural and very correct. The technology 20:49to get us to go faster is great and it's 20:52very exciting. But yeah, this is the normal pattern of 20:55problem solving. Yeah. Like the future is kind of like 20:58figuring out how you staff your project with different kinds 21:00of agents. It almost feels like. And maybe a couple 21:04of people in there just to keep an eye on. 21:06There's some actual humans in there. So I guess maybe 21:11a last point, Marv, where are you headed next with 21:13all this? I know you had said this is your 21:15life since the launch, but where does Kuga, where does 21:18Altk go next? So just like Gemini 3 benchmark results. 21:22So we started with Kuga and then we went out 21:24there and found the most challenging and most representative benchmark 21:27that we can go after, which was WebArena and AppWorld. 21:31So we were number one on both of them for 21:33a long time. And we kept like, oh, let's keep 21:35our position as number one. But no, I think it's 21:38very different to have of keep our position in benchmarks 21:41as number one versus putting it out there and hearing 21:44directly from the users where it breaks. How latency is 21:47for example, a problem right now. Apparently like when we 21:49built Kuga, we really focused on the accuracy and how 21:53good Kuga behaves like listening the task. But latency is 21:57one of the requirements. For example, for us, that came 21:59from the real users when we launched outside that said 22:02like, okay, this is too slow for me to use. 22:05So we have a bunch of things that we capture 22:07from the community that we would like to incorporate. But 22:10also when I mentioned like the altk, which is the 22:13core components, that helps, I think agents or agent builders 22:17boost their agent performance. There are a couple things that 22:20we're working actively on. One of them is the memory 22:23that I mentioned. And I'm not talking about storage and 22:25data structures. I'm talking about what can you make out 22:28of this memory, like what you want to remember, what 22:30you want to forget. And what is the middle ground 22:32that you want to keep learning from. Because some tool 22:37combinations may never work and then you already did it 22:39and your trajectory is you have this like it's saved 22:42in memory and can you bring it up and do 22:45some self learning for Kuga or other agents? And the 22:48other one is the consistency, like this is extremely important 22:51and there is not a single definition of consistency in 22:53the literature. When you look, people define maybe sometimes with 22:56repeatability, but in the I think enterprise setting, especially also 23:00in the consumer, setup is important. Like you don't want 23:03your agent to do something in a way one day 23:05and a very different way in another day. So how 23:08can you bring this consistency to the real world agents, 23:11therefore they are within their own world, consistent with their 23:17behavior and they don't throw really ridiculous answers one day 23:21when you ask the same question. So these are the 23:23two main topics that we are working. I know. And 23:26we're excited that we're making progress towards getting the real 23:29feedback from the community and also advancing the capabilities of 23:33Kuga with these components. That's great. Well, we'll have to 23:36track the project. How do people find out more about 23:38it if they want to keep up with Your work? 23:39Sure. There's kuga.dev. this is the Kuga website where you 23:43can go to GitHub and learn all the blog posts 23:45and other things. And we have ALTK AI. So that's 23:48the individual components that constitutes and helps KUGA perform better. 23:52So if they go to these two websites, they're all 23:55good. Nice. Yeah, those are the solid tlds right there. 24:03Well, I'm going to move us on to our next 24:05topic. This is kind of a recurring theme for us 24:08in 2025 and I think it's kind of a sort 24:11of interesting ongoing stuff set of discussions about how AI 24:14will impact the economy. And I think overall, I would 24:18say the discussion has matured, I think over the course 24:21of the last 12 months. I think in the beginning 24:24of the year we were still very much like all 24:26the jobs are going to disappear because of AI. And 24:29I think now we're getting more to the mode of 24:31like, well, let's do an eval on that. We're approaching 24:34it in a very machine learning way. So OpenAI announced 24:39a benchmark that they call GDP Valley. And essentially what 24:43they're trying to do is to say, well, we have 24:46all sorts of benchmarks on trying to evaluate AI capabilities. 24:50But one count against a lot of these benchmarks is 24:52that they don't tend to be very realistic. Not frequently. 25:01and solve complex math theory problems. What they do is 25:06they basically curate a set of tasks from a number 25:08of actual professions and they evaluate whether or not AI 25:12is able to kind of produce outputs on par with 25:15a human expert. And they run this as a way 25:19of kind of trying to get an assessment of, well, 25:21what are the effects of AI going to be on 25:23the economy, particularly against these sort of economically valuable tasks. 25:28And there's some interesting results. I think the big headline 25:30is that even though it's a benchmark for OpenAI, they 25:34discover that Claude Opus 4.1 is the strongest performer against 25:37these tasks and in some cases are able to reach 25:40sort of near parity with human experts. And so I 25:45guess maybe, Gabe, maybe I'll turn it to you. I'm 25:47curious what you read from these types of results. Are 25:50we still kind of back where we are like December 25:522024, which is like, oh God, AI is going to 25:54replace all the jobs. These are certainly really impressive results. 25:58But how do you parse through it? Yeah, I mean, 26:01I think in professional settings the promise of AI has 26:05been take Away the stuff I don't want to do 26:07so I can spend more time on the stuff I 26:09do want to do. I think it's really easy to 26:15poke holes at benchmarks because measuring things is really difficult. 26:20So I want to upfront say, like, this is a 26:22really good stab at a new aspect of benchmarking. And 26:28I think especially the reliance on human experts is important 26:33in this space. The holes that I saw immediately looking 26:38at this were they're still doing this as basically one 26:42shot artifact creation as the, the benchmark, which I don't 26:46know about you guys, but most of the time I 26:47spend at my job doing things I don't want to 26:50do is stuff that involves like investigation, asynchronous this, that 26:54and the other thing. And in fact, when it comes 26:57time to create artifacts, that's the stuff I do want 26:59to do. Right? Writing code is my happy place, even 27:02if I'm using an assistant for it. But what is 27:05less happy is walking around trying to find the correct 27:10way to implement a little corner edge on the Internet. 27:12Or we're still trying to look through a giant pile 27:15of corporate docs to figure out the right official approved 27:18tool to do a certain piece of my job. So 27:21I think every benchmark makes approximations of the space so 27:27that it can actually translate from a fuzzy human space 27:31into math. That's just the nature of benchmarking. And in 27:34this case they've made some approximations that while valuable, still 27:38have some, some holes in it. I think the other 27:40part of it that I saw that was interesting was 27:44that they actually used human graders, at least as the 27:48gold standard for evaluating. Because, you know, even if you 27:52put aside the fact that these are sort of canned 27:55problems, you know, very well curated canned problems, but still 27:58canned problems, it's very hard to say is the answer 28:02here. You're right, because at the end of the day 28:05you need a thumbs up or a thumbs down or 28:07at best a, you know, value between 0.0 and 1.0 28:11to say how well did this thing do? And that's 28:15much harder the more complex the thing that it's trying 28:17to do is. So using a human grader is in 28:21one sense the right approach. But the whole reason we're 28:24in this Genai boom is that we figured out ways 28:26to not have humans labeling data. And this sounds like 28:30we're back to humans evaluating data. So they also created 28:33like a proxy to this with a another model that 28:36could proxy what their human graders did. But now you've 28:39got AI evaluating AI and you're kind of in a 28:41recursive loop there. So I think it's really interesting to 28:47see this try to tackle real world problems. The other 28:49piece that they didn't really articulate very well was amongst 28:54the classes of problems that a given profession has to 28:56tackle, is this tackling the hardest ones or the simplest 28:59ones? Right. Responding to email, responding to slack is not 29:02the most mentally challenging thing I do but it's a 29:04lot of what I do all day and I'm sure 29:08the same is true for a lawyer or a doctor 29:10or a nurse or anyone in a professional capacity spends 29:14a lot of their time in the long tail of 29:18less mentally taxing work. So I'd be curious if there's 29:23any way that they sort of evaluate these tasks against 29:27is this the low lift or is this the high 29:30lift that they're trying to evaluate against Reyna, you're. Smiling 29:33through Gabe's explication of gdp. Val, curious about your take. 29:37What I'm hearing from Gabe is better but maybe not 29:40good enough right at like I can think assessing what 29:43is ultimately like a very complex thing like what is 29:45a job, you know. So yeah so I really found 29:48it interesting diving into this. First of all props to 29:51whoever in their comms or marketing or whatever team thought 29:55several months ahead of what the write up headlines were 29:57going to be be because the headlines of AI can 30:01now do half of our jobs. Great. It's not what 30:05you're supposed to get from this but like fantastic how 30:07they chose the jobs, the fact that they actually went 30:09to the BLS types of jobs and how they went 30:14about it. Very, very nice choice. So I like what 30:16they're trying to do very much so if you actually 30:19go and read through the data points which they made 30:22available on Hugging Face very good with the transparency it's 30:25mostly planning tasks, summarization tasks, you know the kind of 30:29things that actually alums are pretty good at. And reading 30:32between the lines it did seem to me that they 30:33had a lot of different submissions and they sort of 30:35narrowed them down, narrowed them down, narrowed them down until 30:37they had a set of maybe similar kind of looking 30:41tasks. Even though it was across a number of jobs 30:43and it's we're still only talking about a couple hundred 30:46tasks that really got made. So I think that that 30:50is, that is first of all a point that it's 30:52only going to be be so many and I completely 30:55agree with Gabe. There's going to be some tasks that 30:56are harder, some tasks that are easier. When you look 30:58at the prompts themselves as I Read them. They looked 31:01to me very detailed, very refined, somebody with a very 31:04clear idea of what they want. So I think a 31:06lot of sort of pre work already goes into. Before 31:09you ask, hey, write me this summary, Write me this 31:11schedule. I have prepared some files for you. I have 31:13prepared some reference Excel, some sites that I need you 31:16to go to. So there's a lot of sort of 31:18pre planning and then this does the. All right, fine, 31:21put it all together. Probably this still helps a good 31:24amount. And it also, if anything, would give an example 31:27to people who don't understand, what does it mean? What 31:29kind of jobs can the AI be kind of decent 31:31at, which is, look at these kind. Please read these 31:33200 and take a look at what is the kind 31:36of thing that this is actually pretty good at. So 31:38I like that part of it. Now, as far as 31:40the valuation goes, look, it's pairwise comparisons. Is it the 31:43most sophisticated, detailed thing you could dive into? No. If 31:47I had to guess, having, you know, done evaluations for 31:49years and years is they probably tried a variety of 31:51other things and got a lot of noise because the 31:54artifact that you get out of each of these is 31:56going to be very detailed and very noisy and very 31:59difficult even for a human to say. This part is 32:01really better. This part is really not that much better. 32:04No, it's going to be like this, like this, like 32:05this. So they finally just ended up going with a 32:07win rate. So I wouldn't really listen to the headlines. 32:11I would look in interest at the paper, especially the 32:14appendices, where they really go into the process of how 32:17they did the evaluations. And again, what does it mean 32:19with the automatic grader? Not the automatic grader. I'll add 32:22one more thing. These are all prompts. So there is 32:27not a lot of breaking it down. Exactly what Gabe 32:29said, Exactly what we're just talking about with Merve of 32:32first do this, then do this. Here's an agentic plan, 32:34here's this, this, this, this, this kind of thing we 32:36can do. This is a prompt and it primarily would 32:39like the model to present something akin to maybe a 32:43plan that perhaps you would then execute on downstream. So 32:46again, this is a very particular type of task. It 32:48may be interesting in the future to say actually, let's 32:51have multiple models thrown at this with multiple capabilities, maybe 32:56even specific models, et cetera. Now, is evaluation of that 32:59going to be harder Exponentially, which is probably why their 33:04path at this benchmark was the way that it was. 33:07It was very intelligently done, very ready, ready for headlines, 33:11ready for write ups ready for all of that. That. 33:13But I do like the interest in real world tasks 33:17and the lens that this can hopefully show on. When 33:20we talk about AI taking our jobs, which jobs. I 33:23wish there were more write ups that dove into the 33:25actual data. I find that more interesting than the final 33:28evaluation. Yeah, absolutely. That was a great analysis. Marvit, maybe 33:33a final question for you. It strikes me that there's 33:37really interesting tension in what Marina said, which is you 33:39want the eval to measure real world tasks risks, but 33:42it turns out the real world is really, really messy. 33:45And so methodologically it's just very hard to do a 33:48good eval here. It's kind of a part of me 33:50that's like we're going to spend a lot of time 33:52in the next few years trying to develop increasingly sophisticated 33:55evaluations to measure the economic impact of AI. But also 33:59the economic impact of AI is just going to kind 34:01of happen to us as well. And I'm kind of 34:04curious about what you think about the value of this 34:08kind of exercise. There's almost a point of view which 34:10is is we can spend a lot of time trying 34:11to create better and better proxies, but we're also just 34:15in the middle of it right now and I guess 34:17the effect of AI on the economy we're just about 34:19to see. And so I guess the kind of question 34:21is, do you think these evals are more important with 34:24time, less important with time as kind of a research 34:26area? Where should we be going with this kind of 34:28work? Well, I think just when we were discussing this, 34:33it made me remember I had a client conversation. Another 34:36client is a friend of mine last week in Istanbul. 34:38He is one of the CEOs of one of the 34:40most leading hospitals in Istanbul. And he told me that 34:44it makes him very uncomfortable to see that each doctor 34:47has a medical secretary attending patient visits. This is one 34:50of those 44 occupations they include in the GTP. Well, 34:53this is exactly the description of the they are ready 34:57the people who wants to adopt LLMs because there are 35:00certain tasks. This is a perfect use of LLM. We'll 35:03listen to conversation and it will summarize. Summarization is a 35:08very popular use case. And also if you just take 35:11this task, fine, we can benchmark and evaluate. But there 35:14is implication on other things. There are other added benefits. 35:18Likely a bench would vary almost impossible to measure. If 35:23you consider these other aspects of this particular example that 35:27I give that you can drive insights from doctor patient 35:30conversation. You can extract, for example, the best practices like 35:35One doctor may be doing something that is working more 35:37for specific certain diseases for subset of patients. You can 35:41do better process optimizations like supply planning for a surgery 35:45if you know what's happening in a conversation or customize 35:48follow up contents. Like many many different other added benefits 35:50you can try, but each of them it's so difficult 35:53to measure. To Marina's point, even non human evaluations are 35:57not that easy. Now we are adding this human component 36:00and derivatives of these one One occupation that I gave 36:05as an example, but it's real. I had this conversation 36:08last week with a person who is looking to implement 36:11this in their hospital. So now it comes to where 36:15the real world is heading versus what do we benchmark? 36:19How do we benchmark when you implement these systems in 36:21the real? You can also track the edit benefits as 36:24a roi, like how much better you are in your 36:27supply management for your surgeries and so forth. So you 36:30can have KPIs that can help you understand I guess 36:33the added benefit. But on the paper, just like pure 36:36scientifically looking and trying to mimic and understand what the 36:40combination of things that these tasks can lead and then 36:43try to benchmark against all of them with some data, 36:46whether it's human annotated or AI annotated to me is 36:49going to be extremely difficult and overwhelming exercise. But I 36:52do like this because it's helping us from getting out 36:54of our academic mindsets when we compare the models and 36:58usages to really real world examples and where industries are 37:05starting to adapt and change. Yeah, I love that. It's 37:07kind of like almost like the eval is sort of 37:10also just a useful exercise in terms of being like 37:12how do we decompose this task anyways? Like what do 37:15we do with our days, you know, is like an 37:18important part of it. Great. I'm going to move us 37:23on to our last topic of the day, which I 37:25think is actually like now that I think of it 37:27weirdly related to what we were just talking about. So 37:31the final news is this really interesting story that was 37:34actually tackled in more depth in the Security Intelligence podcast. 37:38Chris Hay, a frequent MoE panelist, was on there debriefing 37:42it. I'll give the summary of the story and then 37:45I'd actually love to link it to specifically what we 37:47were just talking about. Anthropic disclosed basically that they discovered 37:51that an actor, which they believe to be a state 37:53actor, was misusing Claude to launch a sophisticated cyber attack. 37:59And they have this long blog post which is kind 38:01of very interesting, breaking down what they discovered about the 38:03Attack hack. But I think the thing that really kind 38:05of stood out to me was this quote which I'll 38:07just read, which is quote, the threat actor was able 38:10to use AI to perform 80 to 90% of the 38:12campaign with human intervention required only sporadically, perhaps four to 38:17six critical decision points per hacking campaign. And as far 38:20as I know, this is kind of the first real 38:22dead to rights example of what people have been kind 38:24of theorizing as like vibe hacking. Sort of the idea 38:28that all of this agentix technology at some point point 38:30gets used for good stuff, it also gets used for 38:33bad stuff and I guess to link it to what 38:36we. Were talking about earlier. Marvey, maybe I'll kick it 38:38back to you kind of as our kind of agent 38:40expert here. It does feel like if we had not 38:44GDP val, but for evil tasks it really does seem 38:47like AI is making a real impact in cybersecurity, cyber 38:51attack operations right now. I guess the question to be 38:54asked is do you think agents are going to favor 38:56illegitimate cases faster than legitimate ones? Well, so even though 39:00I work on agentic systems, I'm not a security expert, 39:03but I did chat about this one of my co 39:05workers, Ian Molloy, who is leading our agent security work 39:09and his own words were like it will be extremely 39:11difficult, maybe impossible to prevent malicious use of these agents 39:15while preserving their legitimate use because it's designed that way. 39:19Right? We want them to be flexible, we want them 39:21to listen instructions, we want them to do what we 39:25tell them to do. On the other hand, when we 39:27tell the malicious things, like with these current alignment approaches, 39:30I think it's going to be impossible. But what we 39:33can do, I think security researchers have been predicting this 39:36from the beginning of the LLM era and not even 39:38before agents, because you can also do this model attacks 39:41without the agents. But what I love is the anthropic 39:46had I think the perfect telemetry and monitoring to be 39:50able to talk about this and show what happened. And 39:53I think we can instrument our systems with, with this 39:56observability layers or some monitoring capabilities that is going to 40:02be basically showing us transparently what is happening in the 40:05system. And you can maybe revert back or talk about 40:09it and learn about what happened. I think that's, I 40:12think to me we can't control these things. We can 40:16build additional components for security like authentication, other things and 40:20maybe for enterprise settings that might be able to even 40:23to me easier than broad use because you can add 40:26some controls when you implement agents in the enterprise setting, 40:29but in the Broad use, you just do what the 40:32user says. But I think if we instrument the systems 40:36with these components and which one of them could be, 40:41as I mentioned, observability and the robust telemetry and monitoring 40:45systems, that's what I think can bring trust to use 40:49users while they're using these very powerful or small models, 40:54whatever they do to feel comfortable and trustworthy for these 40:59systems. Gabe, I was joking with a friend which is 41:02that there's probably a team at OpenAI which is kind 41:05of relieved that they weren't involved in this attack, but 41:07also kind of jealous that they chose Claude vs OpenAI. 41:11I guess I have a kind of interesting question here, 41:13which is why, why not use open source here? It 41:17feels like incredibly risky for an actor of this kind 41:20to go with a model that's provided through the cloud 41:24is monitored the way that CLAUDE is monitored, it's sort 41:28of very interesting. Do you have a hypothesis on why 41:31that's the case versus saying we're going to run our 41:33own on premise solution, if you will, to run this 41:37attack? My answer is that they probably are. This is 41:41just the ones that got caught, you know, and I 41:43think to your point about OpenAI feeling bummed that they 41:46weren't part of it, maybe they just don't have the 41:47telemetry to notice. Like I think it's, it's. We've seen 41:52one example of this and, and had it exposed to 41:56the sunlight, but that does not mean that this is 41:58the only one out there. In fact, I strongly suspect 42:00and they, they mentioned in the article that it is 42:02not the only one out there. So I do think, 42:05you know, there is generally a, the frontier models are 42:13generally closed. Even now that we have extremely capable open 42:17models like the recent Kimi K2 thinking and minimax models, 42:22they're extremely hard to run at scale. So depending on 42:26the division of labor, if you've got a team that's 42:28the experts in the cyber hacking side of things, they 42:32probably aren't the experts in running the expensive GPU rigs 42:37to run these extremely large models. And I think what 42:40you get with running all of this through a frontier 42:43model is the hands off nature of it versus trying 42:47to put this together more piecemeal. The interesting thing too 42:51is, you know, in the cyber security domain, the attackers 42:54always have the advantage unless they are targeting one very 42:57specific needle in the haystack. Right. If their goal is 42:59just to go get what they can, steal what's available, 43:03there's no penalty for screwing up, right? You just keep 43:06trying, right? So this Same set of actors may very 43:08well also be banging on an open source version. And 43:12one using GPT Star and one, you know, like, just 43:16try them all, why not, right? Like what's the downside 43:19other than it costing money? But the, you know, the 43:22defenders have the much more difficult task of catching everything 43:27that slips through, right? Like you have to have, have 43:29ironclad practices to avoid being, you know, in the spotlight 43:34of these things. So screwing up has huge penalties. The 43:38thing that did strike me as interesting in all of 43:40this, just to, to look at that defender view of 43:42all of it, is that eventually, even though the agent 43:46was taking a lot of the decision making and the 43:50sort of scripting and the basics of how do we 43:53craft this attack, it's still all exploited standard vulnerabilities, right? 43:59Like, it seemed to me like most of what they 44:01were doing was looking for systems that were running exposed 44:05and vulnerable versions of known exploits and then exploiting them. 44:09So if anything, this just puts a finer point on 44:11enterprises needing to stay up to date with their CVE 44:14patch fixes and just all the best security practices that 44:18we've all had hammered into us, just really do them 44:21for real do them like someone's gonna find it really 44:23doable them. So I, I think, you know, the defensive 44:27side of this, there's sort of the, the AI element 44:30of the defense, which is like, how in the world 44:32do we catch these things? What do we put into 44:34our AI systems as we're building them to make sure 44:36that they can't be exploited? And then there's just the 44:39good old fashioned cybersecurity of like patch your software people 44:42like do it for real or you're going to get 44:45hacked. So I, I think it's an all of the 44:47above strategy on the defense side. But if anything, this 44:51just makes it more urgent that that patch fix timeline 44:55tightens up. I love that we're kind of landing on 45:02place, which is just like at the end of the 45:03day these are not very complex tasks that are being 45:06implemented. Marina, I'll give you the last word if you 45:10have any hot takes before we close the episode today. 45:12Nah, I agree with Gabe that again, I'm not a 45:15cybersecurity expert, but it seemed like here it was crazy, 45:18not creativity, but yes, scale, where it was just try 45:21the same well known things, but just try them a 45:23lot faster than a human could do them. Okay, so 45:26probably at least a lot of those things could be 45:27done. One thing I kind of liked was the anthropic 45:30team slipped in there that they actually used Claude to 45:33analyze the logs of what was going on. So something 45:37that then comes to mind as well. Should we maybe 45:39be thinking that the models themselves know what they would 45:43use if they were turned to evil, so they're more 45:45likely to catch themselves themselves than what a human might 45:48come up with? And then the different models have different 45:50biases. So maybe they could be to have OpenAI models, 45:53check the anthropic models and try to see what you 46:02but at the end of the day, it's almost like 46:05the tasks like. I agree with Gabe, just do the 46:07basics. Because now you can just have your basics broken 46:10a lot faster. So you really need to do this. 46:12To get the basics right. Please, please. Well, with that 46:16bit of very good advice, we'll close the episode today. 46:18Marina, Gabe, Marv, thank you for joining us for the 46:21show today. That's all the time we have and thanks 46:23for joining all you listeners. If you enjoyed what you 46:25heard, you can get us on Apple Podcasts, Spotify and 46:27podcast platforms everywhere. And we'll see you all next week 46:30on Mixture of Experts.