Learning Library

← Back to Library

AI Code Generation: Past, Present, Future

Key Points

  • The episode frames code generation as the year’s biggest AI story, noting rapid shifts in software engineering from tools like Cursor, Windsurf, and Vibe Coding.
  • Adoption has moved beyond early adopters; even former skeptics now rely on AI for project kick‑offs, and hiring processes are beginning to assess candidates’ proficiency with AI tooling.
  • Standardization efforts, such as the use of agents.mmd files, are emerging to give projects a consistent way for AI systems to interpret and act on codebases.
  • Despite growing usage, panelists stress that current models remain “lazy” and untrustworthy for end‑to‑end tasks, with notable limitations and occasional catastrophic failures.
  • The future is seen as a blend of maturation—where AI becomes a staple in daily workflows—and ongoing turbulence as the technology continues to grapple with reliability and capability gaps.

Sections

Full Transcript

# AI Code Generation: Past, Present, Future **Source:** [https://www.youtube.com/watch?v=oRRDAZtJLmk](https://www.youtube.com/watch?v=oRRDAZtJLmk) **Duration:** 00:35:09 ## Summary - The episode frames code generation as the year’s biggest AI story, noting rapid shifts in software engineering from tools like Cursor, Windsurf, and Vibe Coding. - Adoption has moved beyond early adopters; even former skeptics now rely on AI for project kick‑offs, and hiring processes are beginning to assess candidates’ proficiency with AI tooling. - Standardization efforts, such as the use of agents.mmd files, are emerging to give projects a consistent way for AI systems to interpret and act on codebases. - Despite growing usage, panelists stress that current models remain “lazy” and untrustworthy for end‑to‑end tasks, with notable limitations and occasional catastrophic failures. - The future is seen as a blend of maturation—where AI becomes a staple in daily workflows—and ongoing turbulence as the technology continues to grapple with reliability and capability gaps. ## Sections - [00:00:00](https://www.youtube.com/watch?v=oRRDAZtJLmk&t=0s) **Lazy AI Models & Code Generation** - In this Mixture of Experts episode, Tim Hong and three AI engineers examine the evolution, shortcomings, and future impact of AI‑driven code generation, from model “laziness” and context‑window limits to its reshaping of software development. - [00:03:03](https://www.youtube.com/watch?v=oRRDAZtJLmk&t=183s) **AI Tools Bridge Advanced Coding Gap** - The speakers discuss how generative AI, exemplified by Claude 45 Opus, is moving beyond day‑to‑day or junior coding tasks to solve deep, performance‑critical problems in low‑level systems like llama‑CPP. - [00:06:38](https://www.youtube.com/watch?v=oRRDAZtJLmk&t=398s) **Learning the Limits of AI Tools** - The speaker reflects on their difficulty forming accurate intuitions about what AI systems can and cannot handle, emphasizing that the gap is rooted more in engineering culture and evolving norms than in inherent technical shortcomings, and that this understanding will improve with experience. - [00:10:36](https://www.youtube.com/watch?v=oRRDAZtJLmk&t=636s) **Model Differentiation in Code Generation** - The speaker explores whether AI models such as OpenAI and Anthropic are developing distinct code‑generation behaviors, and how any emerging differences might influence programmers’ roles, tool preferences, and community identities. - [00:14:39](https://www.youtube.com/watch?v=oRRDAZtJLmk&t=879s) **Cloud Code Simplifies Optimization Workflow** - The speaker praises cloud code for automating file selection and parallel execution, emphasizing its powerful tooling layer that outperforms manual, mathematically driven attempts at optimization. - [00:18:15](https://www.youtube.com/watch?v=oRRDAZtJLmk&t=1095s) **Debating Claude Code's Minimal Toolset** - The speakers discuss how Claude Code operates efficiently with very few built‑in tools, questioning its design choices, context‑window usage, and potential future enhancements. - [00:21:45](https://www.youtube.com/watch?v=oRRDAZtJLmk&t=1305s) **Autonomous Deep‑T Expert Agents** - The speaker predicts that as AI models become more capable, specialized “deep‑T” expert agents will require minimal oversight, enabling parallel deployment without constant supervision. - [00:24:59](https://www.youtube.com/watch?v=oRRDAZtJLmk&t=1499s) **Challenges Integrating Open-Weight Models** - The speakers explain that despite the power of open-weight models, most code‑generation tools don’t support seamless plug‑and‑play use, instead relying on hybrid agent architectures, and cite “continue” as an example that works well with some models (e.g., Gemini) but poorly with others (e.g., Granite). - [00:28:39](https://www.youtube.com/watch?v=oRRDAZtJLmk&t=1719s) **Open‑Source Integration Challenges and Optimism** - The speakers contend that open‑source solutions demand significant configuration and vertical integration, limiting their plug‑and‑play appeal, yet they remain confident that community effort can ultimately overcome these constraints. - [00:31:47](https://www.youtube.com/watch?v=oRRDAZtJLmk&t=1907s) **Subscription Lock‑in vs Inference Costs** - The speaker highlights how subscription‑only AI coding tools force users to pay per‑token API fees, arguing that only smaller, locally‑runnable models could break this costly lock‑in. ## Full Transcript
0:01The models are really lazy. Here's my 0:02favorite Claude called code one in a 0:04moment. Uh, you know, uh, due to context 0:06window limitations, you know, I'm I'm 0:08stopping right now. You're like, dude, 0:10come on, try harder. Do you know what I 0:11mean? 0:12>> We were right in the middle of it. 0:13>> Imagine a junior engineer just went, 0:15"No, it's it's 4:30 in the afternoon. 0:18I'm going to knock off in 30 minutes, so 0:20there's no point me looking at this. I'm 0:21going home now." 0:23>> All that and more on today's Mixture of 0:24Experts. 0:30I'm Tim Hong and welcome to Mixture of 0:32Experts. Each week, Moe brings together 0:34a panel of the smartest minds in 0:35technology to distill down what's 0:37important in the crazy world of 0:39artificial intelligence. Joining us 0:41today are three incredible panelists. 0:42We've got Chris Haye, distinguished 0:44engineer, Olivia Boozek, lead developer 0:46advocate for AI, and Gabe Goodhart, 0:48chief architect, AI open innovation. Um, 0:51this is going to be a fun episode. Uh 0:52it's one of the our end ofear episodes. 0:54So we are basically departing from our 0:57usual uh news story format. And I wanted 1:00to get this group together specifically 1:01to talk about the past, present and 1:04future of code generation. 1:11I think in my opinion code generation is 1:13basically one of the biggest stories of 1:15the year for AI, right? starting from 1:17January to now like the work of 1:19engineering has changed in a very 1:22significant way and you know from cursor 1:25and windsurf to clawed code and sort of 1:27the rise of vibe coding as a thing. Um 1:29really if you're like where's the most 1:31salient impact of AI happening right now 1:32it's it's kind of in software 1:34engineering and code generation. Um and 1:37so I guess Olivia maybe I'll start with 1:38you. I think the question is like what 1:40do you think comes next right? Are we 1:42like now entering a mature space or is 1:45next year going to be kind of like as 1:46crazy and tumultuous as 2025 was in your 1:49opinion? 1:49>> I think it's a little bit of both. So 1:51what I've seen over the last year is 1:54that even the AI skeptics are starting 1:56to use uh AI in their work almost every 1:59day. Um so when you're starting a 2:02project, you definitely use AI to to get 2:04your things off the ground. Um we have 2:06the evolution of things like the 2:08agents.mmd files that people are putting 2:10in their projects so that you have a uh 2:14a standard way that your particular 2:16project will be interpreted by the AI 2:19and you know in hiring processes we're 2:21seeing people actually checking to see 2:24whether or not you know understand how 2:25to know or how to use the um the AI 2:28tools. So all of that points in a strong 2:31direction of this thing is here it's 2:33here to stay. Um at the same time I 2:35think we see a lot of limitations as 2:37well. So um so far I have yet to hear 2:41anybody say oh yeah this is as capable 2:44as a human. I hand off all sorts of 2:46tasks to it. I literally just tell it to 2:49look at my board and take take off the 2:51next task and take care of things for me 2:54because it's just not there yet and it's 2:55just not that trustworthy yet. Um and we 2:59have seen a few catastrophic failures 3:01over the course of the year. 3:02>> Right. Yeah. Yeah, and I did want to get 3:03into that as kind of like it's almost a 3:05little bit barbell shaped in my mind, 3:07right? I think Gabe, one of the things I 3:09wanted to talk to you about was, you 3:10know, I'm not a day in dayout coder. I I 3:12used to be, but I'm terrible at coding, 3:14so I like stopped it doing it and moved 3:16to like a different profession uh 3:18podcast host. And uh I guess the 3:21question for you is like at least for 3:22me, these tools have kind of 3:23revolutionized the game because I can 3:25just sit down and start having fun. Um, 3:27but I guess the question to Olivia's 3:29point is like do we still feel like 3:31there's kind of a gap to using this at 3:32like the frontier, the most complex, the 3:34hardest software applications? I guess 3:36the question is whether or not like this 3:38is still more in the realm of like yeah, 3:40if you're doing day in dayout coding 3:41work or you're kind of a junior 3:42engineer, that's where most of the 3:44action is happening. Do do you buy that 3:45as a premise? 3:47>> I would have said yes last week. Uh, 3:51what changed? I used Claude 45 Opus to 3:55crack a problem that I have been trying 3:57to crack for months and it nailed it in 4:00under an hour. Um, and this is a problem 4:04that is deep in the guts of llama CPP. 4:07Trying to get better performance out of 4:09the uh recurrent models, optimizing the 4:12metal kernels, understanding the shape 4:14of uh, you know, grid layouts and thread 4:18group dynamics and SIMD group memory 4:21sharing and just the gnarliest corners 4:24of gorpy bits. And let me just be clear, 4:26the official internet documentation for 4:29Apple Metal programming is a 2,000page 4:33PDF. That's it. Uh there there is no 4:36like you go to CUDA and the internet is 4:38full of good information. I expect AI 4:41models to nail CUDA, but for metal like 4:44I was blown away at how strong it was. 4:47So to that end, I will say I think the 4:51barbell-shaped analogy or something 4:54something shaped that is not nice and 4:56uniform, whatever physical shape you 4:58choose, is exactly my experience right 5:00now. These models can do some amazing 5:03stuff and they can fall really, really, 5:06really flat in what should be really 5:09simple use cases. So the opposite end of 5:10the spectrum, the reason I would have 5:12said this a week ago was I also heavily 5:14used cloud code to try to build out a 5:17CLI for a pretty straightforward REST 5:19API. And it did a fantastic job of 5:22cranking out a beautiful CLI with lots 5:24of nice pretty colors and inline JSON 5:27highlighting and all sorts of awesome 5:29stuff that was 100% code coverage, too. 5:31Like the tests were great. They all just 5:33mocked everything and didn't actually 5:35test anything and nothing worked, but it 5:37was so cool. And then I spent a week 5:40having to I somebody I think from the 5:41continue team coined this phrase of 5:44chiseling. Um so you you basically like 5:46use it to create the rough block and 5:48then you have to chisel out the shape of 5:49what you actually want your statue to 5:51look like under the underneath the the 5:53very rough block that you just like 5:55splatted out with a code assistant. So 5:57this was my experience. It actually 5:58probably took much longer. Now the the 6:00end product might be prettier. I 6:01probably wouldn't have come up with all 6:02the the latest coolest um CLI libraries 6:05myself. But the process of actually like 6:08fixing all of the stuff that it just 6:10mocked away in the unit tests and said 6:12sure if I pretend that this is the right 6:14answer then I got the right answer when 6:15I did this in my code uh was really 6:18frustrating. So it's it's in some ways I 6:22would have expected exactly the opposite 6:23experience that generating a CLI against 6:25a well- definfined REST API is bread and 6:28butter like that should just be for 6:30point and shoot forget about it. Uh, and 6:32then, you know, deep in the gnarly weeds 6:34of metal optimization is where it would 6:36fall over and have no idea what to do. 6:37But it so 6:38>> it's actually like the reverse. Yeah. I 6:39think it's right. 6:40>> Yeah. You're pointing out like the 6:41intuitions are flipped, right? You're 6:43like, how could you not get this right? 6:45>> For me personally, what it points to is 6:47that I just don't quite have a good 6:48intuition about what it's going to be 6:50good at and what it's not going to be 6:51good at. And I've tried to build that 6:53intuition which means that there's still 6:56some 6:57misalignment between the capabilities of 6:59these tools and sort of the day-to-day 7:03sort of mental model that at least I as 7:05a developer have around the complexity 7:07of a task and the difficulty of a task. 7:09So that's right still some learning on 7:11my part to do. Well, and I do want to go 7:13to that point, right? Because I think 7:14Chris, typically when we've seen these 7:16like high-profile failures, I think 7:18people are like, "Haha, look at the 7:19terrible AI." And I'm kind of of the 7:21view, it's just like maybe those kind of 7:23get ironed out over time as engineers 7:25understand like what these systems are 7:27good and bad at. So, it's actually it's 7:29less of a technical problem and actually 7:31more of like a engineering culture and 7:33norms and understanding problem. Like, 7:35we don't actually we've got this hammer, 7:37but we're still not really sure what 7:38hammers are good for yet. And so we're 7:40kind of swinging it around being like, 7:41"Oh, it wasn't really good for that." 7:42Um, and like maybe those problems kind 7:44of like disappear with time as we kind 7:46of get a little bit more mature on how 7:47to buy these like use these 7:49applications. Um, do you buy that? 7:52>> Yeah, I think so. I think one of the 7:55questions I like to ask myself with the 7:57with the coding models is who is the 7:59architect? 8:01And and and I asked that question for a 8:03second because if if the if the coding 8:06assistant is the architect, then it's 8:10going to choose the framework. It's 8:11going to choose which libraries. It's 8:13going to choose whether it's going to 8:14mock or not mock, etc. Right? You're 8:17putting all of the decisions onto the 8:20model. And and that is okay. I I mean if 8:23you don't really know a language or you 8:26don't know the frameworks or you you're 8:29not a UI person or whatever then you 8:31don't really have much of a choice right 8:33so you're saying actually I don't quite 8:35know what I'm doing here so I'm more in 8:38the vibing world so go do that and and I 8:42think therefore it can make bad 8:44decisions in that sense and and and Gabe 8:48to your point the models are really lazy 8:50right if they think they can get away 8:51with just mocking something up or they 8:53can just go ah I can't you know here's 8:55my favorite cloud code one at the moment 8:57uh you know uh due to context window 8:59limitations you know I'm I'm stopping 9:01right now you're like dude come on try 9:03harder do you know what I mean in the 9:05middle of it 9:06>> imagine a junior engineer just went no 9:08it's it's 4:30 in the afternoon I'm 9:11going to knock off in 30 minutes so 9:13there's no point me looking at this I'm 9:14going home now you know you'll be like 9:16ah you're fired you know but um so I I 9:20think there's a fair point but but 9:21asking who is the architect in this 9:22case? And and in sometimes it's okay not 9:25to be the architect, right? You you're 9:27vibing, you're prototyping, you're doing 9:29whatever. Um, but I think you're in a 9:32different paradigm when you want to 9:34start productionizing and that's where 9:36you have to really use things like the 9:38rules you have to use like if you're in 9:40cloud code your cloud MD or your agents 9:43MD you if you're in cursor you need to 9:45use rules or client or whatever to 9:47really guide the model and say this is 9:49the architecture that I expect these are 9:51the standards that I want you to follow 9:53and uh and therefore you probably have 9:55to put as much effort into architecting 9:59um as you would normally do with 10:01architecture. So I I think maybe I think 10:05that is the big paradigm shift that is 10:08probably happening which is architecture 10:11is going to become more important but 10:14actually being able to write your 10:16architecture which is AI friendly agent 10:19friendly as opposed to in a word 10:22document or a UML diagram sitting 10:24somewhere on the cloud. Right? It is it 10:26is really about um orchestrating with 10:30the AI and then you're going to get 10:32really fast feedback back loops. 10:34>> Well, and I think that's actually one of 10:36the things I I do want to talk a little 10:37bit more about this kind of like the 10:38evolving role of like the engineer or 10:41the programmer in all this. Um and you 10:44know I think one of the things I'm 10:46really interested in is how all these 10:48models are kind of differentiating with 10:49time, right? Like we live in a world I 10:51think Gabe you might have made this 10:52comment on a previous episode. you were 10:53like we live in like model abundance. 10:56There's like all these models and 10:57they're all really really good. Um and 11:00you know I guess I'm kind of interested 11:02in like if you're seeing differentiation 11:04and living maybe I'll toss this question 11:05back to you is you know just take open 11:07AAI and anthropic for a second like do 11:09you feel like these models are 11:11approaching codegen differently? Like is 11:14the kind of code they're producing in 11:16flavor different or is are they is 11:18everybody kind of converging on the same 11:20kind of code generation with time? And I 11:22asked that a little bit because it's 11:23kind of like you can imagine being like, 11:24"Oh, I really understand what OpenAI is 11:26good and bad at, but I have no idea what 11:28Anthropic is good and bad at." And that 11:30really has big implications for for how 11:32almost these models become a kind of 11:34like programming language of time, 11:36right? Like that it's almost like a 11:38tribe that you say like, "Oh, I'm a I'm 11:39a Pythonista." Um, I'm wondering if that 11:41kind of thing is on its way or or if 11:43you're you're not seeing that. Currently 11:45I'm seeing a lot of people just in an 11:47experimental phase with a whole bunch of 11:49different ones because um I don't know 11:52that we have solved that 11:53characterization. That characterization 11:55may evolve more over time. I also think 11:58it's in some ways less about the models 12:01themselves and more about the agent 12:03architecture that's underlying those 12:05code assistants and I think that is 12:06making a much larger difference. Um so 12:10for example when I'm playing with a a 12:12truly when I'm just playing with the 12:14model itself on something like continue 12:17um in my uh in my IDE then I'm not 12:21getting that agent experience and so 12:23because it doesn't have very many agents 12:25to it like it almost doesn't matter what 12:27what model I throw at a particular 12:29problem it can only do so much. Um where 12:32there you see a huge difference though 12:34is in uh the actual planning for a task. 12:38So like in one uh assistant it'll be 12:41like oh I I you know my planning tends 12:45to be focused more around like security 12:47and optimization problems and so it'll 12:49get stuck on that part and another agent 12:51will be more interested in like I really 12:54want to do this mocking thing that Gabe 12:56is talking about. Um, and so you'll see 12:59like tendencies because of the agent 13:01architecture that's underlying it, which 13:03is of course completely opaque to the 13:05user other than um the way that you sort 13:08of start sort of start feeling it out. 13:10>> Yeah, that's actually really 13:11interesting. You're you're almost saying 13:12that this is like less of a function of 13:13the model, but just kind of agent 13:14orchestration is producing these like 13:16differentiations of time. 13:18>> Gabe, I guess you're you're nodding 13:19shaking your head if you want to jump 13:20in. Yeah, I'm I'm doing this weird nod 13:23shake head thing at the same time 13:24because uh Olivia when you said that 13:26that's exactly the comment I wanted to 13:28make here as well is that I've said this 13:30on many episodes but 13:31>> the user experience of any one of these 13:34AI uh tools is a combination of the 13:38quality of the model and the quality of 13:40the system that is built around the 13:41model. And in this case, I have seen 13:43tools, literally multiple different 13:45tools using the clawed family of models 13:47behave extremely differently with 13:50exactly the same flavor of problem 13:52thrown at them. And it comes down to the 13:54implementation of simple things like 13:55context compaction. What do you do when 13:57you get a 20,000 line C++ file thrown at 14:00you? Do you just explode or do you 14:02carefully read it in chunks of 100 lines 14:04and keep going? What do you do when uh 14:08you know you are unable to find an 14:10answer on the internet or when you try 14:11an experiment and it fails? How do you 14:14back up and try again? So these things 14:16are all at that orchestration layer. And 14:19I think this is where the actual 14:22individual tools are going to 14:23differentiate themselves. It's why I 14:25keep coming back to cloud code because I 14:27think of all of the tools I've tried, 14:28they have this experience of it just 14:30works nailed. um everything else has 14:33just required so much more finagling 14:35from me and sort of babysitting from me. 14:37Whereas with cloud code, I don't have to 14:39select what mode it's in. I don't have 14:41to, you know, carefully choose uh, you 14:44know, oh, I I'm going to only send you 14:47files that have context that I know you 14:49can handle. I don't have to do any of 14:51that stuff. I just point it at files on 14:53the internet. I point it at files on my 14:54local machine. It asks me at the right 14:57times when to do what operations and it 14:59goes to town. Um, so that's my personal 15:02favorite these days. Um, and I really 15:05think this uh, you know, this is the the 15:09tooling layer is really important. That 15:11said, the reason I was doing the funky 15:12shake the head thing is that uh, you 15:15know, again using this example from the 15:18last couple of days, you know, I did 15:19this work on the metal optimization with 15:21cloud code 45 opus and was blown away 15:25literally in parallel. I had 15:27>> You did it, Gabe. I mean, you just said 15:30you knew you said you said you knew 15:32nothing about it and it was nice opus 15:35figured it for you. How much did you 15:37actually do? 15:39>> Let Okay. All right. We we'll unpack 15:40that one. So, so actually yes. Uh and 15:43this was actually to to your comment 15:44Chris. I love your framing of you know 15:46who's the architect here, right? So, I 15:48have been banging my head against this 15:49problem for months now. I've been trying 15:51to tackle this from the mathematical 15:53perspective of reformulating the SSM 15:55scan operation as SSD following the 15:57Mamba paper blah blah blah blah and 16:00turns out I was looking in the wrong 16:02place. The right place to look was the 16:04very inefficient SSM con implementation 16:07that didn't take advantage of thread 16:08grouping. Who knew? So I actually was 16:11the one that figured that out myself by 16:13carefully commenting out chunks of code 16:15and realizing that if I took away the 16:16SSM comm operation, I got double the 16:19performance, which was the light bulb 16:21for me that said, "Ah, shoot. I've been 16:23looking in the wrong place." Then I went 16:25over to and I've I've read this kernel 16:27many times myself. I have not seen 16:29anything that says this is clearly a 16:32problem because I don't know the ins and 16:33outs of how the metal GPU is 16:35architected. So, I got all the way to 16:37the point of I found the problem, but I 16:39don't know what to do with it. I pointed 16:40Claude at that and said, "Claude, here's 16:43what I'm experiencing. Here are 16:45literally the commands I've been running 16:46to isolate this. Here's the line I had 16:48to comment out to get to this point of 16:50discovery. Please take it from here." 16:53And it was able to say, "Oh, I read that 16:55code. Thank you for the pointers. The 16:57problem is right there, this line." And 16:59that was where we got to. So, I did a 17:01lot of the work to get there. Claude did 17:03the work to actually solve the problem. 17:04And actually, Gabe, I I think to your 17:06point, I think you're sort of really 17:10stating where we are just now, which is 17:13if you don't know what you're doing at 17:15all, you will get so far. Um, it won't 17:18be the most maintainable code. It will 17:20be a bit muddy, whatever, etc. But 17:23today, you still need the human part of 17:25that loop, right? you you know you need 17:27to guide Claude and and and actually to 17:29your point in Claude code um it really 17:32does deal in a couple hundred lines at a 17:35time. So it's a kind of very very narrow 17:37window and yeah you can you can direct 17:39and push it to different places. So if 17:42you need that broader view and you need 17:44it to look at the larger context, you're 17:46either having to do some thinking 17:48yourself or or or you're going to deal 17:52the claude web interface and you're 17:54typing in there going you think a little 17:56bit further for me. But but the point is 17:58um you do need to do that thinking um 18:01today. I I'm not so sure that is going 18:06to be so necessary in the future if I'm 18:08on 18:09>> what I was going to ask is like I mean 18:10is the hundred line thing is that that's 18:12a design choice by anthropic right? 18:14>> Yeah. Yeah. Yeah. 18:15>> So I guess how do we read that right 18:17like I think I mean the most generous 18:19interpretation is they actually want you 18:20to do some thinking but uh I don't know 18:22if you would read it that way. 18:24>> I think they just don't want you to burn 18:25your context windows. Do you know what I 18:27mean? I I I really think it's as simple 18:29as that. But it is actually 18:31>> remarkably 18:33efficient at it, right? I mean, if you 18:35look at the tools that claude code has, 18:37it actually has very little tools, 18:39right? It's got uh it's got fetch, it's 18:42got gp, it's got bash. 18:45>> I mean, what more do you need? What more 18:46do you need? 18:47>> I'm going to be honest, that that's 18:49actually good enough. So, 18:51>> yeah, exactly. 18:52>> Said, 18:53>> yeah, it Well, you get them via Bash, 18:55Gabe, so you know. 18:56>> That's true. That's true. Yes. So, so in 18:59reality, it actually has very little 19:02tools, but it it is incredible in the 19:06way that it's able to execute. So, I I 19:08sometimes question my lifestyle choices 19:10in in building MCP servers every so 19:12often going am I wasting my time here 19:15because uh you know, Cloud Code does so 19:17well with so little tools. Um but I I I 19:20I just think that it reflects kind of 19:23where we are today though. Um but I I do 19:26believe in the future that that the 19:28tools are going to get more efficient. 19:30They're not only going to be using GP 19:32they're you know if you look at things 19:33like cursor for example they are 19:35indexing your code base. I'm sure um 19:37cloud code is on that path already. In 19:40fact I think they released something uh 19:42recently and therefore I think a lot of 19:45those constraints we're uh we're talking 19:47about going to go away. Ultimately I do 19:49think though that you are still going to 19:52be part of that loop. you still want to 19:53be that architect. Um but but I I I 19:57think the progression is actually I 20:01think it's just another treat it like 20:02another compiler. I hate to say it this 20:04way but um you know we went from you 20:07know punch cards to assembly to C to uh 20:10C++ to Java and then we're Pythons to 20:14JavaScript to typescripts etc. So you've 20:16went up and up and up the up the stack, 20:18but do you I mean apart from Gabe and 20:20his story about looking at Apple Metal 20:22there, but do we really go and look at 20:24the assembly code that often now? No, 20:26because we kind of trust the tools to do 20:27that job. And I and I I just wonder, in 20:30fact, I don't wonder. I'm pretty sure 20:32we're in that kind of paradigm shift. 20:35And um but it it's you still need to 20:38know what's going on below the hood, 20:40right? But I but I think we're at 20:42another higher level abstraction going 20:44forward. 20:44>> Well, yeah. And to to wrap up that you 20:46know where we are today point you know 20:50what I was going to say is in parallel 20:51to doing this with 45 opus I did a 20:53simpler task with 45 sonnet and it 20:59needed way more oversight than than what 21:01I had to give opus. Opus actually did a 21:04great job of looking at git history 21:07looking at adjacent files looking at uh 21:10all the pointers I gave it both on the 21:12web and locally. um and needed 21:15ultimately very little oversight in 21:17solving a complex complex problem. Uh 21:20sonnet on the other hand uh I pointed it 21:23at a gnarly problem that's very hard to 21:26test because it crashed my terminal 21:27every time it it it uh was triggered. Uh 21:30but but literally the whole terminal app 21:32just died which was a pain. Um but uh 21:36Sonnet claimed success like three or 21:38four times and I had to keep going back 21:40and saying no I'm pretty sure that's not 21:41right. I'm pretty sure that's not right. 21:43So there is a model capability here and 21:45I think the difference we're going to 21:46see and I think I I haven't tried this 21:49against um Gemini 3 or the latest 21:53versions of the OpenAI 5 codeex models 21:57um or any of the other latest gen ones. 22:00But I suspect that that's the capability 22:02difference we're going to see is 22:03essentially how much oversight do you 22:06have to give this sort of deep tea 22:08expert that you're pointing at a 22:09specific problem. the the thing that I 22:12think is going to be really interesting 22:13for next year um is to see if that 22:17individual task oriented deep tea expert 22:21when I say deep tea I'm referring to 22:22like deep the T-shaped skill sets right 22:24like I think what we're seeing right now 22:25is that if you give one of these models 22:28a wellressearched problem in a domain 22:31that you are not yourself a deep tea 22:34expert that you could go do in great 22:36depth um it it can actually do a very 22:38very good job of of solving that, but 22:41the more capable the model, the less you 22:44have to supervise that solution. I think 22:47going forward, we're going to see this 22:48paradigm that we see peing out of from 22:50under the covers with Google's 22:52anti-gravity of I've hit a point where I 22:54can actually reliably trust that that 22:57deep tea expert is going to get the 22:58problem correct. So now I can start 23:00launching a bunch of these in parallel 23:01and not babysit them. I think that's the 23:03the holy grail to get to next year. 23:05We're definitely not there yet. From 23:07what I hear from anti-gravity users, 23:09from other attempts at fleets of agents 23:11and becoming an agent manager, I don't 23:14think we're there yet. Um, but I just 23:17smell test based on the capability gap 23:19from 45 sonnet to 45 opus. I'm curious 23:23if we will actually I I feel at least 23:26some optimism that we will get to that 23:27point next year where you can actually 23:29queue up large quantities of tasks with 23:31independent operation on them and 23:34basically only uh supervise them when 23:37they come back and tell you they're 23:38completed. 23:39>> Um I'm going to move us on to a final 23:41topic um particularly I think fitting 23:43given the folks on this panel but I do 23:45want to just take the last few minutes 23:46of this episode to talk about open 23:48source. So, one of the big kind of 23:50narratives, I think meta narratives in 23:522025 is like open source is like 23:56continuing to catch up, right? It used 23:57to be like, oh, give it a few months and 23:59then, you know, open source will have 24:01what the state-of-the-art had, you know, 24:03in a few months. And then now it feels 24:05like basically where we've gotten is 24:06like it's now kind of at par or even 24:09like getting ahead of the proprietary 24:10models. Um, and I guess Olivia, uh, just 24:13to kind of hear from you on on your 24:15experience with this, like I guess the 24:16question I have is like whether or not 24:17we're going to see that pattern also 24:19happen in the codegen space. Um, where 24:21you see these open models start to be 24:23able to do code generation at like a I 24:26don't know like kind of a clawed level. 24:28Um, is that is that in the offing? Why 24:30or why not? 24:31>> Yeah. So I think um we have to make a 24:33little bit of a distinction here between 24:35um uh open weights and o opensource uh 24:40frameworks that are being used to to do 24:43codegen. Um so I mentioned uh continue 24:46which is an open- source uh framework 24:48that you can use for codegen stuff. Um, 24:52as I mentioned though, they haven't 24:53really leaned into agentic pieces yet. 24:56And so, um, you're kind of on your own 24:59in terms of making that model highly 25:01highly performant. Um, and but then a 25:05lot of these models that we're talking 25:06about are in fact open weights models 25:09where you can in fact download the 25:11weights and and put them behind a whole 25:13bunch of different things. What we're 25:15not seeing yet, I think, is an openness 25:18within the the most common tools to just 25:21use any open open weights model on the 25:24on the uh open market, right? So, we're 25:27not seeing uh every single uh code 25:30generation tool saying, well, you can 25:32just pop in whatever open weights model 25:34you want. they they end up doing this 25:36this hybrid synthesis thing of an open 25:40weights model combined with an agent 25:42architecture that is designed for that 25:45particular um uh model basically and so 25:48I think we're seeing we're still seeing 25:51a lot of combinations being more 25:53successful than the open weights models 25:56themselves but that doesn't mean that 25:57the open weights models aren't powerful 25:59it just means that they need a lot more 26:02guidance than being able to be used just 26:04off the shelf. 26:05>> The one delta I would say on that is 26:07that continue actually has leaned in 26:08heavily to agents. But continue is like 26:11many open-source tooling layers trying 26:14to split the difference between running 26:15against local models and against hosted 26:18closed models. So their agents work 26:21great if you plug in quad or Gemini. Um 26:25and I spent a bunch of time last week 26:27trying to get it to work with granite 26:28for small uh and it does not work very 26:31well. Um there are others out there like 26:34open code which I also tried extensively 26:36with granite for small uh to similar 26:38effect. Um now part of this could be 26:40simply the nature of the size of these 26:42models. I haven't tried running it 26:44against a really large sort of frontier 26:47level open model um because I can't run 26:50that on my dev box. uh but uh I also 26:54think there is an inherent advantage of 26:57closed ecosystems to be able to 27:00co-evolve the model and the tooling 27:02together so that you're not trying to 27:04keep this sort of level of separation 27:06between the model's capabilities and the 27:09the actual agentic patterns around it. I 27:11mean all of these uh agentic pattern 27:14tooling layers for coding or otherwise 27:17involve a great deal of prompt 27:18engineering and a great deal of sort of 27:20manual tuning of ah I've seen that it 27:23tends to fail in this corner case so 27:25either I need to code my way or prompt 27:26my way out of that corner case and that 27:28sort of thing and it's just really hard 27:29to do that in a model agnostic way. So I 27:32think that's one of the big advantages. 27:33I haven't personally tried like Quen's 27:36uh direct Quen 2.5 coder like local CLI. 27:41I probably should give that one a shot 27:42um because I think that's an example of 27:44an an open ecosystem trying to do this 27:46where they have a modelspecific 27:49uh open 27:51uh tooling layer. The one I have tried 27:53is pointing OpenAI's codecs at GPTO 27:56OSS120B. 27:58Uh and I would say that is a solid step 28:00up from running continue or open code 28:02against granite for small. Um again 28:05model size is a big element here but 28:07also the the pairing of the model 28:09capabilities with the agent side. So I 28:12don't have a clear decisive answer here 28:14but I do think you know you're spot on 28:16to point out that this really has to do 28:17with the software layer and that that's 28:20probably where there's the most catchup 28:23to be done on the open side relative to 28:25the model capabilities. Um, but I think 28:28it's still because of sort of the the 28:31loose coupling in open source, I think 28:33we're going to see it a little bit 28:34harder to get to those peak performance 28:37capabilities. 28:38>> Yeah, Olivia, this almost feels like 28:39it's a like a story of like vertical 28:40integration a little bit where it's 28:42almost like maybe there's a structural 28:44advantage. The dream of open source is 28:46like you take a bunch of components off 28:47the shelf and we kind of click them 28:48together and like with a little bit of 28:49spit and polish it works. Um, but it 28:52feels like here the amount of work that 28:54still needs to get like get into it to 28:56get sort of like the model and the 28:58software all to work together is still 29:00something where like almost structurally 29:02open source kind of has like a problem 29:03doing it's strong at other things but in 29:05this particular case it might feel like 29:07it has some some limitations. 29:10Is that does that mean we should be a 29:11little bit pessimistic about kind of 29:12like sort of you know kind of fully open 29:15ecosystems for codegen or do you feel 29:17like there are there are things the 29:19community will do to kind of like you 29:20know deal with this? 29:21>> So I'm I'm still heavily optimistic 29:24about it. Um I just don't think that we 29:27can draw the conclusion that open source 29:30is ever an offtheshelf off the shelf 29:33solution. So um you're always and I 29:36think this is true in every space you 29:38know like if I you know looked at like 29:41an Ubuntu guey I can do anything but I 29:44have to know what I'm doing right like I 29:46have to even to this day like if I'm 29:48using Linux for something like it's 29:50going to require more configuration but 29:52I can configure the heck out of it and I 29:55can get exactly what I want from it. So 29:57I think we'll see more of that. So I 29:59think if you can imagine a world where 30:02um you end up having a lot more control 30:05over that agent architecture and you you 30:07know get to choose also your o your um 30:10open weights model as well then you can 30:13basically say well I do these particular 30:17types of tasks like this is you know 90% 30:19of my work looks like this and turns out 30:22like claude code or doesn't necessarily 30:25get me there or codeex doesn't 30:27necessarily get me there But I'm doing 30:29this particular type of task all the 30:31time. I'm I'm not going to speculate 30:32about exactly which task that that would 30:35be required for, but I would just 30:37believe that that that's going to exist. 30:39And you know, these open uh ecosystem 30:42things are going to enable that to 30:44happen. And I also think that the 30:46development of this open ecosystem is 30:48allowing rapid innovation sharing and 30:52being able to make sure that we're 30:54always working at the state-of-the-art. 30:56um which is not something that you get 30:59when everybody is like fully closed. So 31:02in a world where uh everybody is 31:04completely closed, we can never make 31:06those comparisons of is this model 31:09actually making the difference or is 31:11this agent architecture making the 31:13difference. Once you once it's that 31:14tightly um held together, then you end 31:18up in a world where you're you're only 31:20able to look at uh whether or not the 31:24two together are are succeeding. You can 31:26never say, "Okay, I'm just going to 31:28change out the open weights model 31:30underneath. I'm going to change opus to 31:31sonnet or even um and uh and then I'm 31:35going to check out GPTOSS." Like if if 31:38you're completely unable to make those 31:39comparisons, you'll never know. Is this 31:42caused by my agent architecture or is 31:43this caused by my model? 31:45>> Hey, cool. And I think the thing I would 31:47say is the biggest problem in my mind is 31:50the cost of inference. And if we really 31:53analyze this for a second, the folks 31:56that are using claude code, we're all 31:58sitting there on our various plans. I 32:00sit in my max pro plus whatever plan I'm 32:02on, right? And therefore, I'm never 32:04worrying about the cost of tokens. I I 32:07would not be paying uh the API cost in a 32:11million years, right? And so therefore, 32:13I'm using claude code and the only way I 32:15can use my max plan is is through claude 32:19code. I can't use an open-source tool to 32:21go and connect to that. That's not 32:22allowed. And they're not the only ones 32:24that do that. Gemini is the same. Codeex 32:27is the same. Even Quen and Kimmy K2, 32:30they all offer similar uh plans there. 32:33But but you are doing a subscription 32:35plan and you can only go through those 32:36tools. So you are locked in to that 32:39tool. Now you can go and talk to the 32:40other models but but you're going to be 32:42paying the API costs and and that is 32:44that is a problematic element. So 32:47whether you're a cursor or wind surf 32:49etc. That's why they're all developing 32:51their own models because they need 32:54something that can satisfy the 32:56subscription plan because people don't 32:58want to pay per use they want to pay per 33:00subscription. So when is this in reality 33:03going to change is I the models need to 33:07be much much smaller. So if you have a 33:09coding model that is 3 billion 33:12parameters right 7 billion parameters at 33:15a max or whatever um can run on your 33:17machine and is as capable as Opus 4.5 as 33:20it is today then 33:25>> at that point all of the open source 33:27tools can go wild at that point but but 33:30until then the cost of tokens I just I I 33:33think you're kind of you're in that 33:35vertical stack and it and it's hard to 33:37mix and match Technically, you can do 33:39it, but economically it just doesn't 33:42make a lot of sense. Now, Gabe, to cover 33:44your kind of points about the capability 33:45of the models, the good news, I I have 33:47played with a bunch of those models. I'm 33:48I'm I'm like a model connoisseur. I love 33:51playing with different models and and I 33:54genuinely when I did my Kimmy K2 video, 33:57I was genuinely surprised at how good 33:59that model was. I was like, whoa. you 34:03know that I mean it's not a COD uh four 34:06or five sauna even or Opus levels um but 34:09it was pretty darn good. It was pretty 34:12darn good and and I would say the same. 34:14I was playing with the Deepseek V32 um 34:18uh the reasoner model at the weekend and 34:20and again incredible model but do you 34:23know what openweight model can I can I 34:25run it on my machine? No. I am, you 34:28know, it's like I can download it, sure, 34:30after a few days, but I can't I got 34:32nothing I can run it on. Do you know 34:34what I mean? So, I think I think 34:36inference needs to be sorted out. Um, 34:38and until then, we're going to be 34:39sitting on these vertical stacks. 34:40>> Yeah, absolutely. 34:41>> 100% agree. 34:42>> Yeah, same. 34:44>> Well, on that note of uh unonymity, 34:47>> um, Gabe, Olivia, Chris, this uh this 34:50panel is fire. I wish I could bring it 34:51together once a quarter. Uh, it's 34:53amazing to have you all on the show. Um, 34:55and that's all the time that we have for 34:57today. Uh, so, uh, thank you to all your 35:00listeners. If you enjoyed what you 35:01heard, you can get us on Apple Podcast, 35:02Spotify, and podcast platforms 35:04everywhere. And we'll see you next week 35:06on Mixture of Experts.