Learning Library

← Back to Library

Apple's Modest AI Rollout

38m • Unknown Channel • ai-ml • news • intermediate • Watch on YouTube ↗

Key Points

Apple’s new AI rollout is modest, focusing on privacy‑centric on‑device LLM features like text rewriting, email summarization, and emoji generation, but it isn’t compelling enough to drive immediate iPhone upgrades.
The panel stresses that the success of autonomous AI agents will hinge on robust control mechanisms and clear benchmarks, warning that insufficient safeguards could spur increased fraud.
While Siri is expected to improve—thanks to Apple’s history of polished user experiences and upcoming customization options—many users remain skeptical about its practical usefulness.
The discussion highlights a broader industry need for agreed‑upon evaluation standards to reliably measure AI progress, noting that current benchmarks are necessary but not sufficient.

Sections

00:00:00 Apple AI, Agent Challenges, and Siri Outlook - The panel discusses Apple's modest entry into AI, the difficulty of building reliable autonomous agents and the need for unified benchmarks, and debates whether Siri will ever become genuinely useful.

Full Transcript

# Apple's Modest AI Rollout **Source:** [https://www.youtube.com/watch?v=j9vTEhimRqk](https://www.youtube.com/watch?v=j9vTEhimRqk) **Duration:** 00:38:30 ## Summary - Apple’s new AI rollout is modest, focusing on privacy‑centric on‑device LLM features like text rewriting, email summarization, and emoji generation, but it isn’t compelling enough to drive immediate iPhone upgrades. - The panel stresses that the success of autonomous AI agents will hinge on robust control mechanisms and clear benchmarks, warning that insufficient safeguards could spur increased fraud. - While Siri is expected to improve—thanks to Apple’s history of polished user experiences and upcoming customization options—many users remain skeptical about its practical usefulness. - The discussion highlights a broader industry need for agreed‑upon evaluation standards to reliably measure AI progress, noting that current benchmarks are necessary but not sufficient. ## Sections - [00:00:00](https://www.youtube.com/watch?v=j9vTEhimRqk&t=0s) **Apple AI, Agent Challenges, and Siri Outlook** - The panel discusses Apple's modest entry into AI, the difficulty of building reliable autonomous agents and the need for unified benchmarks, and debates whether Siri will ever become genuinely useful. ## Full Transcript

0:00so it turns out that apple is starting 0:02with AI pretty modestly it's not going 0:05to get me to buy a phone all right so 0:06level with us how hard is it to make 0:08agents actually work I think control is 0:11going to be the key thing that makes or 0:13breaks autonomous agents I think there's 0:15going to be a lot more fraud coming 0:17benchmarks are necessary we need to all 0:19agree of a particular thing we're 0:21looking at they are not sufficient all 0:23that and more on today's episode of 0:25mixture of experts 0:32hello everybody I'm Tim hang and I'm 0:34joined today as I am every Friday by a 0:36worldclass panel of researchers product 0:38leaders and more to hash out the week's 0:40news in AI Kate soul is a program 0:43director for generative AI research 0:44hello Kate Marina denki is a senior 0:46research scientist and my Morad is 0:49product manager AI 0:51[Music] 0:55incubation okay so as we usually do for 0:58mixture of experts we're going to start 1:00start with a quick round the horn 1:01question and that question I want you 1:02all to answer is this if you have an 1:05iPhone will Siri ever be any good Maya 1:08what do you think yes or no I think so I 1:11think Apple have a great tracker track 1:13record at amazing user experiences and I 1:16know they took their time with their AI 1:18but I know it's for the benefit of how I 1:20interact with my phone definitely 1:23marina maybe I still find myself 1:26fighting with Siri a whole lot and 1:28giving up on it most of the time it's 1:29really great 1:30reminders for sure and Kate what do you 1:33think yes assuming they can get to the 1:36user customization uh well let's just 1:38get into it because our first story of 1:40the day is to talk a little bit about 1:42the Apple intelligence uh updates of 1:44this week um background is of course 1:46that there was a big WWDC announcement 1:49earlier in the year that announced kind 1:51of Apple's long awaited drive into 1:53artificial intelligence and um you know 1:55this week we basically saw a SLE of 1:57announcements uh continuing to Hype 1:59Apple intellig but in the context of the 2:01new iPhone 16 uh release um and they 2:04really announced like a whole SLE of 2:05things I was looking at the blog posts 2:07uh you know llm assistance on text image 2:10search image Generation Um and what 2:13we're just talking about a Siri update 2:15um and I guess I kind of want to ask you 2:17just almost like to begin as like a a 2:19user of all these products yeah I mean 2:20uh we were talking a little bit before 2:22the show and it sounds like all of the 2:23panelists uh myself also included um 2:25have an iPhone and I guess Mara curious 2:28are you excited about any of the featur 2:30that are on its way and and if so like 2:33why um it's not really actually right 2:36now enough for me to go out and buy a 2:38new iPhone and try to get those features 2:40um yeah the the features that are there 2:44while it's nice to have the llm locally 2:46and I always do like apple stance on 2:47privacy I mean what can they do all 2:50right we can rephrase text we can 2:52summarize email we can I think generate 2:54emojis which would be really fun when 2:56texting my kids none of it is still in 2:59that 3:00I'm going to pay $1,000 now for a phone 3:02I'm not going to pay $1,000 for customer 3:03emojis um but so like it it's helpers 3:06and the helpers are nice and I'll 3:08appreciate it when I get it but it's not 3:10enough for me to go wow this was 3:12something that was a game Cher by Apple 3:14at least that was what my my feeling was 3:16yeah and I think that's kind of the most 3:17interesting thing there's almost like 3:18two points of view one of them is like 3:20Apple's getting it all wrong like AI is 3:23the killer feature and it really will 3:25sell phones the other one is like they 3:27have it exactly right like none of the 3:28AI features that are current on the 3:30market are good enough to like motivate 3:32someone to actually buy a phone like the 3:35the software is not really pushing the 3:36hardware here well I mean I think from 3:38my perspective it it's not going to get 3:40me to buy a phone it's not something 3:41that's going to push the the boundaries 3:44like significantly but at least gives 3:45Apple an entry into what they've largely 3:48been kind of standing away from so you 3:50know I think it's definitely helping 3:52them move in the right direction 3:53hopefully like I just don't even use 3:55Siri today because it's always like a 3:565050 shot is Siri GNA yeah I don't even 3:59use it either it's like the they're like 4:00pushing it it's like the button and 4:02stuff but I never touch need to get to 4:03the point of like basic Siri usability 4:05and I don't think you can get there 4:07without llm so you know I think they're 4:08making the right move from that call and 4:10then later on I think they have more 4:12opportunity once they lock that in to 4:14actually differentiate both hardware and 4:16software with AI yeah for sure Maya what 4:19do you think I mean I think you know one 4:21way of looking at some of this is that 4:22like apple I mean they're hardware 4:23company right so they're like very 4:25careful by Nature because if you mess up 4:26Hardware you you really mess it up um 4:29but are they maybe like too slow like by 4:31the time Apple gets out this stuff done 4:32you know it's going to be like open AI 4:34on the phone and anthropic on the phone 4:36and perplexity on the phone I think 4:37that's a good question I think part of 4:39Apple's ethos and appeal especially like 4:41when they started decades ago is this 4:43focus on design and the user experience 4:46and doing less they always did less 4:48compared to the competitors so I think 4:50we're in the space where user experience 4:52matters I think we're o we're overloaded 4:55cognitively with too many things on our 4:56phones I think they're I'm really 4:57excited about Siri being to Able to 4:59navigate different apps so if they nail 5:02that for me that delivers tremendous 5:03value and then I don't see them being in 5:06the hot spot to be able to respond to 5:08generative AI they're not search their 5:10core is not search they're not a chat 5:12bot in itself they're a hardware they're 5:14way for me to communicate a series and 5:16add-ons so I don't think they're feeling 5:17the same pressure as their some of the 5:19other Tech players are yeah for sure 5:21yeah I think it's going to be an 5:22interesting situation because like I 5:24don't know so the other week I was like 5:25let's try out all these AI services so I 5:27like signed up for a bunch of 5:28subscriptions and like the first month's 5:30bill just came in and I'm just like this 5:32is really bad but it kind of feels like 5:34maybe if Apple can release these Pro 5:36some of these features for free like it 5:37totally changes the economics of some of 5:39this stuff um K I see you nodding well I 5:42was just wondering Maya if you had a 5:44perspective on like does Apple have an 5:46advantage given that they are the 5:47integrator between all of these apps 5:50beyond what any one app like a 5:52perplexity or anthropic app that you 5:53could also install on your phone might 5:56have yeah I think absolutely like they 5:58they own the ecosystem system of apps so 6:00they own that App Store and I think that 6:02would be like if they could lower the 6:04barrier in the interface of connecting 6:06between different apps that would be 6:07really interesting and I wonder if in a 6:09year from now if Apple will be the AI 6:12killer in the same way when like open AI 6:14launches and you release and that kills 6:16a bunch of startups yeah that's for sure 6:18I mean I think one of the things I think 6:19a little bit about is there's been all 6:21these demos that have been floating 6:22around and I think they they're often 6:23more impressive than they actually are 6:24in practice but it's like type what you 6:26want and a new app emerges and it kind 6:29of feels like sort of thing might 6:30eventually happen on Apple but it also 6:32is like this enormous threat to this 6:34like whole edifice of the App Store that 6:36they've constructed um so I don't know 6:38navigating that I think is going to be 6:39really complex and challenging I think 6:42my brings up actually a really good 6:43point that they're more likely to Gat 6:45keep the App Store until they've got 6:47their own Integrations working and then 6:49it'll be all about how seamless it is 6:51otherwise if you're going to have 6:52different services having to try to talk 6:53to different apps there's so much under 6:55the hood of trying to navigate different 6:57kinds of middleware that you'll have so 6:59many points of failure that people be 7:01like all right well look the Apple 7:02version can't do maybe as much but at 7:05least it works and there seems to be a 7:07real opportunity potentially there yeah 7:09for sure yeah and it reminds me a little 7:11bit of I know there's a debate many 7:12years ago about like okay how do we get 7:14the self-driving car thing to work um 7:16and like one of the ideas was like 7:18they're always five years away it's 7:19always five years away yeah and I think 7:21one of the most interesting things was 7:22this debate over like oh do we need to 7:23like reconstruct the whole environment 7:25to make it simpler for the robot cars to 7:27work or do we just like kind of let the 7:29robot car car like Roam and try to train 7:30it around all sorts of environments and 7:32there's like a similar thing here for 7:34like agents I guess in the AI case for 7:35Apple where it's like they confront this 7:37like very heterogen heterogeneous kind 7:40of like situation for the App Store um 7:43which really prevents them from kind of 7:44like enforcing kind of clean agent 7:46experiences and it's like well I guess 7:48if anyone can do it it's Apple because 7:49they have the most control at least over 7:51the 7:51[Music] 7:55space well it's a really good segue 7:57because I think my one of the reasons we 7:59were really excited to have you on the 8:00show was kind of the second topic I 8:02wanted to sort of touch on today which 8:04is you know I think literally in the 8:06last 10 episodes of mixture of experts 8:09um people have been like and agents 8:11agents are on the way agents are going 8:13to be the new big thing and we've kind 8:14of debated it back and forth and you 8:16know you've actually been working on 8:18agents right and I feel like so 8:20infrequently like the circle of people 8:21talking about agents and the people 8:23actually working on agents is like a 8:24very different kind of like Delta um and 8:27you're you're pretty rare in that 8:28respect because think you've been really 8:30kind of like in the trenches on it um do 8:32you want to tell us a little bit about 8:33that work I was curious both to learn a 8:35little bit about it and then kind of 8:36what you're learning yeah of course so 8:38um I sit in a really interesting team in 8:40research so my team focuses on 8:41incubating new technologies and opening 8:43up Market opportunities for IBM and 8:46we've been focusing on agents for 8:47several months now um so very much in 8:50the trenches and one of the first things 8:52that we did this month is we open- 8:54sourced an framework for building 8:57agentic applications so still very early 8:59days we did a silent drop um but we 9:01think we have some interesting features 9:03that we can offer in this place um that 9:05reflect some of our learnings and I 9:07don't know how much time we have but um 9:08there's a lot of things that we learned 9:10along the way and I think it's a lot 9:12very hard to bring agentic applications 9:14into production and it's very easy to 9:16take it for granted I think in terms of 9:18operational complexity this is a step 9:20change from fix flow like it's 9:23incrementally not incrementally it's uh 9:25it's another Paradigm it's much harder 9:27than fixed flows in terms of 9:28implementation yeah we'd love to talk 9:30about learnings and we've got time for 9:32it I mean I feel like this is where like 9:33M experts can can shine like lots of 9:35people are talking about Apple let's 9:36like talk about like really building 9:38agents I mean maybe one way to cut 9:40through it is is there something that 9:41you found like 9:42surprisingly surprisingly hard right 9:45where you're like before you went into 9:46it you're like ah we can nail that it's 9:47no problem and you're like actually this 9:48is like really difficult so there's two 9:51things that were kind of took us like 9:54they were a blind spot blind spot for us 9:56and we didn't expect how hard they would 9:58be and then I think there's one thing 10:00that made us very clear of how do we 10:02bring this into production so I'm just 10:03going to talk about it at the high level 10:05first thing is an agent is under pred by 10:08a prompt so think of a set of 10:10instructions that tell it how to behave 10:13the an so let's say you took an you 10:15built an agent around Model A and then 10:17you optimized it around Model A now if 10:20you want to bring it to model B the 10:21whole thing breaks and we had this 10:23experience firsthand so we started 10:24dabbling with llama 3 we moved to llama 10:263.1 which we expected to be an 10:29incremental change nothing much changed 10:31the whole thing broke took us three 10:32weeks to reoptimize everything under the 10:34hood and we're still not fully there so 10:37this I think this is a bit critical 10:39because if you want to stay on top of 10:41the latest and the best models you have 10:44all of this cost related to changing 10:46models and that makes it really 10:48prohibitive to try new models and I 10:50think this is pushing Us in another 10:51Direction Where once you picked your 10:53model of choice and buil something in 10:54production with it it's going to be 10:56really hard for you to change model 10:58providers and I don't know if I'm very 10:59happy with that being the status quo so 11:02that's one part of it and we have some 11:03ideas about how to overcome it the other 11:06part is something actually I was we were 11:07discussing with Kate Soul this morning 11:09which is we take for granted how to 11:12build with AI meaning if I'm building 11:14traditional software engineering 11:16applications I specify my features and 11:19then I code in my features and then I 11:21test my features and I'm done with llms 11:23I have features I didn't maybe sign up 11:25for in the first place maybe like 11:27outputting hate speech um things that 11:29are useful like summarization features 11:32features that kind of come out of the 11:33box but did I test for them no I kind of 11:36take them for granted and I think we 11:37made the mistake initially of taking 11:39those features for granted and not 11:41following a test driven approach and I 11:43actually want to pass it over to Kate 11:44because she works a lot in prompt 11:46optimization and I'm sure you've had 11:48your own struggles with this yeah yeah 11:50thanks Maya um you know I think one of 11:52the things that I found really 11:54interesting when we started to get into 11:56the weeds is how do you think about this 11:57kind of hierarchy of how a model or 12:00agent is aligned so there's all of the 12:03work that the model provider does in 12:05order to train it to be safe make sure 12:08it's good at things like summarization 12:09and basic language tasks and they have a 12:11a perspective that they enforce on the 12:14model and how the model should behave in 12:15situations then you have the alignment 12:19preferences and priorities set by the 12:21agent Builder and so you have all of 12:24these behaviors about how the agent 12:25supposed to behave so how this model 12:27with this system prompt together are now 12:29going to interact in this new agentic 12:31environment what patterns they're going 12:33to follow and then there's even a third 12:35level which is like when a user is 12:36interacting with the agent I might have 12:38my own preferences on how I want the 12:41agent and so this gets back to for 12:42example like Siran the ability does 12:45Apple had the ability to actually 12:46personalize to an individual user like 12:48as a user I might want a very specific 12:51way of interacting with my agent I want 12:53things to be short always and bullets 12:55versus I want you know everything to be 12:57in markdown format and much longer and 13:00you know have tables and everything else 13:01inserted where possible so there's kind 13:04of these different tiers you have to 13:06start to account for when you're 13:07building models and you have control 13:09over some parts you can impact your 13:12system prompt design for the agent you 13:14can try and create tools and uh 13:16different parameters that users can play 13:18with for impacting their own 13:19personalization alignment of the agent 13:22but then there's things you don't have 13:23control over that's defined by the model 13:25builder and so you know I think that's 13:26where a lot of some of the interesting 13:28challeng es come out of is trying to 13:31navigate kind of those different levels 13:32of control that you have and and trying 13:34to work within the system that different 13:36model providers are setting up 13:38particularly if you envision needing to 13:40ever be able to switch models to 13:41something new where different provider 13:43might have set a different process yeah 13:46it almost kind of presages there's sort 13:47of this really interesting thing of like 13:49kind of like Legacy code or Legacy 13:51models where you know like the intention 13:54will always be like let's move to the 13:55next great capability model but my kind 13:58of the story that you're telling is like 14:00it changes so much about the way the 14:02agent behaves that it's almost like you 14:05in many cases you may not want to 14:07because of the uncertainty or like in 14:09the very least the kind of like 14:10evaluation burden of trying to figure 14:12out how to get that to work well one 14:14thought is here the notion of backward 14:15compatibility is not something that 14:17exists in llm we haven't really thought 14:19about it that much it has not been a 14:20thing as soon as but you we know that 14:22that's a really really big deal in 14:23everything having to do with software 14:25the notion of backward compatibility and 14:27did everything immediately break and is 14:28that really all that useful so if we're 14:31going to be really doing this seriously 14:33I think you're going to have to start uh 14:35having to take that into account and 14:36have people create a whole new slew of 14:38benchmarks and functionalities and tests 14:40and oh it was working this way before 14:42how is it going to work if you're trying 14:43to plug it into something old another 14:45thing that I completely agree with what 14:47Kate was saying that notion of control 14:49think about generative AI with art you 14:51can give it a prompt to make a picture 14:53but then you can't tell it okay I love 14:54it but like just change that little 14:56thing over there to to something else it 14:58won't work that's it's not how these 14:59models work that's a real challenge that 15:01in software is like all right well you 15:02did a lot of this right but I need you 15:03to change this little bit to there and 15:05this little bit to there it's not going 15:06to work like that at a very basic level 15:08so both of these things I think are a 15:10very different way of looking at 15:13software building and I think needs to 15:14be thought of very carefully to make 15:16sure that it's actually practical over 15:18time Marina I was wondering if you had a 15:20perspective like does getting towards 15:22like GPT structured output start to 15:24solve some of the backwards 15:25compatibility like if we can now have 15:27greater structure on exactly how we 15:30prompt the model and exactly the outputs 15:32does that start to solve some of the 15:33problem in your mind it's a step I think 15:36that part of it certainly is the 15:38structure of the output but another very 15:40large part of it is what are sort of the 15:42acceptable States and the constraints on 15:43what makes sense or not even if you have 15:45the structure the content of that output 15:47could still theoretically kind of be 15:49anything we're still talking strings or 15:51or other Primitives and there's still a 15:53notion of what kind of states in your 15:55application as you're going through are 15:57okay and not okay valid and valid in a 15:59deterministic flow you write it all up 16:02and you know exactly what will and won't 16:03happen you've got tests you've got 16:05catchers you've got things like that how 16:06does that look here that is I think the 16:09next thing that's that's on my mind yeah 16:11I I think both of you raised a really 16:13great point and I loved how much you 16:14spend time on speaking about control I 16:16think control is going to be the key 16:18thing that makes or breaks autonomous 16:20agents um these things can run wild and 16:23have consequences that could be costly 16:25so like we've seen firsthand how data 16:27that you might have thought as propri 16:29gets sent to an external tool and now 16:31goes to a third party and maybe you 16:33didn't intend that in the first place so 16:35one of my thoughts at the back of my 16:36head is I don't know if it's fully 16:39autonomous agents that will go into 16:40production this year whereas it's more 16:42of a hybrid sort of compound AI system 16:45where some parts are agentic meaning 16:47there's degrees of freedom that the llm 16:49can take and other parts are more 16:51prescriptive are verifiers are things 16:53that allow us to get the level of 16:54control we want and I think that's some 16:56of the ideas we're starting to explore 16:58here because I I just don't think with 17:00the underlying models that we have we 17:03can have autonomous agents fully 17:04autonomous agents safe in production 17:07that's where I would put my money on 17:08right now yeah for sure it almost kind 17:10of seems like we're going to enter an 17:11era of like almost like pseudo agents 17:13where you have kind of like agenty 17:15elements but like actually is quite 17:16deterministic in some ways and that that 17:19actually could persist for a very long 17:20time I don't know if my ultimately leer 17:21dream is you know the agent Unshackled 17:24um but it kind of seems like that the 17:26issues you're pointing out are very like 17:28pretty deep and categorical right I 17:30don't know if you'd agree with that yeah 17:32definitely not in the camp of agents and 17:34AI Unshackled I think um it's AI has to 17:37serve us and fit to our needs and I 17:40think we need to understand how it works 17:42and we need to understand make sure that 17:43it works in ways that is accordance with 17:45our values and I I think that's the type 17:47of AI that I personally would prescribe 17:49and would align with my own worldview 17:52and values yeah yeah for sure well 17:55before we move on to the next topic I 17:56know Maya you said you sort of soft 17:58launched this uh do you want to direct 17:59our listeners to it yet or is it still 18:01like you're just teasing it it's going 18:02to come out soon so no I can uh more 18:04than a teaser so we launched it we just 18:07it's a silent drop we haven't shared it 18:08broadly um if this is the first moment 18:10you're sharing it yeah first moment I'm 18:12sharing this so if you're listening to 18:13this it's called the B stack and 18:15specifically maybe we can drop the URL 18:16later it's called the B agent framework 18:19um you can do some cool things out of 18:20the box with it right now so you're you 18:22can already uh create an agent that can 18:24plan and use tools and maybe correct 18:27itself um so we have some use cases but 18:29we also have exciting uh updates to 18:31bring so one along the lines of solving 18:33for the cost of switching models I don't 18:36think we'll have the ultimate solution 18:37but we want to reduce the friction of 18:39switching models and um we're working on 18:41bringing some of these Enterprise 18:43controls that I've mentioned and um if 18:45there are people who are interested in 18:46joining in this um Journey I'm very open 18:49to it still the very beginning steps but 18:51I think we're excited about what we can 18:52learn and my it's be as in the 18:55insect our team yeah yeah like bees our 18:58team really likes naming things puns so 19:01bees as worker bees and then maybe 19:03there's hives and all of that 19:05[Music] 19:09so so to move us on to our next story uh 19:12this week there was a New York City 19:13based startup called hyper write um that 19:15released a model uh called reflection 19:1870b um and it was kind of widely touted 19:22the leader of the company came out 19:23saying that you know this model 19:25integrated this new method called 19:27reflection tuning and this was the 19:29secret sauce that allowed this model to 19:31hit crazy good um uh metrics on all the 19:34major benchmarks um there was a lot of 19:36hype about it as most models kind of you 19:38know Rising on the leaderboard get 19:40nowadays um and then immediately there 19:42was kind of a turn where people said 19:44wait a minute we tried to reproduce some 19:45of these results and like this seems 19:47nowhere near what you're claiming uh 19:50furthermore doing some digging this 19:52seems like you just did some like 19:53Bargain Bin fine tuning on some open 19:55source models um and by now in true kind 19:58of like Twitter driven media cycle 20:00fashion there's just been this big cycle 20:02of mutual recrimination and um and uh 20:04and dispute um but the end result seems 20:06to be that we have gotten a startup that 20:09uh went on the publicly available 20:11benchmarks that everybody is using to 20:13evaluate model quality um and may have 20:16engaged at least in some shading of the 20:17numbers uh to make their model look 20:19better than it actually was and I think 20:22that's so interesting just because like 20:23these leaderboards have become in some 20:25ways like The Benchmark that we use to 20:27tell who's actually advancing the 20:29state-ofthe-art and what models are good 20:31or what models are bad um and so I think 20:33the main thing I wanted to kind of raise 20:35and Marino maybe we'll toss it over to 20:36you first is should we be worried that 20:38we're going to see more of this like it 20:40seems like the value of gaming these 20:41metrics is always rising and so you know 20:43regardless of whether or not there was 20:45intentional fraud here seems like 20:47there's going to be a lot of bad 20:48incentives in the space soon but I don't 20:49know if this is something you think we 20:50should care about or this is just kind 20:52of what happens I mean to some extent 20:54this is kind of what happens but I was 20:56actually really happy to see so many 20:58other folks jump on right away and say 21:00no I'm going to try to reproduce the 21:02results you need to upload your weights 21:04uh what about this what about that that 21:06is science acting correctly s good 21:08science is supposed to be reproducible 21:11and while it is possible to have 21:13something and have a hype cycle it was 21:15really nice to see that there was 21:17immediate checking going on from you 21:20know third parties and and everything 21:21else as quickly as it was in previous 21:24years decades centuries the Cycles were 21:26very very slow of what it took for other 21:28people to check their work and it's 21:29certainly still the case maybe in other 21:31fields like biochem things I don't know 21:33about at least in our field it's 21:35actually very easy to quickly start 21:37checking so that was a nice thing to say 21:39another thing I'll say about benchmarks 21:40and you know I can always go off about 21:42benchmarks is they are I mean it's great 21:45they are a temporal 21:47proy of a really specific slice of the 21:51world with a lot of things held constant 21:53we are trying to check performance on a 21:54particular thing and they're being 21:56artificially controlled are they useful 21:58apps absolutely are they sufficient no 22:01so we always like in science to talk 22:02about necessary insufficient benchmarks 22:04are necessary we need to all agree of a 22:06particular thing we're looking at they 22:08are not sufficient nor should they be 22:10the whole point is you have a benchmark 22:11then you have another one you have 22:12another one you have another one we keep 22:14sort of you know check each other keep 22:15each other honest and motivate each 22:17other to explore what are the holes what 22:20is the next thing to look at what is the 22:21next thing to look at so actually I I 22:24found this to be a very satisfying uh 22:27yeah the system is working basically a 22:29little bit Yeah Yeah and maybe I mean 22:31this is maybe another way at the problem 22:32I mean I remember going to nurs a number 22:35of years ago and there was a push a 22:37number of years ago to say look machine 22:38learning has this big reproducibility 22:40crisis um and if anything this story is 22:43like you know almost the other direction 22:46like it actually turns out that it isn't 22:47reproducibility in the kind of academic 22:49sense but we are seeing kind of like a 22:51emergent reproducibility happening in 22:53like the Rough and Tumble of of of 22:55Twitter or x uh I guess Kate you're 22:57nodding I don't know do you think like 22:59is the problem solved for 23:00reproducibility like are we kind of 23:01there or is this still kind of a 23:02persistent thing that we should worry 23:04about I mean I I think it's encouraging 23:06that we have such an active Community 23:08focused and I really like how Marina 23:10said it on good science right making 23:12sure that we check and validate I think 23:15for from my perspective one of the 23:17bigger issues here isn't necessarily 23:19just the reproducibility aspect but the 23:21transparency aspect if we think of how 23:24uh this model was trained how it's 23:25communicated so you know we need to move 23:28as a field and I I think there's a lot 23:30of good actors here but clearly there's 23:33um some cases where we haven't quite got 23:35there yet where it's not just a matter 23:37of dropping a benchmark you have to be 23:38transparent and open about how the like 23:40good science also means sharing your 23:42methods and your approach and how it was 23:44trained and what it was trained on and 23:46you know getting into a lot more open 23:48detail where uh right now you know the 23:51the norm is to kind of train these 23:53behind a black box put an API out there 23:55and say here we did this really cool 23:57thing Trust it works like can you 24:00imagine that happening in like other 24:02products or Industries it's like here's 24:04an airplane yeah trust us we tested it 24:06it's fine um you know I I think we need 24:09a lot more openness in just in general 24:11and how these are trained because 24:12there's always going to be misaligned 24:13incentives when it comes to these 24:15benchmarks they just get so much 24:17attraction and commentary that you know 24:19we need to have it always paired with a 24:21really open discourse of what was 24:23actually done and the ability to inspect 24:25what was actually done yeah and this 24:27seems to be the Crux of right because I 24:29think you know benchmarks have almost 24:30represented like the compromise position 24:32for the field right which is okay well 24:34we have all these companies that have 24:35trade secrets and they want to kind of 24:37keep their Innovations like you know 24:39reflection tuning or whatever they want 24:40to do um and so we're like okay well so 24:43long as you can kind of like show us 24:44your performance on the benchmarks 24:46that's the transparency we're looking 24:47for I guess Kate is just to kind of push 24:49you a little bit further on it is that 24:50like you're almost saying that like we 24:51should actually expect more than just 24:53the benchmarks yeah and I think there's 24:55a lot of like transparency and name only 24:57like company say oh we're going to 24:59publish a paper on this to follow or 25:00here is a highle overview of what we did 25:03but like if you don't actually share 25:05share the weights share more details 25:07like get really you know get into the 25:09weeds of um exactly what was done you 25:12know a lot of this can just be kind of 25:13surfac that you know you can talk about 25:15publicly as being transparent but did 25:17you actually deliver details that have 25:19helped scientists reproduce What was 25:21done you know that is the level of 25:22transparency we need to drive to so I'm 25:24in the business of building AI 25:26applications for production and so so 25:28benchmarks is something we were just 25:29discussing this 25:31morning for me it's not useful to see a 25:34certain model's performance on a 25:35benchmark because there could be a 25:37number of things that are happening it 25:38could be that the model has seen this 25:40data before it could be that even though 25:42it does good on this it might not 25:43generalize to my own use cases and what 25:45I care about so my ethos around 25:48benchmarks is if there is a test data 25:50set that fits a feature that I'm trying 25:52to develop maybe I'm trying to improve 25:54reasoning capabilities maybe I'm trying 25:56to improve tool calling there's a test 25:58data set that helps me identify my blind 26:00spots I'm going to go all in for it but 26:02it's so important like we're investing 26:04or building our own test cases and our 26:06own email criteria because we have a 26:08very specific thing we're going after 26:10and nothing beats doing that and um yeah 26:13even When selecting llms it's I yeah I 26:15feel like benchmarks is nice to like 26:18maybe have a sub selection of models to 26:20look at but it's like Marina said it's 26:23an it's helpful but not a complete 26:25signal yeah there's almost a kind of 26:27interesting phenomenon that I've been 26:28sort of chasing after is a little bit 26:30like the idea that you know it used to 26:32be that the limiting reagent was getting 26:33new models out but then like models are 26:35everywhere now right and so it's almost 26:37like now the new limiting reagent is 26:38like a well-crafted eval or like a 26:40well-crafted benchmark set um and like 26:43that's increasingly becoming like where 26:44a lot of the the the bottleneck is it 26:46feels like in some of the workflows that 26:47you're seeing all over the place well 26:49and I wonder like as our models are 26:51across the board you know if you just 26:53look at the state the field models are 26:55getting better and better being able to 26:57do what last year a model you know could 26:59do but it needed to be 10 times bigger 27:01you know so these model performances are 27:03continuing to improve you know we're 27:05starting to get into the zone of you 27:07know is this becoming commoditized and 27:10so therefore it's this race to the 27:11bottom trying to inch up and get a 0.01 27:15you know percent increase in a in a 27:17metric that might not actually 27:20informative of your use case because 27:21everyone's trying to you know show some 27:24level of differentiation it's becoming 27:25increasingly more difficult as 27:27performance improves over the you know 27:29I'll call it like the workforce tasks 27:31that are kind of the um bread and butter 27:33low hanging fruit that pretty much any 27:35model can do now yeah for sure yeah 27:38sometimes I look at some of these 27:39benchmarks and like what are we what are 27:40we doing here exactly what are we 27:42spending time on yeah 27:44[Music] 27:49yeah uh so at mixture of experts we 27:51always try to do a paper uh as one of 27:54our stories um and I do want to end 27:56today with kind of focusing on um a 27:58paper that just came out that I thought 28:00was pretty fascinating it's entitled can 28:02llms generate novel research ideas um 28:05and this is almost a continuation of a 28:06paper that we talked about uh last week 28:09uh which was about kind of using AI for 28:11science um and the big debate there was 28:13basically this kind of very interesting 28:15question of can llms be creative right 28:18can they become a partner that sort of 28:20like pushes Us in new research 28:22directions that we would have not have 28:23gone on before um and this is 28:26particularly interesting because I think 28:27there's also a parallel discussion um 28:29some of you may have seen this article 28:30by the Sci-Fi writer Ted Shang that came 28:33out uh I think about a month or so ago 28:35in the New Yorker sort of arguing on the 28:37creative side right like llms can't be 28:39creative in some ways because you know 28:41they sort of don't make these kind of 28:42intentional choices about their outputs 28:45um and so anyways I think to quickly sum 28:46up the paper their argument is look we 28:49played around with llms it does seem 28:51like they can generate kind of creative 28:52prompts you know their their kind of 28:54spicy claim is sometimes even more 28:56creative prompts than humans themselves 28:58um and dot dot dot you know we should be 29:00really bullish I think on the promise of 29:03llms assisting research almost at the 29:06very very beginning of the workflow 29:08right in the most kind of human part of 29:10the work um and I guess you know the 29:12first question is just like do do we buy 29:14it right um and uh I guess Kate maybe to 29:17start the story with you because I think 29:18you've kind of come in as kind of the 29:20second commenter on a number of the 29:21stories do you buy this case like do you 29:23buy this 29:24claim so I I think the paper makes a lot 29:27of really interesting um assessments of 29:30of how a model can support in research 29:32tasks but what struck me the most about 29:34it is they had to search through like 29:374,000 plus examples to get 200 29:41unique uh actual research topics in this 29:44paper so while the models were pretty 29:47good at creating I I think specifically 29:49they're focused on novelty on new ideas 29:52that you know the other human subjects 29:55that in this study hadn't thought of or 29:57come up with you know is it really that 29:59models are more creative or we're just 30:00brute forcing able to automate searching 30:03through thousands and thousands of 30:05scenarios until we somehow you know get 30:08I to can roll dice until a unique number 30:10comes up exactly so you know that's 30:13where I think there's still some room 30:15left for debate on what does it mean to 30:17be creative or or novel um and if did 30:21the humans get a fair Shake of it if you 30:22think of it from that term I guess 30:24Marina I think as like a researcher you 30:27know do you think this is the kind of 30:28tool you might be using in the future or 30:29this is kind of still like mostly just 30:31game playing because I think you know K 30:33if I maybe I'm being uncharitable to 30:35your position but you're sort of like 30:37this is just like a magic a ball right 30:39like and you know uh if it generates 30:41something unique and great that inspires 30:43people then awesome but there's almost 30:44nothing sort of uniquely creative I 30:47guess is almost what you're saying I I 30:48mean I don't want to say a broken clock 30:51is correct twice a day but like if you 30:53do something enough you're going to 30:55stumble across something new does that 30:57actually means something is more 30:58creative uh and especially if you think 31:00about the New Yorker article that you 31:02brought up to where creativity is a 31:03choice like are we actually saying 31:05models are creative or are they a tool 31:07where we can brute S search through you 31:10know a larger number of ideas that are 31:12being randomly generated it's it's the 31:14second one to me although that's 31:15valuable that is very very valuable 31:17absolutely creativity requires intent 31:20there is no intent here uh it's valuable 31:23because uh it can take humans a lot of 31:25times and we come in with particular 31:27biases of how to to think about things 31:29whereas if something is brought up to 31:30you you can immediately say oh that 31:31makes no sense versus oh I didn't think 31:33about it that way that's valuable but 31:36there's no judgment from these models so 31:38I know that this has been used a lot 31:39recently in uh other science Fields like 31:42uh medicine or chemical compounds you 31:44know where there's just like thousands 31:45and thousands thousands you you can try 31:47to put them together and humans just 31:49don't have the time what's interesting 31:51is they they are being used as a filter 31:53as a brute force and you get it down to 31:55this makes no sense this is physically 31:56impossible da and then you start getting 31:58like a human intuition of like well 32:00based on my experience in the field for 32:0220 years this is an idea worth pursuing 32:04and this one is not can I tell you 32:05exactly why to the extent that I can 32:07like you know prompt an alum for it no 32:09but it's sort of a sum of of my 32:11experiences that kind of a progress 32:13that's great that's a very useful thing 32:15but creativity means intent and there is 32:17no intent yeah that's right so you you 32:20kind of buy the paper in some ways I 32:21guess almost is it equal almost about 32:23calling it creativity like the problem 32:24is that like we're kind of giving it 32:26this word that has all this baggage 32:28right it's kind of on a pedestal like 32:29the great creative artist you know the 32:31great creative scientiic implication 32:34yeah and then the not the paper itself 32:36is not creative sorry um because asking 32:40machine learning to uh help you you know 32:44go through a whole bunch of stuff and 32:45say which one of it is definitely not 32:46garbage or here some other things I've 32:48tried that's not new they're applying it 32:50maybe to this extremely specific use 32:52case that's great but that part to me is 32:54not new while being useful yeah for sure 32:57um Maya maybe I'll turn to you I think 32:59you know this conversation almost makes 33:01me think about like the debates we were 33:02having you know almost 10 years ago now 33:04about inability right where people are 33:06like you know they're basically like I 33:08think there's one line of debate which 33:09is well you don't really want to use 33:10these systems if you don't understand 33:12how they do what they do and then I 33:14think there's a certain group of machine 33:15learning like chauvinists that were 33:17basically like well if the model always 33:18succeeds at doing the task what do you 33:20care about how it gets itself done and I 33:22think there's almost like a similar bias 33:24when we start talking about like a and 33:25creativity where we're like we don't 33:27just want you get to the right answer 33:28but we want you to get to the right 33:30answer like in the right way and I think 33:32what we mean by that in the creative 33:34context is like that we want somehow for 33:36it to be a little bit more than like a 33:38random number generator um and yeah I 33:41think do do you I don't if you fall on 33:43one side of that debate or another where 33:45you're kind of like actually doesn't 33:47matter right like if these tools help us 33:48get more creative results then I'm happy 33:51versus like no part of our research 33:52agenda really should be to like kind of 33:54like get these systems to be Capital C 33:57creative right what that means right 33:58seems to be a big question yeah I don't 34:00if I have a blanket answer for this one 34:02but it's context dependent so I know in 34:04the Arts World um Source attribution is 34:06really important a lot of artworks are 34:08done in the style of EXN artist how like 34:11there should be credit given to a 34:13certain artist in this place and how do 34:14you give that credit and we want to 34:17assign this characteristic to AI because 34:18we put social values regard guarding to 34:21IP regard guarding to giving the right 34:22credit so we want to align these AI 34:24systems to how our world functions and 34:27how we we give credit where credit is 34:29due um so I think it depends really like 34:33what is the cost of not having that 34:35explainability bacon and not just the 34:36explainability but these other values 34:39that matter to us and that would make 34:41this technology align with our societies 34:44and not us having to adapt around it 34:46yeah that makes a lot of sense and I 34:48think that'll be sort of an interesting 34:49struggle right because where some of 34:50this goes is like oh you know model you 34:53came up with like an intuitive 34:55counterintuitive very creative result 34:58like can you explain why that's the case 34:59or how you reach that result right like 35:01that kind of interpretability which 35:02starts to look a little bit more like 35:04chain of th I think we're just giving 35:05too much credit for these systems like 35:08of the the paper that we talked about 35:10and generating novel ideas well first of 35:12all what do you mean by novelty and is 35:13it just like net something net new is 35:15that something that useful um and it's 35:19like how the system works it's just like 35:20the statistical probability of the next 35:22word so um for us when we're coming up 35:25with novel ideas there's meaning and 35:26value intent behind it right I'm trying 35:29to put an idea that maybe I care about I 35:31want to push forward so I it's really 35:34tough to equate these two and put them 35:36on the same standing one can be a tool 35:37for the other to maybe give you an idea 35:39that maybe you didn't think about but I 35:42don't think one replaces the other yeah 35:44for sure yeah there's also like and this 35:46also plays into this very interesting 35:47kind of like debate which is like 35:49they're just stochastic parrots and then 35:51people being like no they're more than 35:52stochastic parrots but I think Maya 35:54you're almost outlining like a third 35:55path which is yeah they're stochastic 35:57parrots but like that's really powerful 35:59actually like you know like let's not 36:01downplay that too much like the the 36:03stochastic parrot is actually like 36:04incredibly useful in certain domains uh 36:08and and like we almost shouldn't sell 36:09that short right like as as a little bit 36:11of what you're saying yeah absolutely I 36:13think it's it's one of the this whole 36:16industry of generative AI was unlocked 36:17on how good these stochastic parrots 36:20were and I I read that initial paper 36:22that came out but I think what took us 36:24all back was actually this going to have 36:26really useful applications of 36:27implemented in the right way but it it 36:29doesn't mean that it can inherently 36:31assign reason and intent that we're not 36:34like I don't think there's a world where 36:35we're there well so Marina I'll let you 36:37have the last word in four or five years 36:39you going to have a an llm co-author on 36:41a paper or is this like still total pipe 36:44dream no and I don't think that that's 36:46the right aim in that sense I don't 36:48think I'm G to have that co-author but 36:49it could be very much that and we're 36:51actually asked to when we publish now to 36:53say whether you've used AI in your work 36:55or anything of that kind llm is helping 36:57you sift to the related work llm is 36:59helping you you know manage your 37:00bibliography and figure out uh things 37:02that are similar and different and and 37:04things of that kind sure but um you know 37:07also to go with what my and Kate were 37:09saying again about intent that's not 37:11what technology is able to do intent is 37:14something that you get from beings that 37:16are alive that beings that are actually 37:18able to give that you can have that's 37:19why you can't have actual art I agree 37:22with Ted Chang very strongly you're not 37:24going to have ai art art moves us 37:25because of the intent behind the person 37:28that made it whether it was actually 37:30originally their their intent or not 37:31doesn't matter you know that that's what 37:32that was versus AI art now you can 37:35actually feel again the the care and the 37:39intent behind the people who made the 37:41llm think about how much you know effort 37:43we all pour into making something that 37:44is useful it's not the technology itself 37:46though it's the people that are trying 37:48to create something that is intended to 37:51be used and helpful and efficient and 37:53effective that's where the intent is not 37:56the tech itself that's great yeah I'm 37:58I'm applauding 38:00so um Maya Kate Marina um in a a 38:04nightmare landscape of jargon and hype 38:07uh this panel is just like a light in 38:10the darkness so I appreciate you all uh 38:12taking the time this morning to stop by 38:14mixture of experts and hopefully we'll 38:15have you on at some point in the future 38:18um and for all you listeners out there 38:19if you enjoyed what you heard you can 38:21get us on Apple podcast Spotify and 38:24podcast platforms everywhere and we will 38:25see you next week on Mi of experts