Learning Library

← Back to Library

Apple's Modest AI Rollout

Key Points

  • Apple’s new AI rollout is modest, focusing on privacy‑centric on‑device LLM features like text rewriting, email summarization, and emoji generation, but it isn’t compelling enough to drive immediate iPhone upgrades.
  • The panel stresses that the success of autonomous AI agents will hinge on robust control mechanisms and clear benchmarks, warning that insufficient safeguards could spur increased fraud.
  • While Siri is expected to improve—thanks to Apple’s history of polished user experiences and upcoming customization options—many users remain skeptical about its practical usefulness.
  • The discussion highlights a broader industry need for agreed‑upon evaluation standards to reliably measure AI progress, noting that current benchmarks are necessary but not sufficient.

Full Transcript

# Apple's Modest AI Rollout **Source:** [https://www.youtube.com/watch?v=j9vTEhimRqk](https://www.youtube.com/watch?v=j9vTEhimRqk) **Duration:** 00:38:30 ## Summary - Apple’s new AI rollout is modest, focusing on privacy‑centric on‑device LLM features like text rewriting, email summarization, and emoji generation, but it isn’t compelling enough to drive immediate iPhone upgrades. - The panel stresses that the success of autonomous AI agents will hinge on robust control mechanisms and clear benchmarks, warning that insufficient safeguards could spur increased fraud. - While Siri is expected to improve—thanks to Apple’s history of polished user experiences and upcoming customization options—many users remain skeptical about its practical usefulness. - The discussion highlights a broader industry need for agreed‑upon evaluation standards to reliably measure AI progress, noting that current benchmarks are necessary but not sufficient. ## Sections - [00:00:00](https://www.youtube.com/watch?v=j9vTEhimRqk&t=0s) **Apple AI, Agent Challenges, and Siri Outlook** - The panel discusses Apple's modest entry into AI, the difficulty of building reliable autonomous agents and the need for unified benchmarks, and debates whether Siri will ever become genuinely useful. ## Full Transcript
0:00so it turns out that apple is starting 0:02with AI pretty modestly it's not going 0:05to get me to buy a phone all right so 0:06level with us how hard is it to make 0:08agents actually work I think control is 0:11going to be the key thing that makes or 0:13breaks autonomous agents I think there's 0:15going to be a lot more fraud coming 0:17benchmarks are necessary we need to all 0:19agree of a particular thing we're 0:21looking at they are not sufficient all 0:23that and more on today's episode of 0:25mixture of experts 0:32hello everybody I'm Tim hang and I'm 0:34joined today as I am every Friday by a 0:36worldclass panel of researchers product 0:38leaders and more to hash out the week's 0:40news in AI Kate soul is a program 0:43director for generative AI research 0:44hello Kate Marina denki is a senior 0:46research scientist and my Morad is 0:49product manager AI 0:51[Music] 0:55incubation okay so as we usually do for 0:58mixture of experts we're going to start 1:00start with a quick round the horn 1:01question and that question I want you 1:02all to answer is this if you have an 1:05iPhone will Siri ever be any good Maya 1:08what do you think yes or no I think so I 1:11think Apple have a great tracker track 1:13record at amazing user experiences and I 1:16know they took their time with their AI 1:18but I know it's for the benefit of how I 1:20interact with my phone definitely 1:23marina maybe I still find myself 1:26fighting with Siri a whole lot and 1:28giving up on it most of the time it's 1:29really great 1:30reminders for sure and Kate what do you 1:33think yes assuming they can get to the 1:36user customization uh well let's just 1:38get into it because our first story of 1:40the day is to talk a little bit about 1:42the Apple intelligence uh updates of 1:44this week um background is of course 1:46that there was a big WWDC announcement 1:49earlier in the year that announced kind 1:51of Apple's long awaited drive into 1:53artificial intelligence and um you know 1:55this week we basically saw a SLE of 1:57announcements uh continuing to Hype 1:59Apple intellig but in the context of the 2:01new iPhone 16 uh release um and they 2:04really announced like a whole SLE of 2:05things I was looking at the blog posts 2:07uh you know llm assistance on text image 2:10search image Generation Um and what 2:13we're just talking about a Siri update 2:15um and I guess I kind of want to ask you 2:17just almost like to begin as like a a 2:19user of all these products yeah I mean 2:20uh we were talking a little bit before 2:22the show and it sounds like all of the 2:23panelists uh myself also included um 2:25have an iPhone and I guess Mara curious 2:28are you excited about any of the featur 2:30that are on its way and and if so like 2:33why um it's not really actually right 2:36now enough for me to go out and buy a 2:38new iPhone and try to get those features 2:40um yeah the the features that are there 2:44while it's nice to have the llm locally 2:46and I always do like apple stance on 2:47privacy I mean what can they do all 2:50right we can rephrase text we can 2:52summarize email we can I think generate 2:54emojis which would be really fun when 2:56texting my kids none of it is still in 2:59that 3:00I'm going to pay $1,000 now for a phone 3:02I'm not going to pay $1,000 for customer 3:03emojis um but so like it it's helpers 3:06and the helpers are nice and I'll 3:08appreciate it when I get it but it's not 3:10enough for me to go wow this was 3:12something that was a game Cher by Apple 3:14at least that was what my my feeling was 3:16yeah and I think that's kind of the most 3:17interesting thing there's almost like 3:18two points of view one of them is like 3:20Apple's getting it all wrong like AI is 3:23the killer feature and it really will 3:25sell phones the other one is like they 3:27have it exactly right like none of the 3:28AI features that are current on the 3:30market are good enough to like motivate 3:32someone to actually buy a phone like the 3:35the software is not really pushing the 3:36hardware here well I mean I think from 3:38my perspective it it's not going to get 3:40me to buy a phone it's not something 3:41that's going to push the the boundaries 3:44like significantly but at least gives 3:45Apple an entry into what they've largely 3:48been kind of standing away from so you 3:50know I think it's definitely helping 3:52them move in the right direction 3:53hopefully like I just don't even use 3:55Siri today because it's always like a 3:565050 shot is Siri GNA yeah I don't even 3:59use it either it's like the they're like 4:00pushing it it's like the button and 4:02stuff but I never touch need to get to 4:03the point of like basic Siri usability 4:05and I don't think you can get there 4:07without llm so you know I think they're 4:08making the right move from that call and 4:10then later on I think they have more 4:12opportunity once they lock that in to 4:14actually differentiate both hardware and 4:16software with AI yeah for sure Maya what 4:19do you think I mean I think you know one 4:21way of looking at some of this is that 4:22like apple I mean they're hardware 4:23company right so they're like very 4:25careful by Nature because if you mess up 4:26Hardware you you really mess it up um 4:29but are they maybe like too slow like by 4:31the time Apple gets out this stuff done 4:32you know it's going to be like open AI 4:34on the phone and anthropic on the phone 4:36and perplexity on the phone I think 4:37that's a good question I think part of 4:39Apple's ethos and appeal especially like 4:41when they started decades ago is this 4:43focus on design and the user experience 4:46and doing less they always did less 4:48compared to the competitors so I think 4:50we're in the space where user experience 4:52matters I think we're o we're overloaded 4:55cognitively with too many things on our 4:56phones I think they're I'm really 4:57excited about Siri being to Able to 4:59navigate different apps so if they nail 5:02that for me that delivers tremendous 5:03value and then I don't see them being in 5:06the hot spot to be able to respond to 5:08generative AI they're not search their 5:10core is not search they're not a chat 5:12bot in itself they're a hardware they're 5:14way for me to communicate a series and 5:16add-ons so I don't think they're feeling 5:17the same pressure as their some of the 5:19other Tech players are yeah for sure 5:21yeah I think it's going to be an 5:22interesting situation because like I 5:24don't know so the other week I was like 5:25let's try out all these AI services so I 5:27like signed up for a bunch of 5:28subscriptions and like the first month's 5:30bill just came in and I'm just like this 5:32is really bad but it kind of feels like 5:34maybe if Apple can release these Pro 5:36some of these features for free like it 5:37totally changes the economics of some of 5:39this stuff um K I see you nodding well I 5:42was just wondering Maya if you had a 5:44perspective on like does Apple have an 5:46advantage given that they are the 5:47integrator between all of these apps 5:50beyond what any one app like a 5:52perplexity or anthropic app that you 5:53could also install on your phone might 5:56have yeah I think absolutely like they 5:58they own the ecosystem system of apps so 6:00they own that App Store and I think that 6:02would be like if they could lower the 6:04barrier in the interface of connecting 6:06between different apps that would be 6:07really interesting and I wonder if in a 6:09year from now if Apple will be the AI 6:12killer in the same way when like open AI 6:14launches and you release and that kills 6:16a bunch of startups yeah that's for sure 6:18I mean I think one of the things I think 6:19a little bit about is there's been all 6:21these demos that have been floating 6:22around and I think they they're often 6:23more impressive than they actually are 6:24in practice but it's like type what you 6:26want and a new app emerges and it kind 6:29of feels like sort of thing might 6:30eventually happen on Apple but it also 6:32is like this enormous threat to this 6:34like whole edifice of the App Store that 6:36they've constructed um so I don't know 6:38navigating that I think is going to be 6:39really complex and challenging I think 6:42my brings up actually a really good 6:43point that they're more likely to Gat 6:45keep the App Store until they've got 6:47their own Integrations working and then 6:49it'll be all about how seamless it is 6:51otherwise if you're going to have 6:52different services having to try to talk 6:53to different apps there's so much under 6:55the hood of trying to navigate different 6:57kinds of middleware that you'll have so 6:59many points of failure that people be 7:01like all right well look the Apple 7:02version can't do maybe as much but at 7:05least it works and there seems to be a 7:07real opportunity potentially there yeah 7:09for sure yeah and it reminds me a little 7:11bit of I know there's a debate many 7:12years ago about like okay how do we get 7:14the self-driving car thing to work um 7:16and like one of the ideas was like 7:18they're always five years away it's 7:19always five years away yeah and I think 7:21one of the most interesting things was 7:22this debate over like oh do we need to 7:23like reconstruct the whole environment 7:25to make it simpler for the robot cars to 7:27work or do we just like kind of let the 7:29robot car car like Roam and try to train 7:30it around all sorts of environments and 7:32there's like a similar thing here for 7:34like agents I guess in the AI case for 7:35Apple where it's like they confront this 7:37like very heterogen heterogeneous kind 7:40of like situation for the App Store um 7:43which really prevents them from kind of 7:44like enforcing kind of clean agent 7:46experiences and it's like well I guess 7:48if anyone can do it it's Apple because 7:49they have the most control at least over 7:51the 7:51[Music] 7:55space well it's a really good segue 7:57because I think my one of the reasons we 7:59were really excited to have you on the 8:00show was kind of the second topic I 8:02wanted to sort of touch on today which 8:04is you know I think literally in the 8:06last 10 episodes of mixture of experts 8:09um people have been like and agents 8:11agents are on the way agents are going 8:13to be the new big thing and we've kind 8:14of debated it back and forth and you 8:16know you've actually been working on 8:18agents right and I feel like so 8:20infrequently like the circle of people 8:21talking about agents and the people 8:23actually working on agents is like a 8:24very different kind of like Delta um and 8:27you're you're pretty rare in that 8:28respect because think you've been really 8:30kind of like in the trenches on it um do 8:32you want to tell us a little bit about 8:33that work I was curious both to learn a 8:35little bit about it and then kind of 8:36what you're learning yeah of course so 8:38um I sit in a really interesting team in 8:40research so my team focuses on 8:41incubating new technologies and opening 8:43up Market opportunities for IBM and 8:46we've been focusing on agents for 8:47several months now um so very much in 8:50the trenches and one of the first things 8:52that we did this month is we open- 8:54sourced an framework for building 8:57agentic applications so still very early 8:59days we did a silent drop um but we 9:01think we have some interesting features 9:03that we can offer in this place um that 9:05reflect some of our learnings and I 9:07don't know how much time we have but um 9:08there's a lot of things that we learned 9:10along the way and I think it's a lot 9:12very hard to bring agentic applications 9:14into production and it's very easy to 9:16take it for granted I think in terms of 9:18operational complexity this is a step 9:20change from fix flow like it's 9:23incrementally not incrementally it's uh 9:25it's another Paradigm it's much harder 9:27than fixed flows in terms of 9:28implementation yeah we'd love to talk 9:30about learnings and we've got time for 9:32it I mean I feel like this is where like 9:33M experts can can shine like lots of 9:35people are talking about Apple let's 9:36like talk about like really building 9:38agents I mean maybe one way to cut 9:40through it is is there something that 9:41you found like 9:42surprisingly surprisingly hard right 9:45where you're like before you went into 9:46it you're like ah we can nail that it's 9:47no problem and you're like actually this 9:48is like really difficult so there's two 9:51things that were kind of took us like 9:54they were a blind spot blind spot for us 9:56and we didn't expect how hard they would 9:58be and then I think there's one thing 10:00that made us very clear of how do we 10:02bring this into production so I'm just 10:03going to talk about it at the high level 10:05first thing is an agent is under pred by 10:08a prompt so think of a set of 10:10instructions that tell it how to behave 10:13the an so let's say you took an you 10:15built an agent around Model A and then 10:17you optimized it around Model A now if 10:20you want to bring it to model B the 10:21whole thing breaks and we had this 10:23experience firsthand so we started 10:24dabbling with llama 3 we moved to llama 10:263.1 which we expected to be an 10:29incremental change nothing much changed 10:31the whole thing broke took us three 10:32weeks to reoptimize everything under the 10:34hood and we're still not fully there so 10:37this I think this is a bit critical 10:39because if you want to stay on top of 10:41the latest and the best models you have 10:44all of this cost related to changing 10:46models and that makes it really 10:48prohibitive to try new models and I 10:50think this is pushing Us in another 10:51Direction Where once you picked your 10:53model of choice and buil something in 10:54production with it it's going to be 10:56really hard for you to change model 10:58providers and I don't know if I'm very 10:59happy with that being the status quo so 11:02that's one part of it and we have some 11:03ideas about how to overcome it the other 11:06part is something actually I was we were 11:07discussing with Kate Soul this morning 11:09which is we take for granted how to 11:12build with AI meaning if I'm building 11:14traditional software engineering 11:16applications I specify my features and 11:19then I code in my features and then I 11:21test my features and I'm done with llms 11:23I have features I didn't maybe sign up 11:25for in the first place maybe like 11:27outputting hate speech um things that 11:29are useful like summarization features 11:32features that kind of come out of the 11:33box but did I test for them no I kind of 11:36take them for granted and I think we 11:37made the mistake initially of taking 11:39those features for granted and not 11:41following a test driven approach and I 11:43actually want to pass it over to Kate 11:44because she works a lot in prompt 11:46optimization and I'm sure you've had 11:48your own struggles with this yeah yeah 11:50thanks Maya um you know I think one of 11:52the things that I found really 11:54interesting when we started to get into 11:56the weeds is how do you think about this 11:57kind of hierarchy of how a model or 12:00agent is aligned so there's all of the 12:03work that the model provider does in 12:05order to train it to be safe make sure 12:08it's good at things like summarization 12:09and basic language tasks and they have a 12:11a perspective that they enforce on the 12:14model and how the model should behave in 12:15situations then you have the alignment 12:19preferences and priorities set by the 12:21agent Builder and so you have all of 12:24these behaviors about how the agent 12:25supposed to behave so how this model 12:27with this system prompt together are now 12:29going to interact in this new agentic 12:31environment what patterns they're going 12:33to follow and then there's even a third 12:35level which is like when a user is 12:36interacting with the agent I might have 12:38my own preferences on how I want the 12:41agent and so this gets back to for 12:42example like Siran the ability does 12:45Apple had the ability to actually 12:46personalize to an individual user like 12:48as a user I might want a very specific 12:51way of interacting with my agent I want 12:53things to be short always and bullets 12:55versus I want you know everything to be 12:57in markdown format and much longer and 13:00you know have tables and everything else 13:01inserted where possible so there's kind 13:04of these different tiers you have to 13:06start to account for when you're 13:07building models and you have control 13:09over some parts you can impact your 13:12system prompt design for the agent you 13:14can try and create tools and uh 13:16different parameters that users can play 13:18with for impacting their own 13:19personalization alignment of the agent 13:22but then there's things you don't have 13:23control over that's defined by the model 13:25builder and so you know I think that's 13:26where a lot of some of the interesting 13:28challeng es come out of is trying to 13:31navigate kind of those different levels 13:32of control that you have and and trying 13:34to work within the system that different 13:36model providers are setting up 13:38particularly if you envision needing to 13:40ever be able to switch models to 13:41something new where different provider 13:43might have set a different process yeah 13:46it almost kind of presages there's sort 13:47of this really interesting thing of like 13:49kind of like Legacy code or Legacy 13:51models where you know like the intention 13:54will always be like let's move to the 13:55next great capability model but my kind 13:58of the story that you're telling is like 14:00it changes so much about the way the 14:02agent behaves that it's almost like you 14:05in many cases you may not want to 14:07because of the uncertainty or like in 14:09the very least the kind of like 14:10evaluation burden of trying to figure 14:12out how to get that to work well one 14:14thought is here the notion of backward 14:15compatibility is not something that 14:17exists in llm we haven't really thought 14:19about it that much it has not been a 14:20thing as soon as but you we know that 14:22that's a really really big deal in 14:23everything having to do with software 14:25the notion of backward compatibility and 14:27did everything immediately break and is 14:28that really all that useful so if we're 14:31going to be really doing this seriously 14:33I think you're going to have to start uh 14:35having to take that into account and 14:36have people create a whole new slew of 14:38benchmarks and functionalities and tests 14:40and oh it was working this way before 14:42how is it going to work if you're trying 14:43to plug it into something old another 14:45thing that I completely agree with what 14:47Kate was saying that notion of control 14:49think about generative AI with art you 14:51can give it a prompt to make a picture 14:53but then you can't tell it okay I love 14:54it but like just change that little 14:56thing over there to to something else it 14:58won't work that's it's not how these 14:59models work that's a real challenge that 15:01in software is like all right well you 15:02did a lot of this right but I need you 15:03to change this little bit to there and 15:05this little bit to there it's not going 15:06to work like that at a very basic level 15:08so both of these things I think are a 15:10very different way of looking at 15:13software building and I think needs to 15:14be thought of very carefully to make 15:16sure that it's actually practical over 15:18time Marina I was wondering if you had a 15:20perspective like does getting towards 15:22like GPT structured output start to 15:24solve some of the backwards 15:25compatibility like if we can now have 15:27greater structure on exactly how we 15:30prompt the model and exactly the outputs 15:32does that start to solve some of the 15:33problem in your mind it's a step I think 15:36that part of it certainly is the 15:38structure of the output but another very 15:40large part of it is what are sort of the 15:42acceptable States and the constraints on 15:43what makes sense or not even if you have 15:45the structure the content of that output 15:47could still theoretically kind of be 15:49anything we're still talking strings or 15:51or other Primitives and there's still a 15:53notion of what kind of states in your 15:55application as you're going through are 15:57okay and not okay valid and valid in a 15:59deterministic flow you write it all up 16:02and you know exactly what will and won't 16:03happen you've got tests you've got 16:05catchers you've got things like that how 16:06does that look here that is I think the 16:09next thing that's that's on my mind yeah 16:11I I think both of you raised a really 16:13great point and I loved how much you 16:14spend time on speaking about control I 16:16think control is going to be the key 16:18thing that makes or breaks autonomous 16:20agents um these things can run wild and 16:23have consequences that could be costly 16:25so like we've seen firsthand how data 16:27that you might have thought as propri 16:29gets sent to an external tool and now 16:31goes to a third party and maybe you 16:33didn't intend that in the first place so 16:35one of my thoughts at the back of my 16:36head is I don't know if it's fully 16:39autonomous agents that will go into 16:40production this year whereas it's more 16:42of a hybrid sort of compound AI system 16:45where some parts are agentic meaning 16:47there's degrees of freedom that the llm 16:49can take and other parts are more 16:51prescriptive are verifiers are things 16:53that allow us to get the level of 16:54control we want and I think that's some 16:56of the ideas we're starting to explore 16:58here because I I just don't think with 17:00the underlying models that we have we 17:03can have autonomous agents fully 17:04autonomous agents safe in production 17:07that's where I would put my money on 17:08right now yeah for sure it almost kind 17:10of seems like we're going to enter an 17:11era of like almost like pseudo agents 17:13where you have kind of like agenty 17:15elements but like actually is quite 17:16deterministic in some ways and that that 17:19actually could persist for a very long 17:20time I don't know if my ultimately leer 17:21dream is you know the agent Unshackled 17:24um but it kind of seems like that the 17:26issues you're pointing out are very like 17:28pretty deep and categorical right I 17:30don't know if you'd agree with that yeah 17:32definitely not in the camp of agents and 17:34AI Unshackled I think um it's AI has to 17:37serve us and fit to our needs and I 17:40think we need to understand how it works 17:42and we need to understand make sure that 17:43it works in ways that is accordance with 17:45our values and I I think that's the type 17:47of AI that I personally would prescribe 17:49and would align with my own worldview 17:52and values yeah yeah for sure well 17:55before we move on to the next topic I 17:56know Maya you said you sort of soft 17:58launched this uh do you want to direct 17:59our listeners to it yet or is it still 18:01like you're just teasing it it's going 18:02to come out soon so no I can uh more 18:04than a teaser so we launched it we just 18:07it's a silent drop we haven't shared it 18:08broadly um if this is the first moment 18:10you're sharing it yeah first moment I'm 18:12sharing this so if you're listening to 18:13this it's called the B stack and 18:15specifically maybe we can drop the URL 18:16later it's called the B agent framework 18:19um you can do some cool things out of 18:20the box with it right now so you're you 18:22can already uh create an agent that can 18:24plan and use tools and maybe correct 18:27itself um so we have some use cases but 18:29we also have exciting uh updates to 18:31bring so one along the lines of solving 18:33for the cost of switching models I don't 18:36think we'll have the ultimate solution 18:37but we want to reduce the friction of 18:39switching models and um we're working on 18:41bringing some of these Enterprise 18:43controls that I've mentioned and um if 18:45there are people who are interested in 18:46joining in this um Journey I'm very open 18:49to it still the very beginning steps but 18:51I think we're excited about what we can 18:52learn and my it's be as in the 18:55insect our team yeah yeah like bees our 18:58team really likes naming things puns so 19:01bees as worker bees and then maybe 19:03there's hives and all of that 19:05[Music] 19:09so so to move us on to our next story uh 19:12this week there was a New York City 19:13based startup called hyper write um that 19:15released a model uh called reflection 19:1870b um and it was kind of widely touted 19:22the leader of the company came out 19:23saying that you know this model 19:25integrated this new method called 19:27reflection tuning and this was the 19:29secret sauce that allowed this model to 19:31hit crazy good um uh metrics on all the 19:34major benchmarks um there was a lot of 19:36hype about it as most models kind of you 19:38know Rising on the leaderboard get 19:40nowadays um and then immediately there 19:42was kind of a turn where people said 19:44wait a minute we tried to reproduce some 19:45of these results and like this seems 19:47nowhere near what you're claiming uh 19:50furthermore doing some digging this 19:52seems like you just did some like 19:53Bargain Bin fine tuning on some open 19:55source models um and by now in true kind 19:58of like Twitter driven media cycle 20:00fashion there's just been this big cycle 20:02of mutual recrimination and um and uh 20:04and dispute um but the end result seems 20:06to be that we have gotten a startup that 20:09uh went on the publicly available 20:11benchmarks that everybody is using to 20:13evaluate model quality um and may have 20:16engaged at least in some shading of the 20:17numbers uh to make their model look 20:19better than it actually was and I think 20:22that's so interesting just because like 20:23these leaderboards have become in some 20:25ways like The Benchmark that we use to 20:27tell who's actually advancing the 20:29state-ofthe-art and what models are good 20:31or what models are bad um and so I think 20:33the main thing I wanted to kind of raise 20:35and Marino maybe we'll toss it over to 20:36you first is should we be worried that 20:38we're going to see more of this like it 20:40seems like the value of gaming these 20:41metrics is always rising and so you know 20:43regardless of whether or not there was 20:45intentional fraud here seems like 20:47there's going to be a lot of bad 20:48incentives in the space soon but I don't 20:49know if this is something you think we 20:50should care about or this is just kind 20:52of what happens I mean to some extent 20:54this is kind of what happens but I was 20:56actually really happy to see so many 20:58other folks jump on right away and say 21:00no I'm going to try to reproduce the 21:02results you need to upload your weights 21:04uh what about this what about that that 21:06is science acting correctly s good 21:08science is supposed to be reproducible 21:11and while it is possible to have 21:13something and have a hype cycle it was 21:15really nice to see that there was 21:17immediate checking going on from you 21:20know third parties and and everything 21:21else as quickly as it was in previous 21:24years decades centuries the Cycles were 21:26very very slow of what it took for other 21:28people to check their work and it's 21:29certainly still the case maybe in other 21:31fields like biochem things I don't know 21:33about at least in our field it's 21:35actually very easy to quickly start 21:37checking so that was a nice thing to say 21:39another thing I'll say about benchmarks 21:40and you know I can always go off about 21:42benchmarks is they are I mean it's great 21:45they are a temporal 21:47proy of a really specific slice of the 21:51world with a lot of things held constant 21:53we are trying to check performance on a 21:54particular thing and they're being 21:56artificially controlled are they useful 21:58apps absolutely are they sufficient no 22:01so we always like in science to talk 22:02about necessary insufficient benchmarks 22:04are necessary we need to all agree of a 22:06particular thing we're looking at they 22:08are not sufficient nor should they be 22:10the whole point is you have a benchmark 22:11then you have another one you have 22:12another one you have another one we keep 22:14sort of you know check each other keep 22:15each other honest and motivate each 22:17other to explore what are the holes what 22:20is the next thing to look at what is the 22:21next thing to look at so actually I I 22:24found this to be a very satisfying uh 22:27yeah the system is working basically a 22:29little bit Yeah Yeah and maybe I mean 22:31this is maybe another way at the problem 22:32I mean I remember going to nurs a number 22:35of years ago and there was a push a 22:37number of years ago to say look machine 22:38learning has this big reproducibility 22:40crisis um and if anything this story is 22:43like you know almost the other direction 22:46like it actually turns out that it isn't 22:47reproducibility in the kind of academic 22:49sense but we are seeing kind of like a 22:51emergent reproducibility happening in 22:53like the Rough and Tumble of of of 22:55Twitter or x uh I guess Kate you're 22:57nodding I don't know do you think like 22:59is the problem solved for 23:00reproducibility like are we kind of 23:01there or is this still kind of a 23:02persistent thing that we should worry 23:04about I mean I I think it's encouraging 23:06that we have such an active Community 23:08focused and I really like how Marina 23:10said it on good science right making 23:12sure that we check and validate I think 23:15for from my perspective one of the 23:17bigger issues here isn't necessarily 23:19just the reproducibility aspect but the 23:21transparency aspect if we think of how 23:24uh this model was trained how it's 23:25communicated so you know we need to move 23:28as a field and I I think there's a lot 23:30of good actors here but clearly there's 23:33um some cases where we haven't quite got 23:35there yet where it's not just a matter 23:37of dropping a benchmark you have to be 23:38transparent and open about how the like 23:40good science also means sharing your 23:42methods and your approach and how it was 23:44trained and what it was trained on and 23:46you know getting into a lot more open 23:48detail where uh right now you know the 23:51the norm is to kind of train these 23:53behind a black box put an API out there 23:55and say here we did this really cool 23:57thing Trust it works like can you 24:00imagine that happening in like other 24:02products or Industries it's like here's 24:04an airplane yeah trust us we tested it 24:06it's fine um you know I I think we need 24:09a lot more openness in just in general 24:11and how these are trained because 24:12there's always going to be misaligned 24:13incentives when it comes to these 24:15benchmarks they just get so much 24:17attraction and commentary that you know 24:19we need to have it always paired with a 24:21really open discourse of what was 24:23actually done and the ability to inspect 24:25what was actually done yeah and this 24:27seems to be the Crux of right because I 24:29think you know benchmarks have almost 24:30represented like the compromise position 24:32for the field right which is okay well 24:34we have all these companies that have 24:35trade secrets and they want to kind of 24:37keep their Innovations like you know 24:39reflection tuning or whatever they want 24:40to do um and so we're like okay well so 24:43long as you can kind of like show us 24:44your performance on the benchmarks 24:46that's the transparency we're looking 24:47for I guess Kate is just to kind of push 24:49you a little bit further on it is that 24:50like you're almost saying that like we 24:51should actually expect more than just 24:53the benchmarks yeah and I think there's 24:55a lot of like transparency and name only 24:57like company say oh we're going to 24:59publish a paper on this to follow or 25:00here is a highle overview of what we did 25:03but like if you don't actually share 25:05share the weights share more details 25:07like get really you know get into the 25:09weeds of um exactly what was done you 25:12know a lot of this can just be kind of 25:13surfac that you know you can talk about 25:15publicly as being transparent but did 25:17you actually deliver details that have 25:19helped scientists reproduce What was 25:21done you know that is the level of 25:22transparency we need to drive to so I'm 25:24in the business of building AI 25:26applications for production and so so 25:28benchmarks is something we were just 25:29discussing this 25:31morning for me it's not useful to see a 25:34certain model's performance on a 25:35benchmark because there could be a 25:37number of things that are happening it 25:38could be that the model has seen this 25:40data before it could be that even though 25:42it does good on this it might not 25:43generalize to my own use cases and what 25:45I care about so my ethos around 25:48benchmarks is if there is a test data 25:50set that fits a feature that I'm trying 25:52to develop maybe I'm trying to improve 25:54reasoning capabilities maybe I'm trying 25:56to improve tool calling there's a test 25:58data set that helps me identify my blind 26:00spots I'm going to go all in for it but 26:02it's so important like we're investing 26:04or building our own test cases and our 26:06own email criteria because we have a 26:08very specific thing we're going after 26:10and nothing beats doing that and um yeah 26:13even When selecting llms it's I yeah I 26:15feel like benchmarks is nice to like 26:18maybe have a sub selection of models to 26:20look at but it's like Marina said it's 26:23an it's helpful but not a complete 26:25signal yeah there's almost a kind of 26:27interesting phenomenon that I've been 26:28sort of chasing after is a little bit 26:30like the idea that you know it used to 26:32be that the limiting reagent was getting 26:33new models out but then like models are 26:35everywhere now right and so it's almost 26:37like now the new limiting reagent is 26:38like a well-crafted eval or like a 26:40well-crafted benchmark set um and like 26:43that's increasingly becoming like where 26:44a lot of the the the bottleneck is it 26:46feels like in some of the workflows that 26:47you're seeing all over the place well 26:49and I wonder like as our models are 26:51across the board you know if you just 26:53look at the state the field models are 26:55getting better and better being able to 26:57do what last year a model you know could 26:59do but it needed to be 10 times bigger 27:01you know so these model performances are 27:03continuing to improve you know we're 27:05starting to get into the zone of you 27:07know is this becoming commoditized and 27:10so therefore it's this race to the 27:11bottom trying to inch up and get a 0.01 27:15you know percent increase in a in a 27:17metric that might not actually 27:20informative of your use case because 27:21everyone's trying to you know show some 27:24level of differentiation it's becoming 27:25increasingly more difficult as 27:27performance improves over the you know 27:29I'll call it like the workforce tasks 27:31that are kind of the um bread and butter 27:33low hanging fruit that pretty much any 27:35model can do now yeah for sure yeah 27:38sometimes I look at some of these 27:39benchmarks and like what are we what are 27:40we doing here exactly what are we 27:42spending time on yeah 27:44[Music] 27:49yeah uh so at mixture of experts we 27:51always try to do a paper uh as one of 27:54our stories um and I do want to end 27:56today with kind of focusing on um a 27:58paper that just came out that I thought 28:00was pretty fascinating it's entitled can 28:02llms generate novel research ideas um 28:05and this is almost a continuation of a 28:06paper that we talked about uh last week 28:09uh which was about kind of using AI for 28:11science um and the big debate there was 28:13basically this kind of very interesting 28:15question of can llms be creative right 28:18can they become a partner that sort of 28:20like pushes Us in new research 28:22directions that we would have not have 28:23gone on before um and this is 28:26particularly interesting because I think 28:27there's also a parallel discussion um 28:29some of you may have seen this article 28:30by the Sci-Fi writer Ted Shang that came 28:33out uh I think about a month or so ago 28:35in the New Yorker sort of arguing on the 28:37creative side right like llms can't be 28:39creative in some ways because you know 28:41they sort of don't make these kind of 28:42intentional choices about their outputs 28:45um and so anyways I think to quickly sum 28:46up the paper their argument is look we 28:49played around with llms it does seem 28:51like they can generate kind of creative 28:52prompts you know their their kind of 28:54spicy claim is sometimes even more 28:56creative prompts than humans themselves 28:58um and dot dot dot you know we should be 29:00really bullish I think on the promise of 29:03llms assisting research almost at the 29:06very very beginning of the workflow 29:08right in the most kind of human part of 29:10the work um and I guess you know the 29:12first question is just like do do we buy 29:14it right um and uh I guess Kate maybe to 29:17start the story with you because I think 29:18you've kind of come in as kind of the 29:20second commenter on a number of the 29:21stories do you buy this case like do you 29:23buy this 29:24claim so I I think the paper makes a lot 29:27of really interesting um assessments of 29:30of how a model can support in research 29:32tasks but what struck me the most about 29:34it is they had to search through like 29:374,000 plus examples to get 200 29:41unique uh actual research topics in this 29:44paper so while the models were pretty 29:47good at creating I I think specifically 29:49they're focused on novelty on new ideas 29:52that you know the other human subjects 29:55that in this study hadn't thought of or 29:57come up with you know is it really that 29:59models are more creative or we're just 30:00brute forcing able to automate searching 30:03through thousands and thousands of 30:05scenarios until we somehow you know get 30:08I to can roll dice until a unique number 30:10comes up exactly so you know that's 30:13where I think there's still some room 30:15left for debate on what does it mean to 30:17be creative or or novel um and if did 30:21the humans get a fair Shake of it if you 30:22think of it from that term I guess 30:24Marina I think as like a researcher you 30:27know do you think this is the kind of 30:28tool you might be using in the future or 30:29this is kind of still like mostly just 30:31game playing because I think you know K 30:33if I maybe I'm being uncharitable to 30:35your position but you're sort of like 30:37this is just like a magic a ball right 30:39like and you know uh if it generates 30:41something unique and great that inspires 30:43people then awesome but there's almost 30:44nothing sort of uniquely creative I 30:47guess is almost what you're saying I I 30:48mean I don't want to say a broken clock 30:51is correct twice a day but like if you 30:53do something enough you're going to 30:55stumble across something new does that 30:57actually means something is more 30:58creative uh and especially if you think 31:00about the New Yorker article that you 31:02brought up to where creativity is a 31:03choice like are we actually saying 31:05models are creative or are they a tool 31:07where we can brute S search through you 31:10know a larger number of ideas that are 31:12being randomly generated it's it's the 31:14second one to me although that's 31:15valuable that is very very valuable 31:17absolutely creativity requires intent 31:20there is no intent here uh it's valuable 31:23because uh it can take humans a lot of 31:25times and we come in with particular 31:27biases of how to to think about things 31:29whereas if something is brought up to 31:30you you can immediately say oh that 31:31makes no sense versus oh I didn't think 31:33about it that way that's valuable but 31:36there's no judgment from these models so 31:38I know that this has been used a lot 31:39recently in uh other science Fields like 31:42uh medicine or chemical compounds you 31:44know where there's just like thousands 31:45and thousands thousands you you can try 31:47to put them together and humans just 31:49don't have the time what's interesting 31:51is they they are being used as a filter 31:53as a brute force and you get it down to 31:55this makes no sense this is physically 31:56impossible da and then you start getting 31:58like a human intuition of like well 32:00based on my experience in the field for 32:0220 years this is an idea worth pursuing 32:04and this one is not can I tell you 32:05exactly why to the extent that I can 32:07like you know prompt an alum for it no 32:09but it's sort of a sum of of my 32:11experiences that kind of a progress 32:13that's great that's a very useful thing 32:15but creativity means intent and there is 32:17no intent yeah that's right so you you 32:20kind of buy the paper in some ways I 32:21guess almost is it equal almost about 32:23calling it creativity like the problem 32:24is that like we're kind of giving it 32:26this word that has all this baggage 32:28right it's kind of on a pedestal like 32:29the great creative artist you know the 32:31great creative scientiic implication 32:34yeah and then the not the paper itself 32:36is not creative sorry um because asking 32:40machine learning to uh help you you know 32:44go through a whole bunch of stuff and 32:45say which one of it is definitely not 32:46garbage or here some other things I've 32:48tried that's not new they're applying it 32:50maybe to this extremely specific use 32:52case that's great but that part to me is 32:54not new while being useful yeah for sure 32:57um Maya maybe I'll turn to you I think 32:59you know this conversation almost makes 33:01me think about like the debates we were 33:02having you know almost 10 years ago now 33:04about inability right where people are 33:06like you know they're basically like I 33:08think there's one line of debate which 33:09is well you don't really want to use 33:10these systems if you don't understand 33:12how they do what they do and then I 33:14think there's a certain group of machine 33:15learning like chauvinists that were 33:17basically like well if the model always 33:18succeeds at doing the task what do you 33:20care about how it gets itself done and I 33:22think there's almost like a similar bias 33:24when we start talking about like a and 33:25creativity where we're like we don't 33:27just want you get to the right answer 33:28but we want you to get to the right 33:30answer like in the right way and I think 33:32what we mean by that in the creative 33:34context is like that we want somehow for 33:36it to be a little bit more than like a 33:38random number generator um and yeah I 33:41think do do you I don't if you fall on 33:43one side of that debate or another where 33:45you're kind of like actually doesn't 33:47matter right like if these tools help us 33:48get more creative results then I'm happy 33:51versus like no part of our research 33:52agenda really should be to like kind of 33:54like get these systems to be Capital C 33:57creative right what that means right 33:58seems to be a big question yeah I don't 34:00if I have a blanket answer for this one 34:02but it's context dependent so I know in 34:04the Arts World um Source attribution is 34:06really important a lot of artworks are 34:08done in the style of EXN artist how like 34:11there should be credit given to a 34:13certain artist in this place and how do 34:14you give that credit and we want to 34:17assign this characteristic to AI because 34:18we put social values regard guarding to 34:21IP regard guarding to giving the right 34:22credit so we want to align these AI 34:24systems to how our world functions and 34:27how we we give credit where credit is 34:29due um so I think it depends really like 34:33what is the cost of not having that 34:35explainability bacon and not just the 34:36explainability but these other values 34:39that matter to us and that would make 34:41this technology align with our societies 34:44and not us having to adapt around it 34:46yeah that makes a lot of sense and I 34:48think that'll be sort of an interesting 34:49struggle right because where some of 34:50this goes is like oh you know model you 34:53came up with like an intuitive 34:55counterintuitive very creative result 34:58like can you explain why that's the case 34:59or how you reach that result right like 35:01that kind of interpretability which 35:02starts to look a little bit more like 35:04chain of th I think we're just giving 35:05too much credit for these systems like 35:08of the the paper that we talked about 35:10and generating novel ideas well first of 35:12all what do you mean by novelty and is 35:13it just like net something net new is 35:15that something that useful um and it's 35:19like how the system works it's just like 35:20the statistical probability of the next 35:22word so um for us when we're coming up 35:25with novel ideas there's meaning and 35:26value intent behind it right I'm trying 35:29to put an idea that maybe I care about I 35:31want to push forward so I it's really 35:34tough to equate these two and put them 35:36on the same standing one can be a tool 35:37for the other to maybe give you an idea 35:39that maybe you didn't think about but I 35:42don't think one replaces the other yeah 35:44for sure yeah there's also like and this 35:46also plays into this very interesting 35:47kind of like debate which is like 35:49they're just stochastic parrots and then 35:51people being like no they're more than 35:52stochastic parrots but I think Maya 35:54you're almost outlining like a third 35:55path which is yeah they're stochastic 35:57parrots but like that's really powerful 35:59actually like you know like let's not 36:01downplay that too much like the the 36:03stochastic parrot is actually like 36:04incredibly useful in certain domains uh 36:08and and like we almost shouldn't sell 36:09that short right like as as a little bit 36:11of what you're saying yeah absolutely I 36:13think it's it's one of the this whole 36:16industry of generative AI was unlocked 36:17on how good these stochastic parrots 36:20were and I I read that initial paper 36:22that came out but I think what took us 36:24all back was actually this going to have 36:26really useful applications of 36:27implemented in the right way but it it 36:29doesn't mean that it can inherently 36:31assign reason and intent that we're not 36:34like I don't think there's a world where 36:35we're there well so Marina I'll let you 36:37have the last word in four or five years 36:39you going to have a an llm co-author on 36:41a paper or is this like still total pipe 36:44dream no and I don't think that that's 36:46the right aim in that sense I don't 36:48think I'm G to have that co-author but 36:49it could be very much that and we're 36:51actually asked to when we publish now to 36:53say whether you've used AI in your work 36:55or anything of that kind llm is helping 36:57you sift to the related work llm is 36:59helping you you know manage your 37:00bibliography and figure out uh things 37:02that are similar and different and and 37:04things of that kind sure but um you know 37:07also to go with what my and Kate were 37:09saying again about intent that's not 37:11what technology is able to do intent is 37:14something that you get from beings that 37:16are alive that beings that are actually 37:18able to give that you can have that's 37:19why you can't have actual art I agree 37:22with Ted Chang very strongly you're not 37:24going to have ai art art moves us 37:25because of the intent behind the person 37:28that made it whether it was actually 37:30originally their their intent or not 37:31doesn't matter you know that that's what 37:32that was versus AI art now you can 37:35actually feel again the the care and the 37:39intent behind the people who made the 37:41llm think about how much you know effort 37:43we all pour into making something that 37:44is useful it's not the technology itself 37:46though it's the people that are trying 37:48to create something that is intended to 37:51be used and helpful and efficient and 37:53effective that's where the intent is not 37:56the tech itself that's great yeah I'm 37:58I'm applauding 38:00so um Maya Kate Marina um in a a 38:04nightmare landscape of jargon and hype 38:07uh this panel is just like a light in 38:10the darkness so I appreciate you all uh 38:12taking the time this morning to stop by 38:14mixture of experts and hopefully we'll 38:15have you on at some point in the future 38:18um and for all you listeners out there 38:19if you enjoyed what you heard you can 38:21get us on Apple podcast Spotify and 38:24podcast platforms everywhere and we will 38:25see you next week on Mi of experts