Apple's Modest AI Rollout
Key Points
- Apple’s new AI rollout is modest, focusing on privacy‑centric on‑device LLM features like text rewriting, email summarization, and emoji generation, but it isn’t compelling enough to drive immediate iPhone upgrades.
- The panel stresses that the success of autonomous AI agents will hinge on robust control mechanisms and clear benchmarks, warning that insufficient safeguards could spur increased fraud.
- While Siri is expected to improve—thanks to Apple’s history of polished user experiences and upcoming customization options—many users remain skeptical about its practical usefulness.
- The discussion highlights a broader industry need for agreed‑upon evaluation standards to reliably measure AI progress, noting that current benchmarks are necessary but not sufficient.
Full Transcript
# Apple's Modest AI Rollout **Source:** [https://www.youtube.com/watch?v=j9vTEhimRqk](https://www.youtube.com/watch?v=j9vTEhimRqk) **Duration:** 00:38:30 ## Summary - Apple’s new AI rollout is modest, focusing on privacy‑centric on‑device LLM features like text rewriting, email summarization, and emoji generation, but it isn’t compelling enough to drive immediate iPhone upgrades. - The panel stresses that the success of autonomous AI agents will hinge on robust control mechanisms and clear benchmarks, warning that insufficient safeguards could spur increased fraud. - While Siri is expected to improve—thanks to Apple’s history of polished user experiences and upcoming customization options—many users remain skeptical about its practical usefulness. - The discussion highlights a broader industry need for agreed‑upon evaluation standards to reliably measure AI progress, noting that current benchmarks are necessary but not sufficient. ## Sections - [00:00:00](https://www.youtube.com/watch?v=j9vTEhimRqk&t=0s) **Apple AI, Agent Challenges, and Siri Outlook** - The panel discusses Apple's modest entry into AI, the difficulty of building reliable autonomous agents and the need for unified benchmarks, and debates whether Siri will ever become genuinely useful. ## Full Transcript
so it turns out that apple is starting
with AI pretty modestly it's not going
to get me to buy a phone all right so
level with us how hard is it to make
agents actually work I think control is
going to be the key thing that makes or
breaks autonomous agents I think there's
going to be a lot more fraud coming
benchmarks are necessary we need to all
agree of a particular thing we're
looking at they are not sufficient all
that and more on today's episode of
mixture of experts
hello everybody I'm Tim hang and I'm
joined today as I am every Friday by a
worldclass panel of researchers product
leaders and more to hash out the week's
news in AI Kate soul is a program
director for generative AI research
hello Kate Marina denki is a senior
research scientist and my Morad is
product manager AI
[Music]
incubation okay so as we usually do for
mixture of experts we're going to start
start with a quick round the horn
question and that question I want you
all to answer is this if you have an
iPhone will Siri ever be any good Maya
what do you think yes or no I think so I
think Apple have a great tracker track
record at amazing user experiences and I
know they took their time with their AI
but I know it's for the benefit of how I
interact with my phone definitely
marina maybe I still find myself
fighting with Siri a whole lot and
giving up on it most of the time it's
really great
reminders for sure and Kate what do you
think yes assuming they can get to the
user customization uh well let's just
get into it because our first story of
the day is to talk a little bit about
the Apple intelligence uh updates of
this week um background is of course
that there was a big WWDC announcement
earlier in the year that announced kind
of Apple's long awaited drive into
artificial intelligence and um you know
this week we basically saw a SLE of
announcements uh continuing to Hype
Apple intellig but in the context of the
new iPhone 16 uh release um and they
really announced like a whole SLE of
things I was looking at the blog posts
uh you know llm assistance on text image
search image Generation Um and what
we're just talking about a Siri update
um and I guess I kind of want to ask you
just almost like to begin as like a a
user of all these products yeah I mean
uh we were talking a little bit before
the show and it sounds like all of the
panelists uh myself also included um
have an iPhone and I guess Mara curious
are you excited about any of the featur
that are on its way and and if so like
why um it's not really actually right
now enough for me to go out and buy a
new iPhone and try to get those features
um yeah the the features that are there
while it's nice to have the llm locally
and I always do like apple stance on
privacy I mean what can they do all
right we can rephrase text we can
summarize email we can I think generate
emojis which would be really fun when
texting my kids none of it is still in
that
I'm going to pay $1,000 now for a phone
I'm not going to pay $1,000 for customer
emojis um but so like it it's helpers
and the helpers are nice and I'll
appreciate it when I get it but it's not
enough for me to go wow this was
something that was a game Cher by Apple
at least that was what my my feeling was
yeah and I think that's kind of the most
interesting thing there's almost like
two points of view one of them is like
Apple's getting it all wrong like AI is
the killer feature and it really will
sell phones the other one is like they
have it exactly right like none of the
AI features that are current on the
market are good enough to like motivate
someone to actually buy a phone like the
the software is not really pushing the
hardware here well I mean I think from
my perspective it it's not going to get
me to buy a phone it's not something
that's going to push the the boundaries
like significantly but at least gives
Apple an entry into what they've largely
been kind of standing away from so you
know I think it's definitely helping
them move in the right direction
hopefully like I just don't even use
Siri today because it's always like a
5050 shot is Siri GNA yeah I don't even
use it either it's like the they're like
pushing it it's like the button and
stuff but I never touch need to get to
the point of like basic Siri usability
and I don't think you can get there
without llm so you know I think they're
making the right move from that call and
then later on I think they have more
opportunity once they lock that in to
actually differentiate both hardware and
software with AI yeah for sure Maya what
do you think I mean I think you know one
way of looking at some of this is that
like apple I mean they're hardware
company right so they're like very
careful by Nature because if you mess up
Hardware you you really mess it up um
but are they maybe like too slow like by
the time Apple gets out this stuff done
you know it's going to be like open AI
on the phone and anthropic on the phone
and perplexity on the phone I think
that's a good question I think part of
Apple's ethos and appeal especially like
when they started decades ago is this
focus on design and the user experience
and doing less they always did less
compared to the competitors so I think
we're in the space where user experience
matters I think we're o we're overloaded
cognitively with too many things on our
phones I think they're I'm really
excited about Siri being to Able to
navigate different apps so if they nail
that for me that delivers tremendous
value and then I don't see them being in
the hot spot to be able to respond to
generative AI they're not search their
core is not search they're not a chat
bot in itself they're a hardware they're
way for me to communicate a series and
add-ons so I don't think they're feeling
the same pressure as their some of the
other Tech players are yeah for sure
yeah I think it's going to be an
interesting situation because like I
don't know so the other week I was like
let's try out all these AI services so I
like signed up for a bunch of
subscriptions and like the first month's
bill just came in and I'm just like this
is really bad but it kind of feels like
maybe if Apple can release these Pro
some of these features for free like it
totally changes the economics of some of
this stuff um K I see you nodding well I
was just wondering Maya if you had a
perspective on like does Apple have an
advantage given that they are the
integrator between all of these apps
beyond what any one app like a
perplexity or anthropic app that you
could also install on your phone might
have yeah I think absolutely like they
they own the ecosystem system of apps so
they own that App Store and I think that
would be like if they could lower the
barrier in the interface of connecting
between different apps that would be
really interesting and I wonder if in a
year from now if Apple will be the AI
killer in the same way when like open AI
launches and you release and that kills
a bunch of startups yeah that's for sure
I mean I think one of the things I think
a little bit about is there's been all
these demos that have been floating
around and I think they they're often
more impressive than they actually are
in practice but it's like type what you
want and a new app emerges and it kind
of feels like sort of thing might
eventually happen on Apple but it also
is like this enormous threat to this
like whole edifice of the App Store that
they've constructed um so I don't know
navigating that I think is going to be
really complex and challenging I think
my brings up actually a really good
point that they're more likely to Gat
keep the App Store until they've got
their own Integrations working and then
it'll be all about how seamless it is
otherwise if you're going to have
different services having to try to talk
to different apps there's so much under
the hood of trying to navigate different
kinds of middleware that you'll have so
many points of failure that people be
like all right well look the Apple
version can't do maybe as much but at
least it works and there seems to be a
real opportunity potentially there yeah
for sure yeah and it reminds me a little
bit of I know there's a debate many
years ago about like okay how do we get
the self-driving car thing to work um
and like one of the ideas was like
they're always five years away it's
always five years away yeah and I think
one of the most interesting things was
this debate over like oh do we need to
like reconstruct the whole environment
to make it simpler for the robot cars to
work or do we just like kind of let the
robot car car like Roam and try to train
it around all sorts of environments and
there's like a similar thing here for
like agents I guess in the AI case for
Apple where it's like they confront this
like very heterogen heterogeneous kind
of like situation for the App Store um
which really prevents them from kind of
like enforcing kind of clean agent
experiences and it's like well I guess
if anyone can do it it's Apple because
they have the most control at least over
the
[Music]
space well it's a really good segue
because I think my one of the reasons we
were really excited to have you on the
show was kind of the second topic I
wanted to sort of touch on today which
is you know I think literally in the
last 10 episodes of mixture of experts
um people have been like and agents
agents are on the way agents are going
to be the new big thing and we've kind
of debated it back and forth and you
know you've actually been working on
agents right and I feel like so
infrequently like the circle of people
talking about agents and the people
actually working on agents is like a
very different kind of like Delta um and
you're you're pretty rare in that
respect because think you've been really
kind of like in the trenches on it um do
you want to tell us a little bit about
that work I was curious both to learn a
little bit about it and then kind of
what you're learning yeah of course so
um I sit in a really interesting team in
research so my team focuses on
incubating new technologies and opening
up Market opportunities for IBM and
we've been focusing on agents for
several months now um so very much in
the trenches and one of the first things
that we did this month is we open-
sourced an framework for building
agentic applications so still very early
days we did a silent drop um but we
think we have some interesting features
that we can offer in this place um that
reflect some of our learnings and I
don't know how much time we have but um
there's a lot of things that we learned
along the way and I think it's a lot
very hard to bring agentic applications
into production and it's very easy to
take it for granted I think in terms of
operational complexity this is a step
change from fix flow like it's
incrementally not incrementally it's uh
it's another Paradigm it's much harder
than fixed flows in terms of
implementation yeah we'd love to talk
about learnings and we've got time for
it I mean I feel like this is where like
M experts can can shine like lots of
people are talking about Apple let's
like talk about like really building
agents I mean maybe one way to cut
through it is is there something that
you found like
surprisingly surprisingly hard right
where you're like before you went into
it you're like ah we can nail that it's
no problem and you're like actually this
is like really difficult so there's two
things that were kind of took us like
they were a blind spot blind spot for us
and we didn't expect how hard they would
be and then I think there's one thing
that made us very clear of how do we
bring this into production so I'm just
going to talk about it at the high level
first thing is an agent is under pred by
a prompt so think of a set of
instructions that tell it how to behave
the an so let's say you took an you
built an agent around Model A and then
you optimized it around Model A now if
you want to bring it to model B the
whole thing breaks and we had this
experience firsthand so we started
dabbling with llama 3 we moved to llama
3.1 which we expected to be an
incremental change nothing much changed
the whole thing broke took us three
weeks to reoptimize everything under the
hood and we're still not fully there so
this I think this is a bit critical
because if you want to stay on top of
the latest and the best models you have
all of this cost related to changing
models and that makes it really
prohibitive to try new models and I
think this is pushing Us in another
Direction Where once you picked your
model of choice and buil something in
production with it it's going to be
really hard for you to change model
providers and I don't know if I'm very
happy with that being the status quo so
that's one part of it and we have some
ideas about how to overcome it the other
part is something actually I was we were
discussing with Kate Soul this morning
which is we take for granted how to
build with AI meaning if I'm building
traditional software engineering
applications I specify my features and
then I code in my features and then I
test my features and I'm done with llms
I have features I didn't maybe sign up
for in the first place maybe like
outputting hate speech um things that
are useful like summarization features
features that kind of come out of the
box but did I test for them no I kind of
take them for granted and I think we
made the mistake initially of taking
those features for granted and not
following a test driven approach and I
actually want to pass it over to Kate
because she works a lot in prompt
optimization and I'm sure you've had
your own struggles with this yeah yeah
thanks Maya um you know I think one of
the things that I found really
interesting when we started to get into
the weeds is how do you think about this
kind of hierarchy of how a model or
agent is aligned so there's all of the
work that the model provider does in
order to train it to be safe make sure
it's good at things like summarization
and basic language tasks and they have a
a perspective that they enforce on the
model and how the model should behave in
situations then you have the alignment
preferences and priorities set by the
agent Builder and so you have all of
these behaviors about how the agent
supposed to behave so how this model
with this system prompt together are now
going to interact in this new agentic
environment what patterns they're going
to follow and then there's even a third
level which is like when a user is
interacting with the agent I might have
my own preferences on how I want the
agent and so this gets back to for
example like Siran the ability does
Apple had the ability to actually
personalize to an individual user like
as a user I might want a very specific
way of interacting with my agent I want
things to be short always and bullets
versus I want you know everything to be
in markdown format and much longer and
you know have tables and everything else
inserted where possible so there's kind
of these different tiers you have to
start to account for when you're
building models and you have control
over some parts you can impact your
system prompt design for the agent you
can try and create tools and uh
different parameters that users can play
with for impacting their own
personalization alignment of the agent
but then there's things you don't have
control over that's defined by the model
builder and so you know I think that's
where a lot of some of the interesting
challeng es come out of is trying to
navigate kind of those different levels
of control that you have and and trying
to work within the system that different
model providers are setting up
particularly if you envision needing to
ever be able to switch models to
something new where different provider
might have set a different process yeah
it almost kind of presages there's sort
of this really interesting thing of like
kind of like Legacy code or Legacy
models where you know like the intention
will always be like let's move to the
next great capability model but my kind
of the story that you're telling is like
it changes so much about the way the
agent behaves that it's almost like you
in many cases you may not want to
because of the uncertainty or like in
the very least the kind of like
evaluation burden of trying to figure
out how to get that to work well one
thought is here the notion of backward
compatibility is not something that
exists in llm we haven't really thought
about it that much it has not been a
thing as soon as but you we know that
that's a really really big deal in
everything having to do with software
the notion of backward compatibility and
did everything immediately break and is
that really all that useful so if we're
going to be really doing this seriously
I think you're going to have to start uh
having to take that into account and
have people create a whole new slew of
benchmarks and functionalities and tests
and oh it was working this way before
how is it going to work if you're trying
to plug it into something old another
thing that I completely agree with what
Kate was saying that notion of control
think about generative AI with art you
can give it a prompt to make a picture
but then you can't tell it okay I love
it but like just change that little
thing over there to to something else it
won't work that's it's not how these
models work that's a real challenge that
in software is like all right well you
did a lot of this right but I need you
to change this little bit to there and
this little bit to there it's not going
to work like that at a very basic level
so both of these things I think are a
very different way of looking at
software building and I think needs to
be thought of very carefully to make
sure that it's actually practical over
time Marina I was wondering if you had a
perspective like does getting towards
like GPT structured output start to
solve some of the backwards
compatibility like if we can now have
greater structure on exactly how we
prompt the model and exactly the outputs
does that start to solve some of the
problem in your mind it's a step I think
that part of it certainly is the
structure of the output but another very
large part of it is what are sort of the
acceptable States and the constraints on
what makes sense or not even if you have
the structure the content of that output
could still theoretically kind of be
anything we're still talking strings or
or other Primitives and there's still a
notion of what kind of states in your
application as you're going through are
okay and not okay valid and valid in a
deterministic flow you write it all up
and you know exactly what will and won't
happen you've got tests you've got
catchers you've got things like that how
does that look here that is I think the
next thing that's that's on my mind yeah
I I think both of you raised a really
great point and I loved how much you
spend time on speaking about control I
think control is going to be the key
thing that makes or breaks autonomous
agents um these things can run wild and
have consequences that could be costly
so like we've seen firsthand how data
that you might have thought as propri
gets sent to an external tool and now
goes to a third party and maybe you
didn't intend that in the first place so
one of my thoughts at the back of my
head is I don't know if it's fully
autonomous agents that will go into
production this year whereas it's more
of a hybrid sort of compound AI system
where some parts are agentic meaning
there's degrees of freedom that the llm
can take and other parts are more
prescriptive are verifiers are things
that allow us to get the level of
control we want and I think that's some
of the ideas we're starting to explore
here because I I just don't think with
the underlying models that we have we
can have autonomous agents fully
autonomous agents safe in production
that's where I would put my money on
right now yeah for sure it almost kind
of seems like we're going to enter an
era of like almost like pseudo agents
where you have kind of like agenty
elements but like actually is quite
deterministic in some ways and that that
actually could persist for a very long
time I don't know if my ultimately leer
dream is you know the agent Unshackled
um but it kind of seems like that the
issues you're pointing out are very like
pretty deep and categorical right I
don't know if you'd agree with that yeah
definitely not in the camp of agents and
AI Unshackled I think um it's AI has to
serve us and fit to our needs and I
think we need to understand how it works
and we need to understand make sure that
it works in ways that is accordance with
our values and I I think that's the type
of AI that I personally would prescribe
and would align with my own worldview
and values yeah yeah for sure well
before we move on to the next topic I
know Maya you said you sort of soft
launched this uh do you want to direct
our listeners to it yet or is it still
like you're just teasing it it's going
to come out soon so no I can uh more
than a teaser so we launched it we just
it's a silent drop we haven't shared it
broadly um if this is the first moment
you're sharing it yeah first moment I'm
sharing this so if you're listening to
this it's called the B stack and
specifically maybe we can drop the URL
later it's called the B agent framework
um you can do some cool things out of
the box with it right now so you're you
can already uh create an agent that can
plan and use tools and maybe correct
itself um so we have some use cases but
we also have exciting uh updates to
bring so one along the lines of solving
for the cost of switching models I don't
think we'll have the ultimate solution
but we want to reduce the friction of
switching models and um we're working on
bringing some of these Enterprise
controls that I've mentioned and um if
there are people who are interested in
joining in this um Journey I'm very open
to it still the very beginning steps but
I think we're excited about what we can
learn and my it's be as in the
insect our team yeah yeah like bees our
team really likes naming things puns so
bees as worker bees and then maybe
there's hives and all of that
[Music]
so so to move us on to our next story uh
this week there was a New York City
based startup called hyper write um that
released a model uh called reflection
70b um and it was kind of widely touted
the leader of the company came out
saying that you know this model
integrated this new method called
reflection tuning and this was the
secret sauce that allowed this model to
hit crazy good um uh metrics on all the
major benchmarks um there was a lot of
hype about it as most models kind of you
know Rising on the leaderboard get
nowadays um and then immediately there
was kind of a turn where people said
wait a minute we tried to reproduce some
of these results and like this seems
nowhere near what you're claiming uh
furthermore doing some digging this
seems like you just did some like
Bargain Bin fine tuning on some open
source models um and by now in true kind
of like Twitter driven media cycle
fashion there's just been this big cycle
of mutual recrimination and um and uh
and dispute um but the end result seems
to be that we have gotten a startup that
uh went on the publicly available
benchmarks that everybody is using to
evaluate model quality um and may have
engaged at least in some shading of the
numbers uh to make their model look
better than it actually was and I think
that's so interesting just because like
these leaderboards have become in some
ways like The Benchmark that we use to
tell who's actually advancing the
state-ofthe-art and what models are good
or what models are bad um and so I think
the main thing I wanted to kind of raise
and Marino maybe we'll toss it over to
you first is should we be worried that
we're going to see more of this like it
seems like the value of gaming these
metrics is always rising and so you know
regardless of whether or not there was
intentional fraud here seems like
there's going to be a lot of bad
incentives in the space soon but I don't
know if this is something you think we
should care about or this is just kind
of what happens I mean to some extent
this is kind of what happens but I was
actually really happy to see so many
other folks jump on right away and say
no I'm going to try to reproduce the
results you need to upload your weights
uh what about this what about that that
is science acting correctly s good
science is supposed to be reproducible
and while it is possible to have
something and have a hype cycle it was
really nice to see that there was
immediate checking going on from you
know third parties and and everything
else as quickly as it was in previous
years decades centuries the Cycles were
very very slow of what it took for other
people to check their work and it's
certainly still the case maybe in other
fields like biochem things I don't know
about at least in our field it's
actually very easy to quickly start
checking so that was a nice thing to say
another thing I'll say about benchmarks
and you know I can always go off about
benchmarks is they are I mean it's great
they are a temporal
proy of a really specific slice of the
world with a lot of things held constant
we are trying to check performance on a
particular thing and they're being
artificially controlled are they useful
apps absolutely are they sufficient no
so we always like in science to talk
about necessary insufficient benchmarks
are necessary we need to all agree of a
particular thing we're looking at they
are not sufficient nor should they be
the whole point is you have a benchmark
then you have another one you have
another one you have another one we keep
sort of you know check each other keep
each other honest and motivate each
other to explore what are the holes what
is the next thing to look at what is the
next thing to look at so actually I I
found this to be a very satisfying uh
yeah the system is working basically a
little bit Yeah Yeah and maybe I mean
this is maybe another way at the problem
I mean I remember going to nurs a number
of years ago and there was a push a
number of years ago to say look machine
learning has this big reproducibility
crisis um and if anything this story is
like you know almost the other direction
like it actually turns out that it isn't
reproducibility in the kind of academic
sense but we are seeing kind of like a
emergent reproducibility happening in
like the Rough and Tumble of of of
Twitter or x uh I guess Kate you're
nodding I don't know do you think like
is the problem solved for
reproducibility like are we kind of
there or is this still kind of a
persistent thing that we should worry
about I mean I I think it's encouraging
that we have such an active Community
focused and I really like how Marina
said it on good science right making
sure that we check and validate I think
for from my perspective one of the
bigger issues here isn't necessarily
just the reproducibility aspect but the
transparency aspect if we think of how
uh this model was trained how it's
communicated so you know we need to move
as a field and I I think there's a lot
of good actors here but clearly there's
um some cases where we haven't quite got
there yet where it's not just a matter
of dropping a benchmark you have to be
transparent and open about how the like
good science also means sharing your
methods and your approach and how it was
trained and what it was trained on and
you know getting into a lot more open
detail where uh right now you know the
the norm is to kind of train these
behind a black box put an API out there
and say here we did this really cool
thing Trust it works like can you
imagine that happening in like other
products or Industries it's like here's
an airplane yeah trust us we tested it
it's fine um you know I I think we need
a lot more openness in just in general
and how these are trained because
there's always going to be misaligned
incentives when it comes to these
benchmarks they just get so much
attraction and commentary that you know
we need to have it always paired with a
really open discourse of what was
actually done and the ability to inspect
what was actually done yeah and this
seems to be the Crux of right because I
think you know benchmarks have almost
represented like the compromise position
for the field right which is okay well
we have all these companies that have
trade secrets and they want to kind of
keep their Innovations like you know
reflection tuning or whatever they want
to do um and so we're like okay well so
long as you can kind of like show us
your performance on the benchmarks
that's the transparency we're looking
for I guess Kate is just to kind of push
you a little bit further on it is that
like you're almost saying that like we
should actually expect more than just
the benchmarks yeah and I think there's
a lot of like transparency and name only
like company say oh we're going to
publish a paper on this to follow or
here is a highle overview of what we did
but like if you don't actually share
share the weights share more details
like get really you know get into the
weeds of um exactly what was done you
know a lot of this can just be kind of
surfac that you know you can talk about
publicly as being transparent but did
you actually deliver details that have
helped scientists reproduce What was
done you know that is the level of
transparency we need to drive to so I'm
in the business of building AI
applications for production and so so
benchmarks is something we were just
discussing this
morning for me it's not useful to see a
certain model's performance on a
benchmark because there could be a
number of things that are happening it
could be that the model has seen this
data before it could be that even though
it does good on this it might not
generalize to my own use cases and what
I care about so my ethos around
benchmarks is if there is a test data
set that fits a feature that I'm trying
to develop maybe I'm trying to improve
reasoning capabilities maybe I'm trying
to improve tool calling there's a test
data set that helps me identify my blind
spots I'm going to go all in for it but
it's so important like we're investing
or building our own test cases and our
own email criteria because we have a
very specific thing we're going after
and nothing beats doing that and um yeah
even When selecting llms it's I yeah I
feel like benchmarks is nice to like
maybe have a sub selection of models to
look at but it's like Marina said it's
an it's helpful but not a complete
signal yeah there's almost a kind of
interesting phenomenon that I've been
sort of chasing after is a little bit
like the idea that you know it used to
be that the limiting reagent was getting
new models out but then like models are
everywhere now right and so it's almost
like now the new limiting reagent is
like a well-crafted eval or like a
well-crafted benchmark set um and like
that's increasingly becoming like where
a lot of the the the bottleneck is it
feels like in some of the workflows that
you're seeing all over the place well
and I wonder like as our models are
across the board you know if you just
look at the state the field models are
getting better and better being able to
do what last year a model you know could
do but it needed to be 10 times bigger
you know so these model performances are
continuing to improve you know we're
starting to get into the zone of you
know is this becoming commoditized and
so therefore it's this race to the
bottom trying to inch up and get a 0.01
you know percent increase in a in a
metric that might not actually
informative of your use case because
everyone's trying to you know show some
level of differentiation it's becoming
increasingly more difficult as
performance improves over the you know
I'll call it like the workforce tasks
that are kind of the um bread and butter
low hanging fruit that pretty much any
model can do now yeah for sure yeah
sometimes I look at some of these
benchmarks and like what are we what are
we doing here exactly what are we
spending time on yeah
[Music]
yeah uh so at mixture of experts we
always try to do a paper uh as one of
our stories um and I do want to end
today with kind of focusing on um a
paper that just came out that I thought
was pretty fascinating it's entitled can
llms generate novel research ideas um
and this is almost a continuation of a
paper that we talked about uh last week
uh which was about kind of using AI for
science um and the big debate there was
basically this kind of very interesting
question of can llms be creative right
can they become a partner that sort of
like pushes Us in new research
directions that we would have not have
gone on before um and this is
particularly interesting because I think
there's also a parallel discussion um
some of you may have seen this article
by the Sci-Fi writer Ted Shang that came
out uh I think about a month or so ago
in the New Yorker sort of arguing on the
creative side right like llms can't be
creative in some ways because you know
they sort of don't make these kind of
intentional choices about their outputs
um and so anyways I think to quickly sum
up the paper their argument is look we
played around with llms it does seem
like they can generate kind of creative
prompts you know their their kind of
spicy claim is sometimes even more
creative prompts than humans themselves
um and dot dot dot you know we should be
really bullish I think on the promise of
llms assisting research almost at the
very very beginning of the workflow
right in the most kind of human part of
the work um and I guess you know the
first question is just like do do we buy
it right um and uh I guess Kate maybe to
start the story with you because I think
you've kind of come in as kind of the
second commenter on a number of the
stories do you buy this case like do you
buy this
claim so I I think the paper makes a lot
of really interesting um assessments of
of how a model can support in research
tasks but what struck me the most about
it is they had to search through like
4,000 plus examples to get 200
unique uh actual research topics in this
paper so while the models were pretty
good at creating I I think specifically
they're focused on novelty on new ideas
that you know the other human subjects
that in this study hadn't thought of or
come up with you know is it really that
models are more creative or we're just
brute forcing able to automate searching
through thousands and thousands of
scenarios until we somehow you know get
I to can roll dice until a unique number
comes up exactly so you know that's
where I think there's still some room
left for debate on what does it mean to
be creative or or novel um and if did
the humans get a fair Shake of it if you
think of it from that term I guess
Marina I think as like a researcher you
know do you think this is the kind of
tool you might be using in the future or
this is kind of still like mostly just
game playing because I think you know K
if I maybe I'm being uncharitable to
your position but you're sort of like
this is just like a magic a ball right
like and you know uh if it generates
something unique and great that inspires
people then awesome but there's almost
nothing sort of uniquely creative I
guess is almost what you're saying I I
mean I don't want to say a broken clock
is correct twice a day but like if you
do something enough you're going to
stumble across something new does that
actually means something is more
creative uh and especially if you think
about the New Yorker article that you
brought up to where creativity is a
choice like are we actually saying
models are creative or are they a tool
where we can brute S search through you
know a larger number of ideas that are
being randomly generated it's it's the
second one to me although that's
valuable that is very very valuable
absolutely creativity requires intent
there is no intent here uh it's valuable
because uh it can take humans a lot of
times and we come in with particular
biases of how to to think about things
whereas if something is brought up to
you you can immediately say oh that
makes no sense versus oh I didn't think
about it that way that's valuable but
there's no judgment from these models so
I know that this has been used a lot
recently in uh other science Fields like
uh medicine or chemical compounds you
know where there's just like thousands
and thousands thousands you you can try
to put them together and humans just
don't have the time what's interesting
is they they are being used as a filter
as a brute force and you get it down to
this makes no sense this is physically
impossible da and then you start getting
like a human intuition of like well
based on my experience in the field for
20 years this is an idea worth pursuing
and this one is not can I tell you
exactly why to the extent that I can
like you know prompt an alum for it no
but it's sort of a sum of of my
experiences that kind of a progress
that's great that's a very useful thing
but creativity means intent and there is
no intent yeah that's right so you you
kind of buy the paper in some ways I
guess almost is it equal almost about
calling it creativity like the problem
is that like we're kind of giving it
this word that has all this baggage
right it's kind of on a pedestal like
the great creative artist you know the
great creative scientiic implication
yeah and then the not the paper itself
is not creative sorry um because asking
machine learning to uh help you you know
go through a whole bunch of stuff and
say which one of it is definitely not
garbage or here some other things I've
tried that's not new they're applying it
maybe to this extremely specific use
case that's great but that part to me is
not new while being useful yeah for sure
um Maya maybe I'll turn to you I think
you know this conversation almost makes
me think about like the debates we were
having you know almost 10 years ago now
about inability right where people are
like you know they're basically like I
think there's one line of debate which
is well you don't really want to use
these systems if you don't understand
how they do what they do and then I
think there's a certain group of machine
learning like chauvinists that were
basically like well if the model always
succeeds at doing the task what do you
care about how it gets itself done and I
think there's almost like a similar bias
when we start talking about like a and
creativity where we're like we don't
just want you get to the right answer
but we want you to get to the right
answer like in the right way and I think
what we mean by that in the creative
context is like that we want somehow for
it to be a little bit more than like a
random number generator um and yeah I
think do do you I don't if you fall on
one side of that debate or another where
you're kind of like actually doesn't
matter right like if these tools help us
get more creative results then I'm happy
versus like no part of our research
agenda really should be to like kind of
like get these systems to be Capital C
creative right what that means right
seems to be a big question yeah I don't
if I have a blanket answer for this one
but it's context dependent so I know in
the Arts World um Source attribution is
really important a lot of artworks are
done in the style of EXN artist how like
there should be credit given to a
certain artist in this place and how do
you give that credit and we want to
assign this characteristic to AI because
we put social values regard guarding to
IP regard guarding to giving the right
credit so we want to align these AI
systems to how our world functions and
how we we give credit where credit is
due um so I think it depends really like
what is the cost of not having that
explainability bacon and not just the
explainability but these other values
that matter to us and that would make
this technology align with our societies
and not us having to adapt around it
yeah that makes a lot of sense and I
think that'll be sort of an interesting
struggle right because where some of
this goes is like oh you know model you
came up with like an intuitive
counterintuitive very creative result
like can you explain why that's the case
or how you reach that result right like
that kind of interpretability which
starts to look a little bit more like
chain of th I think we're just giving
too much credit for these systems like
of the the paper that we talked about
and generating novel ideas well first of
all what do you mean by novelty and is
it just like net something net new is
that something that useful um and it's
like how the system works it's just like
the statistical probability of the next
word so um for us when we're coming up
with novel ideas there's meaning and
value intent behind it right I'm trying
to put an idea that maybe I care about I
want to push forward so I it's really
tough to equate these two and put them
on the same standing one can be a tool
for the other to maybe give you an idea
that maybe you didn't think about but I
don't think one replaces the other yeah
for sure yeah there's also like and this
also plays into this very interesting
kind of like debate which is like
they're just stochastic parrots and then
people being like no they're more than
stochastic parrots but I think Maya
you're almost outlining like a third
path which is yeah they're stochastic
parrots but like that's really powerful
actually like you know like let's not
downplay that too much like the the
stochastic parrot is actually like
incredibly useful in certain domains uh
and and like we almost shouldn't sell
that short right like as as a little bit
of what you're saying yeah absolutely I
think it's it's one of the this whole
industry of generative AI was unlocked
on how good these stochastic parrots
were and I I read that initial paper
that came out but I think what took us
all back was actually this going to have
really useful applications of
implemented in the right way but it it
doesn't mean that it can inherently
assign reason and intent that we're not
like I don't think there's a world where
we're there well so Marina I'll let you
have the last word in four or five years
you going to have a an llm co-author on
a paper or is this like still total pipe
dream no and I don't think that that's
the right aim in that sense I don't
think I'm G to have that co-author but
it could be very much that and we're
actually asked to when we publish now to
say whether you've used AI in your work
or anything of that kind llm is helping
you sift to the related work llm is
helping you you know manage your
bibliography and figure out uh things
that are similar and different and and
things of that kind sure but um you know
also to go with what my and Kate were
saying again about intent that's not
what technology is able to do intent is
something that you get from beings that
are alive that beings that are actually
able to give that you can have that's
why you can't have actual art I agree
with Ted Chang very strongly you're not
going to have ai art art moves us
because of the intent behind the person
that made it whether it was actually
originally their their intent or not
doesn't matter you know that that's what
that was versus AI art now you can
actually feel again the the care and the
intent behind the people who made the
llm think about how much you know effort
we all pour into making something that
is useful it's not the technology itself
though it's the people that are trying
to create something that is intended to
be used and helpful and efficient and
effective that's where the intent is not
the tech itself that's great yeah I'm
I'm applauding
so um Maya Kate Marina um in a a
nightmare landscape of jargon and hype
uh this panel is just like a light in
the darkness so I appreciate you all uh
taking the time this morning to stop by
mixture of experts and hopefully we'll
have you on at some point in the future
um and for all you listeners out there
if you enjoyed what you heard you can
get us on Apple podcast Spotify and
podcast platforms everywhere and we will
see you next week on Mi of experts