AI Roundup: Rabbit Hardware, GPT-2 Bot, FT Deal
Key Points
- “Mixture of Experts” is a weekly AI‑focused programme that brings together a rotating panel of specialists to cut through the flood of news and highlight the most consequential developments.
- The current episode features three IBM‑affiliated experts – Chris Hay (Distinguished Engineer, IBM), Kush Farney (IBM Fellow, AI governance), and Shar (Senior Partner, AI & IoT consulting) – each representing a different AI domain.
- The first headline covers Rabbit’s AI‑companion hardware launch, which quickly ran into technical glitches (a battery‑drain firmware issue) and credibility concerns after it was revealed the device was essentially just an Android app that could run on a phone.
- The second story examines the surprise appearance of a “GPT‑2 chatbot” on Chatbot Arena, raising questions about the reliability of current LLM evaluation methods and the transparency of model provenance.
- The final discussion notes OpenAI’s new agreement with the Financial Times to license the newspaper’s content for training, illustrating the growing trend of commercial media data being used to improve large‑language models.
Full Transcript
# AI Roundup: Rabbit Hardware, GPT-2 Bot, FT Deal **Source:** [https://www.youtube.com/watch?v=hwNkFnR1U0I](https://www.youtube.com/watch?v=hwNkFnR1U0I) **Duration:** 00:41:34 ## Summary - “Mixture of Experts” is a weekly AI‑focused programme that brings together a rotating panel of specialists to cut through the flood of news and highlight the most consequential developments. - The current episode features three IBM‑affiliated experts – Chris Hay (Distinguished Engineer, IBM), Kush Farney (IBM Fellow, AI governance), and Shar (Senior Partner, AI & IoT consulting) – each representing a different AI domain. - The first headline covers Rabbit’s AI‑companion hardware launch, which quickly ran into technical glitches (a battery‑drain firmware issue) and credibility concerns after it was revealed the device was essentially just an Android app that could run on a phone. - The second story examines the surprise appearance of a “GPT‑2 chatbot” on Chatbot Arena, raising questions about the reliability of current LLM evaluation methods and the transparency of model provenance. - The final discussion notes OpenAI’s new agreement with the Financial Times to license the newspaper’s content for training, illustrating the growing trend of commercial media data being used to improve large‑language models. ## Sections - [00:00:00](https://www.youtube.com/watch?v=hwNkFnR1U0I&t=0s) **AI News Roundup with Experts** - The show introduces an IBM‑led panel to dissect recent AI headlines, beginning with the troubled rollout of Rabbit AI hardware. ## Full Transcript
[Music]
hello and welcome to mixture of experts
uh on this show we're going to be
meeting weekly to review the sort of
deluge of news that's happening in the
world of AI um and the goal here is to
distill down right it can be really hard
to keep track of uh everything that's
flying around on a weekly basis but the
Hope here is by bringing together a
group of experts we can distill what's
happening um and give you an
understanding of what's happening in the
world of AI and what to be looking for
uh in the week ahead and so today I'm
joined with a great panel of um three uh
experts uh really hailing from different
areas of the AI World um so just to
quickly run through them Chris hay who's
a distinguished engineer at IBM um he's
the CTO of their customer transformation
operation um Chris welcome to the show
hey thanks for having me looking forward
to this yeah absolutely uh Kush Farney
he's an IBM fellow uh working on AI
governance issues uh Kish welcome yeah
thanks Tim yeah and uh Shar he's the
senior partner Consulting uh running the
AI and iot business uh in the US Canada
and Latin America so welcome to the
show thanks guys thanks for having me um
well let's go ahead and get started
we're going to cover three really big
stories uh of the last few weeks um the
first one is going to be uh the uh
recent release of the rabbit AI Hardware
uh product uh and we're going to talk a
little bit about some of the trouble
they've been having on the roll out and
what it all means for the future of AI
enabled Hardware secondly we're going to
talk about uh what's happening with gpt2
chatbot which is a mysterious chatbot
that has just appeared on chatbot Arena
um what it is and what it really tells
us about the uh evals in the AI and llm
space in particular uh and then finally
uh we're going to talk about open ai's
uh concluding of a deal with the
financial times to license their data uh
for training uh purposes
[Music]
so I'd like to kind of start first with
uh the rabbit story so rabbit if you've
been watching is a a really sort of
widely discussed Hardware startup uh
whose bid is basically to say you know
in the future we're going to have ai
first hardware and rabbit effectively is
a little device that is intending to be
kind of an AI companion for you um they
rolled out just recently um and have run
into immediately a number of problems so
they you know had to push a firmware
update to deal with a a battery problem
right the battery was draining too
quickly um they've been criticized
recently because it turned out that
their product was essentially an Android
app that could be running on a phone um
and so they've received a lot of
criticism and I wanted to bring up this
story just because it feels like it's
the the seconds data point right so
there was the release of rabbit uh
Humane which is another company that
released an AI enabled pin um has
similarly run into aot lot of criticism
people saying you know why would I buy
this is this a good product at all um
you know what is what is this all for
and so Shan I was hoping I would bring
you in kind of maybe to kick us off here
um because I think what's really
interesting is you know personally I'm
like very excited by the future of AI
Hardware right like I think there's just
so many cool things that can happen once
AI is on device and it's a thing that
you can carry around with you but
clearly kind of some of the first forays
some of the most talked about forays
that are happening um today um are are
clearly having some some teething issues
some some issues um and so kind of want
to get your take as someone who's like
deep in the AI and iot space and kind of
thinking about the relationship between
Ai and and Hardware how you see these
recent stories and and what do you think
it tells us about how this Market is
evolving thanks Tim I had the pleasure
of playing with the rabbit R1 at the CES
this year and I obviously as a geek I am
I did drop my 200 Parts I received my
rabbit you own one right so it is
fantastic effort uh if you think about
uh the direction that AI is moving it
will close it'll go to the edge more and
more right you're seeing the models
getting smaller there's a lot of work
that's happening in on device Computing
Apple breaking it its uh wall Gardens
and open sourcing its open Elm you'll
see Google with it share models all of
those are moving closer and closer to
the edge so I generally love the
direction that it's taking that way you
get addressing things around privacy
your data is being commuted on on the
device and that stays with you so the
direction that the tech is taking is
fantastic I'm all for it I think there's
a lack of appreciation of what problem
are you really trying to solve for and
are the other devices that are better
soled for it so when you start to uh
react to a device like that in your in
your brain you're trying to create a set
of things that you're going to evaluate
this on I think that's where the problem
is with R1 or from when you look at even
metas Rand glasses and the Humane pin
right we have a set set of things that
we are looking to evaluate it against as
an example I would appreciate that it
understands me as a person really well
if it's attempting to be a personal
assistant I would appreciate that it
would have instant responses if I'm
looking if I can do something in half a
second or Split Second faster on a
regular mobile phone I'm going to tend
you to go do that so we're all
optimizing for how to make it more
effective in our own lives then you
start to look at things around I already
carrying a cell phone in my pocket so it
has to be net new POS to be like for
example the watch I'm wearing is adding
something to the ecosystem right versus
when you when you create the set of
criteria and then you start to evaluate
rabbits R1 it starts to fail on some of
the basic uh capabilities that will
expecting from it the direction is great
but I think the the battery life being
very low the the fact that the screen
itself is they're teasing you with
certain things you can do with a
touchscreen like typing on the terminal
but you can't really interact with the
menus the menus kind of remind you of
how we had the old scroll wheel iPods
those are amazing to scroll through
music but they're terrible at changing
setting things of that nature and we see
that Paradigm Shift over to the rabbit
R1 as well there are a few things around
taking images the visual recognition of
what's in front of you that has been
pretty decent like I've had good
response when I'm pointing it to certain
things in front of me documents is a hit
or miss right now this handwriting
recognation is still taking a lot more
time
so yeah I was curious if youve got like
have you had like what's your most
magical experience so far right I I'm
almost interested in like the Steel Man
case of like what's the what's the most
exciting thing you've done so far with
it right because it's gotten so much
hate online that it's almost interesting
thinking about like the the killer
application like I remember when I
bought my smartphone for the first time
I was like this has maps on it I
literally never going to be lost again
and like that's like a huge deal um and
I guess in the AI space I'm still kind
of like waiting for that and I'm curious
about as as someone who owns it and uses
it and is playing around with it if
there's like things where you're like oh
this is starting to be really cool yeah
so I think the promise of the large
action model is pretty cool uh it has
solved this to a decent extent with a
few apps like Ubers and and others but
the fact that a lot of the services that
we that we use today are hidden behind
applications and not all of those
capabilities are exposed through apis so
it's difficult for say a personal
assistant Siri or chity of others to be
able to go call those and do some
actions so the large action model I
think that has a lot of Promise uh the
training data becomes a constraint for
them that's their keyless heal so so far
they've had hundreds of people manually
go and train these models right and
they're going to open up this catalog of
hundreds of different models uh over
time but in the current form it's very
limited in what actions I can take on it
but the fact that you can delegate a end
to end process that's very complex and
other you couldn't have done it with
apis and that's what what really excites
me I see a lot of applications I mean uh
so I don't have one of these I'm not as
much of a of a Gadget Guru as shath is
but um uh but yeah I mean uh I think the
um uh I mean there's going to be fits
and starts with any sort of new paradigm
right and uh uh things have to start
somewhere I'm more of an optimist on on
things generally so um to me what if
this is leading to is actually like a
fourth Paradigm of how we interact with
Computing right I mean there was Punch
Cards there was command line then there
was guies and this is now I mean like
we're in this fourth sort of era the
language natural language interactions
and so forth and I think I mean yeah I
mean maybe there's no killer app yet but
the killer app maybe is the fact that we
have this new way of interacting and
that's what these devices are going to
uh start us uh on the road down and uh I
mean having having this more like Mutual
theory of mind like this system
interacts with us it understands us we
understand it I mean I think that's
where we're headed and um uh the more we
can just keep down that road I mean of
course the first instantiation of
anything isn't always the like the most
perfect or the best but um but I think
that's where I'm optimistic about it
yeah and I think there's kind of this
interesting sort of hill climbing right
because I think you know my friend was
like this is this is like Google Glass
all over again right like you're going
to have like a couple products that have
like such a bad rep that they kind of
taint the entire market for like you
know a decade plus but Co I was kind of
like agreeing with you I was kind of
like well it's not like these products
are failing so hard right like if you
remember when Google Glass came out
people like went into bars and like got
beat up because they were wearing the
Google class like we're we're not there
yet and so it feels like we are kind of
more in this like hill climbing um
scenario I mean I I I don't have the
device but I think it's utter nonsense
if if I'm honest right well tell us why
you know well if you think about what it
is right what what what do you me to
have here a camera right a touchcreen
right you need access to Wi-Fi and then
for it to be useful you need a cell
connection as well as you move around
it's going to do image recognition and
then it needs AI Hardware on board what
is it it's a phone okay so this is why
you can't find a killer app cuz the
killer app is a phone and and when I
look at it and and I I'm going to give a
practical example so Apple silicon is
absolutely incredible right so last
night on my M3 I ftuned the mistel 7B
model with my own data set in 15 minutes
at 250 tokens per second the the gpus is
incredible and that same technology is
coming into the phones Apple's going to
go on device they've got the hardware
with apple silicon you know and then the
mobile phone man manufacturers are going
to do the same so as as far as I'm
concerned and I agree with the Paradigm
but it it's like trying to sell a pager
to somebody today it's like here's this
thing that's got the things you need you
can get messages and you know but nobody
has a pager right because it was
replaced by the phone and and so I do
think there will be AI on Hardware
devices I I just don't get that one yeah
yeah and I think you're also raising I
think one final point I wanted to hit on
before we move to the next story is you
know obviously the Thousand PB gorilla
in the room is Apple it's it's not even
th000 pound it's like the 100,000 PB
gorilla basically in the room because it
it's got the hardware it's got the data
you know should be able to execute on
all this they they haven't yet really
right and so this is I think where all
the other Hardware companies see an
opening is like well Apple's going to be
so conservative that there's an opening
in the market for at least to get in and
at least maybe be a good acquisition
Target right um I guess sh do you just
to bring it back to you I'm curious if
you've got any thoughts on Chris's uh
attack on this whole idea because
clearly you were bullish enough to to
buy the products and experiment with it
and um and I'm I'm curious if you got if
youve got the Chris takedown here like
what's what's the what's the thing he's
not seeing no like he's on the right
track in the current state it's not a
great product right but just being an
optimistic of where the tech is going
I'm more on the wash wibe of I see the
promise of what this can bring but uh I
I think that these uh devices will
evolve and apple takes a while to come
into this industry right same thing goes
with the Vision Pro glasses right I
again I was a big fan of them and I
bought them early on and 30 days in I
did return them so I just the fact that
I found some experiences and the promise
of where this can be at this point I'm
just waiting for the next version to
come out but I'm With You Chris in the
current state yes the $200 I I would
have used it elsewhere but uh I'm just a
sucker for a good Tech man yeah me too
sh but maybe my challenge back to you is
let's fast forward 6 months time post
WWDC right when all of the uh AI
capabilities start to move on to iPhone
regular which I think has already got
the hardware that it needs to do these
scenarios and let's see if the rabbit
actually comes out of your drawer at
that point or whether you're just doing
those same scenarios on your phone yeah
I think one of the funny scenarios I was
thinking too is people right now
obviously are focused on like the top
end of the market right which is like
who's willing to pay hundreds of dollars
for the rabbit I also kind of think that
as models get more efficient and more
energy efficient uh we may just end up
putting really small models in all sorts
of existing Technologies and you know I
think this could be both an interesting
thing right and then also potentially
like a bad thing like it's like 2035 and
you're like arguing with your toaster to
get working because at some point like
someone made like the only interface for
this is language basically
[Music]
so um let's go ahead and move to our
second story so um I wanted to talk a
little bit about gpt2 chatbot so if
you're not familiar with this uh
essentially this mysterious thing
happens there's this platform called
chatbot Arena which has become in some
ways kind of the gold standard for
evaluating models and it's a really
simple idea you basically have people
talk to two models um and you tell them
which one you like more right and this
is basically allowed the comparison
cross product of a lot of different sort
of Open Source models and proprietary
models that are floating around the
space and this kind of mysterious one
merged uh gpt2 chatbot which everybody
claims is incredible it's amazing and I
agree actually playing around with it
it's actually like quite impressive um
and it was accompanied by this sort of
mysterious kind of opaque tweet from Sam
Altman saying that he also you know has
good feelings about gbt2 and so it
immediately has led to kind of this like
fanfiction if you will about what this
model is and whether or not it is kind
of a Trojan hored quiet you know stealth
release of what could be GPT 4.5 or GPT
5 um and so Chris I want to kind of
throw it to you on like what are we
seeing here is gpt2 chatbot really like
the Next Generation model um and if
you've got any kind of theories about
that a would just love to get your take
on like what are we seeing do you buy
the hype I don't know I mean I had to
play with it it's pretty good actually
to be fair is it GPD 5 I don't know I
think they've hyped GPT 5 so much that
if that is at this point it has to be
AGI or like not even going to impress us
exactly so maybe it's GPT 45 but I I
don't think that I I I read a Theory
online I can't say who said it but I
actually like it I somebody said that uh
take the GPT to llm which they've open
source you can download that in hugging
face and they reckon that they may have
trained gpt2 on the uh latest uh data
that trains gp4 and I think that's an
interesting Theory right you know gpt2
with gp4 data so maybe it's something
like that um I don't know um but I don't
think it's GPT 5 it probably is GPT 45
and as you say you you've got to put it
in some sort of Arena to to see how well
it's actually performing and you know
they'll have run all the kind of MML
benchmarks and the 20 other thousand
benchmarks that's out there so you know
sticking it in the chatboard arena see
how it performs there is is probably
quite a smart move right it's it's a
good way of testing out how that model
is um so I I think you know seriously
it's probably GPT 45 but I I I really
like the idea that it's gpt2 with GPT 45
data I think that's a cool
Theory yeah I think I mean there's two
things there one of them is like if that
actually turns out to be the case and
the model performance is like really
good in the arena it's like yeah do
these architectures really matter like
is it just like you you have enough data
and you can actually make this like
amazing um like that ends up being the
bigger lever um well curious so I do
have a follow-up question here but just
to quickly pause I mean k show I'm
curious if you got thoughts on like
first do you just buy the hype like do
you think this is the next model is it
all overhyped curious if you got any
thoughts on that it could be um I mean
anyone's guest is as good as anyone
else's so um yeah I mean I'm sure it is
something that's coming up next but uh
yeah why speculate I mean I'm sure
they'll tell us pretty
soon exactly if you guys followed the
the talk by Andrew on how agentic flows
are going to be the way we get to AGI I
think over time the next set of models
that you bring out they will have a
decent router that will go pick the
right models and you're seeing these
kind of things come out from open AI
already right they have they're
automatically picking the right model
based on the queries and things of that
nature right so I think the gpt2 would
be a step in the direction of getting to
or the 4.5 and five but I think it'll
not be just one big model that's going
to be able to solve all of that so I
think they may be testing out in public
and getting some feedback on how people
are reacting what kind of questions
people are asking and things of that
nature in these open in llm sis Arenas
it's very entertaining it's great drama
in the AI world I love it I just pulled
up popcorn and just enjoy what's
happening uh there was a there was
somebody who posted that U originally
Sam Alman had tweeted with the gpt-2 and
then edited that it be a gpd2 and they
just leaving breadcrumbs to just make
this more entertaining so I love the
direction that is going in I think over
time it will not be one big 4.5 or five
model you'll end up with a mixture of
experts the way that uh that they will
solve for this yeah for sure and just
before we move on you mentioned Andrew
who who is Andrew and is that stuff
public if people want to check it out or
is that internal yeah Andrew is God of
AI so he's like like if you look at Deep
learning.ai he started the the Google
brain and whatnot he has co-founded
corera is is a great great guide to
follow on on AI yeah and it's also very
funny to me it kind of occurs to me that
it's like whether or not it's Sam mman
or like Taylor Swift like both are
basically like dropping breadcrumbs on
social media as a way of like driving
engagement around their products so uh
and just for folks earlier that's that's
Andrew ing um if you want to check out
his stuff he's he's great I agree um so
I uh I want to put on my tin foil hat
for a moment right to kind of go on the
next sort of turn of the screw with this
story is basically uh let's assume for a
moment that gpt2 chapot is is the next
greatest thing thing that open AI is
going to release I think it's actually
very indicative that one major way they
want to do an evaluation around this
model is to release it on chatbot Arena
right because like I think one of the
interesting things I see evolving in the
space is that you meet a lot of people
who are mts's ml like you know basically
like real deep and machine learning
people who like I think desperately kind
of hate the idea that like in order to
tell whether or not a model is good you
just talk to it for a bit and then you
tell they tell you whether or not it's
good or not right like they would prefer
to have some kind of much more
structured evaluation for measuring kind
of like conversational quality um but as
it kind of turns out chapot arena is
kind of like dominating the space over
time and um you know it kind of leads
this very interesting world where it's
like become more and more difficult to
like quantify model quality and we're
almost just kind of falling back to like
almost the most one brain cell way of
evaluating models which is well I don't
know you talk to it for 10 minutes and
then you say whether or not you think
it's good or not and I was joking with a
friend recently I was like oh what we
should do is we should we should start
reviewing models like we review like
fine wine where you're like oh this is
like a model with like you know oky
overtones and it's a little bit more
conversational and like I think we're
like moving in that direction um but I'd
be curious to hear from you know
particularly like Chris like whether or
not you agree that like that is the
future because it is so funny that like
what's happening is basically you have
the super advanced technology but our
eval methods remain like very
rudimentary and I think some people
would say that's good some people would
say like well that's not how it's always
going to be I think it's a good thing I
mean I mean is it really that different
from the original churing test right
that's what Alan churing said right
which is you go have a conversation if
you can't tell the difference then you
know then is it human or not and and
actually if I think about the problems
with the benchmarks then this is why I
quite like the ELO ratings again it's
not perfect uh LMC I think they do a
good job there with the leaderboard but
the the problem is that because all the
benchmark are published we know
everybody is fine-tuning to the
benchmarks right so you know so how
valuable are the benchmarks really so
everybody's like I'm 84 or I'm 85 and
you're like well you know but if I then
ask a query that's completely different
that's not on a benchmark then it starts
to mess up right so one of my favorite
tests is I play Hangman with um the
various models right and and sometimes
sometimes uh I'm playing the game and
some times I'm choosing the word and and
I can tell you straight up none of the
models play Hangman very well including
GPT 4 right so if I give it the word so
I use cheese as my test and it very
quickly guesses the E so you get blank
blank e e blank e right there there is
no other words and then every model is
like I don't know an R I'm like what why
are you guessing an R you know and and
therefore you know know that sort of
viess that you get from the kind of uh
from the arena is really important
because they're the sort of things the
creative sort of tests that we'll have
we'll play Hangman we'll play uh
tic-tac-toe we'll you know ask different
questions but if if you are literally
training within an inch of its virtual
life you know being fine- tuned to The
Benchmark then how valuable are those
benchmarks really so I think the future
has got to be tests where you can't find
shune IE you don't know what the
questions and the answer are in advance
and it's got to be a little bit more
creative whether that turns into
benchmarks whether it turns into kind of
um you know LMC as we're doing today
whether it turns into as you're saying
there's like this model has this kind of
vibe you know it's a little bit chatty
it's good at classifications Etc I think
you're right it might move in that
direction but I I I think at the moment
the arena is probably the only sensible
place where you can actually rank these
models sure and I I had really thought
about that is like one if I hear you
right I mean one of the arguments is
like basically all of the benchmarks
we've been using are now kind of useless
is what you're saying so like almost
like there's been this Collective action
problem where like no one's been
everybody's gaming The Benchmark and so
the only thing you can really trust ends
up being like I don't know you have like
a 12-year-old talk to it for a little
bit and tell you whether or not you know
they like it or not you can trust if
everybody's at the same level of you
can't trust if everybody's at the same
level of Benchmark that's no indication
model model but obviously a model is low
you know then you can say they got a
little bit of work to do on that model
but at the higher end you know if you go
oh I'm 0.2 better in this Benchmark
reality who cares yeah that's right
that's right yeah and I think um yeah
and I also buy that right which is
basically maybe it's actually a sign of
the success of these models is that like
a lot of the leading ones are so good
now right that the benchmarks are a lot
less useful because like yeah we're
talking about these gradations that in
terms of like actual experience of the
model like very limited um uh most of my
my accounts of when I'm partnering with
clients these are 1400 companies and and
where we're doing gen at scale and we
putting things into production there's a
much higher threshold of what good looks
like and how can you define it
especially in the regulated industries
that I work in final Services federal
government and things of that nature
right in those cases you have to be very
precise on how do you measure the
accuracy how to measure effect the
answer is is correct it's grounded it's
hallucinating things of that nature
right so for majority of my Fortune 100
companies we have when we partner with
them we create a system of benchmarks
that are very tailored to the way the
use case that they're putting into
production so for example if you're
looking at a rack pattern you're looking
at pulling the right content is that is
that content correct from that from
those Snippets are they rank the right
way given those Snippets can I reliably
create the answer is grounded what's the
grounded score and given the answer that
you retrieve does that really answer the
question that was asked and stuff like
that right across each one of those it's
based on the kind of domain that they're
working on if you're looking at say uh
contracts and you're trying to analyze
if the answer is pulling out are correct
or not the question itself will Define
what is a good metric for it I may ask a
question about given a contract tell me
if I can order this particular part from
that contract or not which means it's
looking at the top of the contract it's
look at an exception on page 19 and so
on so forth so it's more of a Chain of
Thought to understand how things are
connected but then if I say can you
contrast these two contracts that's no
longer a rack pattern you're now asking
a question where it's pulling out the
right information comparing it together
giving it to an llm so that each query
has its own set of metrics that we need
to evaluate at each query type right so
we've created some very robust metrics
to evaluate these models whenever we
have a new model like llama 3 came out
and snowflak optic came out in the last
few days we need to plop that model into
that workflow 10 step process step
number four I'm going to call an nlm
everything that comes before and after
we need to have a good set of metrics to
evaluate it the majority of my fortune
big companies that we're working with
they kind of ignore they look at the
metrics and say hey a new cool model
came in so the wipes that a new model
came in but the evaluation we do not
look at human eval scores we do not look
at the scores that that are public in
nature because those are not as
meaningful to Enterprise use cases so
partnering with Consulting our clients
have built these really robust
benchmarking mechanisms and that's how
we've been bringing these to production
during experimentation and production is
continuously evaluating that throughout
the day yeah totally and I think it's
one of the interesting things like I was
talking with a friend recently I was
like in the future we're probably going
to have these agencies that just focus
on
evaluation um it just feels like it's a
it's an emerging business it's like
essentially like models and pre-training
become more and more commodified the big
question will be like well which one
should actually choose it feels like
there's a whole industry to be built in
terms of like bespoke evaluations even
in like curation of people who evaluate
your model seems like a critical
question yeah we're going to start
interviewing models like we interview
humans right you model are applying for
this HR job so you are going to be
evaluated against your HR skills you
model are a developer let's see what
your react coding skills are like and
you know and and actually I I think it's
a fair point right chit which is it it
doesn't matter how good a model is on a
benchmark right what only matters is is
it good at the task you need it to do so
if you need it to do legal contract
comparisons it doesn't matter if it's
the best poetry writer in Snoop Dog
style right what matters is can I can I
evaluate contracts because that's the
job you want it to do and can it do it
reliably one point I just wanted to make
related to to what Chris was saying on
um kind of people fine tuning to the
benchmarks uh so one thing that uh the
highi research lab of IBM they putting
together kind of this hidden Benchmark
so not releasing it anywhere um and uh
they have this thing open sourced called
unitext um so it's a way to I mean
actually construct these very quickly
very easily and so forth so um I think
that's I mean generally One Direction
that uh is also going to be emerging is
um uh that job interview also kind of
being hidden away uh so that uh people
can't train to it and and things like
that and I think what shth was saying is
is precisely right I mean uh these have
to be right on on point on task uh for
the sort of usage that you want so one
thing we talk a lot about with customers
is something called usage governance and
um that is precisely that right I mean
you don't want to care about what else
this model is doing just for what is
important for for your application for
your industry and and things like that
so um yeah I mean I think it's a it's a
great uh area and government regulations
are going to require a lot of this third
party testing and evaluation too very
soon so I think everything is is headed
in that direction yeah totally one of
the stories I was thinking about
covering which we we probably will end
up doing in a future episode and CH it'
be great to have you back on it is like
nist right and like kind of the
development of all of these like Federal
standards in the space and and what it's
going to look like I hadn't really
thought that it's going to look like
like this a standard like HR interview
like I love the idea that like in 2035
you're basically like what's your
greatest weakness to the model or you
like you know I'm going to need you to
do this bubble sour you know is like the
question you're going to ask people the
models to do um and hopefully models
will find it as frustrating as humans
do yeah and I think models are going to
evaluate models too I mean we're already
seeing that quite a bit and um uh one
thing our team is working on a little
bit is um so the arena is actually just
a par wise comparison with a human
judging two things but um if you have
three models um we can actually figure
out smart ways of having models figure
out which ones are are good because um
when you have three models let's say one
is an expert one is um a noice and one
is like intermediate um the expert can
know which one's the novice the
intermediate one can also know which
one's the novice so with using three at
a time you can actually figure out um
kind of a total ranking of models and
stuff so it's uh it's a fun game uh to
be in fantastic hey Kush yesterday uh we
had a new paper by coh here talking
about the panel of llms right p uh
that's a very very interesting way of
looking at it instead of having one llm
as a judge uh when you start to mix
different LMS as a panelist you get a
better accuracy in being able to Define
that um I I also think that the the task
that we asking an llm to do has
fundamentally we need to have better
appreciation for what steps llm is
better at versus Humans so I feel this a
little bit of a flaw in our benchmarking
systems today where we are evaluating if
you look at a 10-step process there are
certain steps along that way that humans
do that are incredibly easy for llm to
take on right they just slam through
those and then something very very
fundamental will be so darn hard for an
LM to get right so I think we are
projecting what we are good at as a good
Benchmark to evaluate llms I think as we
as we go play around with these more
you'll have a better understanding of
what should we evaluate that llm on so I
think the benchmarks themselves will
start to evaluate the change so I'll
probably not ask them what's their
strength and weaknesses are or tell me a
j story and things of that nature but
I'm sure we'll have a better of what
kind of questions we should be
evaluating these elims they in context
and grounded in the use cases that are
in production for our Enterprise
[Music]
clients let's move on to the final story
so this will be a quick final
conversation but I think it was a big
enough story that I think it's worth
bringing in um so the news uh broke
earlier this week that open AI had
signed
a licensing deal with the financial
times basically to license their content
for for training purposes um and
obviously this happens on the backdrop
of open AI you know getting sued by the
New York Times and a number of other
kind of Rights holders right and the
kind of question about what are you
allowed to train on is it a copyright
infringement what do companies like open
AI ow owe to people who you know uh
whose data is integrated into their
models uh is a really big one um and
course I know you work on AI governance
kind of want to throw this over to you
for the first sort of take is you know
I'm curious if you have any reactions to
this news like do you think that you
know we're going to see more of these
types of Licensing deals going forwards
in the future and and sort of if so why
I'm I'm sort of interested in kind of
like what's driving sort of the business
decision here yeah so um I mean the
content creators uh certainly need to I
mean receive something in order for them
to just exist right um because uh we're
I mean pretty soon have we'll have run
out of all the token in the world for
these things to be trained on right and
so um uh the new content needs to come
from somewhere it can't be just fully
like synthetic generated data I mean
that will lead to model collapse and all
sorts of other um sort of things so uh
how that happens I mean copyright was
always meant not to be a permanent sort
of thing it's just to protect those
creators during their lifetimes so that
they have some livelihood and and so
forth right so I think that's the idea a
that we need to keep going with um and
so it might just lead to a completely
different business model um so local
journalism has kind of died um in in the
world and U maybe this is a way to
resurrect it because um uh you need to
have I mean like this information that
is coming from somewhere um when we have
a rag pattern or anything else I mean
there needs to be timely information as
well so uh I think just the fact that
content needs to be be there um and we
need to have a way to to have it uh uh
kind of incentivized and and so forth is
is the is the Crux of it um at IBM I
mean we do do a great job trying to
eliminate all um copyrighted content out
of our Granite models and I mean we do a
lot of stuff there but um eventually I
think uh it's not a question of like
who's infringing who's getting sued
who's doing licensing deals but how do
we just make an ecosystem such that uh
the creators are are valued as much as
as anyone else yeah for sure and I think
this is actually really at the Crux of
whether or not this AI economy can can
work right because I think that um you
know if you want to use Google as like
an earlier template right it was sort of
this interesting moment where we said
okay you'll allow us to index all of the
web and in exchange we're going to send
traffic to you because we're search
engine right and like that actually
created a trade by which you know an ad
economy could work it feels like here
are the challenges that we haven't yet
built that infrastructure to create sort
of that symbiosis right and so like
essentially there's nothing
incentivizing the creation of new high
quality tokens which is going to be a
structural issue for the the market
ultimately I guess coach maybe the
question I'd have for you to maybe push
back a little bit because I was debating
this with a friend of mine you know my
friend was like this is all just window
dressing right because it turns out that
like Financial Times tokens are just
like not that valuable to open a a it's
not a whole lot of tokens and B they
already have a lot of new stories right
do you do you kind of buy that like how
valuable are the kind of tokens that
we're talking about here um you know
when it comes to a newspaper or when it
maybe even comes to like say like is it
more valuable to get um you know movie
scripts right than it is new stories
like I think this ends up being a really
interesting question about like where
the most valuable tokens are coming from
um and I'm just curious to be like if
you do think that these kind of like
journalism tokens which have been the
focus of so much attention um really is
is where it's at
yeah I mean journalism tokens have been
in the news but uh I mean comedians have
been suing as well I mean it's not that
it's one or the other right so um uh so
I think it's just the fact that it has
to be new tokens and um uh I mean
there's distribution shift right the
world is changing now we have uh uh
whatever we talked about at the
beginning right I mean these uh these
new gadgets the terminology for those
isn't going to exist in um uh kind of
any historical documents so we need to
keep up with the world the way the
meanings of words change the I mean any
sort of new thing that comes up right
and news tends to be one place I guess
your comedy routines tend to be another
place I mean wherever this um uh the
where the world changes however we can
bring that in I think that's where the
value is because it's not about the
number of tokens it's the quality and
the quality in terms of how to get these
models to keep adapting to the world as
it exists yeah yeah for sure well so we
uh probably in wrap-up mode right now
does uh Chris show but I'm curious if
you got any final takes on on this
before we close up I I go one take and
it's probably going to be on the
opposite side and we had this chat
before Tim which is um I think there's
going to be a whole business on data
washing coming out there because if if
you really look at this and yeah I think
there will be some folks like Financial
Times that will license their data and
that's great but you know if you take
five articles which is a news article
you know on the same subject I run it
through a model you know I get it
summarized and then maybe I open- source
that data set right and then somebody
else who's training a model goes and use
that open source data set right you know
where that data originally came from is
gone right and as far as the model
trainer is concerned it's like oh no I
used this open source data set which is
MIT licensed Etc I pulled off the
internet there you're now one step
removed from the uh the original content
sources so I I think that's going to
become I think that's going to become a
big thing and then I I see people doing
that commercially as well so as much as
we're all good people and we want the
Providence to go around I I just don't
see a world where uh everything is so
lovely and and you know we're all
high-fiving each other and how good we
are right I I see this data washing
world coming really really quickly yeah
for sure I have a I have a different
slightly different take on this I think
it is quite dang dangerous
U I'm a big proponent of decentralized
AI I'm with ammed from Cil ai's Camp
right and he left stabil AI with his
mission of decentralizing AI I think the
fact that open AI is making a decision
to partner with one news Outlet if they
picked fox or they picked CNN they would
have a different set of news articles
that are being trained on right so I
think there's quite a bit of bias in the
media itself right if you look at the
elections coming up and what not right
where you what decisions you making on
what data you think is high quality it's
a single entities definition of what is
high quality data where if you look at a
human beings like we're being exposed to
all the different uh both ends of the
spectrum of of news articles and stuff
right so I think decentralizing is going
to be very important and this is this
also goes to speak to open models where
people can add data on the Fly and they
can personalize it more and things of
that nature I think it is a little bit
dangerous when large organiz ation that
have become AIS of Power with AI are
making decisions unilaterally on what
good looks like what ft would would
produces news articles may not be
representative of what the culture or
the what the demographics of a
particular country or particular region
are so I think there's a little bit of
definition of hey if I'm using a gen
model from a particular vendor do I just
get to go personalize it and say hey I'm
leftwing I'm right-wing or I am more
like my thoughts on particular topics
that it gets more and more personalized
to me and that defines how it responds
and becomes my personalized version of
it right I I wonder show it if the
training is actually the important part
for the open AI piece I wonder if
actually it's just going to be ragging
that data is actually the the key thing
for them because you're giving the
up-to-date article and therefore As you
move into those agentic platforms
training the models not going to be that
valuable but actually being able to
serve up and say this is the latest news
from financial times and it's it's valid
and it's not Hall ating I think that's
probably valuable Chris I think it's
just like kids right it's nature and
nurture both of them so I think that's
where we heading right like like what do
you what was your nature of the kid that
was born and what how they were nurtured
or were dying but thanks so much Tim
this is extremely helpful great great
set of questions yeah absolutely well
thanks everybody I could have not asked
for a better panel to start with our
inaugural inaugural episode so uh Chris
Kush show it thanks for joining us and
uh we hope to have you on on a future
episode thanks te everyone
[Music]