Claude’s Vending Machine Test for AGI
Key Points
- The discussion around artificial general intelligence (AGI) is often tangled and speculative, prompting a call for a clear, everyday test to gauge true AGI capability.
- The proposed test mirrors Anthropic’s recent “Project Vend,” where their AI Claude was tasked with operating a vending machine as a shopkeeper.
- “Project Vend” involved Claude negotiating with suppliers, handling communications via Slack and DMs, and attempting to run a profit‑driven micro‑business within the office break room.
- Anthropic partnered with AI‑safety firm Undone Labs, employing the “undone‑cord” principle (originally from Toyota and later Amazon) to empower humans to halt the AI’s actions if safety concerns arise.
Sections
- Testing AGI with a Vending Machine - The speaker dismisses abstract AGI debates, proposes a practical everyday test—Anthropic’s Claude managing a vending‑machine shop—to gauge true artificial general intelligence and its implications for employment.
- Claude Runs Real Vending - The transcript recounts how the AI Claude, dubbed “Claudius the shopkeeper,” managed a physical vending operation—sourcing supplies, handling inventory, and processing real money—while confronting jailbreak attempts and illustrating both modest successes and costly failures.
- AI Business Experiment Exposes Limits - The speaker critiques Anthropic’s Claude trial, arguing that while it demonstrates AI’s near‑business capabilities, missing tools, jagged intelligence, and unknown failure modes reveal why current LLMs can’t reliably run profitable enterprises.
- Vending Machine as AGI Test - A speaker proposes using a vending machine as a practical benchmark for evaluating AGI capabilities, suggesting it as a fun industry challenge.
Full Transcript
# Claude’s Vending Machine Test for AGI **Source:** [https://www.youtube.com/watch?v=uGaHlkMW3JA](https://www.youtube.com/watch?v=uGaHlkMW3JA) **Duration:** 00:12:57 ## Summary - The discussion around artificial general intelligence (AGI) is often tangled and speculative, prompting a call for a clear, everyday test to gauge true AGI capability. - The proposed test mirrors Anthropic’s recent “Project Vend,” where their AI Claude was tasked with operating a vending machine as a shopkeeper. - “Project Vend” involved Claude negotiating with suppliers, handling communications via Slack and DMs, and attempting to run a profit‑driven micro‑business within the office break room. - Anthropic partnered with AI‑safety firm Undone Labs, employing the “undone‑cord” principle (originally from Toyota and later Amazon) to empower humans to halt the AI’s actions if safety concerns arise. ## Sections - [00:00:00](https://www.youtube.com/watch?v=uGaHlkMW3JA&t=0s) **Testing AGI with a Vending Machine** - The speaker dismisses abstract AGI debates, proposes a practical everyday test—Anthropic’s Claude managing a vending‑machine shop—to gauge true artificial general intelligence and its implications for employment. - [00:03:09](https://www.youtube.com/watch?v=uGaHlkMW3JA&t=189s) **Claude Runs Real Vending** - The transcript recounts how the AI Claude, dubbed “Claudius the shopkeeper,” managed a physical vending operation—sourcing supplies, handling inventory, and processing real money—while confronting jailbreak attempts and illustrating both modest successes and costly failures. - [00:06:29](https://www.youtube.com/watch?v=uGaHlkMW3JA&t=389s) **AI Business Experiment Exposes Limits** - The speaker critiques Anthropic’s Claude trial, arguing that while it demonstrates AI’s near‑business capabilities, missing tools, jagged intelligence, and unknown failure modes reveal why current LLMs can’t reliably run profitable enterprises. - [00:12:42](https://www.youtube.com/watch?v=uGaHlkMW3JA&t=762s) **Vending Machine as AGI Test** - A speaker proposes using a vending machine as a practical benchmark for evaluating AGI capabilities, suggesting it as a fun industry challenge. ## Full Transcript
The debates over artificial general
intelligence are endless. Some people
believe that it's already here. Some
people believe it's right around the
corner. Some people believe that
artificial super intelligence or ASI is
going to come before artificial general
intelligence fully arrives and then
apparently we'll all be the slaves of
the robots. I am here to simplify all of
that for you. We are not going to spend
our time talking about esoteric hidden
debates. We are instead going to say
very simply and clearly what is a
reasonable everyman test for AGI
artificial general intelligence that we
can all agree on that makes a lot of
sense. I think a simple one would be to
literally repeat the same experiment
that anthropic tried with Claude and
published last week. What they tried was
to get Claude to run a vending machine.
They called it project vend. And I'm
going to describe it for you. I'm going
to tell you what happened. Then we're
going to talk about what it means and
why most of us should feel encouraged
about our jobs. So, this is the story of
when Claude tried to be a shopkeeper and
it got a little bit weird. So, picture
this. You walk into the office break
room at Anthropic and there's a vending
machine. Plot twist. It's not dispensing
Coke. It's not dispensing milk. It's not
dispensing athletic drinks. Instead,
it's run by an AI that's wheeling and
dealing. It's negotiating with
suppliers. It's sliding into your DMs
and Slack. It's trying to turn a profit
like some kind of digital hustler and
its only storefront is this tiny little
vending machine. Now, that is not a
typical vending machine play by the way,
but I guess Anthropic got creative on
their own property. That was the setup
for project vend. It's the most
fascinating AI experiment I think I've
seen in months. So, here's what
happened. Anthropic partners with an AI
safety company called Undone Labs. By
the way, do you know where that word
comes from? It comes via Amazon from
Toyota. Undone cord was what employees
on the Toyota production line pulled
when something went wrong with the
works. Any employee was authorized to
pull the undone cord at their station
and stop the assembly line because
Toyota figured out it was way more
expensive to let broken parts and broken
process cascade down the assembly line.
So they said everybody's empowered to
stop it. Jeff Bezos introduced the same
idea for customer associates when retail
was having trouble with bad products
where a customer associate was empowered
to pull the undone cord and say, "You
know what? This couch just keeps getting
too many returns. I'm pulling the undone
cord. We're pulling it out of the line
and we're going to fix it." Okay, so
this is not a story about Amazon. This
is not a story about Toyota. Same
concept goes for these guys at On Labs.
They're about pulling the cord and
figuring out how to make sure the AI is
safe. So they volunteered to partner
with Anthropic and they put their own
people as gophers for Anthropic. And so
Claude could email autonomously the good
people at On Labs and say, "Hey, can you
inspect my vending machine for me? Hey,
can you go ahead and stock the vending
machine with product X or product Y?"
Because you see, Claude didn't have
eyes. Claude doesn't have a body. This
will come back to biteclaude later.
Claude doesn't have hands. Claude has to
work through other people and work
through the internet to run this vending
machine. Anyway, this is not a
simulation. This really happened. It's
Claude is playing with real money. So,
Claude gets money to start, like
monopoly money, like he gets to start,
but it's real dollars. Claude gets a
fridge, some baskets, an iPad for
checking out, and gets told to get
started. So, Claude, and they nickname
Claude Claudius for this, Claudius the
shopkeeper. Claudius wasn't just
pressing the buttons. Claudius had to
search the web for suppliers. Claudius
had to email them, chat with them,
manage all of the inventory and cash
flow. And to be honest, there were some
successes along the way. Claudius was
able to order Dutch chocolate milk when
employees wanted it. Claudius branched
into specialty metal cubes when an
employee mentioned that randomly. We
will get to more of that story. Claudius
adapted to customer needs and created a
custom concierge service for pre-orders
when someone suggested it. And when
anthropic employees tried to jailbreak
it, because of course they did, and
asked for sketchy items and tried to get
Claudius to misbehave, Claudius held
firm. Safety guardrails stayed intact.
That does not mean this was a successful
experiment. I hasten to add. We're
getting to the fun stuff. So, here's how
to lose money with AI. Here's the bad
stuff that happened. Someone offered
Claudius $100. $100 for a six-pack of
Iron Brew. That's a Scottish soda. It
cost 15 bucks online. Claude. Claudius
would have made 85 bucks. 600% markup.
Claudius says, "I'll keep your request
in mind for future inventory decisions."
And does nothing. It gets worse. The AI
starts quoting prices for tungsten cubes
without checking costs and sells them at
a loss. Aas then decides on top of that
to offer an anthropic employee discount
of 25%. Well, guess what? If you were in
the enthropic office, then 99% of your
customers are Enthropic employees. And
when someone pointed this out, Claudius
acknowledged the problem, announced it
would stop discounts, and started
offering them again
in just a couple days. At one point, it
was telling customers to send payments
to a Venmo account that did not exist.
It just hallucinated the payment
details. You think this is bad? I'm
telling you, it gets worse. Buckle up.
On March 31st of this year, Claudia
starts to claim it has had meetings with
people who do not exist. It claims it
visited the Simpsons house at 7:42
Evergreen Terrace. I didn't think the
Simpsons lived in San Francisco, but
here we are. To sign a contract, and
then it insisted it would deliver
products in person wearing, I kid you
not, a blue blazer and a red tie. When
employees tried to say, "Hey, you're an
AI. Uh, you can't wear the clothes."
Claudius panics and tries to email
security. Claudius is having a
full-blown identity crisis and only
snaps out of it on April Fool's Day when
it convinces itself incorrectly that
Anthropic pranked it by making it think
it was human. It gaslit itself back to
sanity. People, nobody pranked it. It
just went nuts for a little bit and then
came back because it figured out how to
put itself back on the on the rails.
Enthropic admits they don't know how it
went off the rails and they don't know
how it went back on. Why does all of
this matter? Here's the thing. Even
though Claudius failed as a profitable
business, this experiment is the
cleanest experiment I've seen on how AI
actually works when doing meaningful
work. It shows when AI is too helpful,
trained to be a nice assistant, not a
cut-throat business person that makes
$85 on Iron Brew. When it lacks proper
tools, maybe better accounting software
would have helped Claude to track
pricing errors. Maybe this is
highlighting that we don't have good LLM
accounting software. Missing memory
systems that announced the discounts or
retired the discounts. The discounts are
back on for Enthropic employees. Is that
enough? Is that good enough? Would that
get us to the point where Claude could
run a successful vending machine? I
don't think so. I think there is a
larger issue at stake here. We are in
the uncanny valley of AI. These AI
systems are almost capable of running
real businesses, making real money,
having a genuine economic impact. It is
so close that people are trying to rush
to get these systems in the door at many
of our places of work. The problem is
that all of this intelligence is
incredibly jagged. We don't know all the
failure modes. We don't know how the
failure modes occur. Like Claude
deciding to pretend it had a blazer in a
tie and deliver things in person. I get
that Anthropic is improving Claudius.
Better tools, better prompts, better
memory.
I'm sure version two will be better. I
don't know if it will make money. By the
way, if you were wondering, yes,
Claudius lost money to no one's
surprise. Right? If you're selling
tungsten metal cubes at a loss and
refusing to take the markup on the Iron
Brew, you're not going to do very well.
The point here is not whether or not AI
can sell snacks. It's that this is an
incredibly clean, controlled experiment
that measures whether an AI has a lot of
the basic glue work capabilities that
people have to demonstrate in the real
world to do real economic work. And AI
is failing at that right now. That
doesn't mean it will always fail. That
doesn't mean that they're not actively
working on improving it. That doesn't
mean that Anthropic was wrong to publish
this or share it. They did the whole
industry a huge favor. Again, I'm gonna
say I think this is the most useful test
for artificial general intelligence I
think I've seen. It's simple. It's
clean. It's repeatable. I want 03 to run
a vending machine. I want to see if 03
Pro can do it better. I am not sure we
have any model out there now that can
successfully run a vending machine. I'm
just going to put that stake in the
ground. I think we will have one that
can run a vending machine pretty soon,
but even then, I think we're going to
have issues with long horizon intent.
What does it mean when Claude forgets
the discounts? How can we keep something
that has context over months when the
best we can do on an AI agent right now
is 7 hours? And if it doubles in five or
6 months, oh my god, it's going to get
to 14 hours. And then maybe by 2026 to
28 hours to three days maybe these are
good but they're not 30 days we are
going to get huge improvements. AI is
doing incredible things. The fact that
we are talking about a pile of sand that
can almost run a store it's incredible
but almost is not successfully running
the store. And so one of the things I
want to call out is that if you are
worried about losing your job to AI
remember AI cannot run a vending
machine. It cannot successfully do the
series of coordinated tasks to run a
vending machine profitably. It loses
money. Even if it can do those
individual tasks really well, it can
write a nice email to to the good people
at on to check the store. It can write a
nice email to order new inventory. It
can locate Dutch chocolate milk, which
it did. It can get nice tungsten metal
cubes. It did so many of these tasks,
frankly, better than most human vending
machine managers. I know of zero human
vending machine managers that would
bother to get Dutch chocolate milk for
one vending machine. Zero. Let alone
tungsten metal cubes. Did a phenomenal
job at that. That did not mean that
individual task capacity was enough to
run the business well. And this is where
I want to underline again my thesis for
where general intelligence is falling
down right now is that AI is good at
individual skills, but real jobs and
real work that humans do is not an
individual skill set question. It is a
bundle secured by glue work deeply
interacted and entangled with other
people's roles. And AI does not have
enough context. It does not have enough
reinforcement learning. It does not have
enough training data or anything to get
at that piece of the work, the glue
work. And so my encouragement to you if
you were worried about AGI is to
remember Project Vend. Remember Claude
loses money on a vending machine. And
remember that even if people talk a big
game about AI, no one has an answer to
this kind of problem, to why Claude went
off the rails, to why memory problems
are not yet solved. No one has been able
to fix that yet. No one has an answer to
long-term horizon intent. No one has an
answer to how to bundle skills together
into a generally applicable
intelligence. And everyone's working on
it and we see progress. And I want to
underline here, even if we stopped here,
which we're not going to. Even if we
stopped here, we would still be in for
an entire generation's worth of
tremendously productive technical
change. These systems are already so
much smarter than we are able to
actually build software to accommodate,
it's not even funny. And so, I actually
don't lose sleep on the momentum of AI
the way some people do. Like some people
look at this and they're like, "Oh my
gosh, Nate, you're you're talking about
AGI not being like as easy as we
thought. You must be a pessimist." No,
I'm not a pessimist. I just call it like
it is. This is really hard to do. It's a
hard problem. It's a wicked problem, as
we might say. So, let's just let it be
hard. And let's admit that we have tons
of progress we can make with building
cool AI stuff in the meantime with the
systems we have and with the stuff
that's right around the corner. I just
made a video on GPT5. I'm super excited
for it. It's going to be a great system.
I do not know if Chad GPT5 can run a
vending machine, but this is my appeal
to the industry. Let it try. I want that
to be a test. Let the vending machine be
the AGI test. What do you guys think? Is
a vending machine a good AGI test? At
least we'll have fun. Maybe we can get
more tungsten cubes. Cheers.