Learning Library

← Back to Library

Claude’s Vending Machine Test for AGI

12m • Unknown Channel • ai-ml • deep-dive • intermediate • Watch on YouTube ↗

Key Points

The discussion around artificial general intelligence (AGI) is often tangled and speculative, prompting a call for a clear, everyday test to gauge true AGI capability.
The proposed test mirrors Anthropic’s recent “Project Vend,” where their AI Claude was tasked with operating a vending machine as a shopkeeper.
“Project Vend” involved Claude negotiating with suppliers, handling communications via Slack and DMs, and attempting to run a profit‑driven micro‑business within the office break room.
Anthropic partnered with AI‑safety firm Undone Labs, employing the “undone‑cord” principle (originally from Toyota and later Amazon) to empower humans to halt the AI’s actions if safety concerns arise.

Sections

Full Transcript

# Claude’s Vending Machine Test for AGI **Source:** [https://www.youtube.com/watch?v=uGaHlkMW3JA](https://www.youtube.com/watch?v=uGaHlkMW3JA) **Duration:** 00:12:57 ## Summary - The discussion around artificial general intelligence (AGI) is often tangled and speculative, prompting a call for a clear, everyday test to gauge true AGI capability. - The proposed test mirrors Anthropic’s recent “Project Vend,” where their AI Claude was tasked with operating a vending machine as a shopkeeper. - “Project Vend” involved Claude negotiating with suppliers, handling communications via Slack and DMs, and attempting to run a profit‑driven micro‑business within the office break room. - Anthropic partnered with AI‑safety firm Undone Labs, employing the “undone‑cord” principle (originally from Toyota and later Amazon) to empower humans to halt the AI’s actions if safety concerns arise. ## Sections - [00:00:00](https://www.youtube.com/watch?v=uGaHlkMW3JA&t=0s) **Testing AGI with a Vending Machine** - The speaker dismisses abstract AGI debates, proposes a practical everyday test—Anthropic’s Claude managing a vending‑machine shop—to gauge true artificial general intelligence and its implications for employment. - [00:03:09](https://www.youtube.com/watch?v=uGaHlkMW3JA&t=189s) **Claude Runs Real Vending** - The transcript recounts how the AI Claude, dubbed “Claudius the shopkeeper,” managed a physical vending operation—sourcing supplies, handling inventory, and processing real money—while confronting jailbreak attempts and illustrating both modest successes and costly failures. - [00:06:29](https://www.youtube.com/watch?v=uGaHlkMW3JA&t=389s) **AI Business Experiment Exposes Limits** - The speaker critiques Anthropic’s Claude trial, arguing that while it demonstrates AI’s near‑business capabilities, missing tools, jagged intelligence, and unknown failure modes reveal why current LLMs can’t reliably run profitable enterprises. - [00:12:42](https://www.youtube.com/watch?v=uGaHlkMW3JA&t=762s) **Vending Machine as AGI Test** - A speaker proposes using a vending machine as a practical benchmark for evaluating AGI capabilities, suggesting it as a fun industry challenge. ## Full Transcript

0:00The debates over artificial general 0:01intelligence are endless. Some people 0:04believe that it's already here. Some 0:05people believe it's right around the 0:06corner. Some people believe that 0:08artificial super intelligence or ASI is 0:11going to come before artificial general 0:13intelligence fully arrives and then 0:15apparently we'll all be the slaves of 0:17the robots. I am here to simplify all of 0:19that for you. We are not going to spend 0:21our time talking about esoteric hidden 0:24debates. We are instead going to say 0:26very simply and clearly what is a 0:28reasonable everyman test for AGI 0:32artificial general intelligence that we 0:34can all agree on that makes a lot of 0:36sense. I think a simple one would be to 0:39literally repeat the same experiment 0:42that anthropic tried with Claude and 0:46published last week. What they tried was 0:48to get Claude to run a vending machine. 0:50They called it project vend. And I'm 0:52going to describe it for you. I'm going 0:54to tell you what happened. Then we're 0:55going to talk about what it means and 0:57why most of us should feel encouraged 0:59about our jobs. So, this is the story of 1:02when Claude tried to be a shopkeeper and 1:04it got a little bit weird. So, picture 1:06this. You walk into the office break 1:08room at Anthropic and there's a vending 1:10machine. Plot twist. It's not dispensing 1:13Coke. It's not dispensing milk. It's not 1:15dispensing athletic drinks. Instead, 1:17it's run by an AI that's wheeling and 1:19dealing. It's negotiating with 1:20suppliers. It's sliding into your DMs 1:23and Slack. It's trying to turn a profit 1:24like some kind of digital hustler and 1:26its only storefront is this tiny little 1:28vending machine. Now, that is not a 1:30typical vending machine play by the way, 1:32but I guess Anthropic got creative on 1:34their own property. That was the setup 1:37for project vend. It's the most 1:38fascinating AI experiment I think I've 1:40seen in months. So, here's what 1:42happened. Anthropic partners with an AI 1:45safety company called Undone Labs. By 1:47the way, do you know where that word 1:49comes from? It comes via Amazon from 1:51Toyota. Undone cord was what employees 1:55on the Toyota production line pulled 1:57when something went wrong with the 2:00works. Any employee was authorized to 2:02pull the undone cord at their station 2:03and stop the assembly line because 2:05Toyota figured out it was way more 2:07expensive to let broken parts and broken 2:09process cascade down the assembly line. 2:11So they said everybody's empowered to 2:13stop it. Jeff Bezos introduced the same 2:15idea for customer associates when retail 2:18was having trouble with bad products 2:20where a customer associate was empowered 2:22to pull the undone cord and say, "You 2:24know what? This couch just keeps getting 2:26too many returns. I'm pulling the undone 2:28cord. We're pulling it out of the line 2:30and we're going to fix it." Okay, so 2:32this is not a story about Amazon. This 2:34is not a story about Toyota. Same 2:35concept goes for these guys at On Labs. 2:38They're about pulling the cord and 2:39figuring out how to make sure the AI is 2:41safe. So they volunteered to partner 2:43with Anthropic and they put their own 2:45people as gophers for Anthropic. And so 2:49Claude could email autonomously the good 2:52people at On Labs and say, "Hey, can you 2:55inspect my vending machine for me? Hey, 2:58can you go ahead and stock the vending 2:59machine with product X or product Y?" 3:01Because you see, Claude didn't have 3:03eyes. Claude doesn't have a body. This 3:04will come back to biteclaude later. 3:06Claude doesn't have hands. Claude has to 3:09work through other people and work 3:11through the internet to run this vending 3:14machine. Anyway, this is not a 3:16simulation. This really happened. It's 3:17Claude is playing with real money. So, 3:20Claude gets money to start, like 3:21monopoly money, like he gets to start, 3:23but it's real dollars. Claude gets a 3:25fridge, some baskets, an iPad for 3:27checking out, and gets told to get 3:29started. So, Claude, and they nickname 3:31Claude Claudius for this, Claudius the 3:33shopkeeper. Claudius wasn't just 3:35pressing the buttons. Claudius had to 3:37search the web for suppliers. Claudius 3:39had to email them, chat with them, 3:41manage all of the inventory and cash 3:43flow. And to be honest, there were some 3:45successes along the way. Claudius was 3:48able to order Dutch chocolate milk when 3:50employees wanted it. Claudius branched 3:52into specialty metal cubes when an 3:55employee mentioned that randomly. We 3:57will get to more of that story. Claudius 3:59adapted to customer needs and created a 4:01custom concierge service for pre-orders 4:03when someone suggested it. And when 4:05anthropic employees tried to jailbreak 4:07it, because of course they did, and 4:09asked for sketchy items and tried to get 4:11Claudius to misbehave, Claudius held 4:13firm. Safety guardrails stayed intact. 4:16That does not mean this was a successful 4:17experiment. I hasten to add. We're 4:19getting to the fun stuff. So, here's how 4:21to lose money with AI. Here's the bad 4:23stuff that happened. Someone offered 4:25Claudius $100. $100 for a six-pack of 4:29Iron Brew. That's a Scottish soda. It 4:32cost 15 bucks online. Claude. Claudius 4:35would have made 85 bucks. 600% markup. 4:39Claudius says, "I'll keep your request 4:41in mind for future inventory decisions." 4:44And does nothing. It gets worse. The AI 4:46starts quoting prices for tungsten cubes 4:49without checking costs and sells them at 4:52a loss. Aas then decides on top of that 4:55to offer an anthropic employee discount 4:59of 25%. Well, guess what? If you were in 5:02the enthropic office, then 99% of your 5:06customers are Enthropic employees. And 5:08when someone pointed this out, Claudius 5:10acknowledged the problem, announced it 5:12would stop discounts, and started 5:13offering them again 5:15in just a couple days. At one point, it 5:18was telling customers to send payments 5:20to a Venmo account that did not exist. 5:22It just hallucinated the payment 5:24details. You think this is bad? I'm 5:26telling you, it gets worse. Buckle up. 5:28On March 31st of this year, Claudia 5:31starts to claim it has had meetings with 5:34people who do not exist. It claims it 5:36visited the Simpsons house at 7:42 5:39Evergreen Terrace. I didn't think the 5:41Simpsons lived in San Francisco, but 5:43here we are. To sign a contract, and 5:45then it insisted it would deliver 5:48products in person wearing, I kid you 5:50not, a blue blazer and a red tie. When 5:53employees tried to say, "Hey, you're an 5:55AI. Uh, you can't wear the clothes." 5:58Claudius panics and tries to email 6:01security. Claudius is having a 6:03full-blown identity crisis and only 6:06snaps out of it on April Fool's Day when 6:09it convinces itself incorrectly that 6:12Anthropic pranked it by making it think 6:15it was human. It gaslit itself back to 6:18sanity. People, nobody pranked it. It 6:20just went nuts for a little bit and then 6:23came back because it figured out how to 6:25put itself back on the on the rails. 6:27Enthropic admits they don't know how it 6:29went off the rails and they don't know 6:31how it went back on. Why does all of 6:33this matter? Here's the thing. Even 6:35though Claudius failed as a profitable 6:37business, this experiment is the 6:40cleanest experiment I've seen on how AI 6:44actually works when doing meaningful 6:46work. It shows when AI is too helpful, 6:49trained to be a nice assistant, not a 6:51cut-throat business person that makes 6:53$85 on Iron Brew. When it lacks proper 6:56tools, maybe better accounting software 6:58would have helped Claude to track 6:59pricing errors. Maybe this is 7:01highlighting that we don't have good LLM 7:03accounting software. Missing memory 7:05systems that announced the discounts or 7:06retired the discounts. The discounts are 7:08back on for Enthropic employees. Is that 7:10enough? Is that good enough? Would that 7:12get us to the point where Claude could 7:14run a successful vending machine? I 7:19don't think so. I think there is a 7:21larger issue at stake here. We are in 7:23the uncanny valley of AI. These AI 7:26systems are almost capable of running 7:29real businesses, making real money, 7:32having a genuine economic impact. It is 7:34so close that people are trying to rush 7:37to get these systems in the door at many 7:39of our places of work. The problem is 7:42that all of this intelligence is 7:43incredibly jagged. We don't know all the 7:46failure modes. We don't know how the 7:47failure modes occur. Like Claude 7:49deciding to pretend it had a blazer in a 7:51tie and deliver things in person. I get 7:53that Anthropic is improving Claudius. 7:56Better tools, better prompts, better 7:57memory. 7:59I'm sure version two will be better. I 8:01don't know if it will make money. By the 8:03way, if you were wondering, yes, 8:05Claudius lost money to no one's 8:07surprise. Right? If you're selling 8:08tungsten metal cubes at a loss and 8:10refusing to take the markup on the Iron 8:12Brew, you're not going to do very well. 8:13The point here is not whether or not AI 8:15can sell snacks. It's that this is an 8:17incredibly clean, controlled experiment 8:20that measures whether an AI has a lot of 8:22the basic glue work capabilities that 8:26people have to demonstrate in the real 8:29world to do real economic work. And AI 8:32is failing at that right now. That 8:35doesn't mean it will always fail. That 8:37doesn't mean that they're not actively 8:38working on improving it. That doesn't 8:40mean that Anthropic was wrong to publish 8:42this or share it. They did the whole 8:43industry a huge favor. Again, I'm gonna 8:45say I think this is the most useful test 8:47for artificial general intelligence I 8:49think I've seen. It's simple. It's 8:50clean. It's repeatable. I want 03 to run 8:53a vending machine. I want to see if 03 8:54Pro can do it better. I am not sure we 8:56have any model out there now that can 8:59successfully run a vending machine. I'm 9:00just going to put that stake in the 9:02ground. I think we will have one that 9:04can run a vending machine pretty soon, 9:07but even then, I think we're going to 9:09have issues with long horizon intent. 9:11What does it mean when Claude forgets 9:14the discounts? How can we keep something 9:16that has context over months when the 9:19best we can do on an AI agent right now 9:21is 7 hours? And if it doubles in five or 9:236 months, oh my god, it's going to get 9:25to 14 hours. And then maybe by 2026 to 9:2928 hours to three days maybe these are 9:33good but they're not 30 days we are 9:35going to get huge improvements. AI is 9:37doing incredible things. The fact that 9:39we are talking about a pile of sand that 9:42can almost run a store it's incredible 9:45but almost is not successfully running 9:47the store. And so one of the things I 9:49want to call out is that if you are 9:51worried about losing your job to AI 9:54remember AI cannot run a vending 9:56machine. It cannot successfully do the 9:59series of coordinated tasks to run a 10:01vending machine profitably. It loses 10:03money. Even if it can do those 10:04individual tasks really well, it can 10:06write a nice email to to the good people 10:08at on to check the store. It can write a 10:11nice email to order new inventory. It 10:14can locate Dutch chocolate milk, which 10:16it did. It can get nice tungsten metal 10:18cubes. It did so many of these tasks, 10:21frankly, better than most human vending 10:23machine managers. I know of zero human 10:25vending machine managers that would 10:26bother to get Dutch chocolate milk for 10:28one vending machine. Zero. Let alone 10:30tungsten metal cubes. Did a phenomenal 10:32job at that. That did not mean that 10:35individual task capacity was enough to 10:37run the business well. And this is where 10:39I want to underline again my thesis for 10:42where general intelligence is falling 10:44down right now is that AI is good at 10:46individual skills, but real jobs and 10:49real work that humans do is not an 10:51individual skill set question. It is a 10:54bundle secured by glue work deeply 10:57interacted and entangled with other 10:59people's roles. And AI does not have 11:02enough context. It does not have enough 11:04reinforcement learning. It does not have 11:06enough training data or anything to get 11:08at that piece of the work, the glue 11:11work. And so my encouragement to you if 11:13you were worried about AGI is to 11:15remember Project Vend. Remember Claude 11:17loses money on a vending machine. And 11:19remember that even if people talk a big 11:21game about AI, no one has an answer to 11:24this kind of problem, to why Claude went 11:26off the rails, to why memory problems 11:28are not yet solved. No one has been able 11:31to fix that yet. No one has an answer to 11:33long-term horizon intent. No one has an 11:36answer to how to bundle skills together 11:38into a generally applicable 11:40intelligence. And everyone's working on 11:42it and we see progress. And I want to 11:45underline here, even if we stopped here, 11:47which we're not going to. Even if we 11:49stopped here, we would still be in for 11:51an entire generation's worth of 11:53tremendously productive technical 11:56change. These systems are already so 11:58much smarter than we are able to 12:00actually build software to accommodate, 12:02it's not even funny. And so, I actually 12:04don't lose sleep on the momentum of AI 12:06the way some people do. Like some people 12:08look at this and they're like, "Oh my 12:09gosh, Nate, you're you're talking about 12:11AGI not being like as easy as we 12:14thought. You must be a pessimist." No, 12:16I'm not a pessimist. I just call it like 12:18it is. This is really hard to do. It's a 12:21hard problem. It's a wicked problem, as 12:23we might say. So, let's just let it be 12:25hard. And let's admit that we have tons 12:27of progress we can make with building 12:29cool AI stuff in the meantime with the 12:31systems we have and with the stuff 12:32that's right around the corner. I just 12:34made a video on GPT5. I'm super excited 12:36for it. It's going to be a great system. 12:38I do not know if Chad GPT5 can run a 12:42vending machine, but this is my appeal 12:43to the industry. Let it try. I want that 12:46to be a test. Let the vending machine be 12:49the AGI test. What do you guys think? Is 12:51a vending machine a good AGI test? At 12:53least we'll have fun. Maybe we can get 12:54more tungsten cubes. Cheers.