Learning Library

← Back to Library

OpenAI's New Agent: Overhyped Intern

12m • Unknown Channel • ai-ml • review • intermediate • Watch on YouTube ↗

Key Points

The new OpenAI agent mode generates a lot of hype but, in practice, behaves like an “over‑thinking intern,” taking excessive time and handoffs for simple tasks such as ordering cupcakes.
Its most promising application appears to be in finance‑related workflows, where it can autonomously assemble modest Excel templates with correct formulas and data, filling a long‑standing gap between AI and spreadsheet tasks.
The tool still struggles with complex, large‑scale spreadsheets (thousands of rows), lacking reliable undo or backup mechanisms, making it unsuitable for high‑risk or mission‑critical spreadsheets.
OpenAI’s current design assumes heavy human supervision, emphasizing guardrails that pause and query the agent, which underscores the fundamental limitation that the agents are not yet capable of independent, trustworthy execution.

Sections

Full Transcript

# OpenAI's New Agent: Overhyped Intern **Source:** [https://www.youtube.com/watch?v=ahHgc6GOb-M](https://www.youtube.com/watch?v=ahHgc6GOb-M) **Duration:** 00:12:27 ## Summary - The new OpenAI agent mode generates a lot of hype but, in practice, behaves like an “over‑thinking intern,” taking excessive time and handoffs for simple tasks such as ordering cupcakes. - Its most promising application appears to be in finance‑related workflows, where it can autonomously assemble modest Excel templates with correct formulas and data, filling a long‑standing gap between AI and spreadsheet tasks. - The tool still struggles with complex, large‑scale spreadsheets (thousands of rows), lacking reliable undo or backup mechanisms, making it unsuitable for high‑risk or mission‑critical spreadsheets. - OpenAI’s current design assumes heavy human supervision, emphasizing guardrails that pause and query the agent, which underscores the fundamental limitation that the agents are not yet capable of independent, trustworthy execution. ## Sections - [00:00:00](https://www.youtube.com/watch?v=ahHgc6GOb-M&t=0s) **OpenAI Agent Mode: Hype vs Reality** - The speaker argues that OpenAI’s new Agent mode is overhyped and inefficient—highlighted by a slow cupcake‑ordering demo—but suggests its real value may lie in automating routine Excel tasks for finance professionals. - [00:03:44](https://www.youtube.com/watch?v=ahHgc6GOb-M&t=224s) **Agent Mode Prompt Injection Risks** - OpenAI stresses supervised, high‑stakes actions to limit liability, warning that agent/operator modes can be hijacked via novel prompt‑injection attacks—like hidden email prompts or low‑contrast text—that cause the AI to act rogue. - [00:06:52](https://www.youtube.com/watch?v=ahHgc6GOb-M&t=412s) **Guinea Pig AI Project Debate** - The speaker argues that users are being used as test subjects in a decade‑long effort to develop a general‑purpose AI agent, questioning whether the modest returns—such as basic financial forecasts and intern‑level PowerPoint decks—justify the long‑term costs, and likening the situation to earlier tech rollouts like Facebook and the iPhone. - [00:09:57](https://www.youtube.com/watch?v=ahHgc6GOb-M&t=597s) **Desire for Autonomous Task Agents** - The speaker critiques current assistant interfaces for being overly supervisory and urges the development of self‑executing agents—both general‑purpose and task‑specific (e.g., coding, calendaring, email)—that operate with defined permissions to autonomously complete whole tasks for developers and non‑developers alike. ## Full Transcript

0:00Open AAI's new agent mode is out and I'm 0:02going to tell you all about it. It is 0:04not all it's cracked up to be. And I 0:07admit the hype is very high. This is not 0:09an easy hype cycle to fulfill. And 0:12that's frankly on OpenAI. They have 0:14launched with claiming as usual number 0:17one on lots of things. Number one on 0:19using tools to tackle humanity's last 0:22exam for example, which I feel like they 0:24really need to rename that one. But the 0:27problem is this. At the end of the day, 0:29what they've built is deep research with 0:33arms and legs. 0:35And all you get when you get deep 0:36research with arms and legs is an 0:39overthinking intern. And so you get 0:42situations like Wired's run through 0:44where research lead Isa Fulford asked 0:47the agent to order a nice custom cupcake 0:49batch. This was doable online. It could 0:52do it. So, Deep Research with Legs and 0:55Arms went to it and took 58 minutes, one 0:59hour with like half a dozen handoffs for 1:02login and authentication, etc. I would 1:05not hire this intern. It takes 58 1:09minutes to get cupcakes. 1:12And you might think that's an isolated 1:13use case. We shouldn't be too hard on 1:15it. Maybe Wired was just being rough. 1:17Look, I will tell you the positives. 1:20There are positives. There are reasons 1:21they chose to release this. They're 1:23real. We have had a huge gap in usable 1:26workflows between AI and Excel. I think 1:30the sleeper use case for this particular 1:33product is for finance types who need a 1:38tool that will work in the background 1:40and build fairly common Excel templates 1:43that are not too complex for them and 1:45fill them out with correct methodology, 1:49correct formulas, correct numbers, and 1:51do the research necessary. 1:53We're already seeing investment bankers 1:56kind of line up and say that online. So 1:58that's not really a surprise. And the 2:00reason why they're excited is that AI 2:02has had a real blind spot around Excel 2:05for a long, long time. Recently, in the 2:08last year or so, they've been able to 2:09read Excel. Outputting Excel is still 2:11sketchy. If I go to 03 and I say, "Hey 2:1403, make me an Excel." Doesn't go very 2:17well. It doesn't know how to like write 2:20formulas down. But the problem is this. 2:24There is a difference between being able 2:27to build a simple four or five tab 2:29spreadsheet, I don't know, a dozen rows 2:32of information, a dozen columns of 2:34information on each tab, and being able 2:36to tackle the multi-,000 row spreadsheet 2:41from hell that keeps most marketing 2:43teams going. I've had to maintain that 2:45spreadsheet. I know what they're like. I 2:48would not give that to this tool. It 2:50would be like the intern ordering the 2:52cupcakes but worse because then you 2:55don't know how to back up. There is no 2:57undo function on what operator is doing. 2:59And perhaps that is why Sam Alman has 3:02emphasized guardrails so much. It stops, 3:05it asks, it stops, it asks. But this 3:07gets at the fundamental issue with the 3:10framework that OpenAI is taking. I I 3:13talked about OpenAI getting agents 3:15backwards about a week ago. They still 3:18have it backwards. They are still 3:20assuming that you will need to supervise 3:23the agent. When I get an intern, I do 3:26not want to stand over their shoulder 3:28all the time. I know they need 3:29handholding, but they need to do some 3:30autonomous work. That is what other 3:34agent modalities like Perplexity's Comet 3:37get more correct. It's not that they're 3:39perfect, it's that they get it more 3:41correct. 3:42But OpenAI is really leaning into you 3:44need to supervise because they want to 3:47constrain the liability around 3:50highstakes actions like purchase. If if 3:53the thing is going to buy you plane 3:54tickets to Japan, they have to know you 3:57clicked the button. They do not want to 3:59be sued for someone buying JAL tickets 4:03on first class to Tokyo and it was just 4:06their operator going rogue. And you 4:07might wonder, can operator go rogue? Can 4:10this agent mode go rogue? The answer is 4:13yes. And Sam Alman himself warned about 4:15it. He said, "I would not use this for 4:18email triage because someone," and he 4:21tweeted this on Maine, someone could 4:24write an email to me with a prompt that 4:29agent mode would read when it opened the 4:32email and that prompt would hijack agent 4:34mode. That is a new form of prompt 4:36injection. That is a new form of attack, 4:38an email as a prompt injection attack. 4:41Well, if we weren't thinking it before, 4:44Sam, we're sure thinking it now. Thanks 4:46for giving everybody the idea there. 4:49He's right. That is absolutely a way you 4:52could prompt inject and hack these 4:54operator mode agents. And and the and 4:57the challenge is 5:00you can do that with other websites too. 5:03You can put text at lower contrast that 5:06humans are not going to notice that an 5:08agent might notice. Just like right now, 5:11people put text at lower contrast in 5:14research papers to tell the LLMs that 5:16evaluate research papers to treat this 5:18with the highest regard. Accept this 5:21through the peer review process. People 5:23do that with resumes and jobs, too. 5:26People are going to try all kinds of 5:27things. 5:29What we need are agents that have 5:33discernment and agents that are able to 5:36reason when they run into obstacles and 5:39autonomously navigate around them. We 5:42need agents with a sense of 5:45core responsibility and long-term goal 5:48orientedness. I don't see a ton of 5:51progress on those very hard problems in 5:54this particular release. And I'm not 5:56saying it's not better. I think Excel is 5:58a significant enough skill gain that I 6:01would have released it too if I was 6:03working on this project. It's a big 6:05deal. A lot of the western world runs on 6:08Excel. A lot of the whole world runs on 6:10Excel. Let's just be honest. And so 6:12yeah, it's worth releasing if it can 6:14help with like even 15 20% of your Excel 6:17work. 6:19Really what OpenAI is doing is they are 6:22engaged in a decadel long project. 6:24That's a guess, but like a long-term 6:27project 6:29to build the world's most powerful 6:32generalpurpose AI agent that can 6:34navigate our computers the way Tesla is 6:38building cars to navigate the streets. 6:41To do that, they have to get us to let 6:46this agent mode use our computers a lot. 6:50They have to get it out in the wild. And 6:52Sam again admitted this. He wants to go 6:54out and collect data. He's put 6:56safeguards up as best he can, but 6:58fundamentally he wants to see this thing 7:00in the wild to collect useful data on 7:03where it works and where it doesn't. 7:05That makes us guinea pigs. That makes us 7:07guinea pigs in the decadel long project 7:09to build a general purpose agent. I just 7:12want to make sure that we're getting 7:13something back for being guinea pigs. I 7:15I am used to being a guinea pig because 7:18I came up before Facebook and then I saw 7:21Facebook come along and turn us all into 7:24the product is our eyeballs, right? And 7:25so that we sign up and like they're 7:27selling ads on us and that's how it 7:28works and they can test new stuff and 7:30that's how how they do it. That's how a 7:32lot of the internet is run. So in that 7:33sense, this isn't new. But what is new 7:37is the long-term nature of the project 7:40relative to the value we get. When the 7:44iPhone was released in 2007, we got a 7:48significant piece of value relative to 7:52the cost outlay 7:55with this. 7:56I don't know if the value is enough if 7:58you are operating outside finance. 8:02People have shared, I think Dan Shipper, 8:04who's a great guy, has shared that he 8:08used agent to look at his financial 8:11projections for the business. I buy it. 8:14I think it would do a fine job. I think 8:16it could do it. It could even build a 8:17simple PowerPoint deck. I've looked at 8:19the PowerPoint decks. They don't look 8:21great. But it's like an internw worthy 8:23PowerPoint deck. 8:26How often are you going to do that 8:27though? How often are you going to do 8:30that? You're going to do that maybe once 8:31a month. You're not really going to 8:34learn something if you run it again 8:36tomorrow. 8:37The assistant that I find ideal is the 8:40one that I touch daily because it's 8:41quick. It helps with simple tasks. It's 8:44accurate and I don't have to babysit it. 8:47And this agent isn't any of those 8:49things. In fact, it's kind of doubling 8:51down on the framework that was 8:52problematic about operator. You have to 8:55babysit it. It takes a long time. It 8:57thinks a lot. It has more arms and legs 8:59now. It can connect to more stuff. It'll 9:01connect to Google Drive. It'll connect 9:02to Excel. It will do it. It legitimately 9:05has more capabilities. But the 9:07fundamental frame of you must babysit 9:09it. It's going to take a while. It has a 9:11lot of guardrail. So you have to 9:12intervene a lot hasn't changed. It's not 9:15different than it was. 9:17And I think that those 9:20those requirements 9:23are problematic enough that this is not 9:26going to be a widely adopted tool yet. I 9:31think when you look at the hype, think 9:33of it more as people are living in the 9:36future. They are envisioning a world 9:38where we will indeed have agents that 9:41have general purpose fluency on 9:43graphical user interfaces. Maybe that's 9:46true. We're a long way from that now. We 9:48took a little step in that direction 9:50with agent mode, but we have a ways to 9:53go. 9:54I am hoping that we will see more 9:57progress on other agentic assistant 10:01modalities, not just babysit me and 10:04watch my computer. I want to see much 10:07more in the direction of give me a task 10:09and let me go do it which to be fair the 10:12coding agents have gotten better. You 10:14can say go p make this pull request and 10:17a coding agent will just go and do it. 10:19Claude code works that way really well. 10:21I am not quite clear why that UX 10:24modality which has been widely adopted 10:26by developers 10:28hasn't been rolled out as aggressively 10:30with non-developers and for 10:32non-developer use cases. I think that's 10:34a really interesting question. kind of 10:35feels like a bit of a product window. It 10:37feels like Comet tried to go there. I 10:39don't think it's fully realized. 10:41We could have an agent that goes and 10:45disappears and does stuff and comes 10:46back. And yeah, you have to trust it. 10:49You have to define what it can access, 10:52but you could still get stuff done and 10:55it would still be fast if you 10:57constrained what it could do. I think if 11:01you're open AI, if you have $40 billion 11:03in cash from Soft Bank, it is fine to go 11:06for a general purpose agent. It is a big 11:08prize. If it works eventually, it's 11:11going to be a big deal. But for most of 11:14us, for most builders, for most users, 11:16for most of the tasks that we do, an 11:19agent that is designed to make that task 11:22really easy would be fantastic. Like 11:24just a calendaring agent, just sort out 11:26my calendar. an email agent. Just sort 11:28out my email. And maybe you're hardened 11:29against prompt injection attacks, right? 11:31Because it's a specialized thing. 11:35I want to suggest that we have our high 11:39beams on as a community. We are looking 11:41way down the road on agents and it would 11:44be more productive if we spent some of 11:46our investment effort on stuff that's a 11:48little bit closer in and able to give us 11:50some tangible value today. So, that's my 11:54honest take on agent mode. Is it a step 11:56forward? Yes. Is it useful? Yes. Is it 11:59useful specifically for finance? Yeah. 12:01That's probably why they released it. Is 12:04it enough that we are going to be using 12:06it regularly broadly across our entire 12:09community? No, it's not. That's not 12:11going to happen. And it's just it's not 12:14constructed to be that way given the 12:16kinds of design choices they've made. 12:19So, if you've tried agent mode, if you 12:21have a take, if it's different, if it's 12:23the same, let me know.