OpenAI's New Agent: Overhyped Intern
Key Points
- The new OpenAI agent mode generates a lot of hype but, in practice, behaves like an “over‑thinking intern,” taking excessive time and handoffs for simple tasks such as ordering cupcakes.
- Its most promising application appears to be in finance‑related workflows, where it can autonomously assemble modest Excel templates with correct formulas and data, filling a long‑standing gap between AI and spreadsheet tasks.
- The tool still struggles with complex, large‑scale spreadsheets (thousands of rows), lacking reliable undo or backup mechanisms, making it unsuitable for high‑risk or mission‑critical spreadsheets.
- OpenAI’s current design assumes heavy human supervision, emphasizing guardrails that pause and query the agent, which underscores the fundamental limitation that the agents are not yet capable of independent, trustworthy execution.
Sections
- OpenAI Agent Mode: Hype vs Reality - The speaker argues that OpenAI’s new Agent mode is overhyped and inefficient—highlighted by a slow cupcake‑ordering demo—but suggests its real value may lie in automating routine Excel tasks for finance professionals.
- Agent Mode Prompt Injection Risks - OpenAI stresses supervised, high‑stakes actions to limit liability, warning that agent/operator modes can be hijacked via novel prompt‑injection attacks—like hidden email prompts or low‑contrast text—that cause the AI to act rogue.
- Guinea Pig AI Project Debate - The speaker argues that users are being used as test subjects in a decade‑long effort to develop a general‑purpose AI agent, questioning whether the modest returns—such as basic financial forecasts and intern‑level PowerPoint decks—justify the long‑term costs, and likening the situation to earlier tech rollouts like Facebook and the iPhone.
- Desire for Autonomous Task Agents - The speaker critiques current assistant interfaces for being overly supervisory and urges the development of self‑executing agents—both general‑purpose and task‑specific (e.g., coding, calendaring, email)—that operate with defined permissions to autonomously complete whole tasks for developers and non‑developers alike.
Full Transcript
# OpenAI's New Agent: Overhyped Intern **Source:** [https://www.youtube.com/watch?v=ahHgc6GOb-M](https://www.youtube.com/watch?v=ahHgc6GOb-M) **Duration:** 00:12:27 ## Summary - The new OpenAI agent mode generates a lot of hype but, in practice, behaves like an “over‑thinking intern,” taking excessive time and handoffs for simple tasks such as ordering cupcakes. - Its most promising application appears to be in finance‑related workflows, where it can autonomously assemble modest Excel templates with correct formulas and data, filling a long‑standing gap between AI and spreadsheet tasks. - The tool still struggles with complex, large‑scale spreadsheets (thousands of rows), lacking reliable undo or backup mechanisms, making it unsuitable for high‑risk or mission‑critical spreadsheets. - OpenAI’s current design assumes heavy human supervision, emphasizing guardrails that pause and query the agent, which underscores the fundamental limitation that the agents are not yet capable of independent, trustworthy execution. ## Sections - [00:00:00](https://www.youtube.com/watch?v=ahHgc6GOb-M&t=0s) **OpenAI Agent Mode: Hype vs Reality** - The speaker argues that OpenAI’s new Agent mode is overhyped and inefficient—highlighted by a slow cupcake‑ordering demo—but suggests its real value may lie in automating routine Excel tasks for finance professionals. - [00:03:44](https://www.youtube.com/watch?v=ahHgc6GOb-M&t=224s) **Agent Mode Prompt Injection Risks** - OpenAI stresses supervised, high‑stakes actions to limit liability, warning that agent/operator modes can be hijacked via novel prompt‑injection attacks—like hidden email prompts or low‑contrast text—that cause the AI to act rogue. - [00:06:52](https://www.youtube.com/watch?v=ahHgc6GOb-M&t=412s) **Guinea Pig AI Project Debate** - The speaker argues that users are being used as test subjects in a decade‑long effort to develop a general‑purpose AI agent, questioning whether the modest returns—such as basic financial forecasts and intern‑level PowerPoint decks—justify the long‑term costs, and likening the situation to earlier tech rollouts like Facebook and the iPhone. - [00:09:57](https://www.youtube.com/watch?v=ahHgc6GOb-M&t=597s) **Desire for Autonomous Task Agents** - The speaker critiques current assistant interfaces for being overly supervisory and urges the development of self‑executing agents—both general‑purpose and task‑specific (e.g., coding, calendaring, email)—that operate with defined permissions to autonomously complete whole tasks for developers and non‑developers alike. ## Full Transcript
Open AAI's new agent mode is out and I'm
going to tell you all about it. It is
not all it's cracked up to be. And I
admit the hype is very high. This is not
an easy hype cycle to fulfill. And
that's frankly on OpenAI. They have
launched with claiming as usual number
one on lots of things. Number one on
using tools to tackle humanity's last
exam for example, which I feel like they
really need to rename that one. But the
problem is this. At the end of the day,
what they've built is deep research with
arms and legs.
And all you get when you get deep
research with arms and legs is an
overthinking intern. And so you get
situations like Wired's run through
where research lead Isa Fulford asked
the agent to order a nice custom cupcake
batch. This was doable online. It could
do it. So, Deep Research with Legs and
Arms went to it and took 58 minutes, one
hour with like half a dozen handoffs for
login and authentication, etc. I would
not hire this intern. It takes 58
minutes to get cupcakes.
And you might think that's an isolated
use case. We shouldn't be too hard on
it. Maybe Wired was just being rough.
Look, I will tell you the positives.
There are positives. There are reasons
they chose to release this. They're
real. We have had a huge gap in usable
workflows between AI and Excel. I think
the sleeper use case for this particular
product is for finance types who need a
tool that will work in the background
and build fairly common Excel templates
that are not too complex for them and
fill them out with correct methodology,
correct formulas, correct numbers, and
do the research necessary.
We're already seeing investment bankers
kind of line up and say that online. So
that's not really a surprise. And the
reason why they're excited is that AI
has had a real blind spot around Excel
for a long, long time. Recently, in the
last year or so, they've been able to
read Excel. Outputting Excel is still
sketchy. If I go to 03 and I say, "Hey
03, make me an Excel." Doesn't go very
well. It doesn't know how to like write
formulas down. But the problem is this.
There is a difference between being able
to build a simple four or five tab
spreadsheet, I don't know, a dozen rows
of information, a dozen columns of
information on each tab, and being able
to tackle the multi-,000 row spreadsheet
from hell that keeps most marketing
teams going. I've had to maintain that
spreadsheet. I know what they're like. I
would not give that to this tool. It
would be like the intern ordering the
cupcakes but worse because then you
don't know how to back up. There is no
undo function on what operator is doing.
And perhaps that is why Sam Alman has
emphasized guardrails so much. It stops,
it asks, it stops, it asks. But this
gets at the fundamental issue with the
framework that OpenAI is taking. I I
talked about OpenAI getting agents
backwards about a week ago. They still
have it backwards. They are still
assuming that you will need to supervise
the agent. When I get an intern, I do
not want to stand over their shoulder
all the time. I know they need
handholding, but they need to do some
autonomous work. That is what other
agent modalities like Perplexity's Comet
get more correct. It's not that they're
perfect, it's that they get it more
correct.
But OpenAI is really leaning into you
need to supervise because they want to
constrain the liability around
highstakes actions like purchase. If if
the thing is going to buy you plane
tickets to Japan, they have to know you
clicked the button. They do not want to
be sued for someone buying JAL tickets
on first class to Tokyo and it was just
their operator going rogue. And you
might wonder, can operator go rogue? Can
this agent mode go rogue? The answer is
yes. And Sam Alman himself warned about
it. He said, "I would not use this for
email triage because someone," and he
tweeted this on Maine, someone could
write an email to me with a prompt that
agent mode would read when it opened the
email and that prompt would hijack agent
mode. That is a new form of prompt
injection. That is a new form of attack,
an email as a prompt injection attack.
Well, if we weren't thinking it before,
Sam, we're sure thinking it now. Thanks
for giving everybody the idea there.
He's right. That is absolutely a way you
could prompt inject and hack these
operator mode agents. And and the and
the challenge is
you can do that with other websites too.
You can put text at lower contrast that
humans are not going to notice that an
agent might notice. Just like right now,
people put text at lower contrast in
research papers to tell the LLMs that
evaluate research papers to treat this
with the highest regard. Accept this
through the peer review process. People
do that with resumes and jobs, too.
People are going to try all kinds of
things.
What we need are agents that have
discernment and agents that are able to
reason when they run into obstacles and
autonomously navigate around them. We
need agents with a sense of
core responsibility and long-term goal
orientedness. I don't see a ton of
progress on those very hard problems in
this particular release. And I'm not
saying it's not better. I think Excel is
a significant enough skill gain that I
would have released it too if I was
working on this project. It's a big
deal. A lot of the western world runs on
Excel. A lot of the whole world runs on
Excel. Let's just be honest. And so
yeah, it's worth releasing if it can
help with like even 15 20% of your Excel
work.
Really what OpenAI is doing is they are
engaged in a decadel long project.
That's a guess, but like a long-term
project
to build the world's most powerful
generalpurpose AI agent that can
navigate our computers the way Tesla is
building cars to navigate the streets.
To do that, they have to get us to let
this agent mode use our computers a lot.
They have to get it out in the wild. And
Sam again admitted this. He wants to go
out and collect data. He's put
safeguards up as best he can, but
fundamentally he wants to see this thing
in the wild to collect useful data on
where it works and where it doesn't.
That makes us guinea pigs. That makes us
guinea pigs in the decadel long project
to build a general purpose agent. I just
want to make sure that we're getting
something back for being guinea pigs. I
I am used to being a guinea pig because
I came up before Facebook and then I saw
Facebook come along and turn us all into
the product is our eyeballs, right? And
so that we sign up and like they're
selling ads on us and that's how it
works and they can test new stuff and
that's how how they do it. That's how a
lot of the internet is run. So in that
sense, this isn't new. But what is new
is the long-term nature of the project
relative to the value we get. When the
iPhone was released in 2007, we got a
significant piece of value relative to
the cost outlay
with this.
I don't know if the value is enough if
you are operating outside finance.
People have shared, I think Dan Shipper,
who's a great guy, has shared that he
used agent to look at his financial
projections for the business. I buy it.
I think it would do a fine job. I think
it could do it. It could even build a
simple PowerPoint deck. I've looked at
the PowerPoint decks. They don't look
great. But it's like an internw worthy
PowerPoint deck.
How often are you going to do that
though? How often are you going to do
that? You're going to do that maybe once
a month. You're not really going to
learn something if you run it again
tomorrow.
The assistant that I find ideal is the
one that I touch daily because it's
quick. It helps with simple tasks. It's
accurate and I don't have to babysit it.
And this agent isn't any of those
things. In fact, it's kind of doubling
down on the framework that was
problematic about operator. You have to
babysit it. It takes a long time. It
thinks a lot. It has more arms and legs
now. It can connect to more stuff. It'll
connect to Google Drive. It'll connect
to Excel. It will do it. It legitimately
has more capabilities. But the
fundamental frame of you must babysit
it. It's going to take a while. It has a
lot of guardrail. So you have to
intervene a lot hasn't changed. It's not
different than it was.
And I think that those
those requirements
are problematic enough that this is not
going to be a widely adopted tool yet. I
think when you look at the hype, think
of it more as people are living in the
future. They are envisioning a world
where we will indeed have agents that
have general purpose fluency on
graphical user interfaces. Maybe that's
true. We're a long way from that now. We
took a little step in that direction
with agent mode, but we have a ways to
go.
I am hoping that we will see more
progress on other agentic assistant
modalities, not just babysit me and
watch my computer. I want to see much
more in the direction of give me a task
and let me go do it which to be fair the
coding agents have gotten better. You
can say go p make this pull request and
a coding agent will just go and do it.
Claude code works that way really well.
I am not quite clear why that UX
modality which has been widely adopted
by developers
hasn't been rolled out as aggressively
with non-developers and for
non-developer use cases. I think that's
a really interesting question. kind of
feels like a bit of a product window. It
feels like Comet tried to go there. I
don't think it's fully realized.
We could have an agent that goes and
disappears and does stuff and comes
back. And yeah, you have to trust it.
You have to define what it can access,
but you could still get stuff done and
it would still be fast if you
constrained what it could do. I think if
you're open AI, if you have $40 billion
in cash from Soft Bank, it is fine to go
for a general purpose agent. It is a big
prize. If it works eventually, it's
going to be a big deal. But for most of
us, for most builders, for most users,
for most of the tasks that we do, an
agent that is designed to make that task
really easy would be fantastic. Like
just a calendaring agent, just sort out
my calendar. an email agent. Just sort
out my email. And maybe you're hardened
against prompt injection attacks, right?
Because it's a specialized thing.
I want to suggest that we have our high
beams on as a community. We are looking
way down the road on agents and it would
be more productive if we spent some of
our investment effort on stuff that's a
little bit closer in and able to give us
some tangible value today. So, that's my
honest take on agent mode. Is it a step
forward? Yes. Is it useful? Yes. Is it
useful specifically for finance? Yeah.
That's probably why they released it. Is
it enough that we are going to be using
it regularly broadly across our entire
community? No, it's not. That's not
going to happen. And it's just it's not
constructed to be that way given the
kinds of design choices they've made.
So, if you've tried agent mode, if you
have a take, if it's different, if it's
the same, let me know.