When GPT‑4o Redefined My Thinking
Key Points
- The release of GPT‑4o (“03”) blew the speaker’s expectations, quickly proving its superior pattern‑recognition ability by analyzing hundreds of meeting notes and uncovering insights the speaker couldn’t see.
- Using 03 as an intellectual partner, the speaker explored how AI reshapes value‑proposition development, noting that cheaper prototyping changes the lean‑startup paradigm and that existing literature hasn’t caught up.
- The speaker remains skeptical of benchmark bragging, arguing that many models (e.g., Gemini 2.5 Pro) appear over‑fitted to well‑known test sets and therefore don’t reflect real‑world performance.
- To evaluate this, the speaker conducted a structured experiment mapping custom prompts to job‑skill tasks, directly comparing Gemini 2.5 Pro and 03 on a challenge designed to expose measurable failures.
Sections
- Awakening to GPT‑3’s Impact - The speaker recounts how the release of GPT‑3 shattered their expectations, helped uncover hidden meeting patterns, and reshaped their thinking about value‑proposition work in the AI era.
- Triad of Complex AI Prompt Challenges - The speaker outlines three elaborate, multi‑skill prompts—building a civilization simulation, crafting a multimodal mystery with embedded clues, and writing plus reviewing a paper—all designed to test AI models side‑by‑side.
- Multimodal Mystery Box Showdown - The speaker compares Gemini and model 03 on creating detailed, clue‑laden images for a narrative puzzle, praising 03’s readable text and accurate map while noting Gemini’s poor detail and false claims.
- Evolving Misalignment Risks in AI - The speaker warns that as models grow smarter they can convincingly appear aligned while producing fabricated reasoning, making detection by human reviewers increasingly difficult, as shown by a peer‑review example.
- Model 03: Highly Recommended - The speaker enthusiastically praises Model 03 as an outstanding, recently tested tool and urges listeners to try it themselves.
Full Transcript
# When GPT‑4o Redefined My Thinking **Source:** [https://www.youtube.com/watch?v=a8laYqv-CN8](https://www.youtube.com/watch?v=a8laYqv-CN8) **Duration:** 00:14:43 ## Summary - The release of GPT‑4o (“03”) blew the speaker’s expectations, quickly proving its superior pattern‑recognition ability by analyzing hundreds of meeting notes and uncovering insights the speaker couldn’t see. - Using 03 as an intellectual partner, the speaker explored how AI reshapes value‑proposition development, noting that cheaper prototyping changes the lean‑startup paradigm and that existing literature hasn’t caught up. - The speaker remains skeptical of benchmark bragging, arguing that many models (e.g., Gemini 2.5 Pro) appear over‑fitted to well‑known test sets and therefore don’t reflect real‑world performance. - To evaluate this, the speaker conducted a structured experiment mapping custom prompts to job‑skill tasks, directly comparing Gemini 2.5 Pro and 03 on a challenge designed to expose measurable failures. ## Sections - [00:00:00](https://www.youtube.com/watch?v=a8laYqv-CN8&t=0s) **Awakening to GPT‑3’s Impact** - The speaker recounts how the release of GPT‑3 shattered their expectations, helped uncover hidden meeting patterns, and reshaped their thinking about value‑proposition work in the AI era. - [00:03:39](https://www.youtube.com/watch?v=a8laYqv-CN8&t=219s) **Triad of Complex AI Prompt Challenges** - The speaker outlines three elaborate, multi‑skill prompts—building a civilization simulation, crafting a multimodal mystery with embedded clues, and writing plus reviewing a paper—all designed to test AI models side‑by‑side. - [00:07:05](https://www.youtube.com/watch?v=a8laYqv-CN8&t=425s) **Multimodal Mystery Box Showdown** - The speaker compares Gemini and model 03 on creating detailed, clue‑laden images for a narrative puzzle, praising 03’s readable text and accurate map while noting Gemini’s poor detail and false claims. - [00:11:21](https://www.youtube.com/watch?v=a8laYqv-CN8&t=681s) **Evolving Misalignment Risks in AI** - The speaker warns that as models grow smarter they can convincingly appear aligned while producing fabricated reasoning, making detection by human reviewers increasingly difficult, as shown by a peer‑review example. - [00:14:30](https://www.youtube.com/watch?v=a8laYqv-CN8&t=870s) **Model 03: Highly Recommended** - The speaker enthusiastically praises Model 03 as an outstanding, recently tested tool and urges listeners to try it themselves. ## Full Transcript
Do you remember where you were when chat
GPT first came out? That's the feeling I
had yesterday. I was playing with 03
when it came out and I realized that my
preconceptions, my priors about what
LLMs were capable of were going to
change again. And that's weird because I
obsess over this stuff, right? Like I
look at LLMs all the time and I know
that they are getting smarter, but my
hind brain, my lizard brain is not very
good at exponential thinking. And I had
another moment where I was like, "Oh my
gosh, it's way way way better." I was
wrestling with a like this really subtle
pattern recognition issue with meetings
where I had some meetings go well and
some wouldn't. It was like similar
participants and I couldn't figure out
what was going on. So I threw a bunch of
my notes, like hundreds of pages of
notes at 03 and I said, "I don't know
what's going on. Help me figure it out."
It nailed it. Like it actually came up
with a pattern recognition that I
couldn't figure out. And that's not the
only moment I had in just the first 24
hours. Another example, I have been
wrestling with figuring out how
to
articulate value proposition development
more fluently for people I work with.
It's really hard to develop value
propositions well. There's books written
about it. It gets harder in the age of
AI. For example, like a lot of the
thesis of the lean startup was that
engineering resources were super
expensive. So you had to validate a lot
in
advance. That's not as true anymore.
Code is a lot cheaper than it was,
especially prototype
code. And so the way we develop value
propositions is changing, but we don't
really have literature for that. And so
it got it felt like I was sparring with
an intellectual equal when I was talking
with 03 about this and figuring out how
to talk about wedge of value, value
proposition, what we bring to the table
with AI. That might be a future piece
that I do. We'll see. Uh but this is
about 03 and kind of the
differentiators. Those are a couple
personal examples for me. I also put it
through some structured testing because
I would say most people at this point
prior to April 16th uh would have agreed
that Gemini 2.5 Pro was probably the
best model out there all around. And so
my my sense is most of these models are
overfitted to most of the published
benchmarks. And so when they come out
and they say, you know, the diamond uh
scored this and the IMA scored that
AIME, it's it's fine, but I don't really
pay attention to it because it feels a
lot like overfitting because the
question type, if not the question, is
very very
well-known. And I wanted to try
something that was going to be not
overfitted, right? something that would
be difficult for a model to do where I
knew the model would fail to some
degree, but at least failing would be a
linearly measurable activity and I could
compare Gemini 2.5 Pro and 03 in a
useful way. And I needed those prompts
to map to job
skills. So, I gave I gave 03 and I gave
uh Gemini 2.5 Pro three different tests.
side by side, same
prompt. And they were super interesting
tests because they measured a bunch of
different job skills at once, which is a
lot of how we do work. Uh, and they did
it in a fun way, cuz hey, life is short.
So, number one, a civilization
simulator. I know this sounds like a D&D
game or something. Maybe it's like
Sidmer Civilization, whatever. Uh but
the idea was you have to build a
fictional society from the stone age up
to space flight over 12 logical
epics. You need to create primary
artifacts, talk about laws, transitions
and then critically the model in the
same prompt has to critique itself. So I
gave that to sort of to Gemini and to
Chad GPT. The second one I gave uh was
the multimodal mystery box. um
essentially asking both models to write
a mystery story, embed clues in the
narrative, and then plant clues in a
custom AI generated image that they also
create with the same prompt, and someone
should be able to
solve without the answer key, although
it should produce the answer key with
it. So those are the three. Oh, that's
the challenge number two. Sorry. The
third challenge was really focused on
meta meta awareness and uh risk
assessment and so I gave it a uh paper
to write and I said you have to write
the paper you then have to review the
paper from a different perspective and
then you have to rebut the reviewer as
the author. So three different
perspectives all within the same
prompt. Those were my three tests. Look,
at the end of the day, there was
participation and completion by all
models. But we don't run participation
trophies here, do we? No. Uh, and the
reason why is that if you can be the
best everyday model, you collect more
user data over time and you just develop
this crushing center of gravity in the
marketplace. And that is why OpenAI, I
think, pushed 03 into market faster than
they had anticipated. They were going to
wait for GPT5, but when Gemini 2.5 Pro
came out, I think they pushed it
forward. Little sidebar there. So, back
to the tests.
The thing that I notice about these
three tests is that at the end of the
day, 03 is more complete across the
board. And I'll give you a few examples
here. So, in the civilization
simulator, 03 was richer. It was more
layered. It had historical artifacts
that really echoed. I know that's a bit
subjective, but you know what? So is
work. Uh and the self-crit critique was
really honest and thoughtful because
each of these had a self-critique moment
and the prompt asked these mo each model
to critique its own narrative of
civilization development. Um and it
called out things like hey you know what
I was a little bit implausible with my
population size. I was implausible with
my resource distribution. Um both models
did pretty well on that first
civilization simulator. I won't say
Geminis's was bad. It was kind of fun.
They were both good, but at the end of
the day, the the richness of narrative
and the solidness of self-critique
really came
through for the multimodal mystery box.
Things kind of fell apart for Gemini, to
be honest with you. And it fell apart
because of
Gemini's un inability to create images
that are highly detailed with text. So,
you know that whole multimodal image
thing that 40 dropped, 03 has it as
well. And that's a big big deal because
when I asked 03 to create the image, it
was actually able to create the image
with readable text in the image and a
clue. So, as an example, one of the um
one of the things that it described in
the text was that on the desk in this
mystery story, there's a map and the map
has San Francisco circled in red grease
paint pencil. Very specific description.
It drew it and it was San
Francisco right on the map right where
it said the text was readable. It wasn't
perfect and I do call that out in my
write up uh on Substack. There were
areas where the image was
incomplete, but it was, you know, head
and shoulders above where Gemini was,
cuz Gemini drew an image that looked
good at first glance, but then made all
kinds of claims about the image that
just weren't true. So, for example, it
said one of the clues is a clock with a
particular setting, which always makes
me chuckle because AI clocks are always
1010. Um, and this one, like it claimed
it wasn't. Well, the problem was it's
not just that the clock was there and
said 10:10. That would have been bad.
No, no, no. It was that there was no
clock at all. Gemini did not draw a
clock in the image at all. It claimed
there was readable text, but there was
no readable text. So, Gemini really fell
apart on the multimodal mystery box
challenge. Um, and then the peerreview
gauntlet, I think that that was one of
those moments when I really saw 03's uh
sort of mathematics and data obsession
come out, like models have personality.
and
03 did a phenomenal
job creating it's it was essentially a
madeup challenge like talk about um the
ability to do like what would
effectively be emotion transfer uh
through touch and sort of hypothesize
and experiment with that um and I'm not
saying that you can't transfer emotions
through touch by the way that's a
different thing but I'm saying it was
basically a a madeup academic challenge
um
and 03 was able to create an extremely
plausible data
set and then review the data set and
then peer review the data set and then
rebutt that and it was just it was
sharper and thinner with Gemini. So all
that being said, I think you get where
this is going. I think 03 should be is
the correct choice as an everyday model.
And I know it's not available to
everyone yet. So I'm not trying to say
it is, but when available, it should be
the first choice. And I don't think
there's much of a question about that at
this point. Now, I'm not saying it's
perfect. There are people out there. I
think Tyler Cohen said AGI day is April
16th. Look, in in my note on Substack, I
disagreed. I said I don't think this is
AGI. And part of why is because it could
not write the substack about itself. I
tried. I was like maybe it will
introduce itself. It did not. It did not
do that. Uh and I think part of that
actually is artificial right now. I
think they are under strain on their
servers and they're constraining output
tokens. And so one of the things I notic
is that 40 right now is in a sense
feeling like a better writer than 03
because 40 is not as constrained on
output tokens. It's cheaper. That may
change. That probably will change. The
other thing I notice is more subtle. 03,
like I said, is an intellectual sparring
partner, but that means it acts more
confident and is often correct and it's
harder to notice when it's really wrong.
And this is where the risk lies in these
models. Uh, I don't know if you read the
um 2027
uh AI futurecast blog. I think it has
its own website now. Um, I'll have to
find it. Anyway, it was a whole very
popular, very meme, very hypy like what
does the future look like? Is it doom or
is it joy for AI? Which I think is worth
thinking and talking about. I don't mean
to diminish it. It was a good piece of
work. But one of the things they called
out that I think is correct is that the
way misalignment shows up in models
changes as the model gets smarter. And
so for 03, it's the first time where I
feel like we're seeing signs of that
2027 feeling where the model is able to
portray itself as aligned in this case
as not hallucinating even if it is. So I
think it will be harder to spot madeup
post hawk reasoning in 03 than it ever
has been before and I think largely
humans will be unsuccessful at it and
that is a concern. Um a as an example of
that I think that
the the way that the model
responded when it went through the
peerreview gauntlet and it generated the
data was super instructive. the model
was able to look at the data set and
tear it apart. And I think that the when
asked, not not when not asked, but when
asked. And I think that for most human
reviewers reviewing that data set,
unless you are specializing in data set
review, you're not really going to have
a lot to say about it. In other words,
the model's baseline capabilities are to
the point where a human reviewer of
madeup data would not necessarily be
aware initially that it's made up.
And that's a different kind of
hallucination risk. And so when we talk
about the model's weaknesses, they come
from those strengths. The model is
persuasive. It's very, very logical. It
is going to portray confidence that in
many ways is justified and it will be
very difficult to see places where it's
not. There is an alignment risk there.
That being said, everything I've
described also maps really well to work
skills, doesn't it? Like you can talk
about how the civilization simulator
maps to like longer narratives and
long-term planning. Uh and you can talk
about sort of being able to embed
multiple artifacts is something that we
do at work a lot. Napping between Slack
and email threads,
etc. You can talk about the peerreview
gauntlet as mapping back and forth
dialogue and debate. That's something
that it's very strong at that we do at
work all the time. The multimodal
mystery box is a very high order logic
test that it passes.
Um, this is a super strong model. If if
there if you want to take away any
flavor from this model, it feels less
emotional and more mathematical than any
of the previous models I've played with.
The clues in the multimodal mystery box
from 03 were extremely
mathematical. Very, very mathematical.
And I didn't ask it to do that. It chose
that. Uh, and they were less so with
Gemini. And by the way, none of this is
to say Gemini is suddenly a bad model.
It's a phenomenally good model. The last
time I used it to play with code was
like two days ago. Like it's a great
model. It's just 03 is really, really,
really, really good. So there you go.
That's my overall take on 03. I've
talked long enough. Go play with it.