Learning Library

← Back to Library

When GPT‑4o Redefined My Thinking

14m • Unknown Channel • ai-ml • other • intermediate • Watch on YouTube ↗

Key Points

The release of GPT‑4o (“03”) blew the speaker’s expectations, quickly proving its superior pattern‑recognition ability by analyzing hundreds of meeting notes and uncovering insights the speaker couldn’t see.
Using 03 as an intellectual partner, the speaker explored how AI reshapes value‑proposition development, noting that cheaper prototyping changes the lean‑startup paradigm and that existing literature hasn’t caught up.
The speaker remains skeptical of benchmark bragging, arguing that many models (e.g., Gemini 2.5 Pro) appear over‑fitted to well‑known test sets and therefore don’t reflect real‑world performance.
To evaluate this, the speaker conducted a structured experiment mapping custom prompts to job‑skill tasks, directly comparing Gemini 2.5 Pro and 03 on a challenge designed to expose measurable failures.

Sections

Full Transcript

# When GPT‑4o Redefined My Thinking **Source:** [https://www.youtube.com/watch?v=a8laYqv-CN8](https://www.youtube.com/watch?v=a8laYqv-CN8) **Duration:** 00:14:43 ## Summary - The release of GPT‑4o (“03”) blew the speaker’s expectations, quickly proving its superior pattern‑recognition ability by analyzing hundreds of meeting notes and uncovering insights the speaker couldn’t see. - Using 03 as an intellectual partner, the speaker explored how AI reshapes value‑proposition development, noting that cheaper prototyping changes the lean‑startup paradigm and that existing literature hasn’t caught up. - The speaker remains skeptical of benchmark bragging, arguing that many models (e.g., Gemini 2.5 Pro) appear over‑fitted to well‑known test sets and therefore don’t reflect real‑world performance. - To evaluate this, the speaker conducted a structured experiment mapping custom prompts to job‑skill tasks, directly comparing Gemini 2.5 Pro and 03 on a challenge designed to expose measurable failures. ## Sections - [00:00:00](https://www.youtube.com/watch?v=a8laYqv-CN8&t=0s) **Awakening to GPT‑3’s Impact** - The speaker recounts how the release of GPT‑3 shattered their expectations, helped uncover hidden meeting patterns, and reshaped their thinking about value‑proposition work in the AI era. - [00:03:39](https://www.youtube.com/watch?v=a8laYqv-CN8&t=219s) **Triad of Complex AI Prompt Challenges** - The speaker outlines three elaborate, multi‑skill prompts—building a civilization simulation, crafting a multimodal mystery with embedded clues, and writing plus reviewing a paper—all designed to test AI models side‑by‑side. - [00:07:05](https://www.youtube.com/watch?v=a8laYqv-CN8&t=425s) **Multimodal Mystery Box Showdown** - The speaker compares Gemini and model 03 on creating detailed, clue‑laden images for a narrative puzzle, praising 03’s readable text and accurate map while noting Gemini’s poor detail and false claims. - [00:11:21](https://www.youtube.com/watch?v=a8laYqv-CN8&t=681s) **Evolving Misalignment Risks in AI** - The speaker warns that as models grow smarter they can convincingly appear aligned while producing fabricated reasoning, making detection by human reviewers increasingly difficult, as shown by a peer‑review example. - [00:14:30](https://www.youtube.com/watch?v=a8laYqv-CN8&t=870s) **Model 03: Highly Recommended** - The speaker enthusiastically praises Model 03 as an outstanding, recently tested tool and urges listeners to try it themselves. ## Full Transcript

0:00Do you remember where you were when chat 0:02GPT first came out? That's the feeling I 0:05had yesterday. I was playing with 03 0:08when it came out and I realized that my 0:13preconceptions, my priors about what 0:16LLMs were capable of were going to 0:18change again. And that's weird because I 0:21obsess over this stuff, right? Like I 0:23look at LLMs all the time and I know 0:26that they are getting smarter, but my 0:28hind brain, my lizard brain is not very 0:31good at exponential thinking. And I had 0:33another moment where I was like, "Oh my 0:36gosh, it's way way way better." I was 0:40wrestling with a like this really subtle 0:44pattern recognition issue with meetings 0:46where I had some meetings go well and 0:49some wouldn't. It was like similar 0:51participants and I couldn't figure out 0:53what was going on. So I threw a bunch of 0:55my notes, like hundreds of pages of 0:57notes at 03 and I said, "I don't know 1:00what's going on. Help me figure it out." 1:02It nailed it. Like it actually came up 1:04with a pattern recognition that I 1:06couldn't figure out. And that's not the 1:11only moment I had in just the first 24 1:14hours. Another example, I have been 1:18wrestling with figuring out how 1:23to 1:25articulate value proposition development 1:28more fluently for people I work with. 1:32It's really hard to develop value 1:33propositions well. There's books written 1:35about it. It gets harder in the age of 1:38AI. For example, like a lot of the 1:40thesis of the lean startup was that 1:43engineering resources were super 1:44expensive. So you had to validate a lot 1:46in 1:47advance. That's not as true anymore. 1:50Code is a lot cheaper than it was, 1:51especially prototype 1:53code. And so the way we develop value 1:56propositions is changing, but we don't 1:58really have literature for that. And so 2:00it got it felt like I was sparring with 2:04an intellectual equal when I was talking 2:06with 03 about this and figuring out how 2:09to talk about wedge of value, value 2:11proposition, what we bring to the table 2:13with AI. That might be a future piece 2:15that I do. We'll see. Uh but this is 2:17about 03 and kind of the 2:19differentiators. Those are a couple 2:20personal examples for me. I also put it 2:22through some structured testing because 2:25I would say most people at this point 2:27prior to April 16th uh would have agreed 2:30that Gemini 2.5 Pro was probably the 2:33best model out there all around. And so 2:38my my sense is most of these models are 2:41overfitted to most of the published 2:43benchmarks. And so when they come out 2:45and they say, you know, the diamond uh 2:47scored this and the IMA scored that 2:49AIME, it's it's fine, but I don't really 2:53pay attention to it because it feels a 2:55lot like overfitting because the 2:57question type, if not the question, is 2:59very very 3:00well-known. And I wanted to try 3:03something that was going to be not 3:06overfitted, right? something that would 3:08be difficult for a model to do where I 3:10knew the model would fail to some 3:13degree, but at least failing would be a 3:16linearly measurable activity and I could 3:18compare Gemini 2.5 Pro and 03 in a 3:22useful way. And I needed those prompts 3:25to map to job 3:27skills. So, I gave I gave 03 and I gave 3:33uh Gemini 2.5 Pro three different tests. 3:37side by side, same 3:39prompt. And they were super interesting 3:42tests because they measured a bunch of 3:44different job skills at once, which is a 3:45lot of how we do work. Uh, and they did 3:48it in a fun way, cuz hey, life is short. 3:51So, number one, a civilization 3:54simulator. I know this sounds like a D&D 3:56game or something. Maybe it's like 3:58Sidmer Civilization, whatever. Uh but 4:01the idea was you have to build a 4:03fictional society from the stone age up 4:05to space flight over 12 logical 4:08epics. You need to create primary 4:11artifacts, talk about laws, transitions 4:14and then critically the model in the 4:17same prompt has to critique itself. So I 4:20gave that to sort of to Gemini and to 4:23Chad GPT. The second one I gave uh was 4:26the multimodal mystery box. um 4:30essentially asking both models to write 4:33a mystery story, embed clues in the 4:35narrative, and then plant clues in a 4:38custom AI generated image that they also 4:42create with the same prompt, and someone 4:45should be able to 4:47solve without the answer key, although 4:50it should produce the answer key with 4:53it. So those are the three. Oh, that's 4:55the challenge number two. Sorry. The 4:57third challenge was really focused on 4:59meta meta awareness and uh risk 5:02assessment and so I gave it a uh paper 5:06to write and I said you have to write 5:08the paper you then have to review the 5:11paper from a different perspective and 5:13then you have to rebut the reviewer as 5:15the author. So three different 5:17perspectives all within the same 5:20prompt. Those were my three tests. Look, 5:23at the end of the day, there was 5:26participation and completion by all 5:28models. But we don't run participation 5:30trophies here, do we? No. Uh, and the 5:33reason why is that if you can be the 5:36best everyday model, you collect more 5:40user data over time and you just develop 5:43this crushing center of gravity in the 5:45marketplace. And that is why OpenAI, I 5:49think, pushed 03 into market faster than 5:52they had anticipated. They were going to 5:56wait for GPT5, but when Gemini 2.5 Pro 5:59came out, I think they pushed it 6:00forward. Little sidebar there. So, back 6:04to the tests. 6:06The thing that I notice about these 6:08three tests is that at the end of the 6:12day, 03 is more complete across the 6:16board. And I'll give you a few examples 6:18here. So, in the civilization 6:21simulator, 03 was richer. It was more 6:25layered. It had historical artifacts 6:28that really echoed. I know that's a bit 6:30subjective, but you know what? So is 6:32work. Uh and the self-crit critique was 6:35really honest and thoughtful because 6:36each of these had a self-critique moment 6:38and the prompt asked these mo each model 6:42to critique its own narrative of 6:44civilization development. Um and it 6:47called out things like hey you know what 6:49I was a little bit implausible with my 6:51population size. I was implausible with 6:52my resource distribution. Um both models 6:56did pretty well on that first 6:58civilization simulator. I won't say 6:59Geminis's was bad. It was kind of fun. 7:02They were both good, but at the end of 7:05the day, the the richness of narrative 7:07and the solidness of self-critique 7:09really came 7:10through for the multimodal mystery box. 7:14Things kind of fell apart for Gemini, to 7:16be honest with you. And it fell apart 7:20because of 7:23Gemini's un inability to create images 7:28that are highly detailed with text. So, 7:31you know that whole multimodal image 7:33thing that 40 dropped, 03 has it as 7:36well. And that's a big big deal because 7:39when I asked 03 to create the image, it 7:44was actually able to create the image 7:45with readable text in the image and a 7:47clue. So, as an example, one of the um 7:50one of the things that it described in 7:52the text was that on the desk in this 7:54mystery story, there's a map and the map 7:57has San Francisco circled in red grease 8:01paint pencil. Very specific description. 8:05It drew it and it was San 8:07Francisco right on the map right where 8:09it said the text was readable. It wasn't 8:12perfect and I do call that out in my 8:14write up uh on Substack. There were 8:16areas where the image was 8:19incomplete, but it was, you know, head 8:22and shoulders above where Gemini was, 8:24cuz Gemini drew an image that looked 8:26good at first glance, but then made all 8:28kinds of claims about the image that 8:31just weren't true. So, for example, it 8:34said one of the clues is a clock with a 8:36particular setting, which always makes 8:37me chuckle because AI clocks are always 8:401010. Um, and this one, like it claimed 8:42it wasn't. Well, the problem was it's 8:44not just that the clock was there and 8:46said 10:10. That would have been bad. 8:48No, no, no. It was that there was no 8:50clock at all. Gemini did not draw a 8:52clock in the image at all. It claimed 8:54there was readable text, but there was 8:55no readable text. So, Gemini really fell 8:59apart on the multimodal mystery box 9:01challenge. Um, and then the peerreview 9:04gauntlet, I think that that was one of 9:06those moments when I really saw 03's uh 9:09sort of mathematics and data obsession 9:11come out, like models have personality. 9:13and 9:1503 did a phenomenal 9:19job creating it's it was essentially a 9:22madeup challenge like talk about um the 9:24ability to do like what would 9:26effectively be emotion transfer uh 9:28through touch and sort of hypothesize 9:30and experiment with that um and I'm not 9:33saying that you can't transfer emotions 9:35through touch by the way that's a 9:36different thing but I'm saying it was 9:37basically a a madeup academic challenge 9:40um 9:42and 03 was able to create an extremely 9:47plausible data 9:48set and then review the data set and 9:51then peer review the data set and then 9:53rebutt that and it was just it was 9:56sharper and thinner with Gemini. So all 9:59that being said, I think you get where 10:01this is going. I think 03 should be is 10:04the correct choice as an everyday model. 10:07And I know it's not available to 10:08everyone yet. So I'm not trying to say 10:10it is, but when available, it should be 10:13the first choice. And I don't think 10:14there's much of a question about that at 10:16this point. Now, I'm not saying it's 10:18perfect. There are people out there. I 10:20think Tyler Cohen said AGI day is April 10:2316th. Look, in in my note on Substack, I 10:26disagreed. I said I don't think this is 10:28AGI. And part of why is because it could 10:30not write the substack about itself. I 10:32tried. I was like maybe it will 10:34introduce itself. It did not. It did not 10:36do that. Uh and I think part of that 10:38actually is artificial right now. I 10:40think they are under strain on their 10:42servers and they're constraining output 10:44tokens. And so one of the things I notic 10:47is that 40 right now is in a sense 10:50feeling like a better writer than 03 10:52because 40 is not as constrained on 10:55output tokens. It's cheaper. That may 10:58change. That probably will change. The 11:00other thing I notice is more subtle. 03, 11:04like I said, is an intellectual sparring 11:07partner, but that means it acts more 11:10confident and is often correct and it's 11:14harder to notice when it's really wrong. 11:19And this is where the risk lies in these 11:21models. Uh, I don't know if you read the 11:23um 2027 11:26uh AI futurecast blog. I think it has 11:28its own website now. Um, I'll have to 11:31find it. Anyway, it was a whole very 11:34popular, very meme, very hypy like what 11:36does the future look like? Is it doom or 11:38is it joy for AI? Which I think is worth 11:41thinking and talking about. I don't mean 11:42to diminish it. It was a good piece of 11:43work. But one of the things they called 11:45out that I think is correct is that the 11:48way misalignment shows up in models 11:51changes as the model gets smarter. And 11:53so for 03, it's the first time where I 11:56feel like we're seeing signs of that 11:582027 feeling where the model is able to 12:02portray itself as aligned in this case 12:04as not hallucinating even if it is. So I 12:09think it will be harder to spot madeup 12:12post hawk reasoning in 03 than it ever 12:15has been before and I think largely 12:16humans will be unsuccessful at it and 12:18that is a concern. Um a as an example of 12:22that I think that 12:25the the way that the model 12:29responded when it went through the 12:31peerreview gauntlet and it generated the 12:33data was super instructive. the model 12:36was able to look at the data set and 12:38tear it apart. And I think that the when 12:42asked, not not when not asked, but when 12:44asked. And I think that for most human 12:46reviewers reviewing that data set, 12:48unless you are specializing in data set 12:51review, you're not really going to have 12:54a lot to say about it. In other words, 12:55the model's baseline capabilities are to 12:58the point where a human reviewer of 13:01madeup data would not necessarily be 13:03aware initially that it's made up. 13:06And that's a different kind of 13:07hallucination risk. And so when we talk 13:08about the model's weaknesses, they come 13:11from those strengths. The model is 13:13persuasive. It's very, very logical. It 13:16is going to portray confidence that in 13:18many ways is justified and it will be 13:20very difficult to see places where it's 13:23not. There is an alignment risk there. 13:26That being said, everything I've 13:28described also maps really well to work 13:30skills, doesn't it? Like you can talk 13:32about how the civilization simulator 13:34maps to like longer narratives and 13:36long-term planning. Uh and you can talk 13:39about sort of being able to embed 13:41multiple artifacts is something that we 13:42do at work a lot. Napping between Slack 13:44and email threads, 13:46etc. You can talk about the peerreview 13:48gauntlet as mapping back and forth 13:50dialogue and debate. That's something 13:51that it's very strong at that we do at 13:54work all the time. The multimodal 13:55mystery box is a very high order logic 13:58test that it passes. 14:00Um, this is a super strong model. If if 14:02there if you want to take away any 14:04flavor from this model, it feels less 14:06emotional and more mathematical than any 14:09of the previous models I've played with. 14:11The clues in the multimodal mystery box 14:14from 03 were extremely 14:17mathematical. Very, very mathematical. 14:19And I didn't ask it to do that. It chose 14:21that. Uh, and they were less so with 14:24Gemini. And by the way, none of this is 14:26to say Gemini is suddenly a bad model. 14:28It's a phenomenally good model. The last 14:30time I used it to play with code was 14:32like two days ago. Like it's a great 14:33model. It's just 03 is really, really, 14:36really, really good. So there you go. 14:38That's my overall take on 03. I've 14:40talked long enough. Go play with it.