Learning Library

← Back to Library

Why OpenAI Holds Back Better Models

4m • Unknown Channel • ai-ml • deep-dive • intermediate • Watch on YouTube ↗

Key Points

OpenAI released ChatGPT 4.1, bundling previously hidden improvements (sequential task handling, numeric reasoning, coding) while pulling the newer 4.5 model from availability, even claiming 4.1 outperforms 4.5.
Despite being an upgrade over GPT‑4, 4.1 still lags behind competitors like Gemini 2.5 in benchmark scores (55 % vs 64 % on the SWE engineering test).
OpenAI is withholding its strongest models (e.g., Deep Research) from the API, limiting developers to weaker versions and pushing users toward the proprietary chat app.
The speaker argues that this restricts ecosystem growth, contrasting OpenAI’s approach with Google’s more open strategy of releasing Gemini 2.5 via its API, which they find more usable and innovative.

Sections

Full Transcript

# Why OpenAI Holds Back Better Models **Source:** [https://www.youtube.com/watch?v=Jl7wX80nxlo](https://www.youtube.com/watch?v=Jl7wX80nxlo) **Duration:** 00:04:16 ## Summary - OpenAI released ChatGPT 4.1, bundling previously hidden improvements (sequential task handling, numeric reasoning, coding) while pulling the newer 4.5 model from availability, even claiming 4.1 outperforms 4.5. - Despite being an upgrade over GPT‑4, 4.1 still lags behind competitors like Gemini 2.5 in benchmark scores (55 % vs 64 % on the SWE engineering test). - OpenAI is withholding its strongest models (e.g., Deep Research) from the API, limiting developers to weaker versions and pushing users toward the proprietary chat app. - The speaker argues that this restricts ecosystem growth, contrasting OpenAI’s approach with Google’s more open strategy of releasing Gemini 2.5 via its API, which they find more usable and innovative. ## Sections - [00:00:00](https://www.youtube.com/watch?v=Jl7wX80nxlo&t=0s) **Critique of GPT‑4.1 Release** - The speaker argues that OpenAI’s newly released GPT‑4.1, which replaces the short‑lived 4.5, offers only modest improvements and falls short of state‑of‑the‑art performance, especially compared to rivals like Gemini 2.5. - [00:03:19](https://www.youtube.com/watch?v=Jl7wX80nxlo&t=199s) **Mixed Feelings on 4.1 Release** - The speaker finds the new 4.1 version functional but less smooth and confident than 2.5, labeling it a necessary yet insufficient upgrade while questioning its impact against rivals like Claude and Google. ## Full Transcript

0:00Okay, it's time to talk about Chad 0:02GPT4.1. It dropped today. I'm actually 0:05not going to spend most of my time 0:06talking about all the features they 0:07announced with it because to be honest 0:09with you, we've heard all those features 0:11before. What they did was they took most 0:13of the features they had quietly stuffed 0:16into Chat GPT40 like better following of 0:19sequential tasks, like uh better 0:22handling of numbers, like somewhat 0:25better coding abilities. and they said, 0:27"Hey, maybe the coding abilities should 0:29go with the API because I don't know, 0:31maybe developers would want it. That 0:32sounds like a reasonable plan." And so 0:34they put it in 4.1. And then in their 0:37infinite wisdom, they decided to 0:39deprecate 4.5 because, and I kid you 0:42not, actually from the live stream, they 0:44said 4.1 is better than 0:484.5. Why? The naming continues to get 0:51more insane every single time. I was 0:54just getting to know 4.5 as a model. It 0:56feels really weird to me to release it 0:58as a reach search preview and then yank 1:00it back, but here we are. Maybe this is 1:02OpenAI's admission that it was a flop. I 1:04kind of liked it. So 1:06anyway, regardless, 4.1 is not a 1:10state-of-the-art model. And I want to 1:12emphasize that we are used to, and I 1:14think OpenAI likes this, we are used to 1:15thinking of OpenAI as always releasing a 1:18state-of-the-art model. It's not. Gemini 1:222.5, for example, scores like 64% on 1:26SWE. GPT4.1, even though it's doing much 1:30better than 40, is scoring only 55%. And 1:33if you're wondering what SWE, it's just 1:36an independent measure of engineering 1:38capability. It measures your ability to 1:40do engineering tasks. And Gemini 2.5 Pro 1:44is better at 1:45it. So, this is the question I have. 1:51Why is OpenAI choosing to release worse 1:55models in the API than they release in 1:58their chat app? They have models in 2:01their chat app that they are choosing to 2:05not release in the 2:07API. Deep research, for instance, is a 2:09model you cannot call in the API. It's a 2:12good model, not in the 2:15API. And I think that's really 2:17interesting. And I think it makes it 2:18more difficult to build 2:21out infrastructure that we can use to 2:24advance artificial intelligence 2:26capabilities across the ecosystem and it 2:29pushes people more into a particular 2:31app. Now that may be good from a 2:34consumer perspective, but it's not good 2:36from an ecosystem perspective. From a 2:38consumer perspective, if you're always 2:40getting the best apps in your app, in 2:41your chat GPT app, and you don't have to 2:43think about it, you're fine. But if you 2:45want an overall healthy AI ecosystem, 2:48you need state-of-the-art models getting 2:51released that enable you to build in 2:54ways that drive the ecosystem forward. 2:55And I got to give credit. Google has 2:57done a better job here. Gemini 2.5 is a 3:00fine model and they've released it in 3:02the API. I've been playing with it for a 3:04couple weeks now and I'm really enjoying 3:06it. It feels like a thoughtful model. It 3:08can be opinionated. It's clear. Uh, 3:10sometimes I like it to bounce ideas back 3:13and forth with Claude 3.7 in my IDE. 3:16It's working pretty well. I have played 3:19with 4.1. I played with 4.1 this 3:21afternoon when it came out. Also in my 3:23IDE, I found it a little bit wordy, not 3:26as confident. It's fine. I got progress, 3:29but it didn't feel as smooth sailing as 3:31I felt when I was running with 2.5. 2.5 3:34the the vibe felt as good as the test 3:37results showed. felt better just just 3:40like 2.5 scored better, right? So, for 3:42what it's worth, um I think that 4.1 was 3:47probably a necessary release because the 3:49only other API that they had in that 3:52class was four, which clearly wasn't up 3:54to it, but it's not a sufficient 3:57release. You know how necessary, but not 3:58sufficient. It's a step forward, but 4:01it's not enough. Chat GPT has some 4:03catching up to do on the tech ecosystem 4:05side and I know they have a big week 4:06ahead, but it really remains to be seen 4:08if they're going to release something 4:09that moves the needle versus uh Claude 4:12and really versus Google.