Why OpenAI Holds Back Better Models
Key Points
- OpenAI released ChatGPT 4.1, bundling previously hidden improvements (sequential task handling, numeric reasoning, coding) while pulling the newer 4.5 model from availability, even claiming 4.1 outperforms 4.5.
- Despite being an upgrade over GPT‑4, 4.1 still lags behind competitors like Gemini 2.5 in benchmark scores (55 % vs 64 % on the SWE engineering test).
- OpenAI is withholding its strongest models (e.g., Deep Research) from the API, limiting developers to weaker versions and pushing users toward the proprietary chat app.
- The speaker argues that this restricts ecosystem growth, contrasting OpenAI’s approach with Google’s more open strategy of releasing Gemini 2.5 via its API, which they find more usable and innovative.
Sections
- Critique of GPT‑4.1 Release - The speaker argues that OpenAI’s newly released GPT‑4.1, which replaces the short‑lived 4.5, offers only modest improvements and falls short of state‑of‑the‑art performance, especially compared to rivals like Gemini 2.5.
- Mixed Feelings on 4.1 Release - The speaker finds the new 4.1 version functional but less smooth and confident than 2.5, labeling it a necessary yet insufficient upgrade while questioning its impact against rivals like Claude and Google.
Full Transcript
# Why OpenAI Holds Back Better Models **Source:** [https://www.youtube.com/watch?v=Jl7wX80nxlo](https://www.youtube.com/watch?v=Jl7wX80nxlo) **Duration:** 00:04:16 ## Summary - OpenAI released ChatGPT 4.1, bundling previously hidden improvements (sequential task handling, numeric reasoning, coding) while pulling the newer 4.5 model from availability, even claiming 4.1 outperforms 4.5. - Despite being an upgrade over GPT‑4, 4.1 still lags behind competitors like Gemini 2.5 in benchmark scores (55 % vs 64 % on the SWE engineering test). - OpenAI is withholding its strongest models (e.g., Deep Research) from the API, limiting developers to weaker versions and pushing users toward the proprietary chat app. - The speaker argues that this restricts ecosystem growth, contrasting OpenAI’s approach with Google’s more open strategy of releasing Gemini 2.5 via its API, which they find more usable and innovative. ## Sections - [00:00:00](https://www.youtube.com/watch?v=Jl7wX80nxlo&t=0s) **Critique of GPT‑4.1 Release** - The speaker argues that OpenAI’s newly released GPT‑4.1, which replaces the short‑lived 4.5, offers only modest improvements and falls short of state‑of‑the‑art performance, especially compared to rivals like Gemini 2.5. - [00:03:19](https://www.youtube.com/watch?v=Jl7wX80nxlo&t=199s) **Mixed Feelings on 4.1 Release** - The speaker finds the new 4.1 version functional but less smooth and confident than 2.5, labeling it a necessary yet insufficient upgrade while questioning its impact against rivals like Claude and Google. ## Full Transcript
Okay, it's time to talk about Chad
GPT4.1. It dropped today. I'm actually
not going to spend most of my time
talking about all the features they
announced with it because to be honest
with you, we've heard all those features
before. What they did was they took most
of the features they had quietly stuffed
into Chat GPT40 like better following of
sequential tasks, like uh better
handling of numbers, like somewhat
better coding abilities. and they said,
"Hey, maybe the coding abilities should
go with the API because I don't know,
maybe developers would want it. That
sounds like a reasonable plan." And so
they put it in 4.1. And then in their
infinite wisdom, they decided to
deprecate 4.5 because, and I kid you
not, actually from the live stream, they
said 4.1 is better than
4.5. Why? The naming continues to get
more insane every single time. I was
just getting to know 4.5 as a model. It
feels really weird to me to release it
as a reach search preview and then yank
it back, but here we are. Maybe this is
OpenAI's admission that it was a flop. I
kind of liked it. So
anyway, regardless, 4.1 is not a
state-of-the-art model. And I want to
emphasize that we are used to, and I
think OpenAI likes this, we are used to
thinking of OpenAI as always releasing a
state-of-the-art model. It's not. Gemini
2.5, for example, scores like 64% on
SWE. GPT4.1, even though it's doing much
better than 40, is scoring only 55%. And
if you're wondering what SWE, it's just
an independent measure of engineering
capability. It measures your ability to
do engineering tasks. And Gemini 2.5 Pro
is better at
it. So, this is the question I have.
Why is OpenAI choosing to release worse
models in the API than they release in
their chat app? They have models in
their chat app that they are choosing to
not release in the
API. Deep research, for instance, is a
model you cannot call in the API. It's a
good model, not in the
API. And I think that's really
interesting. And I think it makes it
more difficult to build
out infrastructure that we can use to
advance artificial intelligence
capabilities across the ecosystem and it
pushes people more into a particular
app. Now that may be good from a
consumer perspective, but it's not good
from an ecosystem perspective. From a
consumer perspective, if you're always
getting the best apps in your app, in
your chat GPT app, and you don't have to
think about it, you're fine. But if you
want an overall healthy AI ecosystem,
you need state-of-the-art models getting
released that enable you to build in
ways that drive the ecosystem forward.
And I got to give credit. Google has
done a better job here. Gemini 2.5 is a
fine model and they've released it in
the API. I've been playing with it for a
couple weeks now and I'm really enjoying
it. It feels like a thoughtful model. It
can be opinionated. It's clear. Uh,
sometimes I like it to bounce ideas back
and forth with Claude 3.7 in my IDE.
It's working pretty well. I have played
with 4.1. I played with 4.1 this
afternoon when it came out. Also in my
IDE, I found it a little bit wordy, not
as confident. It's fine. I got progress,
but it didn't feel as smooth sailing as
I felt when I was running with 2.5. 2.5
the the vibe felt as good as the test
results showed. felt better just just
like 2.5 scored better, right? So, for
what it's worth, um I think that 4.1 was
probably a necessary release because the
only other API that they had in that
class was four, which clearly wasn't up
to it, but it's not a sufficient
release. You know how necessary, but not
sufficient. It's a step forward, but
it's not enough. Chat GPT has some
catching up to do on the tech ecosystem
side and I know they have a big week
ahead, but it really remains to be seen
if they're going to release something
that moves the needle versus uh Claude
and really versus Google.