ChatGPT‑5 Review: Health and Coding Insights
Key Points
- The reviewer describes Chat GPT‑5 as a “model router” that orchestrates multiple specialized sub‑models, with a heavy focus on new medical‑focused training to improve health‑care advice accuracy.
- In the live‑stream launch, a cancer survivor highlighted the model’s more reliable medical responses, though the reviewer notes they aren’t medically qualified to fully verify the claims.
- A major showcase was the “vibe coding” feature, positioned as a “lovable killer” that lets anyone quickly prototype apps, countering the narrative that low‑code tools are dead.
- Developers were introduced to richer API controls—including reasoning, verbosity, and a “reasoning effort” parameter—intended to give finer‑grained guidance over the model’s output.
- The reviewer believes the biggest wins for GPT‑5 are lower usage costs and more surgical, agentic coding assistance, enabling developers to make precise, incremental edits rather than broad rewrites.
Sections
- ChatGPT‑5 First Impressions & Healthcare - The speaker provides a quick rundown of ChatGPT‑5’s multi‑model routing architecture and its emphasized medical‑advice improvements, sharing personal testing insights that aren’t covered elsewhere.
- ChatGPT-5 Builds Travel Itinerary App - The speaker notes that AI models often excel only in their native environments and recounts asking ChatGPT‑5 to create a configurable Japan travel‑itinerary app, which surprisingly produced a fully functional, click‑through application.
- Creating a Nightmare CSV Test - The speaker deliberately built a tangled set of malformed CSV files—including SQL injection, inconsistent formatting, and contradictory employee data—to push an LLM into detecting duplicates, security flaws, and business‑critical insights without explicit instructions.
- Prompt Crafting Critical for Advanced Models - The speaker argues that simple prompts are easy, but obtaining accurate, complex results from newer LLMs—especially in “think‑hard” mode where ChatGPT‑5 outperformed rivals—requires sophisticated prompting.
- Evolving AI Perception and Trust - The speaker downplays media hype to emphasize that the newest model’s reduced hallucinations, stronger performance in medical, writing, and coding tasks, and closer partnership feel will engender greater user trust, even if the improvements aren’t immediately obvious to everyone.
- Evaluating AI Model Utility - The speaker urges listeners to look beyond hype about the newest LLM, judging it by its real‑world usefulness, strengths, weaknesses, and ongoing progress rather than proclaiming it as AGI.
Full Transcript
# ChatGPT‑5 Review: Health and Coding Insights **Source:** [https://www.youtube.com/watch?v=DbX_0_0LGag](https://www.youtube.com/watch?v=DbX_0_0LGag) **Duration:** 00:19:55 ## Summary - The reviewer describes Chat GPT‑5 as a “model router” that orchestrates multiple specialized sub‑models, with a heavy focus on new medical‑focused training to improve health‑care advice accuracy. - In the live‑stream launch, a cancer survivor highlighted the model’s more reliable medical responses, though the reviewer notes they aren’t medically qualified to fully verify the claims. - A major showcase was the “vibe coding” feature, positioned as a “lovable killer” that lets anyone quickly prototype apps, countering the narrative that low‑code tools are dead. - Developers were introduced to richer API controls—including reasoning, verbosity, and a “reasoning effort” parameter—intended to give finer‑grained guidance over the model’s output. - The reviewer believes the biggest wins for GPT‑5 are lower usage costs and more surgical, agentic coding assistance, enabling developers to make precise, incremental edits rather than broad rewrites. ## Sections - [00:00:00](https://www.youtube.com/watch?v=DbX_0_0LGag&t=0s) **ChatGPT‑5 First Impressions & Healthcare** - The speaker provides a quick rundown of ChatGPT‑5’s multi‑model routing architecture and its emphasized medical‑advice improvements, sharing personal testing insights that aren’t covered elsewhere. - [00:03:20](https://www.youtube.com/watch?v=DbX_0_0LGag&t=200s) **ChatGPT-5 Builds Travel Itinerary App** - The speaker notes that AI models often excel only in their native environments and recounts asking ChatGPT‑5 to create a configurable Japan travel‑itinerary app, which surprisingly produced a fully functional, click‑through application. - [00:07:33](https://www.youtube.com/watch?v=DbX_0_0LGag&t=453s) **Creating a Nightmare CSV Test** - The speaker deliberately built a tangled set of malformed CSV files—including SQL injection, inconsistent formatting, and contradictory employee data—to push an LLM into detecting duplicates, security flaws, and business‑critical insights without explicit instructions. - [00:10:41](https://www.youtube.com/watch?v=DbX_0_0LGag&t=641s) **Prompt Crafting Critical for Advanced Models** - The speaker argues that simple prompts are easy, but obtaining accurate, complex results from newer LLMs—especially in “think‑hard” mode where ChatGPT‑5 outperformed rivals—requires sophisticated prompting. - [00:15:43](https://www.youtube.com/watch?v=DbX_0_0LGag&t=943s) **Evolving AI Perception and Trust** - The speaker downplays media hype to emphasize that the newest model’s reduced hallucinations, stronger performance in medical, writing, and coding tasks, and closer partnership feel will engender greater user trust, even if the improvements aren’t immediately obvious to everyone. - [00:18:59](https://www.youtube.com/watch?v=DbX_0_0LGag&t=1139s) **Evaluating AI Model Utility** - The speaker urges listeners to look beyond hype about the newest LLM, judging it by its real‑world usefulness, strengths, weaknesses, and ongoing progress rather than proclaiming it as AGI. ## Full Transcript
These are my first full unfiltered
impressions of how Chad GPT5 actually
lands for work. Like most of you, I
watched the live stream. I'm going to
assume here you can go watch the live
stream if you want. I'll give you a
first brief look at what's in the box
for Chat GPT5 if you didn't kind of read
the news. But I will not take long on
that cuz we're getting into how I
actually tested it and what your
takeaway should be, which you're not
going to find on all the other places.
So basically Chad GPT5 is a bunch of
models in a trench code. It is a model
router and there's a bunch of Chad GPT5s
underneath that it's routing to and it
has had some special training. The
special training comes out in the
healthcare side during the broadcast.
They had a cancer survivor come up and
talk about how she used Chad GPT5 versus
using earlier models. It really walked
the line between kind of icky because
like exploiting the disease versus in
the experience of of the person
suffering it versus sort of talking
about healthcare. From a technical point
of view, they've invested really heavily
in making sure that since people are
using Chad GPT for medical advice, they
are going to get medical advice that is
more accurate than the average large
language model. So that's a huge area of
investment. They emphasized it. It comes
through on the benchmarks. Look, I don't
have a medical degree. That was not one
I was qualified to test. anecdotally it
seems to be better is the way I'll leave
it and I'm sure we're going to get that
answer out of comments on this video or
out of others who are trying chat GPT5
with real world medical conditions. The
other area where they are really
emphasizing in this sort of mixture of
models approach is the coding and applet
side. I looked at some of the demos in
the in the live stream video that they
did and it felt like a lovable killer.
Now I love lovable.dev. do not walk away
from this and hear that these vibe
coding tools are dead. I don't think
that's true, but I think that's what
they wanted you to think because they
oneshotted these apps and they showed
how you could vibe code multiple apps
and you could build them and you could
just do it for yourself. It was very
much a everybody can code now message.
And then they brought the developers in
to talk about how you actually can use
the API and how uh you have more
reasoning controls than you had before
and verbosity controls and a reasoning
effort parameter and all of these
in-depth stuff for developers after they
got the vibe coding out of the way. I
actually played with the coding. I
played with vibe coding it. I looked at
the API a little bit. I got to say I
think that where they're actually
winning is on bringing the cost down out
of the gate so people use it more and on
pushing the model to code more
completely and usefully and to code more
agentically when you're working with it.
So what I mean by aentically is have it
code more surgically and make more
surgical edits. These are incremental
improvements but they add up to
something special in the canvas app. And
what's interesting is it is not clear if
adding up to something special in the
canvas app means that it will be special
in cursor or special in lovable where
these tools are already available. You
can get chat gpt5 in lovable or cursor
right now. I tried it and I felt sort of
like with claude code in the terminal
where claude code absolutely sings in
the terminal in a way that claude
doesn't sing quite the same and hit
quite the same in cursor or in lovable.
And I find that really interesting. We
have these two examples now where model
makers are basically giving incredible
results inside their preferred
environments, but not necessarily when
you plug them in other places. I don't
know if that's intentional or if there's
something about the environment their
reinforcement learned on or what, but it
remains true that I gave a fairly
complex coding task to chat GPT5. And I
asked it specifically to do a bunch of
web research to research a bunch of
travel destinations that were specific
and real in Japan for an upcoming trip
that I'd like to take, sort of a dream
trip. I haven't got my tickets yet, but
hey, we're having fun. And I said, I
want an itinerary and I want it
configurable by different interests like
whether I want to go to Zen temples,
whether I want to go and eat ramen,
whether I want to go to onen, etc. And
it was a fairly complicated prompt,
right? You have to build an applet that
lets me figure out my travel itinerary
and that lets me choose different
emphases like, hey, I want a ramen heavy
day today, right? Who doesn't? I want a
temple heavy day because I'm digesting
all the ramen, whatever it is. And it
needs to be an app that works, right? An
app where I can go through and say,
okay, so this is the day. this is the
narrative of the day, etc. What I found
is that chat GPT5 in the canvas app did
deliver a fully working app with real
destinations that I could click through
and use. And I actually have in the
Substack a link to that applet that you
can like play with it and you can see
how it works. But I gave the exact same
prompt to lovable using chat GPT5 and I
got essentially the white screen of
death. like it it technically produced
some text, but there was no design, no
interactivity. I would grade it a
complete fail. And I find that
fascinating. I got a complete fail on
the same model with the same prompt for
the same coding challenge in two
different environments. There is
something going on with the way they're
prioritizing canvas. And I think it's
really interesting. I also found that
this model, this collection of models,
this chat GPT5, all the friends we met
along the way, as I think I heard
someone say a few months ago, all of
these Chad GPT5s in a box are better at
answering in code and proving it with
code and math than they are at most
other things. And that continues a
longtime trend. If you were following
the 03 model generation, that was very
much how they worked. And it continues
today. If you ask them the model to
prove it, it does better. If you ask it
to code it, it does better. As an
example, I was playing around with Gant
charting and I asked the model, can you
show me a Gant chart of the Apollo 13
mission? It clearly did the research. It
laid out all of these components of the
build and kind of what the critical path
was to the error that led to the
disaster on Apollo 13. It knew what it
was talking about, and this is publicly
available information, but it could not
for the life of it write a Gant chart
that was easy to look at. did it did one
that was visible for launch day, but it
did not do one that was very readable
for the whole build cycle of the rocket.
But when I asked it to code it, it was
able to code that and it was able to
code out a full Gant chart I could
follow. Still a little bit of an eye
chart, but it was able to do it. Now, I
will call out in both cases for the for
the Japan travel app and for the Apollo
13 mission, in both cases, it could
overindex and break the app relatively
easily. So I will encourage you to
checkpoint publish when you're done with
them. These are little applets that are
not very durable and the thing does
overbuild and cause bugs sometimes. And
so that's part of why I saved and
published these so you can actually see
how it works in practice. So much for
the coding side of things. Another thing
that they really emphasized was the
quality of thought and how thoughtful
these models are and that they can solve
gnarly real world problems. In fact,
that's the first thing Sam Alman said in
the introductory video as he was setting
up the live stream. He said, "This is
about making your work more effective or
something like that." And then what I
noticed was there was almost nothing on
making your work more effective the rest
of the live stream except coding, which
there was a ton of. And it made me
think, how much do the execs at OpenAI
think the real work is coding versus
everything else? Because I didn't see a
lot other than saying, "Hey, it writes
better," which was that one little demo.
I didn't see a lot for everybody else.
So, I decided to test it. What I did was
I created what I called a gnarly gnarly
gnarly test file. It was three separate
CSVs. The CSVs, and I'll share them on
the Substack. The CSVs are entangled.
The CSVs are not dependable. There is a
SQL injection attack in one of the CSVs.
They don't have common formatting. I
didn't even save them correctly as CSVs.
I basically tried to turn these three
CSV files into the worst disaster of a
test I could imagine. Like it it's like
crawling over mud with barbed wire for
an LLM. I wanted to make it really hard.
Part of why is because they admitted on
the live stream that our benchmarks are
getting saturated and I still have
trouble with realworld tests. And so I
needed something that felt like the kind
of messy data that I see in the real
world. And the CSVs encapsulated a real
world scenario with employees that are
overloaded and underloaded in projects
that are off track and on track and the
need to be auditable and the need to
prove budgeting and the need to get to
revenue. All of this stuff that
businesses care about all in one gnarly
scenario. And then I asked I asked the
model very simply to make sense of it.
Basically come back explain what
happened get to a clear picture of the
number of employees on the team which is
very confusing. find the duplicates.
Make sure that you catch the SQL
injection, all of that, which I didn't
tell it. It had to detect that on its
own. And then make sure you can come
back to the board with a clear picture
of what happened. This is where this is
where it gets interesting, guys. This
test is the thing that showed me that
this model cares more about how you
drive it than any other model previously
because I ran this same test on Claude
Code. I ran it on 03. I ran it on 03
Pro. And I ran it on chat GPT5. And not
just one chat GPT5 either. I ran it on
chat GPT5 vanilla. I ran it on chat GPT
telling it in the prompt to think hard.
I ran it on chat GPT5 clicking the
button think hard. And I even ran it on
chat GPG Pro. And you know what? You
would not believe that GPT5 vanilla with
no think hard got the lowest score of
the lot. It was lower than 03. It was
lower than 03 Pro. It was lower than
Claude Code. than all the other chat
GPT5 responses. In other words, chat
GPT5 was both the best and the worst
response in the set. And I thought that
was really interesting. I thought that
was fascinating. Chad GPT5 Pro was also
not the best response in the set. It
overindexed a little bit. The best
response in the SAT was chat GPT5 with
the think hard button pushed closely
followed by chat GPT5 with think hard
typed in the prompt box. In other words,
part of your job with this model and
part of what I'm going to be doing in
the coming days is digging into when and
how you prompt these models for the kind
of task that you have in front of you.
These people who say that prompting
doesn't matter have not played with this
model. This is getting harder and harder
and harder to prompt. It is getting
trickier to prompt. Yes, if you are
doing casual work and you just want to
gesture vaguely at a piece of work and
you're not too worried about it, it has
never been easier to prompt. That is
true. It has never been easier to say,
"I want an itinerary for Japan." And
just gesture vaguely at it. It will come
up with something. So, that part is
easy. But getting really complex work
done, doing something like what I gave
it where correctness matters, accuracy
matters, the documents don't agree. It's
a very complex context window, that
takes work. Now, to be fair, lest you
think that I'm sort of negging chat
GPT5. Chat GPT5 with think hard M mode
enabled, whether that was through typing
the chat or through the button, did beat
every other model. It beat Claude Code.
It beat 03. It beat 03 Pro. It beat Chat
GPT5 Pro. And it beat Chad GPT5 with
non-thinking enabled, just the vanilla
version. And so this model, if done
correctly, does things I've never seen a
model do. Like this was a really hard
test. I've never seen any other model
get close on it. And I would give the
responses of the think hard versions of
chat cheap GPT5 an A minus in both
cases. They're both solid responses.
Everything else was was B or worse. And
so my conclusion early on in this chat
GPT25 experience having wrestled with
this model really extensively prompting
isn't going anywhere. This model is
strong at coding. This model needs you
to give it really clear indicators of
intent and depth or it will go off the
rails. You need to know what you need to
ask for to get a good response. And so a
lot of people who don't know that are
still going to underuse the power of the
model because they don't realize how
much is under the surface of think hard
or clicking the thinking button. Don't
be that person. I'm also going to call
out that they were right that it's a
better writer. I've spent a lot of time
talking about data analysis. I've talked
about coding. The model's writing is the
best I've ever seen from Chad GPT. I
loved's writing. I thought it was a
great writer. Chai GBT5 is at least as
good and strikes me as slightly better
with cadence and pros. It's clear. It
still tends to over anchor to the
recency of the prompt. So if you give it
a prompt, it tends to like glom onto
that and you may have issues with
framing when you're trying to write. So
again, it rewards clarity of intent. But
it is a really really thoughtful writer
and it writes with pros that is not
horrific to read, which is kind of nice.
I will also say it's a good reader.
actually fed it an essay in handwriting
and it was able to quickly decode the
handwriting, decode the separate set of
handwriting for edits and generate its
own coherent thinking around the essay
and it was it was a fair critique like
it was a good essay critique. So, it's a
solid reader. It's able to be fully
multimodal in that regard. And I I think
that people who are going to be using
it, who are non-coders, who are non-data
people who are in say the marketing
world, the customer success world, the
exec world where you're preparing
presentations, it's going to feel like a
great daily driver for that because it's
going to give you one-shot graphs. It's
going to give you great drafts. It's
going to help you think through. It
feels like a thinking partner. Now, this
is where I include the obligatory
caveats or cautions. I've talked a fair
bit about some of the things that went
wrong and some of the things that went
right in all of these individual cases
with coding, with writing, etc. I want
to call out that there's been this huge
uh backlash on the web in response to
chat GPT5 because there's been this
assumption that it was overhyped that
the model should not have been given the
hype it was and we are still not
anywhere close to artificial general
intelligence. Now the model immediately
jumped to number one on the model boards
and also at the same time poly market
the betting market immediately crashed
the model and said it wasn't the best
model in the world. It seems like
everyone's having really really big
reactions today and not very many people
are doing really really thoughtful
testing. I don't think that whether it's
the best model in the world matters all
that much because that's always a moving
target. If you had to put me to the wall
today and say Nate pick, yeah, I would
say it is. I would say properly prompted
Chad GPT5 is the best model in the
world. That being said, I think the
important thing is actually to recognize
where this fits into the evolving edge
of intelligence and where we still see
areas where models struggle. So, they
emphasized that they're working on
hallucinations, they're working on safer
completions, they're working on less
deception. I see some progress there. It
does feel like it hallucinates less than
03. I still caught it hallucinating a
couple times in my test today. It's not
perfect. I also see that there are going
to be continued assumptions that models
produce the same splash for the same
reader as the moment when their
perception initially shifted on AI.
There is a frog boiling in the pot
problem around media reaction right now.
And I'm putting in this at the end
because media reaction isn't the most
important thing, but I am including it
because I think the way we think about
the evolving intelligence curve does
matter. We're living through a
historical moment. This model is a
significant step forward in the way we
interact with AI. It is closer to
interacting with us as a thought partner
and I think the reduction in
hallucinations helps there. I think the
work done on stuff like medical where
it's high value helps there. I think the
work done on improving writing helps
there. People are going to feel like
they can trust this model more. People
are going to feel like it's right more
and they will be correct about that.
If you compare that to the flashbang
that came when chat GPT entered the
scene in the first place, or the
stunning jump to chat GPT4, which may be
hard to remember, but it was there, or
the jump to 03 reasoning, people, I
think, are assuming that it will feel
the same way with Chad GPT5. And what I
want to leave you with is this. It may
not feel the same way to you because the
model may be getting that much more
intelligent in ways you don't care
about. I think this was a really big
jump on the coding side, but you may not
care about that. And it was a jump on
the coding side in a world where
realistically claude code has had the
crown for a while. And so people are
going to say, well, does it be cla etc.
And there's going to be a lot of debate
about that. I think this was a big jump
on reliability. I think the medical
thing was rightly emphasized. People
using it for personal use cases that
really matter. Medical is a big one.
Probably legal is another one. It
matters to get it right. And so my my
suggestion to you is that if getting
medical information more correct,
significantly gigantically more correct
doesn't feel like a step change to you,
maybe you should check your assumptions
because for people making life or death
decisions, it's going to matter. Getting
it, you know, significantly more
correct, 2, 3x more correct, reducing
errors to like 1 point whatever percent,
which I think is what it is on their new
health bench matters a lot. Getting
writing to feel more natural to people
who are trying to use it to help with
their writing is a big step forward.
Getting a daily driver that's a reasoner
is a huge step forward. Even if people
don't fully understand what think hard
is without, I don't know, watching
videos like this. My point is this. We
have
an extraordinary opportunity to use a
model that is advancing intelligence
jaggedly. It's a mixture of models,
which is what I said at the beginning of
this video. There are big jumps in many
of these models underneath. Jumps in
coding, jumps in writing that I've
talked about, jumps on the medical
piece, jumps on hallucinations. If you
care about those things, they're going
to feel really, really big. If all you
care about is the whisbang of it didn't
do it before and now it does it, this is
not the model update for you because at
the end of the day, it does what the
other models did before only better. I
think that that is fully in line with
expectations. I think that we are living
through an ongoing intelligence
evolution and none of us know where it's
all going to end up or top out. And to
me, this feels roughly in line with the
ongoing intelligence explosion. And we
will see more updates in chat GPT6 and
we will see more updates from Gemini and
we'll see more updates from Claude.
Claude talked about it already. We'll
see more from Grock. Enjoy this step in
the new intelligence explosion. Don't
overindex on whether you personally
think this is the best model in the
world or not. figure out if it's useful
to you. Use it well. If it's not useful
to you, another model is going to come
along next week that will be like that's
the world we're living in. It's an
incredible world. And look at the
overall trajectory. We are arguing about
how amazing this model is and how much
of a surprise it is when 2 years ago if
we had seen this model, we would all
have been swearing that this was
artificial general intelligence come to
us out of the rocks. So, I don't know. I
don't really care if that's what we call
it. I care whether it does useful
things. I care where the real weaknesses
are. And I care whether it shows
continued progress. And I hope you've
gotten a sense of where those strengths
are, where the real world weaknesses
are, and honestly, the sense that we
have continued progress. This is my new
daily driver. Check out Chad GPT5 and
let me know what you