Evaluating Test-Time Inference Scaling Laws
Key Points
- OpenAI claims that allowing more “test‑time” inference (longer thinking or parallel reasoning) yields consistently smarter answers, suggesting a scaling law for AI performance.
- A new competitor, DeepSeek from China, is specifically built to exploit test‑time inference, promising improved intelligence by taking extra time to respond.
- In a head‑to‑head test using the same murder‑mystery logic puzzle, OpenAI’s GPT‑4o (preview) produced a tightly reasoned, coherent solution, while DeepSeek’s answer was less logical and harder to follow.
- The experiment suggests that the purported scaling advantage of test‑time inference may not hold for complex, ambiguous reasoning tasks that are not purely mathematical or scientific.
Full Transcript
# Evaluating Test-Time Inference Scaling Laws **Source:** [https://www.youtube.com/watch?v=zCvdrME4ErA](https://www.youtube.com/watch?v=zCvdrME4ErA) **Duration:** 00:05:02 ## Summary - OpenAI claims that allowing more “test‑time” inference (longer thinking or parallel reasoning) yields consistently smarter answers, suggesting a scaling law for AI performance. - A new competitor, DeepSeek from China, is specifically built to exploit test‑time inference, promising improved intelligence by taking extra time to respond. - In a head‑to‑head test using the same murder‑mystery logic puzzle, OpenAI’s GPT‑4o (preview) produced a tightly reasoned, coherent solution, while DeepSeek’s answer was less logical and harder to follow. - The experiment suggests that the purported scaling advantage of test‑time inference may not hold for complex, ambiguous reasoning tasks that are not purely mathematical or scientific. ## Sections - [00:00:00](https://www.youtube.com/watch?v=zCvdrME4ErA&t=0s) **Testing Test‑Time Inference Claims** - The speaker examines OpenAI’s alleged scaling law for test‑time inference by contrasting it with DeepSeek—a Chinese model built to improve reasoning through longer, parallel inference—and describes a logic‑focused evaluation using a detective‑style murder scenario. ## Full Transcript
one of the things that we need to talk
more about is whether or not the open AI
claim that test time inference has a
scaling law is true I'm going to unpack
that statement so roughly speaking the
scaling law that open AI is claiming
they've unlocked is that if you take
more time to think about the problem if
you run parallel paths with your AI
system and you come back you're going to
have a smarter answer if you can select
the best
response and the reason why we need to
talk about whether that scales or not is
because we have our first legitimately
different competitor outside the big
model maker set that's called Deep seek
it came out came out of China and it is
specifically designed to scale
intelligence through test time inference
and what I mean by that is when you chat
it takes extra time to
respond now I obviously went and
immediately tried it because I'm me and
I wanted to give
a problem that I thought would be out of
left field or an interesting
Challenge and I did not want to give it
a problem that would align with what I
was seeing as reported as good by the
model I want to give it something
different to test new capabilities so
deep seek was actually reported as being
very very good at mathematical and
scientific problems less good at
language and coding I didn't really want
to test language or coding or
mathematics or science I wanted to test
reasoning I wanted to test logic and so
I came up with a detective
scenario I actually gave a short
synopsis of a murder mystery problem to
both open ai's 01 preview model and also
deep seek and I wanted to see with the
exact same scenario what they would
do and I noticed a couple of things one
even
though everyone is saying that deep seek
is able to use that additional time it
takes to respond to give a better answer
that was not true when it wasn't being
tested with a scientific or mathematical
question this was just a pure logic
puzzle with a lot of uncertainty and a
lot of ambiguity and conflicting
evidence thrown in to make it difficult
and so really I was trying to measure
whether the system could sort through
all of that conflicting evidence come
back with a logical
response and at the end of the day when
I compared the two 0's response was a
lot better 0 came back and it had
tightly argued logical reasoning for the
choices that it had made it had clearly
examined all of the evidence I had given
it in the murder mystery prompt it had
thought about it it had put it together
and it just came back with a really
rational response deep SE came back it
named a different murder suspect it spit
out all of the reasoning token so I was
just sitting there looking at gray text
coming down the
screen and when it gave its
response it wasn't as
logical it could not logically
articulate the reason why it had made
the choices it had made with the same
degree of coherence that 01
brought and if I were to go into like
the whole murder mystery and the whole
response it would take a long time
longer than you want to spend on this
YouTube channel so I'm summarizing for
you
but I want to call it out because I
think we need better evals for these
kinds of claims like if we are going to
claim that the scaling law holds for
inference time we need to have evals
that are outside standard knowledge
domains we need to not just say here's
the rean hypothesis or here's a physics
problem and also can you remember
dostoevski and write a story like dooi
like those are one way to measure
intelligence but I wanted to look at
something that actually measures novel
intelligence and reasoning ability
across a new problem so the problem I
gave it was not a murder mystery from
books it was actually a net new murder
problem
scenario and when I look at all of
that it just reminds me that if we're
going to make claims about how inference
compute works we need to be making
equivalent commitments to inference time
evals and if we're not we're really not
not in a position to correctly compare a
bunch of models that come out so that's
my thought I was not super impressed
with deep seek from that test I realize
it's only one I'm not saying that the
people who are testing Math and Science
problems are incorrect when they're
saying it's great at it it probably is
every model has its pros and cons but in
this particular case I felt like if you
were testing just reasoning 01 clearly
came out ahead so give deep seek a try
compare it to 01 preview let me know
what you think