Learning Library

← Back to Library

Evaluating Test-Time Inference Scaling Laws

Key Points

  • OpenAI claims that allowing more “test‑time” inference (longer thinking or parallel reasoning) yields consistently smarter answers, suggesting a scaling law for AI performance.
  • A new competitor, DeepSeek from China, is specifically built to exploit test‑time inference, promising improved intelligence by taking extra time to respond.
  • In a head‑to‑head test using the same murder‑mystery logic puzzle, OpenAI’s GPT‑4o (preview) produced a tightly reasoned, coherent solution, while DeepSeek’s answer was less logical and harder to follow.
  • The experiment suggests that the purported scaling advantage of test‑time inference may not hold for complex, ambiguous reasoning tasks that are not purely mathematical or scientific.

Full Transcript

# Evaluating Test-Time Inference Scaling Laws **Source:** [https://www.youtube.com/watch?v=zCvdrME4ErA](https://www.youtube.com/watch?v=zCvdrME4ErA) **Duration:** 00:05:02 ## Summary - OpenAI claims that allowing more “test‑time” inference (longer thinking or parallel reasoning) yields consistently smarter answers, suggesting a scaling law for AI performance. - A new competitor, DeepSeek from China, is specifically built to exploit test‑time inference, promising improved intelligence by taking extra time to respond. - In a head‑to‑head test using the same murder‑mystery logic puzzle, OpenAI’s GPT‑4o (preview) produced a tightly reasoned, coherent solution, while DeepSeek’s answer was less logical and harder to follow. - The experiment suggests that the purported scaling advantage of test‑time inference may not hold for complex, ambiguous reasoning tasks that are not purely mathematical or scientific. ## Sections - [00:00:00](https://www.youtube.com/watch?v=zCvdrME4ErA&t=0s) **Testing Test‑Time Inference Claims** - The speaker examines OpenAI’s alleged scaling law for test‑time inference by contrasting it with DeepSeek—a Chinese model built to improve reasoning through longer, parallel inference—and describes a logic‑focused evaluation using a detective‑style murder scenario. ## Full Transcript
0:00one of the things that we need to talk 0:01more about is whether or not the open AI 0:04claim that test time inference has a 0:08scaling law is true I'm going to unpack 0:11that statement so roughly speaking the 0:14scaling law that open AI is claiming 0:16they've unlocked is that if you take 0:17more time to think about the problem if 0:20you run parallel paths with your AI 0:23system and you come back you're going to 0:25have a smarter answer if you can select 0:27the best 0:28response and the reason why we need to 0:31talk about whether that scales or not is 0:33because we have our first legitimately 0:36different competitor outside the big 0:37model maker set that's called Deep seek 0:40it came out came out of China and it is 0:44specifically designed to scale 0:47intelligence through test time inference 0:48and what I mean by that is when you chat 0:51it takes extra time to 0:53respond now I obviously went and 0:55immediately tried it because I'm me and 0:59I wanted to give 1:00a problem that I thought would be out of 1:03left field or an interesting 1:05Challenge and I did not want to give it 1:08a problem that would align with what I 1:11was seeing as reported as good by the 1:14model I want to give it something 1:16different to test new capabilities so 1:18deep seek was actually reported as being 1:21very very good at mathematical and 1:22scientific problems less good at 1:25language and coding I didn't really want 1:27to test language or coding or 1:29mathematics or science I wanted to test 1:32reasoning I wanted to test logic and so 1:35I came up with a detective 1:39scenario I actually gave a short 1:42synopsis of a murder mystery problem to 1:46both open ai's 01 preview model and also 1:50deep seek and I wanted to see with the 1:52exact same scenario what they would 1:54do and I noticed a couple of things one 1:59even 2:00though everyone is saying that deep seek 2:03is able to use that additional time it 2:06takes to respond to give a better answer 2:09that was not true when it wasn't being 2:11tested with a scientific or mathematical 2:14question this was just a pure logic 2:17puzzle with a lot of uncertainty and a 2:19lot of ambiguity and conflicting 2:20evidence thrown in to make it difficult 2:23and so really I was trying to measure 2:26whether the system could sort through 2:27all of that conflicting evidence come 2:29back with a logical 2:32response and at the end of the day when 2:34I compared the two 0's response was a 2:37lot better 0 came back and it had 2:40tightly argued logical reasoning for the 2:43choices that it had made it had clearly 2:45examined all of the evidence I had given 2:47it in the murder mystery prompt it had 2:49thought about it it had put it together 2:51and it just came back with a really 2:53rational response deep SE came back it 2:56named a different murder suspect it spit 2:59out all of the reasoning token so I was 3:01just sitting there looking at gray text 3:02coming down the 3:03screen and when it gave its 3:07response it wasn't as 3:10logical it could not logically 3:12articulate the reason why it had made 3:15the choices it had made with the same 3:17degree of coherence that 01 3:19brought and if I were to go into like 3:22the whole murder mystery and the whole 3:23response it would take a long time 3:25longer than you want to spend on this 3:27YouTube channel so I'm summarizing for 3:29you 3:30but I want to call it out because I 3:32think we need better evals for these 3:34kinds of claims like if we are going to 3:36claim that the scaling law holds for 3:39inference time we need to have evals 3:42that are outside standard knowledge 3:44domains we need to not just say here's 3:47the rean hypothesis or here's a physics 3:49problem and also can you remember 3:52dostoevski and write a story like dooi 3:55like those are one way to measure 3:56intelligence but I wanted to look at 3:58something that actually measures novel 4:00intelligence and reasoning ability 4:03across a new problem so the problem I 4:05gave it was not a murder mystery from 4:06books it was actually a net new murder 4:09problem 4:10scenario and when I look at all of 4:14that it just reminds me that if we're 4:16going to make claims about how inference 4:19compute works we need to be making 4:22equivalent commitments to inference time 4:25evals and if we're not we're really not 4:29not in a position to correctly compare a 4:32bunch of models that come out so that's 4:34my thought I was not super impressed 4:36with deep seek from that test I realize 4:38it's only one I'm not saying that the 4:40people who are testing Math and Science 4:42problems are incorrect when they're 4:43saying it's great at it it probably is 4:45every model has its pros and cons but in 4:49this particular case I felt like if you 4:51were testing just reasoning 01 clearly 4:54came out ahead so give deep seek a try 4:57compare it to 01 preview let me know 4:59what you think