Learning Library

← Back to Library

Evaluating Test-Time Inference Scaling Laws

5m • Unknown Channel • ai-ml • deep-dive • advanced • Watch on YouTube ↗

Key Points

OpenAI claims that allowing more “test‑time” inference (longer thinking or parallel reasoning) yields consistently smarter answers, suggesting a scaling law for AI performance.
A new competitor, DeepSeek from China, is specifically built to exploit test‑time inference, promising improved intelligence by taking extra time to respond.
In a head‑to‑head test using the same murder‑mystery logic puzzle, OpenAI’s GPT‑4o (preview) produced a tightly reasoned, coherent solution, while DeepSeek’s answer was less logical and harder to follow.
The experiment suggests that the purported scaling advantage of test‑time inference may not hold for complex, ambiguous reasoning tasks that are not purely mathematical or scientific.

Sections

00:00:00 Testing Test‑Time Inference Claims - The speaker examines OpenAI’s alleged scaling law for test‑time inference by contrasting it with DeepSeek—a Chinese model built to improve reasoning through longer, parallel inference—and describes a logic‑focused evaluation using a detective‑style murder scenario.

Full Transcript

# Evaluating Test-Time Inference Scaling Laws **Source:** [https://www.youtube.com/watch?v=zCvdrME4ErA](https://www.youtube.com/watch?v=zCvdrME4ErA) **Duration:** 00:05:02 ## Summary - OpenAI claims that allowing more “test‑time” inference (longer thinking or parallel reasoning) yields consistently smarter answers, suggesting a scaling law for AI performance. - A new competitor, DeepSeek from China, is specifically built to exploit test‑time inference, promising improved intelligence by taking extra time to respond. - In a head‑to‑head test using the same murder‑mystery logic puzzle, OpenAI’s GPT‑4o (preview) produced a tightly reasoned, coherent solution, while DeepSeek’s answer was less logical and harder to follow. - The experiment suggests that the purported scaling advantage of test‑time inference may not hold for complex, ambiguous reasoning tasks that are not purely mathematical or scientific. ## Sections - [00:00:00](https://www.youtube.com/watch?v=zCvdrME4ErA&t=0s) **Testing Test‑Time Inference Claims** - The speaker examines OpenAI’s alleged scaling law for test‑time inference by contrasting it with DeepSeek—a Chinese model built to improve reasoning through longer, parallel inference—and describes a logic‑focused evaluation using a detective‑style murder scenario. ## Full Transcript

0:00one of the things that we need to talk 0:01more about is whether or not the open AI 0:04claim that test time inference has a 0:08scaling law is true I'm going to unpack 0:11that statement so roughly speaking the 0:14scaling law that open AI is claiming 0:16they've unlocked is that if you take 0:17more time to think about the problem if 0:20you run parallel paths with your AI 0:23system and you come back you're going to 0:25have a smarter answer if you can select 0:27the best 0:28response and the reason why we need to 0:31talk about whether that scales or not is 0:33because we have our first legitimately 0:36different competitor outside the big 0:37model maker set that's called Deep seek 0:40it came out came out of China and it is 0:44specifically designed to scale 0:47intelligence through test time inference 0:48and what I mean by that is when you chat 0:51it takes extra time to 0:53respond now I obviously went and 0:55immediately tried it because I'm me and 0:59I wanted to give 1:00a problem that I thought would be out of 1:03left field or an interesting 1:05Challenge and I did not want to give it 1:08a problem that would align with what I 1:11was seeing as reported as good by the 1:14model I want to give it something 1:16different to test new capabilities so 1:18deep seek was actually reported as being 1:21very very good at mathematical and 1:22scientific problems less good at 1:25language and coding I didn't really want 1:27to test language or coding or 1:29mathematics or science I wanted to test 1:32reasoning I wanted to test logic and so 1:35I came up with a detective 1:39scenario I actually gave a short 1:42synopsis of a murder mystery problem to 1:46both open ai's 01 preview model and also 1:50deep seek and I wanted to see with the 1:52exact same scenario what they would 1:54do and I noticed a couple of things one 1:59even 2:00though everyone is saying that deep seek 2:03is able to use that additional time it 2:06takes to respond to give a better answer 2:09that was not true when it wasn't being 2:11tested with a scientific or mathematical 2:14question this was just a pure logic 2:17puzzle with a lot of uncertainty and a 2:19lot of ambiguity and conflicting 2:20evidence thrown in to make it difficult 2:23and so really I was trying to measure 2:26whether the system could sort through 2:27all of that conflicting evidence come 2:29back with a logical 2:32response and at the end of the day when 2:34I compared the two 0's response was a 2:37lot better 0 came back and it had 2:40tightly argued logical reasoning for the 2:43choices that it had made it had clearly 2:45examined all of the evidence I had given 2:47it in the murder mystery prompt it had 2:49thought about it it had put it together 2:51and it just came back with a really 2:53rational response deep SE came back it 2:56named a different murder suspect it spit 2:59out all of the reasoning token so I was 3:01just sitting there looking at gray text 3:02coming down the 3:03screen and when it gave its 3:07response it wasn't as 3:10logical it could not logically 3:12articulate the reason why it had made 3:15the choices it had made with the same 3:17degree of coherence that 01 3:19brought and if I were to go into like 3:22the whole murder mystery and the whole 3:23response it would take a long time 3:25longer than you want to spend on this 3:27YouTube channel so I'm summarizing for 3:29you 3:30but I want to call it out because I 3:32think we need better evals for these 3:34kinds of claims like if we are going to 3:36claim that the scaling law holds for 3:39inference time we need to have evals 3:42that are outside standard knowledge 3:44domains we need to not just say here's 3:47the rean hypothesis or here's a physics 3:49problem and also can you remember 3:52dostoevski and write a story like dooi 3:55like those are one way to measure 3:56intelligence but I wanted to look at 3:58something that actually measures novel 4:00intelligence and reasoning ability 4:03across a new problem so the problem I 4:05gave it was not a murder mystery from 4:06books it was actually a net new murder 4:09problem 4:10scenario and when I look at all of 4:14that it just reminds me that if we're 4:16going to make claims about how inference 4:19compute works we need to be making 4:22equivalent commitments to inference time 4:25evals and if we're not we're really not 4:29not in a position to correctly compare a 4:32bunch of models that come out so that's 4:34my thought I was not super impressed 4:36with deep seek from that test I realize 4:38it's only one I'm not saying that the 4:40people who are testing Math and Science 4:42problems are incorrect when they're 4:43saying it's great at it it probably is 4:45every model has its pros and cons but in 4:49this particular case I felt like if you 4:51were testing just reasoning 01 clearly 4:54came out ahead so give deep seek a try 4:57compare it to 01 preview let me know 4:59what you think