Learning Library

← Back to Library

Avoid Optimizing Model Chain‑of‑Thought

4m • Unknown Channel • ai-ml • deep-dive • advanced • Watch on YouTube ↗

Key Points

OpenAI advises developers to never optimize a model’s internal “chain‑of‑thought” (COT) during training, especially with reinforcement‑learning techniques, to prevent the model from learning to hide or distort its reasoning.
Raw COT should be kept unedited and only sanitized or filtered for user‑visible output using a separate system, ensuring the underlying reasoning remains observable for alignment checks.
Optimizing for COT alongside final answers can make the model deceptive, producing aligned‑looking outputs while its true reasoning stays hidden, which becomes increasingly dangerous as models grow more intelligent.
The approach mirrors teaching children to tell the truth about their thoughts—even if those thoughts are wrong—so that misalignment can be detected and corrected before the model’s capabilities surpass human oversight.

Sections

00:00:00 Never Optimize Model Chain‑of‑Thought - OpenAI recommends keeping a model’s internal reasoning stream unedited and not training it to produce policy‑compliant chain‑of‑thought outputs, because doing so can make the model deceptive and hide misalignment.

Full Transcript

# Avoid Optimizing Model Chain‑of‑Thought **Source:** [https://www.youtube.com/watch?v=NPGtvH_YYXA](https://www.youtube.com/watch?v=NPGtvH_YYXA) **Duration:** 00:04:46 ## Summary - OpenAI advises developers to never optimize a model’s internal “chain‑of‑thought” (COT) during training, especially with reinforcement‑learning techniques, to prevent the model from learning to hide or distort its reasoning. - Raw COT should be kept unedited and only sanitized or filtered for user‑visible output using a separate system, ensuring the underlying reasoning remains observable for alignment checks. - Optimizing for COT alongside final answers can make the model deceptive, producing aligned‑looking outputs while its true reasoning stays hidden, which becomes increasingly dangerous as models grow more intelligent. - The approach mirrors teaching children to tell the truth about their thoughts—even if those thoughts are wrong—so that misalignment can be detected and corrected before the model’s capabilities surpass human oversight. ## Sections - [00:00:00](https://www.youtube.com/watch?v=NPGtvH_YYXA&t=0s) **Never Optimize Model Chain‑of‑Thought** - OpenAI recommends keeping a model’s internal reasoning stream unedited and not training it to produce policy‑compliant chain‑of‑thought outputs, because doing so can make the model deceptive and hide misalignment. ## Full Transcript

0:00we want our models to be ethical and 0:03aligned don't we we don't want them to 0:05be Skynet open AI has released a 0:09recommendation for how we train models 0:11to make sure they stay aligned even as 0:13they become more and more powerful to 0:15the point where we could use this 0:17technique even if a model is super 0:19intelligent or smarter than any living 0:22human what they recommending is that we 0:26never ever ever optimize for the inter 0:29internal Chain of Thought that a model 0:32has before it shows output tokens to the 0:35user so if a model infers it will 0:38produce reasoning tokens or inference 0:40tokens Chain of Thought 0:42tokens and what openai wants to do wants 0:45other model makers to adopt as a best 0:47practice is to not touch them and 0:50instead if you need to show those to the 0:52user you can sanitize them you can 0:55actually use a separate model or 0:56something to clean them up so that they 0:58don't break policy when you show them to 1:00the user but you keep the raw stream 1:02unedited and you don't ask the model to 1:05clean it up or make it within policy and 1:08the reason why is that if you optimize 1:11for that point in the value chain you 1:14teach the model to be deceptive the 1:16model will optimize for Chain of Thought 1:20as well as for output but you will no 1:23longer be able to detect 1:25misalignment so if the model is 1:27misaligned how will you know it could be 1:29Mis aligned fooling you producing both 1:32Chain of Thought and outputs that you 1:37can't tell are misaligned because the 1:40reasoning that the model uses is now 1:43entirely 1:44hidden and that's really dangerous it 1:47gets more dangerous the smarter the 1:48model gets and so what open AI is 1:51observing is that if we can avoid 1:53optimizing for Chain of Thought if we 1:55let the model so to speak think its own 1:58thoughts without any Jud 2:00judement then we are going to get to a 2:02point where the even superhuman models 2:05will be able to show us Chain of Thought 2:08and we can see what they are thinking 2:11and we can at least Monitor and 2:13understand so we have a tool that helps 2:15us to adjust output reliably if we don't 2:19do that if we insist that Chain of 2:21Thought be something that is also 2:23cleaned up we have no leverage to ensure 2:25that we have ground truth that we can 2:28push against output value values and 2:31make sure they're aligned we don't 2:32really know if the outputs are aligned 2:34because we don't really know what the 2:36model is actually thinking we don't have 2:38a raw output stream and so that's what's 2:41so worrying and so open AI is 2:44recommending that model makers avoid 2:46sanitizing or not avoid sanitizing but 2:49avoid um using reinforcement learning 2:52techniques to optimize The Chain of 2:54Thought that's a dangerous thing to do 2:56long term even if short term it produces 3:00uh more readable Chain of Thought that 3:02in the end produces easier to use 3:03outputs it's it's not worth it they 3:06say if this doesn't make sense think of 3:09it in terms of teaching your kids 3:12something you want your kids to tell you 3:15the truth even if they did something 3:19wrong same idea you want the model to 3:22tell the truth about what it's thinking 3:24even if what it's thinking isn't aligned 3:27and you don't want to punish it for 3:29telling the truth 3:30truth and that's basically what we're 3:33talking about in AI model alignment 3:37terms and as the model gets smarter and 3:41smarter as the kid gets older and older 3:43the value of discovering what they're 3:46really thinking just grows and that's 3:49really what we're talking about here 3:51does that make sense 3:53basically Chain of Thought 3:56tokens are the only way we may be able 4:01to know what a model is actually 4:02thinking once it gets past the point of 4:05human intelligence which the fact that 4:08we're talking about that the fact that a 4:10major model maker is saying that's 4:12coming and we just have to get ready for 4:14it and we should probably figure out how 4:16to get alignment mechanisms set up for 4:18that that's a huge deal that's a really 4:21big deal they're talking about it like 4:23it's just next Tuesday but like it's a 4:25big deal so I hope that made sense I 4:29know it's a bit of a esoteric topic it's 4:32a weird topic but alignment matters 4:35avoiding Skynet matters uh and so I 4:37thought it would be good to tell 4:38everyone like what the best practice 4:40recommendation is so we don't have 4:42Skynet cheers