Learning Library

← Back to Library

Data Lockdown Threatens AI Training

6m • Unknown Channel • ai-ml • deep-dive • intermediate • Watch on YouTube ↗

Key Points

Ilia Sutskiver’s claim that “data is the new oil” is being challenged by emerging trends that suggest data is becoming increasingly locked away, forcing a rethink of AI data‑availability assumptions.
OpenAI’s acquisition of Windinsurf prompted Anthropic to cut off model access to that data source, illustrating how competitive moves are deliberately restricting user‑generated artificial data streams.
Salesforce has barred Glean from accessing Slack messages, underscoring the growing practice of “ring‑fencing” valuable internal communications to protect proprietary AI‑training material.
High‑profile lawsuits—such as the New York Times vs. OpenAI and Disney vs. Midjourney—signal a rising legal pushback against the use of copyrighted or proprietary content in AI models.
Together, these data‑lockouts and legal pressures suggest a faster‑than‑expected depletion of usable training data, which could reshape the future trajectory of AI development.

Sections

Full Transcript

# Data Lockdown Threatens AI Training **Source:** [https://www.youtube.com/watch?v=v6tZ1Wcg-YU](https://www.youtube.com/watch?v=v6tZ1Wcg-YU) **Duration:** 00:06:59 ## Summary - Ilia Sutskiver’s claim that “data is the new oil” is being challenged by emerging trends that suggest data is becoming increasingly locked away, forcing a rethink of AI data‑availability assumptions. - OpenAI’s acquisition of Windinsurf prompted Anthropic to cut off model access to that data source, illustrating how competitive moves are deliberately restricting user‑generated artificial data streams. - Salesforce has barred Glean from accessing Slack messages, underscoring the growing practice of “ring‑fencing” valuable internal communications to protect proprietary AI‑training material. - High‑profile lawsuits—such as the New York Times vs. OpenAI and Disney vs. Midjourney—signal a rising legal pushback against the use of copyrighted or proprietary content in AI models. - Together, these data‑lockouts and legal pressures suggest a faster‑than‑expected depletion of usable training data, which could reshape the future trajectory of AI development. ## Sections - [00:00:00](https://www.youtube.com/watch?v=v6tZ1Wcg-YU&t=0s) **Data Lockdowns Challenge AI Assumptions** - The speaker argues that Ilia Sutskiver’s claim that data is abundant is being undermined by increasing data lockouts and the rise of user‑generated artificial data, suggesting a faster depletion of trainable data than anticipated. - [00:03:08](https://www.youtube.com/watch?v=v6tZ1Wcg-YU&t=188s) **Data Lockdown and Synthetic Iteration** - The speaker contends that corporate efforts to restrict large swaths of data (e.g., Disney, NYT) marginally impact model capabilities yet risk accelerating data scarcity, while highlighting a swift shift toward using AI‑generated synthetic data to iteratively improve training processes. - [00:06:26](https://www.youtube.com/watch?v=v6tZ1Wcg-YU&t=386s) **Executives Cutting Data, Models Innovate** - The speaker observes that in 2025 executives are increasingly shutting off data access, while AI developers counter this by employing synthetic-token reinforcement learning that makes real data less essential. ## Full Transcript

0:00You know, Ilia Sutskiver said data is 0:02the new oil, which is, I suppose, not a 0:04new thing to say, but it's Ilia, and 0:06he's one of the leading lights of the AI 0:08generation. And when he said it at 0:11Nurips in November 2024, we all paid 0:13attention. 0:14There have been two major trend lines 0:16since then that make me think that Ilia 0:21needs to update his priors. I know, who 0:23am I to say that? I'm nobody to say 0:25that. So, let me just finish my thing. 0:27Uh number one, 0:31data is getting locked off actively. And 0:34this is something that's actually in 0:36favor of Ilia's contention. Right? So 0:37Ilia's contention is that we're running 0:39out of data to train on because there's 0:41only so much data in the world to train 0:43on. What if we're running out faster 0:46than he thought because people are 0:48actively locking off sources of data? We 0:51see that on a few places. Uh, one, if 0:54you've been on this channel, OpenAI 0:56bought Windinsurf. Enthropic cut off 0:59model access to Windsurf for their 1:02clawed models because you don't want to 1:04let those guys get the models, right? 1:06Then the model output tokens would go to 1:08their rival OpenAI and that would be a 1:11problem. It's a data lockoff. And this 1:13is for future data streams. This is for 1:15data that users generate in the tool. 1:18what I would call usergenerated 1:22artificial data because it's artificial 1:24because it's coming from a model but 1:26it's userenerated because you're 1:27prompting for it. I think it's one of 1:29the biggest sources of renewable data 1:30that we have. Chat GPT generates a ton 1:33of user generated artificial data. Uh 1:35it's one of their greatest sources of 1:37data. I would hypothesize. Uh and 1:40anthropic did not want that valuable 1:43code data to go to OpenAI. 1:46Okay. 1:48Second development in that direction, 1:50Salesforce has reportedly cut the 1:53rapidly growing uh company Glean off 1:57from getting access to Slack messages. 1:59And why does that matter? Glean 2:01specializes in dealing with a single 2:04pane of glass that executives can see 2:06what's happening in the whole business. 2:08They need Slack to do it. And so 2:11Salesforce is cutting glean off at the 2:13knees because Salesforce views Slack 2:14data as highly valuable in the age of 2:16AI. And I would agree. I think it is. 2:19But Salesforce is making moves they 2:21didn't make last year because they 2:23understand better at the executive level 2:25that they need to ring fence that data 2:27in order to prevent people from getting 2:28a hold of it. And again, this is not 2:31necessarily 2:32usergenerated artificial tokens in this 2:35case. This is just userenerated data. 2:37We're typing in Slack all day. Well, now 2:39Gleam can't get a hold of it. I would 2:42expect those restrictions to continue to 2:43tighten. 2:45Third one is uh on the legal side. I 2:48talked about the New York Times suing 2:50OpenAI and kind of some of the problems 2:51with that. Disney suing Midjourney is 2:54sort of the next generation lawsuit 2:57here. They're basically saying that 2:58MidJourney uh allow uh allows users to 3:02create proprietary Disney characters. 3:05And 3:09without getting into the merits of the 3:10case, I think the story from the data 3:12side is essentially Disney is trying to 3:15lock off huge chunks of the data 3:18landscape to trainers and models. 3:21Similar to what the New York Times is 3:22doing, chunks of the data landscape. 3:24Now, individually, these maybe huge 3:26chunks isn't fair. They're huge in terms 3:28of you know nominal value gigabytes and 3:30so on terabytes pabyte maybe but they're 3:34not huge on the scale of like zetabytes 3:36not huge on the scale of the internet so 3:39I think if you remove them the models 3:41would go on much as before 3:45so three different ways in which we're 3:47cutting off access to data making data 3:50uh effectively less common less 3:53available more of a resource that Ilia 3:56would say is not coming back and that we 3:59have to be careful with. Right? So this 4:00is in favor of Ilia's contention and in 4:02fact suggests that we may be 4:04accelerating the scarcity of data. 4:08But two is a little bit different. Um 4:12theme two is this idea that we are 4:15taking data 4:18and we are very rapidly 4:22using machines 4:25to iterate on machine generated data. 4:28Now there was a very famous study done 4:30too long ago to matter which means 2023 4:34in AI terms that suggest that synthetic 4:38data generated by AI can't be reliably 4:41used by AI. 4:44That's not true anymore. Uh you have 4:48reports coming out of multiple major 4:50model makers that synthetic tokens are 4:54useful in the process of training. And 4:57now we've gone the next step. 4:59There are reports that there are 5:03automated 5:05iterative improvements at multiple major 5:07model makers driven by 5:11autonomous AI. And so this that I've 5:14seen leaks from both Anthropic and from 5:16OpenAI on the internet both suggest 5:20independently that they are working 5:22through the concept of AIdriven 5:26reinforcement learning. So instead of 5:27humans saying is this a good thing or 5:29not and driving reinforcement learning 5:33which trains the model what responses to 5:34prefer or not they're getting better 5:37responses out of the model by letting AI 5:40do the reinforcement learning on AI. AI 5:43taking over the training of AI, which if 5:45you're wondering is one of the big steps 5:48toward a 5:51generalized intelligence explosion 5:54because if AI can do it, there's really 5:58no barrier except power, right? You just 6:00turn on more power and AI can do more of 6:01it. 6:03And so 6:06in that world, 6:08does data still matter if you can use AI 6:12to generate synthetic tokens and then 6:14use AI to reinforcement learn your way 6:16through? And the answer may be you don't 6:19really need the data, but you do have 6:21alignment questions, and that's a 6:22separate conversation. 6:25But it's one of those things where I've 6:26been watching the longer term trend line 6:28since Nurips's uh which is the 6:29conference IA spoke at where he talked 6:31about the data piece and I've been 6:32saying well what how's the data story 6:33evolving in 2025 and I think the story 6:35is basically executives are waking up 6:38and shutting data off wherever they can 6:40and at the same time model makers are 6:42innovating and basically finding ways 6:44for it not to matter. They're finding 6:46ways for AI to drive reinforcement 6:47learning on synthetic tokens. So it's 6:49like well 6:51if you cut off the data it doesn't 6:52matter. You tell me where do you think 6:54the data story is going in the rest of 6:572025.