Learning Library

← Back to Library

Data Lockdown Threatens AI Training

Key Points

  • Ilia Sutskiver’s claim that “data is the new oil” is being challenged by emerging trends that suggest data is becoming increasingly locked away, forcing a rethink of AI data‑availability assumptions.
  • OpenAI’s acquisition of Windinsurf prompted Anthropic to cut off model access to that data source, illustrating how competitive moves are deliberately restricting user‑generated artificial data streams.
  • Salesforce has barred Glean from accessing Slack messages, underscoring the growing practice of “ring‑fencing” valuable internal communications to protect proprietary AI‑training material.
  • High‑profile lawsuits—such as the New York Times vs. OpenAI and Disney vs. Midjourney—signal a rising legal pushback against the use of copyrighted or proprietary content in AI models.
  • Together, these data‑lockouts and legal pressures suggest a faster‑than‑expected depletion of usable training data, which could reshape the future trajectory of AI development.

Full Transcript

# Data Lockdown Threatens AI Training **Source:** [https://www.youtube.com/watch?v=v6tZ1Wcg-YU](https://www.youtube.com/watch?v=v6tZ1Wcg-YU) **Duration:** 00:06:59 ## Summary - Ilia Sutskiver’s claim that “data is the new oil” is being challenged by emerging trends that suggest data is becoming increasingly locked away, forcing a rethink of AI data‑availability assumptions. - OpenAI’s acquisition of Windinsurf prompted Anthropic to cut off model access to that data source, illustrating how competitive moves are deliberately restricting user‑generated artificial data streams. - Salesforce has barred Glean from accessing Slack messages, underscoring the growing practice of “ring‑fencing” valuable internal communications to protect proprietary AI‑training material. - High‑profile lawsuits—such as the New York Times vs. OpenAI and Disney vs. Midjourney—signal a rising legal pushback against the use of copyrighted or proprietary content in AI models. - Together, these data‑lockouts and legal pressures suggest a faster‑than‑expected depletion of usable training data, which could reshape the future trajectory of AI development. ## Sections - [00:00:00](https://www.youtube.com/watch?v=v6tZ1Wcg-YU&t=0s) **Data Lockdowns Challenge AI Assumptions** - The speaker argues that Ilia Sutskiver’s claim that data is abundant is being undermined by increasing data lockouts and the rise of user‑generated artificial data, suggesting a faster depletion of trainable data than anticipated. - [00:03:08](https://www.youtube.com/watch?v=v6tZ1Wcg-YU&t=188s) **Data Lockdown and Synthetic Iteration** - The speaker contends that corporate efforts to restrict large swaths of data (e.g., Disney, NYT) marginally impact model capabilities yet risk accelerating data scarcity, while highlighting a swift shift toward using AI‑generated synthetic data to iteratively improve training processes. - [00:06:26](https://www.youtube.com/watch?v=v6tZ1Wcg-YU&t=386s) **Executives Cutting Data, Models Innovate** - The speaker observes that in 2025 executives are increasingly shutting off data access, while AI developers counter this by employing synthetic-token reinforcement learning that makes real data less essential. ## Full Transcript
0:00You know, Ilia Sutskiver said data is 0:02the new oil, which is, I suppose, not a 0:04new thing to say, but it's Ilia, and 0:06he's one of the leading lights of the AI 0:08generation. And when he said it at 0:11Nurips in November 2024, we all paid 0:13attention. 0:14There have been two major trend lines 0:16since then that make me think that Ilia 0:21needs to update his priors. I know, who 0:23am I to say that? I'm nobody to say 0:25that. So, let me just finish my thing. 0:27Uh number one, 0:31data is getting locked off actively. And 0:34this is something that's actually in 0:36favor of Ilia's contention. Right? So 0:37Ilia's contention is that we're running 0:39out of data to train on because there's 0:41only so much data in the world to train 0:43on. What if we're running out faster 0:46than he thought because people are 0:48actively locking off sources of data? We 0:51see that on a few places. Uh, one, if 0:54you've been on this channel, OpenAI 0:56bought Windinsurf. Enthropic cut off 0:59model access to Windsurf for their 1:02clawed models because you don't want to 1:04let those guys get the models, right? 1:06Then the model output tokens would go to 1:08their rival OpenAI and that would be a 1:11problem. It's a data lockoff. And this 1:13is for future data streams. This is for 1:15data that users generate in the tool. 1:18what I would call usergenerated 1:22artificial data because it's artificial 1:24because it's coming from a model but 1:26it's userenerated because you're 1:27prompting for it. I think it's one of 1:29the biggest sources of renewable data 1:30that we have. Chat GPT generates a ton 1:33of user generated artificial data. Uh 1:35it's one of their greatest sources of 1:37data. I would hypothesize. Uh and 1:40anthropic did not want that valuable 1:43code data to go to OpenAI. 1:46Okay. 1:48Second development in that direction, 1:50Salesforce has reportedly cut the 1:53rapidly growing uh company Glean off 1:57from getting access to Slack messages. 1:59And why does that matter? Glean 2:01specializes in dealing with a single 2:04pane of glass that executives can see 2:06what's happening in the whole business. 2:08They need Slack to do it. And so 2:11Salesforce is cutting glean off at the 2:13knees because Salesforce views Slack 2:14data as highly valuable in the age of 2:16AI. And I would agree. I think it is. 2:19But Salesforce is making moves they 2:21didn't make last year because they 2:23understand better at the executive level 2:25that they need to ring fence that data 2:27in order to prevent people from getting 2:28a hold of it. And again, this is not 2:31necessarily 2:32usergenerated artificial tokens in this 2:35case. This is just userenerated data. 2:37We're typing in Slack all day. Well, now 2:39Gleam can't get a hold of it. I would 2:42expect those restrictions to continue to 2:43tighten. 2:45Third one is uh on the legal side. I 2:48talked about the New York Times suing 2:50OpenAI and kind of some of the problems 2:51with that. Disney suing Midjourney is 2:54sort of the next generation lawsuit 2:57here. They're basically saying that 2:58MidJourney uh allow uh allows users to 3:02create proprietary Disney characters. 3:05And 3:09without getting into the merits of the 3:10case, I think the story from the data 3:12side is essentially Disney is trying to 3:15lock off huge chunks of the data 3:18landscape to trainers and models. 3:21Similar to what the New York Times is 3:22doing, chunks of the data landscape. 3:24Now, individually, these maybe huge 3:26chunks isn't fair. They're huge in terms 3:28of you know nominal value gigabytes and 3:30so on terabytes pabyte maybe but they're 3:34not huge on the scale of like zetabytes 3:36not huge on the scale of the internet so 3:39I think if you remove them the models 3:41would go on much as before 3:45so three different ways in which we're 3:47cutting off access to data making data 3:50uh effectively less common less 3:53available more of a resource that Ilia 3:56would say is not coming back and that we 3:59have to be careful with. Right? So this 4:00is in favor of Ilia's contention and in 4:02fact suggests that we may be 4:04accelerating the scarcity of data. 4:08But two is a little bit different. Um 4:12theme two is this idea that we are 4:15taking data 4:18and we are very rapidly 4:22using machines 4:25to iterate on machine generated data. 4:28Now there was a very famous study done 4:30too long ago to matter which means 2023 4:34in AI terms that suggest that synthetic 4:38data generated by AI can't be reliably 4:41used by AI. 4:44That's not true anymore. Uh you have 4:48reports coming out of multiple major 4:50model makers that synthetic tokens are 4:54useful in the process of training. And 4:57now we've gone the next step. 4:59There are reports that there are 5:03automated 5:05iterative improvements at multiple major 5:07model makers driven by 5:11autonomous AI. And so this that I've 5:14seen leaks from both Anthropic and from 5:16OpenAI on the internet both suggest 5:20independently that they are working 5:22through the concept of AIdriven 5:26reinforcement learning. So instead of 5:27humans saying is this a good thing or 5:29not and driving reinforcement learning 5:33which trains the model what responses to 5:34prefer or not they're getting better 5:37responses out of the model by letting AI 5:40do the reinforcement learning on AI. AI 5:43taking over the training of AI, which if 5:45you're wondering is one of the big steps 5:48toward a 5:51generalized intelligence explosion 5:54because if AI can do it, there's really 5:58no barrier except power, right? You just 6:00turn on more power and AI can do more of 6:01it. 6:03And so 6:06in that world, 6:08does data still matter if you can use AI 6:12to generate synthetic tokens and then 6:14use AI to reinforcement learn your way 6:16through? And the answer may be you don't 6:19really need the data, but you do have 6:21alignment questions, and that's a 6:22separate conversation. 6:25But it's one of those things where I've 6:26been watching the longer term trend line 6:28since Nurips's uh which is the 6:29conference IA spoke at where he talked 6:31about the data piece and I've been 6:32saying well what how's the data story 6:33evolving in 2025 and I think the story 6:35is basically executives are waking up 6:38and shutting data off wherever they can 6:40and at the same time model makers are 6:42innovating and basically finding ways 6:44for it not to matter. They're finding 6:46ways for AI to drive reinforcement 6:47learning on synthetic tokens. So it's 6:49like well 6:51if you cut off the data it doesn't 6:52matter. You tell me where do you think 6:54the data story is going in the rest of 6:572025.