Data Lockdown Threatens AI Training
Key Points
- Ilia Sutskiver’s claim that “data is the new oil” is being challenged by emerging trends that suggest data is becoming increasingly locked away, forcing a rethink of AI data‑availability assumptions.
- OpenAI’s acquisition of Windinsurf prompted Anthropic to cut off model access to that data source, illustrating how competitive moves are deliberately restricting user‑generated artificial data streams.
- Salesforce has barred Glean from accessing Slack messages, underscoring the growing practice of “ring‑fencing” valuable internal communications to protect proprietary AI‑training material.
- High‑profile lawsuits—such as the New York Times vs. OpenAI and Disney vs. Midjourney—signal a rising legal pushback against the use of copyrighted or proprietary content in AI models.
- Together, these data‑lockouts and legal pressures suggest a faster‑than‑expected depletion of usable training data, which could reshape the future trajectory of AI development.
Sections
- Data Lockdowns Challenge AI Assumptions - The speaker argues that Ilia Sutskiver’s claim that data is abundant is being undermined by increasing data lockouts and the rise of user‑generated artificial data, suggesting a faster depletion of trainable data than anticipated.
- Data Lockdown and Synthetic Iteration - The speaker contends that corporate efforts to restrict large swaths of data (e.g., Disney, NYT) marginally impact model capabilities yet risk accelerating data scarcity, while highlighting a swift shift toward using AI‑generated synthetic data to iteratively improve training processes.
- Executives Cutting Data, Models Innovate - The speaker observes that in 2025 executives are increasingly shutting off data access, while AI developers counter this by employing synthetic-token reinforcement learning that makes real data less essential.
Full Transcript
# Data Lockdown Threatens AI Training **Source:** [https://www.youtube.com/watch?v=v6tZ1Wcg-YU](https://www.youtube.com/watch?v=v6tZ1Wcg-YU) **Duration:** 00:06:59 ## Summary - Ilia Sutskiver’s claim that “data is the new oil” is being challenged by emerging trends that suggest data is becoming increasingly locked away, forcing a rethink of AI data‑availability assumptions. - OpenAI’s acquisition of Windinsurf prompted Anthropic to cut off model access to that data source, illustrating how competitive moves are deliberately restricting user‑generated artificial data streams. - Salesforce has barred Glean from accessing Slack messages, underscoring the growing practice of “ring‑fencing” valuable internal communications to protect proprietary AI‑training material. - High‑profile lawsuits—such as the New York Times vs. OpenAI and Disney vs. Midjourney—signal a rising legal pushback against the use of copyrighted or proprietary content in AI models. - Together, these data‑lockouts and legal pressures suggest a faster‑than‑expected depletion of usable training data, which could reshape the future trajectory of AI development. ## Sections - [00:00:00](https://www.youtube.com/watch?v=v6tZ1Wcg-YU&t=0s) **Data Lockdowns Challenge AI Assumptions** - The speaker argues that Ilia Sutskiver’s claim that data is abundant is being undermined by increasing data lockouts and the rise of user‑generated artificial data, suggesting a faster depletion of trainable data than anticipated. - [00:03:08](https://www.youtube.com/watch?v=v6tZ1Wcg-YU&t=188s) **Data Lockdown and Synthetic Iteration** - The speaker contends that corporate efforts to restrict large swaths of data (e.g., Disney, NYT) marginally impact model capabilities yet risk accelerating data scarcity, while highlighting a swift shift toward using AI‑generated synthetic data to iteratively improve training processes. - [00:06:26](https://www.youtube.com/watch?v=v6tZ1Wcg-YU&t=386s) **Executives Cutting Data, Models Innovate** - The speaker observes that in 2025 executives are increasingly shutting off data access, while AI developers counter this by employing synthetic-token reinforcement learning that makes real data less essential. ## Full Transcript
You know, Ilia Sutskiver said data is
the new oil, which is, I suppose, not a
new thing to say, but it's Ilia, and
he's one of the leading lights of the AI
generation. And when he said it at
Nurips in November 2024, we all paid
attention.
There have been two major trend lines
since then that make me think that Ilia
needs to update his priors. I know, who
am I to say that? I'm nobody to say
that. So, let me just finish my thing.
Uh number one,
data is getting locked off actively. And
this is something that's actually in
favor of Ilia's contention. Right? So
Ilia's contention is that we're running
out of data to train on because there's
only so much data in the world to train
on. What if we're running out faster
than he thought because people are
actively locking off sources of data? We
see that on a few places. Uh, one, if
you've been on this channel, OpenAI
bought Windinsurf. Enthropic cut off
model access to Windsurf for their
clawed models because you don't want to
let those guys get the models, right?
Then the model output tokens would go to
their rival OpenAI and that would be a
problem. It's a data lockoff. And this
is for future data streams. This is for
data that users generate in the tool.
what I would call usergenerated
artificial data because it's artificial
because it's coming from a model but
it's userenerated because you're
prompting for it. I think it's one of
the biggest sources of renewable data
that we have. Chat GPT generates a ton
of user generated artificial data. Uh
it's one of their greatest sources of
data. I would hypothesize. Uh and
anthropic did not want that valuable
code data to go to OpenAI.
Okay.
Second development in that direction,
Salesforce has reportedly cut the
rapidly growing uh company Glean off
from getting access to Slack messages.
And why does that matter? Glean
specializes in dealing with a single
pane of glass that executives can see
what's happening in the whole business.
They need Slack to do it. And so
Salesforce is cutting glean off at the
knees because Salesforce views Slack
data as highly valuable in the age of
AI. And I would agree. I think it is.
But Salesforce is making moves they
didn't make last year because they
understand better at the executive level
that they need to ring fence that data
in order to prevent people from getting
a hold of it. And again, this is not
necessarily
usergenerated artificial tokens in this
case. This is just userenerated data.
We're typing in Slack all day. Well, now
Gleam can't get a hold of it. I would
expect those restrictions to continue to
tighten.
Third one is uh on the legal side. I
talked about the New York Times suing
OpenAI and kind of some of the problems
with that. Disney suing Midjourney is
sort of the next generation lawsuit
here. They're basically saying that
MidJourney uh allow uh allows users to
create proprietary Disney characters.
And
without getting into the merits of the
case, I think the story from the data
side is essentially Disney is trying to
lock off huge chunks of the data
landscape to trainers and models.
Similar to what the New York Times is
doing, chunks of the data landscape.
Now, individually, these maybe huge
chunks isn't fair. They're huge in terms
of you know nominal value gigabytes and
so on terabytes pabyte maybe but they're
not huge on the scale of like zetabytes
not huge on the scale of the internet so
I think if you remove them the models
would go on much as before
so three different ways in which we're
cutting off access to data making data
uh effectively less common less
available more of a resource that Ilia
would say is not coming back and that we
have to be careful with. Right? So this
is in favor of Ilia's contention and in
fact suggests that we may be
accelerating the scarcity of data.
But two is a little bit different. Um
theme two is this idea that we are
taking data
and we are very rapidly
using machines
to iterate on machine generated data.
Now there was a very famous study done
too long ago to matter which means 2023
in AI terms that suggest that synthetic
data generated by AI can't be reliably
used by AI.
That's not true anymore. Uh you have
reports coming out of multiple major
model makers that synthetic tokens are
useful in the process of training. And
now we've gone the next step.
There are reports that there are
automated
iterative improvements at multiple major
model makers driven by
autonomous AI. And so this that I've
seen leaks from both Anthropic and from
OpenAI on the internet both suggest
independently that they are working
through the concept of AIdriven
reinforcement learning. So instead of
humans saying is this a good thing or
not and driving reinforcement learning
which trains the model what responses to
prefer or not they're getting better
responses out of the model by letting AI
do the reinforcement learning on AI. AI
taking over the training of AI, which if
you're wondering is one of the big steps
toward a
generalized intelligence explosion
because if AI can do it, there's really
no barrier except power, right? You just
turn on more power and AI can do more of
it.
And so
in that world,
does data still matter if you can use AI
to generate synthetic tokens and then
use AI to reinforcement learn your way
through? And the answer may be you don't
really need the data, but you do have
alignment questions, and that's a
separate conversation.
But it's one of those things where I've
been watching the longer term trend line
since Nurips's uh which is the
conference IA spoke at where he talked
about the data piece and I've been
saying well what how's the data story
evolving in 2025 and I think the story
is basically executives are waking up
and shutting data off wherever they can
and at the same time model makers are
innovating and basically finding ways
for it not to matter. They're finding
ways for AI to drive reinforcement
learning on synthetic tokens. So it's
like well
if you cut off the data it doesn't
matter. You tell me where do you think
the data story is going in the rest of
2025.