Protecting Data for AI Adoption
Key Points
- AI’s power comes from data, so protecting that data is the first critical step before integrating AI into products or business processes.
- The evolution of data storage—from ancient writings to relational databases (Codd 1970) to server farms, cloud, hybrid cloud, data lakes, and lakehouses—has continually improved how we keep and retrieve information.
- Modern data ecosystems still rely on structured data stored in databases, but they also incorporate less‑structured data in lakes and lakehouses to support diverse AI workloads.
- Effective AI initiatives require specialized roles: data engineers design and manage the data architecture, while data scientists transform and analyze the data to generate insights.
Sections
- Data Foundations for AI Security - The speaker stresses that AI depends on data and must be safeguarded, while sketching the historical progression from ancient record‑keeping to relational databases as the groundwork for today’s AI-driven business initiatives.
- Evolution of Data Roles to AI - The speaker outlines the progression from data engineers, scientists, and admins managing and securing data, to modern business applications and AI systems that now extract, train, and operationalize data.
- Classifying Data and Controlling Access - The speaker stresses that identifying the sensitivity of data is the first step to protection and recommends using role‑based permissions instead of direct access to manage how users and systems interact with that data.
- Ensuring Identity Management for Access - The speaker explains that managing access requires authenticating and authorizing users—including privileged accounts—through robust identity management, applying least‑privilege principles, and avoiding shared or generic IDs.
- Governance, Risk, and Encryption Strategies - The speaker links low risk to reduced monitoring, outlines the comprehensive governance umbrella covering data classification, cataloging, and identity/access controls, and emphasizes encrypting data with independently managed keys to render stolen information useless.
Full Transcript
# Protecting Data for AI Adoption **Source:** [https://www.youtube.com/watch?v=LyfG7SGRiZA](https://www.youtube.com/watch?v=LyfG7SGRiZA) **Duration:** 00:15:07 ## Summary - AI’s power comes from data, so protecting that data is the first critical step before integrating AI into products or business processes. - The evolution of data storage—from ancient writings to relational databases (Codd 1970) to server farms, cloud, hybrid cloud, data lakes, and lakehouses—has continually improved how we keep and retrieve information. - Modern data ecosystems still rely on structured data stored in databases, but they also incorporate less‑structured data in lakes and lakehouses to support diverse AI workloads. - Effective AI initiatives require specialized roles: data engineers design and manage the data architecture, while data scientists transform and analyze the data to generate insights. ## Sections - [00:00:00](https://www.youtube.com/watch?v=LyfG7SGRiZA&t=0s) **Data Foundations for AI Security** - The speaker stresses that AI depends on data and must be safeguarded, while sketching the historical progression from ancient record‑keeping to relational databases as the groundwork for today’s AI-driven business initiatives. - [00:03:02](https://www.youtube.com/watch?v=LyfG7SGRiZA&t=182s) **Evolution of Data Roles to AI** - The speaker outlines the progression from data engineers, scientists, and admins managing and securing data, to modern business applications and AI systems that now extract, train, and operationalize data. - [00:06:13](https://www.youtube.com/watch?v=LyfG7SGRiZA&t=373s) **Classifying Data and Controlling Access** - The speaker stresses that identifying the sensitivity of data is the first step to protection and recommends using role‑based permissions instead of direct access to manage how users and systems interact with that data. - [00:09:18](https://www.youtube.com/watch?v=LyfG7SGRiZA&t=558s) **Ensuring Identity Management for Access** - The speaker explains that managing access requires authenticating and authorizing users—including privileged accounts—through robust identity management, applying least‑privilege principles, and avoiding shared or generic IDs. - [00:12:25](https://www.youtube.com/watch?v=LyfG7SGRiZA&t=745s) **Governance, Risk, and Encryption Strategies** - The speaker links low risk to reduced monitoring, outlines the comprehensive governance umbrella covering data classification, cataloging, and identity/access controls, and emphasizes encrypting data with independently managed keys to render stolen information useless. ## Full Transcript
Howdy everyone.
If you're like me, everywhere you're turning now,
you're hearing about AI this, AI that how do I get AI into automation.
How do I leverage and use AI in my products.
How do I use it in my business?
The thing about AI is AI doesn't exist without data.
You have to have data.
And the thing that you need to think about is, how am I going to protect that data?
So what I want to talk about now is some of the fundamentals that you can use
to protect your data as you start to use and build out AI in your business.
Now, information and data has been around pretty much
since the beginning of human history.
We wrote on hieroglyphs, we wrote scrolls, we wrote papers in books,
books went in libraries.
We tried to access all that. In the 60s, when mainframes and computers
started really getting into the mainstream of business,
we really started them formalizing how we stored data.
So you had the inner
integrated data systems, you had information management systems.
These all ran to store data, but they weren't very good
at retrieving that data. In 1970, E.F.
Codd from IBM actually wrote the seminal paper around
relational database management.
And this was the first time where we really had data
that we could retrieve easily and use it for business purposes.
And that's really the foundation by which everything that we're doing today
is built off of.
So when we started back in the 60s
with stuff, it was basically structured data.
We knew
exactly what it looked like, we knew what fields the were.
It was very organized into a database.
So we had structured data and we had a database.
We also, as we started expanding
and growing, those database became based on a server
so that we could access all our data off the server servers became overloaded.
So we started distributing the data
to many servers that evolved into cloud.
And of course the evolution just keeps going.
And now we're into hybrid cloud.
All of these models basically provide ways that we can store data.
Now we've done certain things like we've taken,
you know,
a little bit of data structure here, and we've built data lakes on top of that,
we realized we still
wanted to have some of the benefits of using databases and servers.
So we expanded that and have Lakehouse.
But at the
end of the day, it's really all about data.
And data is stored in some sort of a system.
Whatever the system is, we store data.
Now we have a user
and users want to extract data and use that data.
They're going to query it for information they need.
They're going to write reports. Whatever it is.
People pull information out of data to use. Now,
they can't just take normal data that's just been dumped in there and use it.
It has to be manipulated.
It has to be stored instruction in certain ways.
So that brings in we need to have engineers that work on that data.
So this is data engineers, people that can go and manipulate structure of the data.
We have data scientist.
So data scientists go in and they work on the data.
There's a lot of people that can come in and interact.
And we also have admins
that operate on the data.
So all of these are coming in.
They're changing data, manipulating data.
They're making it so that a user can get the data they need.
Now we also evolved to the point where we have business applications.
Now that we want to run and work on data,
either to manipulate it, change it, or read it and use it.
So we have our business applications and they're also interacting with data.
So this is all really good.
We've come from our evolution. We've built data. We know how to store it.
We have all sorts of models for storing it.
So now if we fast forward a little bit in the last,
you know, decade, a couple of decades for a while,
we've also worried about the security of this data.
So we really are worrying about does somebody like
David Lightman come in, hack into our database and they breach it,
they breach the data, they steal the data.
There's ransomware put against.
And so all of this
we have known what to do with and we've built systems around this.
Now let's get up to where we're at today.
Now we have AI systems.
These have come into play.
And we have AI that we need to extract data out.
We're training models.
We're building systems out of it.
These can be data that we want to train from.
It can be vector databases.
It can be a whole set of kinds of data that we need for our AI systems.
We also may have AI data
that needs to actually interact with our business processes,
because we want it to flow back in, into our data and manipulate data.
Look at enterprise data, extract that out.
This supports our our Rag models of AI.
All of this supports our gen AI systems.
Whatever it is, we're using automation, whatever that is.
We have now introduced AI into this.
With AI, we also have security concerns around that as well.
There are things that people can come do.
They can do data poisoning. So it poisons our data
that we use the train and that manipulates how AI works.
You know, there's lots of different things as we talk about data
that we need to be concerned about and how it's going to operate,
not just in our normal business operations, but now as we're
starting to leverage AI and more and more, how are we going to protect that?
So how do we build our walls around our data so that we can protect it?
So what I want to talk about is go through a few strategies,
some fundamental strategies around protecting data
that you can use to make sure that as you're engaging AI and you're building out
all these systems, you're at least being aware of what it is
that you need to do to make sure the data you're built off of is being protected.
So let's talk about the strategies
that we can use for protecting our data.
So, the first one and this is probably the simplest and sounds
the simplest and the most fundamental is classification of data.
This is extremely important.
And what this means is do you understand
what kind of data you have that you're extracting out.
Is it sensitive personal information?
Is it personally identifiable information?
Do we have confidential information?
What kind of sensitive information do we need to be aware of
so we know how to protect it?
It seems easy, but this is one thing that actually often times gets overlooked.
You, you you don't know what kind of data you have,
so you don't know how and and what you should be protecting in so, so
data classification is, is the first thing that you need to be aware of.
What kind of data do you have?
The second strategy is really about managing access.
So users access the data.
The the engineers they're accessing systems are accessing data.
So the first thing that you want to think about when you're talking about
how to manage access to the data is no direct access.
So a user should not
actually be entitled to go in and actually hit things directly.
What we really want to do is put a role in there,
and that role has privilege against a set of data so that they can go in
and they assign themselves to a role or a governance assigns to a role.
And that role lets them know what they can do.
And you do these roles everywhere, right?
AI would use roles, the business applications, user roles,
everyone that's trying to manipulate the data and get it ready,
they would all have roles assigned to them.
So no direct access actually work through a layer of indirection.
A layer of abstraction,
and have roles that actually are the things that manipulate the data.
And then and then whatever it is that's coming in would go through that role.
The second thing that you should think about
is really to the extent,
where possible, always make the data read only.
And this really talks about the stack area up here.
If we're looking at people who are reading the data using the data or AI,
this part should be read only as much as possible.
We'll talk about this in just a minute, because that won't be read only.
But really try to make this read only.
The next thing you want to talk about are you want to think about
when we're talking about managing access is use least privilege.
And what this says is if a user or even an AI
system is coming in, they shouldn't get access to everything.
They should only get access to what they need to do the job
that they're trying to execute.
So that would be a very specific rule that says that I can only access
a few different things. Right.
And if you need to access something else, then you have another role.
And then maybe you associate that person
could have multiple roles, or an AI system could have multiple roles.
That gets them
just to the narrow piece of information that they need to perform
some sort of the task. So that's the least privilege.
That's the next thing we need to
we need to make sure we're doing when we manage access.
The last thing under managing access is identity management.
And what this really
says is that a user we should we should know who they are.
There's lots of good videos on this. We authenticate them.
That tells us who they are.
There's another system that tells us what they're entitled to do that
maps to the roles.
But all this is around identity management,
and whether it's a business application coming into a sort of, you know,
through a assert or APIs, however that is done by all of these
entities that are accessing data need to have their identities managed.
So we know who it is. We trust who they are.
We know they've been authenticated, we know they've been authorized,
and now we can feed up the data.
So that's the last piece that we're talking about with Manage Access.
Now let's talk about this group here.
These are our privileged users
and we need to think about them as well.
Now we have to have all this stuff with them.
Maybe you know we may alter some of this.
Like obviously we're not going to do read only,
but they still should have least privilege.
We should still have identity management.
The other things that we need to bring in when we're thinking about this now
with privilege users, let's talk about these business applications for a second.
Let's limit or eliminate shared IDs.
An application shouldn't just be using a generic ID
that many people on the other end of this have access to as well, right?
Because then we then we lose control of knowing under identity management
who is actually trying to do stuff.
So we need to think about things like, do we have vaults?
Can we rotate secrets?
You know, how do we make sure that the business application use a 1 to 1
ID to get in or credential to get in and access the data.
So limit any kind of shared credentials, that would
provide access or even even on the engineering side.
The next thing that we need to do then is to monitor.
This set of people have more access
privileges than we do up on the, the, the operational side of the stack.
And so because of that, we need to make sure that we're looking for
are there any anomalies in their behavior.
Is there did a idea of somebody get compromised?
They're coming in at two in the morning.
It's not a regular time for them.
So can we look at that as anomaly and think, okay,
there may be a breach or something going on here.
So basically just monitor the activity, make sure it's within the patterns
of what we expect. The behavior to be.
So therefore we know that everything is proceeding the way
that we want it to be and that something else isn't happening.
Now if you do detect something along this line, then we can take an action.
One of the things as we talk about monitoring, it's also associated with risk.
How and this gets back to our classification.
If there's if it's more sensitive information there's higher risk.
And therefore we really kind of want to monitor certain things
to make sure nothing is happening with that sensitive information.
If the risk is really low, then maybe the monitoring goes down with that.
Right?
So monitoring and risk are always kind of associated together.
Now when we think about these sets of things right here,
this is really about governance.
This is about data governance
classifications about data governance cataloging what the data is.
All of this falls under a governance umbrella, as does identity governance.
You get into IAM, you get into IGA, you get into access governance.
Really, all of this falls under the umbrella of governance.
And there are some really good tools for providing this level of governance.
If you're trying to implement,
you know, these strategies,
there are some really good ways for doing that,
and there's some really good videos on how to do that as well.
All right.
Next thing that we want to do as a strategy is encrypt
our data.
If we encrypt everything in here, then if a breach is something gets stolen,
there's a better likelihood
that the data will then be useless because they don't have the keys.
Unlock it.
And that's actually an important topic as well.
Make sure that the keys that you have are independently or third party manage.
In other words, your admins shouldn't have the keys to unlock the data.
The admins should be manipulating the system.
They can.
If you talk about a role, an admin can have a special role
that says that they can build structures, they can build tables, they can build,
you know, whatever it is you need.
If you're using object database,
whatever it is, admins build structure, but they can't see or manipulate data.
If they do, then that data is encrypted.
If they have the keys to make an unlock it, which is we want.
Why we want to separate that out.
So encryption is a very important topic when we come to protecting data.
And then finally
repeat all of this.
This is just this is just good security hygiene.
You know, it's not enough to just say, look, I did this.
I checked my box, I did that, I check my box okay.
Check check check.
You know, I'm all good.
Things are constantly changing.
It's a very dynamic nature of of data as it changes
what's coming and what kind of systems we're using to store.
So you should be continually reassessing.
Do I have the right classification if I catalog this correctly?
If I set up all my access right, have that somebody get an access
because they used to be over here,
and then they moved over here and they retained that access.
So just repeat just continually repeat
these strategies to make sure your data is properly protected.
And this is really this is really as we look at building out AI systems.
And that's built off of data.
This is a set of strategies that can help you
to make sure that the data you're using is protected.
Thank you.