Securing Merged Enterprise Data for AI
Key Points
- Enterprises are moving beyond isolated siloed data toward unified data warehouses and marts that blend financial, HR, operational, and sales information for easier consumption.
- Traditional access‑control models (request‑and‑approve per database) are being superseded by consolidated views, snapshots, and dashboards that deliver ready‑to‑query insights to users.
- The rise of generative AI and retrieval‑augmented generation (RAG) is driving the need to pull this merged enterprise data into AI models for training and real‑time inference.
- Protecting the “merged” data set becomes a critical challenge, requiring new strategies that secure data across the entire pipeline—from source systems through warehouses, marts, and AI workloads.
- Understanding this evolution—from siloed access, to BI‑driven data marts, to AI‑enabled unified data—helps organizations design safeguards that preserve confidentiality while unlocking the value of enterprise‑wide analytics and AI.
Sections
- Protecting Merged Enterprise Data - The speaker outlines the shift from traditional business intelligence to AI‑driven uses of enterprise data and stresses the need for strategies to secure the integrated data spanning finance, HR, operations, and sales.
- Access Controls for Enterprise Vector Data - The speaker outlines how to protect merged enterprise information stored in a vector database—used for RAG‑style queries—by treating it as a new asset and applying comprehensive access‑control strategies.
- From Data Objects to Virtualization - The speaker explains treating dashboard views and logical groupings as access‑controlled data objects, then shifting from traditional ETL to a data‑virtualization layer that dynamically generates user‑specific query results across enterprise data lakes.
- Birthright Access and Data Governance - The speaker explains how centralized or decentralized access controls enable prompt filtering and emphasizes using role‑based “birthright” permissions to automatically determine data access based on a user’s organization, role, location, and job function.
Full Transcript
# Securing Merged Enterprise Data for AI **Source:** [https://www.youtube.com/watch?v=_K5YRvW4PKA](https://www.youtube.com/watch?v=_K5YRvW4PKA) **Duration:** 00:12:52 ## Summary - Enterprises are moving beyond isolated siloed data toward unified data warehouses and marts that blend financial, HR, operational, and sales information for easier consumption. - Traditional access‑control models (request‑and‑approve per database) are being superseded by consolidated views, snapshots, and dashboards that deliver ready‑to‑query insights to users. - The rise of generative AI and retrieval‑augmented generation (RAG) is driving the need to pull this merged enterprise data into AI models for training and real‑time inference. - Protecting the “merged” data set becomes a critical challenge, requiring new strategies that secure data across the entire pipeline—from source systems through warehouses, marts, and AI workloads. - Understanding this evolution—from siloed access, to BI‑driven data marts, to AI‑enabled unified data—helps organizations design safeguards that preserve confidentiality while unlocking the value of enterprise‑wide analytics and AI. ## Sections - [00:00:00](https://www.youtube.com/watch?v=_K5YRvW4PKA&t=0s) **Protecting Merged Enterprise Data** - The speaker outlines the shift from traditional business intelligence to AI‑driven uses of enterprise data and stresses the need for strategies to secure the integrated data spanning finance, HR, operations, and sales. - [00:03:20](https://www.youtube.com/watch?v=_K5YRvW4PKA&t=200s) **Access Controls for Enterprise Vector Data** - The speaker outlines how to protect merged enterprise information stored in a vector database—used for RAG‑style queries—by treating it as a new asset and applying comprehensive access‑control strategies. - [00:06:33](https://www.youtube.com/watch?v=_K5YRvW4PKA&t=393s) **From Data Objects to Virtualization** - The speaker explains treating dashboard views and logical groupings as access‑controlled data objects, then shifting from traditional ETL to a data‑virtualization layer that dynamically generates user‑specific query results across enterprise data lakes. - [00:09:44](https://www.youtube.com/watch?v=_K5YRvW4PKA&t=584s) **Birthright Access and Data Governance** - The speaker explains how centralized or decentralized access controls enable prompt filtering and emphasizes using role‑based “birthright” permissions to automatically determine data access based on a user’s organization, role, location, and job function. ## Full Transcript
Howdy, everyone.
We've been talking for a while now about data.
But as we look at what has been emerging over the last 10 plus years, we're really starting to use data in an enterprise in different ways than we traditionally have.
We're looking at it from business intelligence.
We're generating insights out of our data.
What can we learn?
What can take actions on?
And more recently, we're using data for AI.
We're bringing data from all over across the enterprise
and using it to train models, to run gen AI systems, or run RAG models.
So as we start thinking about how we've been using data, the big key thing that's happening, especially with insights, and especially with AI,
is we are merging data from across the enterprise together, and now we have a blend of information from everywhere.
And the question is, how do we protect that merge?
When we look at how to answer this question and we'll generate some strategies for how to protect that merge data,
we need to step back a little bit and look at how we've gotten to where we're at.
So let's start by thinking about the data that we have in the enterprise.
We have financial information.
We have HR information.
We have operational information.
We have a lot of different data sales information everywhere across the enterprise and as we want to consume that data,
we'll have a user or we'll have an application that basically does queries and retrieves information out of these data systems.
Now, what we typically have done in the past is someone who needs to access something, they request access and they get approved.
They get request access to HR data.
They get it approved.
Whatever it is, they have to make sure that they have the proper rights to access certain information,
and we control the access
for that information within the database.
Now, as we've gone forward and we have lots of information and we want to bring it together and make it more consumable,
what we've done is we've started creating data warehouses that actually merge certain data together into certain pools of information
that make it a little bit more consummable for us if we want try to query that information, it's in one place now.
Now, as we started looking at business intelligence and we wanted to create view and snapshots into data,
we actually went another step and started creating data march,
which are very specific organizations of information that can then generate reports and go up to a dashboard that now a user can query and get access to and see.
So now they can see, instead of having to go into all these individual data locations and try to find the data they need,
now they just go to a quick snapshot, a quick dashboard, and get the specific information they're trying to do to do their job.
So that's kind of our traditional path that we've had in getting access to data.
As we look at what's happening with AI, especially as we look Gen AI and RAG models,
what we have is we have an AI system and within here is a large language model where someone is,
it's helping us to understand how to present the question that they're asking.
They may be asking a question through an assistant.
And then retrieving information and presenting that in a very consumable fashion back to the user.
Now, in this situation, what is happening is we have a vector database,
and we are embedding information from our traditional enterprise data systems so that now we can query it, get the answers, respond back, and do questions.
We can also, in rag models, directly access the information we have in an enterprise if we need to complement and supplement the responses that we're bringing back.
So again, when we're doing this, this information is all getting merged together into a vector database and that's feeding back.
So what privileges does that user have and how do we control the access to this?
So this is really the question that we are trying to answer.
So let's look at a set of strategies that we can use then to protect
this merged data that we're seeing across our enterprise.
So let's think about what are a set of strategies that we can use.
So the first strategy that we could apply is the one that we've been using really for decades, and this is all around just access controls.
How do we make sure that we have the right access controls in place across our entire data flow?
So first thing that we kind of think about doing this is we can treat
some of this as a new data type, right?
We can think of it as it's a new asset, it's new place that we wanna give access to.
And the way to think about this is as we merge all this data into a data warehouse,
that data warehouse now becomes its own data asset that we then give access too, right.
So somebody is responsible for understanding what the user's trying to do with that warehouse and providing them access.
So they do not
necessarily have to access to the individual data source.
So this is one way that we can look at this, is how do we just create these as separate assets and kind of run it that way.
The other thing that we've done quite a bit in the past is, we look at us and say, that you can only have access,
if you have all access.
In other words, the way to think about this is if we're going to look at, say, we want to go and flow information back from an AI prompt,
the only way that they can actually get the response back is if they have access to everything, right?
So if they have access all this information, then they can have access to all the information that's in here.
Same kind of thing here.
They can view everything in the dashboards as long as all the originating source information they have access to.
Now, in practice, I realize this is really hard to do, because it's not likely we're going to provide access to everything, right?
So in some cases, when we've implemented this, we do some data engineering, so we restructure things that are happening in,
say, the vector database or the warehouse,
so that groupings of things can then become an asset that we can provide access to, and this is actually the next piece of this.
As we can really start thinking of things as data objects.
And the way to think about that is like in a particular dashboard here,
this may be a very specific view from a set of information and we think of that as a data object and that object then is what you're allowed to get access to.
We could sort of apply this into a vector database but the vectors, we can have many, many, many vectors and it's really hard to think of that as an object,
but...
The way to really kind of look at this is you have logical groupings of things that then you're providing access to.
So that's the data object.
Now, the next strategy that we can do is start applying data virtualization.
Now, in our traditional view of how we work at data, we do a lot of extract, transform load,
we do lot of moving of data, shaping the data to what we need to do so it finally gets to the output that we need do.
With the emergence of data lakes, which really says that we take a big view,
a collection view of our enterprise data and look at that as a set and not necessarily warehouse and we look at how and what we're allowed to do within this.
We can actually create a data virtualization layer that basically says instead of always doing ETLs and moving data, every time we want to do a query,
we can create a virtualized output that then is really specific to what that user has access and controls to.
So that virtualization is kind of a runtime control and view of what they're allowed to see.
And that really allows us to get a much better ability to start controlling
access to things.
So we really start thinking when we're looking at data virtualization,
this is really about data lakes,
and it's also about data governance.
If we're going to do this properly...
We have to understand what kind of data is here.
Is there PII data?
Is there SPI data?
Where's the data from?
What's the lineage of the data?
We really have to govern the data very well so that as we do things like data virtualization,
we can apply it and make it so that it's very consumable with the right access controls in place.
Now, data virtualization does get really hard when we start thinking about embeddings and vector database
because we're really not doing this in a runtime fashion.
The embeddings happen and are set up ahead of time, so it really doesn't lend itself towards data virtualization.
So what can we do here?
So the next strategy really then starts talking about filtering,
And this is really filtering the results, filtering the query results.
and we can either do pre-filtering or post-filter.
In other words, pre- filtering is we send a prompt in, and the results that are returned are only the results that a user is allowed to get access to.
So that all flows back.
Post filtering is another option here.
Post filtering really says that we do the prompt, we find all the nearest neighbors in our vector database,
and then after the results come back, then we filter it based on what that user is allowed to do.
Now, to make this really work well for doing the filtering, that requires that we have some good knowledge of access controls,
and and where throughout the enterprise
what people are allowed to do?
This can be centralized access control, it can be decentralized access control.
So there's a lot of challenges with this as well, but it does allow us to then do some sort of filtering on a prompt and a results that are coming back.
This gets us to a certain extent,
and even when we're talking about filtering and the pre and post filtering and the access controls, again, data governance and this piece here.
Mark this up a little bit.
This piece here becomes very critical and key to anything that we're trying to do around access controls and protecting merged data,
but now let's start thinking about where can we now start going from here.
The next thing is birthright access.
Now birthright is not a new concept either.
It's been around for a while.
And really what this concept says is that the person trying to get access to information, and it's not about them going and individually getting access to things,
it's really who are they, what organization or division or group are they part of, what's their role, and where are they located.
And that really, what job are they trying to accomplish, that then drives what data they're allowed to see.
So they don't have to go ask.
It's really driven by who they are and what they're trying to do.
And that's how we give access.
And that simplifies the problem greatly, whether we're trying do it for insights and dashboards, whether we're trying to do it,
from an AI side and what data are they allowed to access that we retrieve back in a rag model.
If we know who they are and we know about the data, we can merge that together,
do our pre-filting, do our post-filtering, whatever it is that we need to do and drive it that way.
Now again, that relies heavily on having good, strong data governance in place to be able to accomplish that,
but it does simplify our problem in protecting merge data and provides a way for us to do that.
So these are all,
these are four strategies that we can use to start solving the problem of how do we protect merged data,
and in a lot of ways, this is really about least privilege.
Who is the user?
What are they trying to do?
And what are they allowed to access?
The last thing I want to bring in, and we should always be thinking about this any time we're talking about protecting data, is really around compliance.
So we need to make sure that anything that we are doing is observable.
And that we monitor all the action and all the activities taking place.
And this is something that I'm hoping is a standard process with everyone.
So again, just kind of looking at this whole thing, as we look at, especially in our AI world, and we merge data from across an enterprise,
we need to make sure that we're providing the right access and that that data is only what they need to do
to accomplish the task they're setting out for,
and these are a set of strategies that we can use to do that and to protect that merge data.