Learning Library

← Back to Library

Securing Merged Enterprise Data for AI

12m • Unknown Channel • security • deep-dive • intermediate • Watch on YouTube ↗

Key Points

Enterprises are moving beyond isolated siloed data toward unified data warehouses and marts that blend financial, HR, operational, and sales information for easier consumption.
Traditional access‑control models (request‑and‑approve per database) are being superseded by consolidated views, snapshots, and dashboards that deliver ready‑to‑query insights to users.
The rise of generative AI and retrieval‑augmented generation (RAG) is driving the need to pull this merged enterprise data into AI models for training and real‑time inference.
Protecting the “merged” data set becomes a critical challenge, requiring new strategies that secure data across the entire pipeline—from source systems through warehouses, marts, and AI workloads.
Understanding this evolution—from siloed access, to BI‑driven data marts, to AI‑enabled unified data—helps organizations design safeguards that preserve confidentiality while unlocking the value of enterprise‑wide analytics and AI.

Sections

Full Transcript

# Securing Merged Enterprise Data for AI **Source:** [https://www.youtube.com/watch?v=_K5YRvW4PKA](https://www.youtube.com/watch?v=_K5YRvW4PKA) **Duration:** 00:12:52 ## Summary - Enterprises are moving beyond isolated siloed data toward unified data warehouses and marts that blend financial, HR, operational, and sales information for easier consumption. - Traditional access‑control models (request‑and‑approve per database) are being superseded by consolidated views, snapshots, and dashboards that deliver ready‑to‑query insights to users. - The rise of generative AI and retrieval‑augmented generation (RAG) is driving the need to pull this merged enterprise data into AI models for training and real‑time inference. - Protecting the “merged” data set becomes a critical challenge, requiring new strategies that secure data across the entire pipeline—from source systems through warehouses, marts, and AI workloads. - Understanding this evolution—from siloed access, to BI‑driven data marts, to AI‑enabled unified data—helps organizations design safeguards that preserve confidentiality while unlocking the value of enterprise‑wide analytics and AI. ## Sections - [00:00:00](https://www.youtube.com/watch?v=_K5YRvW4PKA&t=0s) **Protecting Merged Enterprise Data** - The speaker outlines the shift from traditional business intelligence to AI‑driven uses of enterprise data and stresses the need for strategies to secure the integrated data spanning finance, HR, operations, and sales. - [00:03:20](https://www.youtube.com/watch?v=_K5YRvW4PKA&t=200s) **Access Controls for Enterprise Vector Data** - The speaker outlines how to protect merged enterprise information stored in a vector database—used for RAG‑style queries—by treating it as a new asset and applying comprehensive access‑control strategies. - [00:06:33](https://www.youtube.com/watch?v=_K5YRvW4PKA&t=393s) **From Data Objects to Virtualization** - The speaker explains treating dashboard views and logical groupings as access‑controlled data objects, then shifting from traditional ETL to a data‑virtualization layer that dynamically generates user‑specific query results across enterprise data lakes. - [00:09:44](https://www.youtube.com/watch?v=_K5YRvW4PKA&t=584s) **Birthright Access and Data Governance** - The speaker explains how centralized or decentralized access controls enable prompt filtering and emphasizes using role‑based “birthright” permissions to automatically determine data access based on a user’s organization, role, location, and job function. ## Full Transcript

0:00Howdy, everyone. 0:01We've been talking for a while now about data. 0:05But as we look at what has been emerging over the last 10 plus years, we're really starting to use data in an enterprise in different ways than we traditionally have. 0:14We're looking at it from business intelligence. 0:16We're generating insights out of our data. 0:19What can we learn? 0:19What can take actions on? 0:21And more recently, we're using data for AI. 0:25We're bringing data from all over across the enterprise 0:28and using it to train models, to run gen AI systems, or run RAG models. 0:34So as we start thinking about how we've been using data, the big key thing that's happening, especially with insights, and especially with AI, 0:43is we are merging data from across the enterprise together, and now we have a blend of information from everywhere. 0:51And the question is, how do we protect that merge? 0:56When we look at how to answer this question and we'll generate some strategies for how to protect that merge data, 1:03we need to step back a little bit and look at how we've gotten to where we're at. 1:07So let's start by thinking about the data that we have in the enterprise. 1:12We have financial information. 1:15We have HR information. 1:18We have operational information. 1:20We have a lot of different data sales information everywhere across the enterprise and as we want to consume that data, 1:29we'll have a user or we'll have an application that basically does queries and retrieves information out of these data systems. 1:37Now, what we typically have done in the past is someone who needs to access something, they request access and they get approved. 1:44They get request access to HR data. 1:46They get it approved. 1:47Whatever it is, they have to make sure that they have the proper rights to access certain information, 1:54and we control the access 1:56for that information within the database. 1:58Now, as we've gone forward and we have lots of information and we want to bring it together and make it more consumable, 2:05what we've done is we've started creating data warehouses that actually merge certain data together into certain pools of information 2:14that make it a little bit more consummable for us if we want try to query that information, it's in one place now. 2:20Now, as we started looking at business intelligence and we wanted to create view and snapshots into data, 2:27we actually went another step and started creating data march, 2:31which are very specific organizations of information that can then generate reports and go up to a dashboard that now a user can query and get access to and see. 2:44So now they can see, instead of having to go into all these individual data locations and try to find the data they need, 2:50now they just go to a quick snapshot, a quick dashboard, and get the specific information they're trying to do to do their job. 2:58So that's kind of our traditional path that we've had in getting access to data. 3:03As we look at what's happening with AI, especially as we look Gen AI and RAG models, 3:09what we have is we have an AI system and within here is a large language model where someone is, 3:16it's helping us to understand how to present the question that they're asking. 3:20They may be asking a question through an assistant. 3:23And then retrieving information and presenting that in a very consumable fashion back to the user. 3:29Now, in this situation, what is happening is we have a vector database, 3:34and we are embedding information from our traditional enterprise data systems so that now we can query it, get the answers, respond back, and do questions. 3:44We can also, in rag models, directly access the information we have in an enterprise if we need to complement and supplement the responses that we're bringing back. 3:53So again, when we're doing this, this information is all getting merged together into a vector database and that's feeding back. 4:01So what privileges does that user have and how do we control the access to this? 4:06So this is really the question that we are trying to answer. 4:09So let's look at a set of strategies that we can use then to protect 4:14this merged data that we're seeing across our enterprise. 4:19So let's think about what are a set of strategies that we can use. 4:23So the first strategy that we could apply is the one that we've been using really for decades, and this is all around just access controls. 4:33How do we make sure that we have the right access controls in place across our entire data flow? 4:39So first thing that we kind of think about doing this is we can treat 4:44some of this as a new data type, right? 4:48We can think of it as it's a new asset, it's new place that we wanna give access to. 4:54And the way to think about this is as we merge all this data into a data warehouse, 4:58that data warehouse now becomes its own data asset that we then give access too, right. 5:05So somebody is responsible for understanding what the user's trying to do with that warehouse and providing them access. 5:11So they do not 5:12necessarily have to access to the individual data source. 5:15So this is one way that we can look at this, is how do we just create these as separate assets and kind of run it that way. 5:22The other thing that we've done quite a bit in the past is, we look at us and say, that you can only have access, 5:32if you have all access. 5:37In other words, the way to think about this is if we're going to look at, say, we want to go and flow information back from an AI prompt, 5:46the only way that they can actually get the response back is if they have access to everything, right? 5:53So if they have access all this information, then they can have access to all the information that's in here. 5:59Same kind of thing here. 6:00They can view everything in the dashboards as long as all the originating source information they have access to. 6:07Now, in practice, I realize this is really hard to do, because it's not likely we're going to provide access to everything, right? 6:16So in some cases, when we've implemented this, we do some data engineering, so we restructure things that are happening in, 6:23say, the vector database or the warehouse, 6:25so that groupings of things can then become an asset that we can provide access to, and this is actually the next piece of this. 6:34As we can really start thinking of things as data objects. 6:38And the way to think about that is like in a particular dashboard here, 6:42this may be a very specific view from a set of information and we think of that as a data object and that object then is what you're allowed to get access to. 6:51We could sort of apply this into a vector database but the vectors, we can have many, many, many vectors and it's really hard to think of that as an object, 7:00but... 7:00The way to really kind of look at this is you have logical groupings of things that then you're providing access to. 7:06So that's the data object. 7:08Now, the next strategy that we can do is start applying data virtualization. 7:17Now, in our traditional view of how we work at data, we do a lot of extract, transform load, 7:25we do lot of moving of data, shaping the data to what we need to do so it finally gets to the output that we need do. 7:31With the emergence of data lakes, which really says that we take a big view, 7:37a collection view of our enterprise data and look at that as a set and not necessarily warehouse and we look at how and what we're allowed to do within this. 7:46We can actually create a data virtualization layer that basically says instead of always doing ETLs and moving data, every time we want to do a query, 7:56we can create a virtualized output that then is really specific to what that user has access and controls to. 8:03So that virtualization is kind of a runtime control and view of what they're allowed to see. 8:09And that really allows us to get a much better ability to start controlling 8:16access to things. 8:16So we really start thinking when we're looking at data virtualization, 8:21this is really about data lakes, 8:25and it's also about data governance. 8:27If we're going to do this properly... 8:32We have to understand what kind of data is here. 8:35Is there PII data? 8:36Is there SPI data? 8:38Where's the data from? 8:40What's the lineage of the data? 8:41We really have to govern the data very well so that as we do things like data virtualization, 8:46we can apply it and make it so that it's very consumable with the right access controls in place. 8:53Now, data virtualization does get really hard when we start thinking about embeddings and vector database 8:59because we're really not doing this in a runtime fashion. 9:01The embeddings happen and are set up ahead of time, so it really doesn't lend itself towards data virtualization. 9:08So what can we do here? 9:09So the next strategy really then starts talking about filtering, 9:15And this is really filtering the results, filtering the query results. 9:19and we can either do pre-filtering or post-filter. 9:21In other words, pre- filtering is we send a prompt in, and the results that are returned are only the results that a user is allowed to get access to. 9:30So that all flows back. 9:31Post filtering is another option here. 9:33Post filtering really says that we do the prompt, we find all the nearest neighbors in our vector database, 9:39and then after the results come back, then we filter it based on what that user is allowed to do. 9:45Now, to make this really work well for doing the filtering, that requires that we have some good knowledge of access controls, 9:53and and where throughout the enterprise 9:56what people are allowed to do? 9:58This can be centralized access control, it can be decentralized access control. 10:02So there's a lot of challenges with this as well, but it does allow us to then do some sort of filtering on a prompt and a results that are coming back. 10:12This gets us to a certain extent, 10:14and even when we're talking about filtering and the pre and post filtering and the access controls, again, data governance and this piece here. 10:22Mark this up a little bit. 10:24This piece here becomes very critical and key to anything that we're trying to do around access controls and protecting merged data, 10:31but now let's start thinking about where can we now start going from here. 10:36The next thing is birthright access. 10:41Now birthright is not a new concept either. 10:43It's been around for a while. 10:44And really what this concept says is that the person trying to get access to information, and it's not about them going and individually getting access to things, 10:54it's really who are they, what organization or division or group are they part of, what's their role, and where are they located. 11:02And that really, what job are they trying to accomplish, that then drives what data they're allowed to see. 11:08So they don't have to go ask. 11:09It's really driven by who they are and what they're trying to do. 11:12And that's how we give access. 11:14And that simplifies the problem greatly, whether we're trying do it for insights and dashboards, whether we're trying to do it, 11:21from an AI side and what data are they allowed to access that we retrieve back in a rag model. 11:26If we know who they are and we know about the data, we can merge that together, 11:31do our pre-filting, do our post-filtering, whatever it is that we need to do and drive it that way. 11:36Now again, that relies heavily on having good, strong data governance in place to be able to accomplish that, 11:42but it does simplify our problem in protecting merge data and provides a way for us to do that. 11:48So these are all, 11:49these are four strategies that we can use to start solving the problem of how do we protect merged data, 11:56and in a lot of ways, this is really about least privilege. 12:00Who is the user? 12:02What are they trying to do? 12:03And what are they allowed to access? 12:06The last thing I want to bring in, and we should always be thinking about this any time we're talking about protecting data, is really around compliance. 12:14So we need to make sure that anything that we are doing is observable. 12:21And that we monitor all the action and all the activities taking place. 12:24And this is something that I'm hoping is a standard process with everyone. 12:30So again, just kind of looking at this whole thing, as we look at, especially in our AI world, and we merge data from across an enterprise, 12:39we need to make sure that we're providing the right access and that that data is only what they need to do 12:46to accomplish the task they're setting out for, 12:48and these are a set of strategies that we can use to do that and to protect that merge data.