Learning Library

← Back to Library

Governed Data Architecture for AI

Key Points

  • High‑quality, well‑governed data is the foundation of the AI lifecycle, reducing time spent on collection and cleaning so teams can focus on model work.
  • Modern data architectures—whether data lakes, data fabrics, or other repositories—must adopt AI‑specific guardrails such as standardized organization, clear classification (personal, financial, etc.), and documented ownership.
  • Data should be ingested from batch loads or event‑driven streams into a single, agreed‑upon location, where metadata records the data’s uniqueness, lineage, and sensitivity to satisfy compliance and improve model accuracy.
  • The principles of data cataloging, documentation, and strict schema enforcement, once optional, are now essential investments to meet today’s AI performance and regulatory demands.
  • Applying these governance practices across any existing data environment enables more accurate AI results and a faster, more reliable development pipeline.

Full Transcript

# Governed Data Architecture for AI **Source:** [https://www.youtube.com/watch?v=AtXqpveCWQU](https://www.youtube.com/watch?v=AtXqpveCWQU) **Duration:** 00:11:57 ## Summary - High‑quality, well‑governed data is the foundation of the AI lifecycle, reducing time spent on collection and cleaning so teams can focus on model work. - Modern data architectures—whether data lakes, data fabrics, or other repositories—must adopt AI‑specific guardrails such as standardized organization, clear classification (personal, financial, etc.), and documented ownership. - Data should be ingested from batch loads or event‑driven streams into a single, agreed‑upon location, where metadata records the data’s uniqueness, lineage, and sensitivity to satisfy compliance and improve model accuracy. - The principles of data cataloging, documentation, and strict schema enforcement, once optional, are now essential investments to meet today’s AI performance and regulatory demands. - Applying these governance practices across any existing data environment enables more accurate AI results and a faster, more reliable development pipeline. ## Sections - [00:00:00](https://www.youtube.com/watch?v=AtXqpveCWQU&t=0s) **Data Governance for AI Development** - The speaker explains how aggregating, governing, and architecting high‑quality data—using data lakes, fabrics, and management principles—reduces AI lifecycle cycle time, enables model experts to focus on modeling, and meets modern compliance and accuracy demands. - [00:03:15](https://www.youtube.com/watch?v=AtXqpveCWQU&t=195s) **Pre‑Ingestion Documentation and Automated Data Quality** - The speaker stresses documenting dataset relationships, timestamps, and retention policies before loading data, then automating standardized, tested ingestion to enforce data‑quality controls in the data lake. - [00:06:23](https://www.youtube.com/watch?v=AtXqpveCWQU&t=383s) **Ensuring Data Quality for AI** - The speaker emphasizes rigorous monitoring, change tracking, and proactive alerts during data ingestion to protect investment, control costs, and guarantee reliable, high‑quality data for downstream AI development and API use. - [00:09:30](https://www.youtube.com/watch?v=AtXqpveCWQU&t=570s) **Tagging, Vectorizing, and Governance in Generative AI** - The speaker explains how to tag data before vectorizing for RAG, manage vector reuse through governance, and apply tagged data for fine‑tuning LLMs such as with Instruct Lab. ## Full Transcript
0:00I'm sure you're interested in developing more A.I. 0:03and how are you going to do that? 0:05High quality data. 0:07Let's talk about how we can aggregate and govern your data for AI development. 0:12And let's highlight the high level architecture and guardrails for your data. 0:16For AI results. A majority of the AI lifecycle really involves data collection and data gathering as well as data cleaning. 0:26We want to reduce the cycle time so our professionals can focus on what they do best, which is working with models. 0:33And there's been a lot of information already put out there about AI development, as well as data lake architecture. 0:40And today we're going to be blending those two topics together 0:43and talking about the best data management technologies and how those can actually set you up for better 0:50AI development and more accurate AI results. 0:55So what we're going to be talking about today can be applied to any kind of data architecture 1:01that you may have or you may be working with currently in your environment. 1:06This could be a data lake and data fabric, any kind of data repository. 1:10And a lot of what we're going to be going through here are principles that can be applied to those architectures. 1:15And a few years ago they might have just been a nice to have and might not even have been worth the investment. 1:20But given the importance of AI today 1:23and the compliance needed around AI, 1:26these are now a must have and they're going to be important investments that we make 1:31in our data architectures to bring them up to the standards that we need. 1:36So let's get started at the beginning. 1:39We have a data repository. 1:41All data repository should be stored in a nice location 1:46where we all agree data is going to be collected and then from there it's going to be brought in from other sources, 1:54these other sources, maybe other data repositories that are brought in on batch 2:00or maybe other systems of engagement that exist in the cloud and are brought over 2:06on an event trigger. So they're more so on a stream schedule. 2:10And so no matter what the kind of data or the kind of data behavior you have that goes into 2:16your data lake, you're basically going to make sure that no matter what you're applying, 2:22what's our first guardrail, which is standard organization. 2:25So from here, we can make sure that all data that's coming in matches exactly what you expect. 2:35And what do I mean by that? Is what data is this? Is it personal information? 2:40Is it sensitive personal information? 2:43Is it financial data? 2:44You want to define what that data is and why you're expecting it 2:48to be so that we can make sure that as we bring it in, that we're adhering to those standards. 2:54So we want to make sure that we're very clearly documenting this, 2:58no matter and again, no matter what the source or what the format 3:02has to go through this process and a few other recommendations that you should store beyond just 3:07what it is and who owns it, you should look into making sure you're documenting 3:11exactly how this data is unique, what makes a row unique, basically? 3:15How data sets can be joined or merged together, 3:18and also what timestamp can actually define where data can be 3:24is actually created and where we can look to know when it can be removed as per our retention policies. 3:32Knowing all of that beforehand before you actually bring anything into your data lake 3:37is essential and is going to save us so much time down the road. 3:41And you'll see that this documentation phase really becomes the keystone for the rest of our process as we go on here. 3:48So once we actually have that documentation 3:52and again, that documentation can be applied in any way you want, any kind of code repository, 3:57but once you have that, you're then going to actually be bringing in this data and that's going to come in an era ingestion layer. 4:05And this probably is going to be really your first area 4:09of enforcement of the data quality to make sure that as you're building your data lake, 4:14that you are protecting it as much as you can. 4:18So what that really means is we're going to be automating all ingestions or basically all rights to your data lake 4:26So that just means that we want to make sure that all of our. 4:35So we want to make sure that all of our rights that are coming in 4:40are standardized, tested and deployed in an automated fashion. 4:44So it's not just the Wild West. 4:45We're not just letting anyone upload a spreadsheet and moving on with their day. 4:49We have to make sure they go through the standardized process 4:53and that we can make sure that everything comes in how we expect it. 4:58Not only does this become easier to manage and easier to monitor, 5:02it also allows us to easily link back to those data standards that we wrote before 5:07and make sure that they are actually being enforced on ingestion 5:10so that nothing is going to hit your data link that you need to fix later. 5:14It's just coming in the right way at the beginning. 5:17And then once we actually have your data like here, it needs to be stored in an efficient manner for AI. 5:25So this is a bit different than a transactional database. 5:28Most storage technologies will involve document storage or object storage 5:35to make sure that we can organize data in these large pockets of information 5:40that can handle the data behavior that we see from a data lake. 5:43And that really is going to be large queries that happen occasionally 5:47very different than what you would see in a transactional database, for example. 5:51So as we're storing this, we're then going to make sure that we're enforcing 5:55that any changes get tracked while in the process of that data leak. 6:00And because very rarely do we see something on ingestion come in 6:05in the perfect format that's ready for AI, usually there's some kind of post processing. 6:10Maybe you're aggregating something that came in at a minute rate to an hour. 6:14Maybe you're adding in more computations or you're adding in new calculations. 6:21And again, those are new rights. 6:23So those rights have to come under the same scrutiny as our second guardrail here of automatic rights. 6:30But also, we need to just track any changes that occur to make sure that we are. 6:39Tracking everything efficiently. 6:46So from there, we can guarantee that the data will be ready, 6:51because by the time you're ready for development, you've already made an investment. 6:55You've already probably had many meetings, you've gotten a budget approved, you have resources assigned. 7:01They're looking at the data. 7:02They're ready to build. 7:03AI You don't want to do all this work with ingestion. 7:06And then something happened to the data that got that resulted in data corruption or just 7:12mystery data or some kind of missing information that would cause more cycles down the line. 7:19And also think of this, too, as an investment. 7:22It costs the same amount of money to store poor quality data as it does high quality data. 7:28So by making sure that you're investing at the ingestion layer and then 7:32actually maintaining that throughout the whole lifecycle 7:35to make sure that it's always in that state and then it cannot fall out of that state without proactive alerting. 7:41You're basically protecting your investment. 7:44And this again, is going to be an important cost control as well to ensure quality results. 7:51So lastly, now we have the data an AI joint use case where we actually are going to be using that data for API. 8:01This is where your data scientists and AI professionals come 8:04into your data lake and they're going to be querying it to actually use that data. 8:09And all the work that you've done before here gets carried through. 8:12So that basically what we're going to do is we're going to tag our data 8:15to make sure that as we use it, we know exactly where it was used. 8:24And what model it was combined with. 8:27So this is really the part of the organization where we can apply that AI 8:33governance to make sure that we know exactly 8:37what data was combined with which model and then what AI product was built out of that. 8:44And that becomes the audibility that we need to make better decisions about what data we're using for AI, 8:50but also so we can learn what data maybe we're not using and how we can even improve it further. 8:56And the tagging is going to depend a little bit based on what kind of AI you're developing. 9:02So, for example, we have traditional AI 9:11And with traditional AI this is the air that we've been doing for decades. 9:15This is our regression and this is our optimization. 9:18We have our training and our testing set and that's what we can actually bring in to train on a model. 9:26And we're going to make sure we're tallying that data, get all this good documentation linked. 9:32And now for generative AI 9:36You might be following something like a RAG pattern because you probably aren't 9:40treating a large language model from scratch. 9:43You're probably taking some kind of data out of your repository and you're vectorizing so we can enrich an existing LLM model. 9:51And what's important here is that you're going to tag your data before you vectorize it, 9:56because once you vectorize your data, it is hard to understand really what was in that data before it was vectorized. 10:03So that's why all this pre-work is especially more important with the use of generative AI 10:09And basically we also have to think about the fact that vectorizing is such a computationally heavy operation. 10:17It's something we don't want to do lightly and you want to make sure your vectorizing the right data first. 10:22And also this is where we can actually apply 10:25governance again around that area so that we know how we already vectorized that data. 10:30So you probably don't have to vectorize it again 10:33as you can maybe find more opportunities for reuse within your organization. 10:38And then lastly, and along similar lines, we want to maybe fine tune a model. 10:46And this is along similar lines of the generative AI use case. 10:51We'll be tagging the data that is being used to enrich 10:54a large language model with a fine tuning process like an open source technology 11:00like Instruct Lab so that basically those LLMs can be more educated on certain pieces of data that you would be adding to it. 11:10So we want to make sure that through this whole process that we're making 11:14sure that everything not only leads to easier development 11:19so there's less time needed for that AI cleaning and to figure out where did this come from? 11:24Did I fill out the right paperwork? 11:26Is, you know, is all the compliance. 11:28Okay. 11:28It allows all that work to be done upfront so that you can move faster at the end. 11:34So that's really the entire process here. 11:37And all, of course, starts with our documentation, 11:39comes through it, our ingestion layer, and then we enforce it through the whole process 11:44and we make sure that it's then tagged and stamped. 11:46And with our generative AI. 11:49So by following all these data management principles, 11:52we can make sure that we're supporting more accurate and faster development.