Learning Library

← Back to Library

Building Governed Data Lakes for AI

Key Points

  • Data lakes serve as centralized repositories that ingest and store diverse data sources—streaming, batch, internal, and external—to enable powerful user and business insights.
  • A flexible ingestion framework standardizes and copies data into the lake, allowing analysts to work on the data without affecting the original sources.
  • Raw data typically requires extensive cleansing, preparation, and feature extraction before it can be used for advanced analytics or machine learning.
  • Each processing step generates new derived datasets that remain linked to the original data, which is essential for tracing impacts and updating models when source data changes.
  • Embedded governance captures metadata, enforces usage policies, and maintains lineage throughout the pipeline, ensuring data is used correctly and responsibly.

Full Transcript

# Building Governed Data Lakes for AI **Source:** [https://www.youtube.com/watch?v=LxcH6z8TFpI](https://www.youtube.com/watch?v=LxcH6z8TFpI) **Duration:** 00:05:15 ## Summary - Data lakes serve as centralized repositories that ingest and store diverse data sources—streaming, batch, internal, and external—to enable powerful user and business insights. - A flexible ingestion framework standardizes and copies data into the lake, allowing analysts to work on the data without affecting the original sources. - Raw data typically requires extensive cleansing, preparation, and feature extraction before it can be used for advanced analytics or machine learning. - Each processing step generates new derived datasets that remain linked to the original data, which is essential for tracing impacts and updating models when source data changes. - Embedded governance captures metadata, enforces usage policies, and maintains lineage throughout the pipeline, ensuring data is used correctly and responsibly. ## Sections - [00:00:00](https://www.youtube.com/watch?v=LxcH6z8TFpI&t=0s) **Understanding Data Lakes and Ingestion** - Adam Kocoloski explains what data lakes are, how they consolidate diverse data sources via a common ingestion framework, and the preparation steps needed to enable intelligent applications. - [00:03:11](https://www.youtube.com/watch?v=LxcH6z8TFpI&t=191s) **Data Lake Enables Intelligent Applications** - The passage outlines how a data lake supports creating dashboards, recommendation engines, and automated processes by following the AI ladder’s four stages—collect, organize, analyze, and infuse—resulting in a continuous feedback loop of new data and smarter models. ## Full Transcript
0:00Hi everyone, my name's Adam Kocoloski with IBM Cloud 0:02and I'm here to talk to you today about data lakes 0:04- what they are, how you use one, 0:06and the kind of things you ought to be thinking about as you set one up 0:09to power your applications and 0:11create more intelligent experiences for users. 0:14So, data lakes exist because we're all awash with data 0:18and we've got systems of record, 0:20we've got systems of engagement, 0:22we've got streaming data, we've got batch data internal, external data, 0:26and it's really a combination of these different kinds of data sources 0:29that leads us to get powerful insights 0:32about what our users are doing, 0:33about the way the world is working around us, 0:35and leads us to develop more intelligent applications. 0:38Data lakes start by collecting all those different types of data sources 0:42through a common ingestion framework 0:44and that ingestion framework is something that typically wants to be able 0:47to support a diverse array of different types of data, 0:50and it wants to kind of standardize 0:52and centralize all that stuff into a common storage repository. 0:55That's not always required, 0:57but typically you don't want to be analyzing the source data directly, 1:00you want to be able to take a copy of it, 1:02so that you've got the flexibility to do the kind of things you need to do with that data. 1:07And speaking of that, 1:08the data typically doesn't common a form where you can use it right out of the box. 1:12There's a lot of data cleansing and data preparation that's required. 1:17There is often times the ability to, or the requirement to create new features, 1:26something we call feature extraction, 1:28combinations of different types of data that need to be 1:31pulled together in order to create the right 1:35sort of bits of information to analyze. 1:40And once you cleanse that data, prep the data, 1:44model the right kind of features for your analysis, 1:48then you get to the fun part - which is actually going in and doing the machine learning model training 1:52and doing your advanced analytics. 1:56And each of these steps is typically creating new derived data sets 2:02that tie back to the original one. 2:04And that relationship is a really important thing to capture, 2:07because, let's say, there was a problem with one of your data sources. 2:09You know there was a correction that needed to be made. 2:12You need to understand how that flows through the entire pipeline 2:15of more refined data sets and models that you're producing, 2:19so that you can go back and correct it. 2:20And that's what this governance stuff comes into play. 2:23This is something that's really you know infused at every step of the journey. 2:26It means collecting meta data, you know data about your data, you know the right kinds 2:31of information about the tables in your data sets and how they relate to one another. 2:35It means being able to enforce policies so that as an organization we use the data the 2:40way it's meant to be used, the way it's intended to be used, the way it's acceptable to be 2:45used to drive the business forward. 2:47That's really something that can't be bolted on after the fact that something has to be 2:50present throughout the entire life cycle. 2:53If we stop here, we haven't really changed anything. 2:57It's only by getting these insights that were producing in this data lake back out into 3:01the real world that were able to you know deliver on the business promise of these data 3:07lakes that that we're all investing in and that's where this apply step comes in. 3:11This can take a few different forms. 3:14You might be you know building simply dashboards That are helping business executives make 3:19smarter decisions about where to take the business forward with new projects to invest 3:24in. 3:25Or you might be building smarter applications that are able to make intelligent recommendations 3:31to the users of those apps based on you know historical purchased data. 3:38Increasingly we're also seeing a lot of process automation where an intelligent model can 3:45smooth over some typically manual business processes and create a more intelligent experience 3:51and based on the sort of rich data driven understanding of the problem at hand. 3:58And really this whole process iterates back, right. 4:02Those more intelligent applications, they end up generating new data and the cycle continues. 4:07And so that in a nutshell at a very high level is what a day lake does. 4:12Some of you may have heard us talk about "the ladder to AI", the "AI ladder", and we talk 4:17about that - we talk about collecting data. 4:21We talk about organizing data. 4:25We talk about analyzing. 4:29And we talk about infusing. 4:32And really those four steps on this ladder are things that you can see represented throughout 4:38this data lake environment. 4:40Clearly over here we're doing a lot of collection of these individual sources of data. 4:44This data preparation and feature extraction step into governed fashion is absolutely what 4:49we mean by the organizing of data. 4:52ML model training is a key example of data analysis. 4:55And we talk about infusing the insights from the data lake into the applications, that's 5:01really this last step here. 5:02And so, there is very much a clear linkage between climbing this AI ladder and a data 5:08lake as a vehicle that can help you make that journey. 5:11Thanks for watching. 5:12If you have any questions or comments, please drop us a line below. 5:13If you enjoyed this content, please consider liking or subscribing thank you.