Learning Library

← Back to Library

Data Lake Persistence and Ingestion Overview

Key Points

  • The core of a cloud‑based data lake is persistent storage of the raw data, its indexes, and catalog metadata in object storage.
  • Existing data from relational, NoSQL, or other operational databases is brought into the lake primarily via batch ETL (SQL‑as‑a‑service) followed by replication of change feeds for ongoing updates.
  • Real‑time sources such as IoT devices, connected cars, and application/service logs are streamed continuously into the lake, where they are stored in object storage for later analysis.
  • On‑premises data—whether on local disks, legacy Hadoop clusters, or traditional data lakes—can be migrated to the cloud to unify storage and leverage the lake’s persistent, indexed, and cataloged architecture.

Full Transcript

# Data Lake Persistence and Ingestion Overview **Source:** [https://www.youtube.com/watch?v=IPkQpBdde5Y](https://www.youtube.com/watch?v=IPkQpBdde5Y) **Duration:** 00:14:32 ## Summary - The core of a cloud‑based data lake is persistent storage of the raw data, its indexes, and catalog metadata in object storage. - Existing data from relational, NoSQL, or other operational databases is brought into the lake primarily via batch ETL (SQL‑as‑a‑service) followed by replication of change feeds for ongoing updates. - Real‑time sources such as IoT devices, connected cars, and application/service logs are streamed continuously into the lake, where they are stored in object storage for later analysis. - On‑premises data—whether on local disks, legacy Hadoop clusters, or traditional data lakes—can be migrated to the cloud to unify storage and leverage the lake’s persistent, indexed, and cataloged architecture. ## Sections - [00:00:00](https://www.youtube.com/watch?v=IPkQpBdde5Y&t=0s) **Persisting and Ingesting Cloud Data Lakes** - Torsten Steinbach explains that cloud data lakes store data, indexes, and catalog metadata in object storage, and acquire data via batch ETL (SQL‑as‑a‑service) and replication mechanisms. - [00:03:20](https://www.youtube.com/watch?v=IPkQpBdde5Y&t=200s) **Streaming and Uploading Logs to Cloud** - The speaker highlights the necessity of streaming logs and providing efficient upload mechanisms to transfer on‑premises data—including raw, volatile device and application data—into cloud object storage for subsequent processing. - [00:06:31](https://www.youtube.com/watch?v=IPkQpBdde5Y&t=391s) **Leveraging FaaS and SQL for Data Pipelines** - The speaker outlines using a data catalog to track assets, employing Functions‑as‑a‑Service and SQL‑as‑a‑Service to transform, index, and store data, and then delivering business insights via BI reporting and dashboards. - [00:09:44](https://www.youtube.com/watch?v=IPkQpBdde5Y&t=584s) **SQL-as-a-Service Data Pipeline** - The passage outlines how cloud SQL‑as‑a‑service can ETL data from a prepared data lake into a traditional data warehouse for BI reporting, while also allowing advanced analytics and machine‑learning tools to directly access the lake’s object storage for end‑to‑end insight generation. - [00:12:50](https://www.youtube.com/watch?v=IPkQpBdde5Y&t=770s) **Data Lake Governance and Automation** - The speaker stresses the importance of tracking data lineage, enforcing governance policies and access controls (including anonymization), and using cloud function‑as‑a‑service to automate and operationalize the entire data‑lake pipeline. ## Full Transcript
0:00Hello, this is Torsten Steinbach, 0:02an architect at IBM for Data and Analytics in the cloud 0:05and I'm going to talk to you about data lakes in the cloud. 0:10The center of a date a lake in the cloud is the data persistency itself. 0:15So, we talk about persistency of data, 0:22and the data itself in the data lake in the cloud is persisted in object storage. 0:32But we don't just persist the data itself, 0:37we also persist information about the data, 0:40which is, on one side, information about indexes. 0:43So, we need to index the data 0:45so that we can make use of this data in the cloud, data lake, efficiently. 0:50And we also need to store metadata about the data in the catalog. 0:58So, this is our persistency of the data lake. 1:02Now the question is: how do we get this data into the data lake? 1:05So, there are different types of data that we can ingest, 1:10so we need to talk about ingestion of data, 1:18and we can have a situation that some of your data that is already persistent in databases. 1:28So, these can be relational databases and can also be other operational databases, noSQL database and so on. 1:35And we get this data into a data lake, 1:40actually via 2 fundamental mechanisms. 1:44One is basically an ETL, 1:47which stands for "Extract-Transform-Load", 1:50and this is done in a batch fashion. 1:54And the typical mechanism to do ETL is using SQL, 1:59and since we're talking about cloud data lakes, this is "SQL-as-a-service" now. 2:05But there's also, in addition, and often you combine those things, 2:08the mechanism of replication 2:10which is basically more of the change feeds 2:12so after you may have done a batch ETL on the initial data set, 2:15we talk about, "how do you replicate all of the changes that come in after this initial batch ETL?" 2:24Next, we may have data that has not persisted yet at all 2:27which is generated as we are speaking here, for instance, from devices. 2:34So, we may have things like IoT devices, driving cars, and the like. 2:43And they are actually producing a lot of IoT messages 2:51- all the time, continuously, 2:54and they also need to basically land in, and stream in, to the date lake. 2:59So, here we're talking about streaming mechanism. 3:06In a very similar manner, we are taught that we have data that is originated from applications 3:13that are running in the cloud 3:16or services that are used by your applications. 3:20They're all producing logs 3:22and that's very valuable information, especially if you're talking about 3:26operational optimizations and getting business insights of your user behavior, and these kind of things. 3:33This is very important data that we need to get hold of. 3:38So, we're talking about logs 3:41and these also need a streaming mechanism 3:44to basically get streamed and stored in object storage. 3:49And finally, 3:50you may have a situation that you do already have data sitting around in local discs. 3:56So, you may have local discs, maybe on your own machine. 4:05You may have even a local data lake, a classical data lake, not in a cloud 4:13and typically these are Hadoop clusters 4:17that you have on-premises in your enterprise, 4:20or it can be as simple as 4:23- you find it very frequently just NFS shares that are used in your team, in your enterprise, to store certain data. 4:31And if you want to basically get them to a data lake, 4:33you also need a mechanism, 4:35and it's basically an upload mechanism. 4:39So, a data lake needs to provide you with 4:42an efficient mechanism to upload data 4:45from ground, on-premises, to the cloud, into the object storage. 4:51Now, the next thing we need to do, once the data is here, is process it. 5:00We need to process the data. 5:02This is especially important if you're talking about data that hasn't gone through an initial processing, 5:09like for instance device data, application data, this is pretty raw data 5:13that has a very raw format, that is very volatile, 5:19that has very different structures, changing schema, 5:23and sometimes it doesn't have a real structure, 5:28it can be binary data - let's say, images that are being taken by 5:32a device's cameras and you need to extract features from that. 5:37So, we're talking about feature extraction from this data. 5:47But even if you have no structure extracted already, it might still need a lot of cleansing, 5:52you may have to basically normalize it to certain units, 5:55you may have to round it up to certain time boundaries, 5:59get rid of null values, and these kind of things. 6:02So, there's a lot of things that you need to do about transformation, 6:06you need to transform the data. 6:10Once you have transformed the data, 6:12basically you now have the data that 6:14you can potentially now use for other analytics, 6:16but one additional thing is advisable that you should do with this data: 6:20you should basically create indexes. 6:21So, you should index this data 6:24so that we know more about the data 6:26and can do performant analytics. 6:31And finally, you should also leverage this data - you have a catalog you need to leverage it. 6:36and you need to tell the data lake about this by cataloging the data. 6:43So, there are multiple steps 6:48and often we talk about the pipeline of data transformations 6:52that need to be done here. 6:54Now the question is what do we use here? 6:57And there are actually two processes, two mechanisms, two services or types of services 7:08that are especially suited for this type of processing. 7:12One is Functions-as-a-service (FaaS) 7:18and the other one is SQL-as-a-service again. 7:24So, with SQL and function as a service you can do this whole range of things here, you 7:30can basically create indexes through SQL DDLs, it also can create tables through SQL DDLs, 7:34you can transform data 7:36when you can use functions with custom libraries and custom code to do future extractions 7:40from the format of the data that you need to process. 7:47Once we have gone through this pipeline, the question is what's next now? 7:52So, we have prepared, we have processed all of this data, and we have probably cataloged it, 7:57so we know of what data we have. 8:00Now it comes to the point that we really harvest all of this work by basically generating insights. 8:15So, generating insights is on one side the whole group of business intelligence, 8:26which consists of things like doing reporting, or creating dashboards, 8:39and that's what's typically often referred to as BI (Business Intelligence). 8:45And one option that is possible now 8:50is to simply directly do BI against this data in a data lake. 8:58But, actually, it turns out that it's especially useful, or an option, for batch ETL options 9:04- like creating reports in a batch function. 9:08Because when it comes to more interactive requirements, you need - basically sitting in front of the screen, 9:13and you need to refresh in a subsecond, let's say a dashboard here. 9:17There is actually another very important mechanism 9:21that is very well established and it is part of this whole data lake ecosystem 9:25and this is a data warehouse. 9:32So, a data warehouse - or a database, maybe more generally - is highly optimized 9:37and has a lot of mechanisms for giving you low latency 9:41and also guaranteed response times for your queries. 9:44So, the question is, how so we do that? 9:48Now, we obviously need to move this data one step further 9:52after it has gone through all of the data preparation 9:54in the data lake with an ETL again. 10:01And it happens to be again that SQL-as-a-service is a useful mechanism 10:07because it's a service we have available on the cloud, we already use it to ETL data into the data lake, 10:14now we can also use it to ETL data out of this data lake into a data warehouse. 10:18So that it's now in this - I would say more traditional, established stack of doing BI 10:23that can be used by your BI tools, reporting tools, dashboarding tools, 10:28to do interactive BI with performance and response time SLAs. 10:42So, that's one end-to-end flow now, 10:47but, very obviously, inside there is more than just doing reporting and dashboarding. 10:52So, there's a whole domain of tools and frameworks out there 10:58for more advanced types of analytics such as machine learning, 11:05or simply using data signs, tools, and framework 11:17that now, basically, can also do analytics 11:23and do AI, artificial intelligence, 11:29against the data that we've prepared here in a catalog. 11:33And machine learning tools and data science tools, 11:37basically they all have very strong support for accessing data in an object storage. 11:43So, that's why this is a good fit basically let them connect directly here to this data lake. 11:50Now, that is the end-to-end process - basically getting from your data, with the help of a data lake, into insights. 11:59One of the big problems that is there today is for people to do that, 12:06to prove and explain how they got to this insight. 12:09How can you trust this insight? 12:11How can you reproduce this insight? 12:13So, one of the key things that need to be part of this picture is data governance. 12:24So, data governance, in this context, has two main things that we need to take care of. 12:34One is we need to be able to track the lineage of data 12:42- because you've seen the data is traveling from different sources, 12:45through preparation, into some insights in the form of a report. 12:50And you always need to be able to track back: where did this report come from? 12:54Why is it looking like this? 12:55What's the data that basically produced it? 12:58And the other things are: you need to be able to enforce 13:04- what a data lake actually needs to be able to enforce, 13:09policies, governance policies. 13:12Who is able to access what? 13:14Who is able to see personal information? 13:16- and can I access it directly, or only in an anonymized and masked forms? 13:23So, these are all governance rules, 13:25and there are governance services available, also in the cloud, 13:29that basically a data lake needs to apply and use in order to track all of this. 13:38So, we're almost done with this overall Data Lake introduction, but there is just one more thing that I want to highlight 13:44and this is, since we're talking about the cloud: 13:47In the cloud, how can I deploy my entire pipeline of data traveling through this whole infrastructure, 13:54- how can I automate that? 13:57And here, basically, function-as-a-service plays a special role 14:02because function-as-a-service has a lot of mechanisms 14:05that can that I can use to schedule and automate 14:09things like, for instance, batch ETL step, 14:12- or like generating a report. 14:16So, this is the final thing that we need in our data lake 14:19in order to automate and operationalize, eventually, 14:26my entire data and analytics using a data lake. 14:31Thank you very much.