Data Lake Persistence and Ingestion Overview
Key Points
- The core of a cloud‑based data lake is persistent storage of the raw data, its indexes, and catalog metadata in object storage.
- Existing data from relational, NoSQL, or other operational databases is brought into the lake primarily via batch ETL (SQL‑as‑a‑service) followed by replication of change feeds for ongoing updates.
- Real‑time sources such as IoT devices, connected cars, and application/service logs are streamed continuously into the lake, where they are stored in object storage for later analysis.
- On‑premises data—whether on local disks, legacy Hadoop clusters, or traditional data lakes—can be migrated to the cloud to unify storage and leverage the lake’s persistent, indexed, and cataloged architecture.
Sections
- Persisting and Ingesting Cloud Data Lakes - Torsten Steinbach explains that cloud data lakes store data, indexes, and catalog metadata in object storage, and acquire data via batch ETL (SQL‑as‑a‑service) and replication mechanisms.
- Streaming and Uploading Logs to Cloud - The speaker highlights the necessity of streaming logs and providing efficient upload mechanisms to transfer on‑premises data—including raw, volatile device and application data—into cloud object storage for subsequent processing.
- Leveraging FaaS and SQL for Data Pipelines - The speaker outlines using a data catalog to track assets, employing Functions‑as‑a‑Service and SQL‑as‑a‑Service to transform, index, and store data, and then delivering business insights via BI reporting and dashboards.
- SQL-as-a-Service Data Pipeline - The passage outlines how cloud SQL‑as‑a‑service can ETL data from a prepared data lake into a traditional data warehouse for BI reporting, while also allowing advanced analytics and machine‑learning tools to directly access the lake’s object storage for end‑to‑end insight generation.
- Data Lake Governance and Automation - The speaker stresses the importance of tracking data lineage, enforcing governance policies and access controls (including anonymization), and using cloud function‑as‑a‑service to automate and operationalize the entire data‑lake pipeline.
Full Transcript
# Data Lake Persistence and Ingestion Overview **Source:** [https://www.youtube.com/watch?v=IPkQpBdde5Y](https://www.youtube.com/watch?v=IPkQpBdde5Y) **Duration:** 00:14:32 ## Summary - The core of a cloud‑based data lake is persistent storage of the raw data, its indexes, and catalog metadata in object storage. - Existing data from relational, NoSQL, or other operational databases is brought into the lake primarily via batch ETL (SQL‑as‑a‑service) followed by replication of change feeds for ongoing updates. - Real‑time sources such as IoT devices, connected cars, and application/service logs are streamed continuously into the lake, where they are stored in object storage for later analysis. - On‑premises data—whether on local disks, legacy Hadoop clusters, or traditional data lakes—can be migrated to the cloud to unify storage and leverage the lake’s persistent, indexed, and cataloged architecture. ## Sections - [00:00:00](https://www.youtube.com/watch?v=IPkQpBdde5Y&t=0s) **Persisting and Ingesting Cloud Data Lakes** - Torsten Steinbach explains that cloud data lakes store data, indexes, and catalog metadata in object storage, and acquire data via batch ETL (SQL‑as‑a‑service) and replication mechanisms. - [00:03:20](https://www.youtube.com/watch?v=IPkQpBdde5Y&t=200s) **Streaming and Uploading Logs to Cloud** - The speaker highlights the necessity of streaming logs and providing efficient upload mechanisms to transfer on‑premises data—including raw, volatile device and application data—into cloud object storage for subsequent processing. - [00:06:31](https://www.youtube.com/watch?v=IPkQpBdde5Y&t=391s) **Leveraging FaaS and SQL for Data Pipelines** - The speaker outlines using a data catalog to track assets, employing Functions‑as‑a‑Service and SQL‑as‑a‑Service to transform, index, and store data, and then delivering business insights via BI reporting and dashboards. - [00:09:44](https://www.youtube.com/watch?v=IPkQpBdde5Y&t=584s) **SQL-as-a-Service Data Pipeline** - The passage outlines how cloud SQL‑as‑a‑service can ETL data from a prepared data lake into a traditional data warehouse for BI reporting, while also allowing advanced analytics and machine‑learning tools to directly access the lake’s object storage for end‑to‑end insight generation. - [00:12:50](https://www.youtube.com/watch?v=IPkQpBdde5Y&t=770s) **Data Lake Governance and Automation** - The speaker stresses the importance of tracking data lineage, enforcing governance policies and access controls (including anonymization), and using cloud function‑as‑a‑service to automate and operationalize the entire data‑lake pipeline. ## Full Transcript
Hello, this is Torsten Steinbach,
an architect at IBM for Data and Analytics in the cloud
and I'm going to talk to you about data lakes in the cloud.
The center of a date a lake in the cloud is the data persistency itself.
So, we talk about persistency of data,
and the data itself in the data lake in the cloud is persisted in object storage.
But we don't just persist the data itself,
we also persist information about the data,
which is, on one side, information about indexes.
So, we need to index the data
so that we can make use of this data in the cloud, data lake, efficiently.
And we also need to store metadata about the data in the catalog.
So, this is our persistency of the data lake.
Now the question is: how do we get this data into the data lake?
So, there are different types of data that we can ingest,
so we need to talk about ingestion of data,
and we can have a situation that some of your data that is already persistent in databases.
So, these can be relational databases and can also be other operational databases, noSQL database and so on.
And we get this data into a data lake,
actually via 2 fundamental mechanisms.
One is basically an ETL,
which stands for "Extract-Transform-Load",
and this is done in a batch fashion.
And the typical mechanism to do ETL is using SQL,
and since we're talking about cloud data lakes, this is "SQL-as-a-service" now.
But there's also, in addition, and often you combine those things,
the mechanism of replication
which is basically more of the change feeds
so after you may have done a batch ETL on the initial data set,
we talk about, "how do you replicate all of the changes that come in after this initial batch ETL?"
Next, we may have data that has not persisted yet at all
which is generated as we are speaking here, for instance, from devices.
So, we may have things like IoT devices, driving cars, and the like.
And they are actually producing a lot of IoT messages
- all the time, continuously,
and they also need to basically land in, and stream in, to the date lake.
So, here we're talking about streaming mechanism.
In a very similar manner, we are taught that we have data that is originated from applications
that are running in the cloud
or services that are used by your applications.
They're all producing logs
and that's very valuable information, especially if you're talking about
operational optimizations and getting business insights of your user behavior, and these kind of things.
This is very important data that we need to get hold of.
So, we're talking about logs
and these also need a streaming mechanism
to basically get streamed and stored in object storage.
And finally,
you may have a situation that you do already have data sitting around in local discs.
So, you may have local discs, maybe on your own machine.
You may have even a local data lake, a classical data lake, not in a cloud
and typically these are Hadoop clusters
that you have on-premises in your enterprise,
or it can be as simple as
- you find it very frequently just NFS shares that are used in your team, in your enterprise, to store certain data.
And if you want to basically get them to a data lake,
you also need a mechanism,
and it's basically an upload mechanism.
So, a data lake needs to provide you with
an efficient mechanism to upload data
from ground, on-premises, to the cloud, into the object storage.
Now, the next thing we need to do, once the data is here, is process it.
We need to process the data.
This is especially important if you're talking about data that hasn't gone through an initial processing,
like for instance device data, application data, this is pretty raw data
that has a very raw format, that is very volatile,
that has very different structures, changing schema,
and sometimes it doesn't have a real structure,
it can be binary data - let's say, images that are being taken by
a device's cameras and you need to extract features from that.
So, we're talking about feature extraction from this data.
But even if you have no structure extracted already, it might still need a lot of cleansing,
you may have to basically normalize it to certain units,
you may have to round it up to certain time boundaries,
get rid of null values, and these kind of things.
So, there's a lot of things that you need to do about transformation,
you need to transform the data.
Once you have transformed the data,
basically you now have the data that
you can potentially now use for other analytics,
but one additional thing is advisable that you should do with this data:
you should basically create indexes.
So, you should index this data
so that we know more about the data
and can do performant analytics.
And finally, you should also leverage this data - you have a catalog you need to leverage it.
and you need to tell the data lake about this by cataloging the data.
So, there are multiple steps
and often we talk about the pipeline of data transformations
that need to be done here.
Now the question is what do we use here?
And there are actually two processes, two mechanisms, two services or types of services
that are especially suited for this type of processing.
One is Functions-as-a-service (FaaS)
and the other one is SQL-as-a-service again.
So, with SQL and function as a service you can do this whole range of things here, you
can basically create indexes through SQL DDLs, it also can create tables through SQL DDLs,
you can transform data
when you can use functions with custom libraries and custom code to do future extractions
from the format of the data that you need to process.
Once we have gone through this pipeline, the question is what's next now?
So, we have prepared, we have processed all of this data, and we have probably cataloged it,
so we know of what data we have.
Now it comes to the point that we really harvest all of this work by basically generating insights.
So, generating insights is on one side the whole group of business intelligence,
which consists of things like doing reporting, or creating dashboards,
and that's what's typically often referred to as BI (Business Intelligence).
And one option that is possible now
is to simply directly do BI against this data in a data lake.
But, actually, it turns out that it's especially useful, or an option, for batch ETL options
- like creating reports in a batch function.
Because when it comes to more interactive requirements, you need - basically sitting in front of the screen,
and you need to refresh in a subsecond, let's say a dashboard here.
There is actually another very important mechanism
that is very well established and it is part of this whole data lake ecosystem
and this is a data warehouse.
So, a data warehouse - or a database, maybe more generally - is highly optimized
and has a lot of mechanisms for giving you low latency
and also guaranteed response times for your queries.
So, the question is, how so we do that?
Now, we obviously need to move this data one step further
after it has gone through all of the data preparation
in the data lake with an ETL again.
And it happens to be again that SQL-as-a-service is a useful mechanism
because it's a service we have available on the cloud, we already use it to ETL data into the data lake,
now we can also use it to ETL data out of this data lake into a data warehouse.
So that it's now in this - I would say more traditional, established stack of doing BI
that can be used by your BI tools, reporting tools, dashboarding tools,
to do interactive BI with performance and response time SLAs.
So, that's one end-to-end flow now,
but, very obviously, inside there is more than just doing reporting and dashboarding.
So, there's a whole domain of tools and frameworks out there
for more advanced types of analytics such as machine learning,
or simply using data signs, tools, and framework
that now, basically, can also do analytics
and do AI, artificial intelligence,
against the data that we've prepared here in a catalog.
And machine learning tools and data science tools,
basically they all have very strong support for accessing data in an object storage.
So, that's why this is a good fit basically let them connect directly here to this data lake.
Now, that is the end-to-end process - basically getting from your data, with the help of a data lake, into insights.
One of the big problems that is there today is for people to do that,
to prove and explain how they got to this insight.
How can you trust this insight?
How can you reproduce this insight?
So, one of the key things that need to be part of this picture is data governance.
So, data governance, in this context, has two main things that we need to take care of.
One is we need to be able to track the lineage of data
- because you've seen the data is traveling from different sources,
through preparation, into some insights in the form of a report.
And you always need to be able to track back: where did this report come from?
Why is it looking like this?
What's the data that basically produced it?
And the other things are: you need to be able to enforce
- what a data lake actually needs to be able to enforce,
policies, governance policies.
Who is able to access what?
Who is able to see personal information?
- and can I access it directly, or only in an anonymized and masked forms?
So, these are all governance rules,
and there are governance services available, also in the cloud,
that basically a data lake needs to apply and use in order to track all of this.
So, we're almost done with this overall Data Lake introduction, but there is just one more thing that I want to highlight
and this is, since we're talking about the cloud:
In the cloud, how can I deploy my entire pipeline of data traveling through this whole infrastructure,
- how can I automate that?
And here, basically, function-as-a-service plays a special role
because function-as-a-service has a lot of mechanisms
that can that I can use to schedule and automate
things like, for instance, batch ETL step,
- or like generating a report.
So, this is the final thing that we need in our data lake
in order to automate and operationalize, eventually,
my entire data and analytics using a data lake.
Thank you very much.