Data Pipelines Explained Through Water
Key Points
- Data pipelines move raw, “dirty” data from sources (data lakes, databases, streaming feeds) to where it can be used, much like water pipelines transport untreated water to treatment plants.
- Like water treatment, data must be cleaned, de‑duplicated, and formatted before it becomes useful for business decision‑making.
- The primary method for this is ETL (extract, transform, load), which extracts data, applies transformations to resolve mismatches and missing values, and loads it into a target repository such as an enterprise data warehouse.
- Pipelines can operate in batch mode on a schedule or in continuous streaming mode to handle real‑time data feeds, and they may also employ techniques like data replication or virtualization.
Full Transcript
# Data Pipelines Explained Through Water **Source:** [https://www.youtube.com/watch?v=6kEGUCrBEU0](https://www.youtube.com/watch?v=6kEGUCrBEU0) **Duration:** 00:08:30 ## Summary - Data pipelines move raw, “dirty” data from sources (data lakes, databases, streaming feeds) to where it can be used, much like water pipelines transport untreated water to treatment plants. - Like water treatment, data must be cleaned, de‑duplicated, and formatted before it becomes useful for business decision‑making. - The primary method for this is ETL (extract, transform, load), which extracts data, applies transformations to resolve mismatches and missing values, and loads it into a target repository such as an enterprise data warehouse. - Pipelines can operate in batch mode on a schedule or in continuous streaming mode to handle real‑time data feeds, and they may also employ techniques like data replication or virtualization. ## Sections - [00:00:00](https://www.youtube.com/watch?v=6kEGUCrBEU0&t=0s) **Water Pipeline Analogy for Data Pipelines** - The speaker compares water treatment and distribution pipelines to data flows from lakes, databases, and streams, illustrating how raw data is collected, processed, and delivered to where it’s needed in an organization. ## Full Transcript
let's talk about data pipelines what
they are when and how they're used
so i want to start with a simple idea
most of us are fortunate enough to turn
on the tap whenever we like and fresh
clean water comes out
however have you have you thought about
how that water actually gets to you
well
water starts out in our
lakes
our
oceans
and even our
rivers
but most of us probably wouldn't drink
straight from the lake right we have to
treat and transform this water into
something that's safe for us to use and
we do this using
treatment
facilities
and we get the water from where it is to
where it needs to go using
water pipelines
right
now once that water has gotten from the
source to their treatment plants it's
then cleansed and and made sure it's
safe to use and then it's sent out using
even more pipelines to where we need it
and we use it in a couple different
places we need it
for
drinking water
we need it for
cleaning
and we also need it for
agriculture
right so we use even more pipelines
to get this water to where it's needed
okay
now
as you can see water pipelines take
water from where it is to where it's
needed
now we can start to think about data in
organizations in a very similar way so
data and organization starts out in
data lakes
it's in different databases that are a
part of different sas applications some
applications are on-prem
and then we also have
streaming data
which is kind of like our river here
now this can be data that is coming in
in real time and so an example of that
could be sensor data from uh factories
where data's being collected every
second and being sent
and being sent back up to our
repositories
so just like our water sources
this data is dirty it's contaminated and
it must be
cleaned and transformed before it's
useful
in helping us make business decisions
now when we talk so how do we do this
work we do it using
not water pipelines but
data pipelines
okay
so when we talk about data pipelines we
have a few different processes that we
can use
to help us handle the task of
transforming and cleaning this data
we can use
processes like
etl
we can use
data
replication
we can also use something called
data virtualization
right
okay
so one of the most common processes is
etl which stands for extract transform
and load
and that does exactly what it sounds
like it extracts data from where it is
it transforms it by
cleaning up
mismatching data by taking care of
missing values getting rid of duplicated
data putting in making sure the right
columns are there and then loading it
into
a landing repository
for
ready-to-use business data an example of
one of these repositories could be
an enterprise data warehouse right
okay
so
most of the time we use something called
batch processing
which
means that on a given schedule
we load data into our etl tool and then
load it to where it needs to be
but we could also have stream ingestion
which would support the streaming data
that i mentioned earlier so it's
continuously taking data in transforming
it and then continuously loading it to
where it needs to be
okay
now another tool that we might see is
data replication
so what this involves is a continuously
replicating and copying data into
another repository
before being loaded or used by our use
case
so we could have
a repository here in the middle
that
copies data from our source into this
into this repository so why would we do
that right
well one of the reasons could be that
the application or use case where we
need this data
needs to have a really high performant
back end to it and it's possible that
our source data can't support something
like that
another reason could be for backup and
disaster recovery reasons so in the
situation where our source data goes
offline for some reason we still have
this backup
to keep running our business processes
against
okay
so the last one i want to touch on is
data virtualization
so all of the methods that i've
described so far require you to copy
data from where it is and move it into
another repository
but what if we want to test out a new
data use case and don't want to go
through a large data transformation
project
well in that case we can use a
technology called data virtualization
to simply virtualize access to our data
sources
and only query them in real time when we
need them without copying them over
and once we're happy with the outcome of
our our test use case we can go back and
build out these formal data pipelines
so data virtualization technology allows
us to access all these disparate data
sources
without having to go through
building out
permanent data pipelines
so once we're satisfied with the results
of our data virtualization project we
can build a formal data pipeline that
can support the massive amounts of data
that we need to that we need in a
production use case
now unfortunately we haven't figured out
a way how to virtualize water but we can
definitely do it with data in our in our
organizations
okay
so after we've
used all these different processes to
get data ready for uh analysis or
different applications we can start
using it so what are the different ways
in which we can use this data
well we might need it for our business
intelligence platforms that
are needed for
different types of reporting
well we might also need it for
machine learning use cases right so
machine learning requires tons and tons
of high quality data so we need to use
these data pipeline tools to feed our
machine learning
algorithms
and
so this clean data can be fed into our
machine learning models
to help us start making better and
smarter decisions in our business
okay so as we can see data pipelines
take data from
data
producers
and
give them to
data
consumers
thank you if you have questions please
drop us a line below and if you want to
see more videos like this in the future
please like and subscribe