Data Observability Explained with Train Analogy
Key Points
- Ryan introduces the IBM Technology Channel video, asks viewers to like, subscribe, and share, and promises a train‑analogy demo to illustrate data pipelines and observability.
- He outlines the rapid evolution of software engineering over the past 5‑8 years—CI/CD, DevOps, infrastructure‑as‑code, cloud microservices—making observability a standard practice for application performance monitoring (APM).
- Ryan points out that just as every organization became a software company, today every organization is becoming a data company, leading many software engineers to transition into data‑engineering roles.
- This shift has created a new “data observability” movement, applying the same monitoring, tracing, and alerting principles from APM to data pipelines to ensure data quality, reliability, and timely issue resolution.
Sections
- Data Observability Intro with Train Analogy - Ryan from IBM introduces a video on data observability—encouraging channel interaction, teasing a train‑themed illustration of pipelines, and briefly contextualizing the trend within the rise of CI/CD in software engineering.
- Bridging Software and Data Engineering - The speaker explains how software engineers are shifting to data‑engineer roles, using familiar coding tools to build data pipelines, but unlike application development they lack mature observability, making data observability the next essential capability.
- Observability Challenges in Data Pipelines - The speaker explains how ML pipelines aim to deliver trustworthy data but frequent failures force engineers to spend half their time on maintenance, highlighting the need for observability, illustrated with a train‑movement analogy.
- Data Pipeline Observability and Lineage - The speaker explains how monitoring process quality, data quality, and lineage in data pipelines enables rapid detection of issues and alerts downstream users to prevent cascading failures.
Full Transcript
# Data Observability Explained with Train Analogy **Source:** [https://www.youtube.com/watch?v=jfg9wBJBtKk](https://www.youtube.com/watch?v=jfg9wBJBtKk) **Duration:** 00:11:57 ## Summary - Ryan introduces the IBM Technology Channel video, asks viewers to like, subscribe, and share, and promises a train‑analogy demo to illustrate data pipelines and observability. - He outlines the rapid evolution of software engineering over the past 5‑8 years—CI/CD, DevOps, infrastructure‑as‑code, cloud microservices—making observability a standard practice for application performance monitoring (APM). - Ryan points out that just as every organization became a software company, today every organization is becoming a data company, leading many software engineers to transition into data‑engineering roles. - This shift has created a new “data observability” movement, applying the same monitoring, tracing, and alerting principles from APM to data pipelines to ensure data quality, reliability, and timely issue resolution. ## Sections - [00:00:00](https://www.youtube.com/watch?v=jfg9wBJBtKk&t=0s) **Data Observability Intro with Train Analogy** - Ryan from IBM introduces a video on data observability—encouraging channel interaction, teasing a train‑themed illustration of pipelines, and briefly contextualizing the trend within the rise of CI/CD in software engineering. - [00:03:02](https://www.youtube.com/watch?v=jfg9wBJBtKk&t=182s) **Bridging Software and Data Engineering** - The speaker explains how software engineers are shifting to data‑engineer roles, using familiar coding tools to build data pipelines, but unlike application development they lack mature observability, making data observability the next essential capability. - [00:06:04](https://www.youtube.com/watch?v=jfg9wBJBtKk&t=364s) **Observability Challenges in Data Pipelines** - The speaker explains how ML pipelines aim to deliver trustworthy data but frequent failures force engineers to spend half their time on maintenance, highlighting the need for observability, illustrated with a train‑movement analogy. - [00:09:05](https://www.youtube.com/watch?v=jfg9wBJBtKk&t=545s) **Data Pipeline Observability and Lineage** - The speaker explains how monitoring process quality, data quality, and lineage in data pipelines enables rapid detection of issues and alerts downstream users to prevent cascading failures. ## Full Transcript
Hey, everyone, this is Ryan with IBM.
Excited you came by the IBM Technology Channel today.
We're gonna be talking about what is data observability.
One of the hottest topics in the data space today.
But before we get there, I want to remind you,
please subscribe, like the channel, interact with us in the comments.
It really helps us produce the videos you want to see from us in the future.
So go ahead.
Like it, share it.
Send it to your Mom and Dad.
Let's get the word out there.
All right.
So before we get into data
observability, I want to tease you a
little bit.
I'm going to use a
train analogy later on
in this demonstration.
So what I want you to do
is if you are really excited about
trains like I am, I love trains.
I used to play with trains all the
time during Christmas time with my
grandpa growing up.
This is going to be a cool example
to show you how we're connecting
everything together with data
pipelines, data engineering
and also observability.
So we're going to get back to that
promise.
Okay. So let's give a little quick
history lesson of what's going
on in the industry of how we're
getting to this data observability
moment that we're seeing right now.
And it really comes down to
something going on that
is compared
to the software engineering
group and
explosion that's happened the last
5 to 8 years, which is
the software engineers are basically
ruling the world.
These developers are ruling
with frameworks and
methodologies like CI and CD.
They're pioneering things like dev
ops.
They're doing things like infra
as code, all these
really advanced things
in the software development space.
It's really blown up.
And these are all like table stakes
nowadays. If you're a software
engineer, you're basically doing all
these things and you're doing that
with building applications, like
building applications in the cloud,
you're building microservices,
you're also building applications.
So all these things are great.
And recently, around five
years ago, there was a blow
up in the space around observability
itself. There's another video that
we've done at IBM around
observability that I encourage
you to check that out.
And it's really around this idea
of application performance
monitoring or APM.
APM is really around being able
to detect problems and performance
issues in your application so
developers can be alerted right away
and go resolve something.
Whether a server was down or
you had an application hiccup
in production, you know right away
through the trace logs and issues
that they know exactly where to go
to fix that, right?
And that's awesome that they've got
all these tools kind of helping them
out.
Well, what's going on now is it
used to be that everywhere
everyone was a software company.
Every company is now a software
company. Well, now every
company is now a data company.
And what's going on is that a lot of
these software engineering
folks or developers or engineers,
they're actually moving to become
data engineers.
And so they're using a lot of the
same skill sets.
They're, you know, code heavy
skill sets like Python and
using frameworks like DevOps, all
these things that they're used to as
software engineers and now becoming
data engineers to take the
same things we're doing around
continuous delivery for application
development now into
data development within the
organization today.
So where software engineers
really care about the applications
that they're building like
cloud microservices and
applications out there, what data
engineers are actually really
focused on is their data
pipelines.
These data pipelines are the things
that are moving the data from the
source all the way to the end
state of the consumer.
And they're in charge of around 80
to 90% of all
the data flow within their
organization today.
But the problem is this, is that
whereas software engineers have
APM application performance
monitoring as an observability tool
to help them find things right away,
alert them, resolve them;
data engineering really isn't there
yet. And this is where
observability comes into play.
Observability is basically, data
observability is the
next step for data engineers to
operationalize any instant detection
that they have within their data
pipelines.
So let's walk through and we're
going to use the train analogy as
a as a pipeline.
So I view this train as an
actual pipeline that's going to be
moving down the tracks here.
But this is what engineers will
do. They are building pipelines,
they're orchestrating the data
from the source data all the way
to their end consumer.
And he's going to have hundreds to
thousands of different data
pipelines that they're they're using
most commonly, they're using open
source tools like Apache Airflow to
to take the data and move it.
And just to illustrate real quick,
this becomes a complicated problem
because what you're doing is
you're taking all different
types of sources that you can, maybe
can and cannot control,
from third party applications, from,
you know, web server APIs, things
like that... You're taking these
things and then we're expected to
funnel them into
storage areas like a
data warehouse.
Or something that's unstructured,
like a data lake.
So data engineers.
Again, they're in charge of moving
this data to eventually,
when you get the data in
the right state, they want to be
able to power the data products
that you're building today.
And those can be things like
business analytics, finance,
marketing, sales, having
predictive analytics around, you
know, the success of their business
and making decisions off of that
data.
That's one use case.
Another use case is actually just
building data products themselves,
building applications
that are mobile or web or
have high volume of transactions
that are associated with those.
Like, for example, like a sports
betting company.
That would be an example of a
building, a data product.
ML deep learning pipelines is
another one.
Using these pipelines to
really drive and take the business
to the next step in their AI journey
for trustworthy data.
So all this at the end, the goal,
though, what they're really trying
to do is they're trying to deliver
the data in a trustworthy way
to eventually get to
their consumers.
Now, this would be great if
everything works well in these these
pipelines, right?
But I hate to break it to you,
but that's not how things work.
We also know that
data engineers are spending around
50% of their time actually
maintaining these pipelines because
things break and things are
constantly halting them
for holding them from building
these really cool, data driven
products in their organization
today.
So where does observability come
in? Well, all goes back to being
able to monitor an observer data
pipeline. So let's walk through and
examine the examples of this, the
problems of these data engineers,
these teams will face
with this train analogy that I wrote
up here. So the first one is this If
you're on a train
and you're sitting there waiting for
it to move, the first thing you're
going to ask is, is the train moving
or not?
You want to be understanding right
away. Hey, is the train moving?
And that relates back to a pipeline.
Is a pipeline actually operational?
Is the pipeline executing
or is it failing or is a halt?
Is it stalled if the data
pipeline is not moving correctly?
If it's not moving at all, this data
cannot get to the end consumer.
The next question is like, well, how
fast is this train going?
You know, is it is it going at 90%,
80%, 20%?
Did it did it take an hour
when it should've taken 5 minutes to
take 10 hours?
Mission taking one hour.
This is a problem.
If we don't understand how
the if we don't know exactly when
the data is going to be first
operationalize and running and
orchestration with with the
pipelines. But also if we have a
data SLA that needs to get there at
a certain time.
Well, we've got to know that.
We've got to know that that's going
to that's going to be a problem.
Right. So that's the first part.
The second part is around the cargo
on the train.
So all trains, hopefully they're
taking cargo.
Maybe they got cars, maybe they got
computers, whatever it is.
Well, the next question is, okay, if
our pipeline is running really good.
Well, what's going on
with the actual data sets there?
It's basically the cargo on that
pipeline.
This is what gets into understanding
if there is any things going on,
anything going on with the data at
the data set level.
So for example, there could have
been a schema
change to this.
So we had no idea about.
So we were expecting ten columns and
we got 11 we're expecting ten
columns and we got nine.
That's a problem because that data
is going to be impacted downstream
with something that's going to
consume that data set.
We also have since
a no record through.
I will say it was a really high
value piece of data that we expected
to come through every single day and
didn't happen.
That's a problem.
Or if the data actually change, if
all these are checks on ones and X,
that's a problem.
It's going to corrupt the data
downstream.
With the data, we're moving in these
data pipelines.
So the first one is really around
the process quality.
The second one is around the data
quality within the actual pipeline,
and the next one is around the
lineage. So
if you are on a train, the train is
going to go somewhere.
Eventually it's going to drop off
the cargo, it's going to go on
different tracks. It may hook up,
change trains may go to you if
you're going from Georgia to York
City, maybe even a stop in Delaware
for an example.
As an example, this really gets into
data lineage, which is
how are things connected
to dependent pipelines
and data sets downstream.
So for example, if we can
make if we know the pipeline is is
working good, but then we find that
there's a problem in the actual
data with the pipeline.
The next question is how does this
this problem here
impact something downstream
of another data set that's consuming
from another pipeline?
So we want to know right away if
this fails, we want to be able to
know that it's going to impact this
this person down here to let them
know.
And this is what that observability
is all about.
What we're doing is we're
operationalizing the
incident detection.
So whenever we see a problem
at the source, the warehouse,
even downstream, at the at the
product level, we want to be able to
alert up into the data
engineering team so they can
notify be notified right away
when a problem occurs, they can fix
it and prevent it from getting it
from impacting downstream consumers
and ultimately having big, costly
impacts to the business.
And so really the tenets of data
observability is to detect
earlier, get those get those
sections right away at the source,
resolve them faster, know exactly
where they are and how they affect
other people downstream, and then
ultimately deliver that trust from
the data to hit your expected data.
SLA As for your end consumer,
so let's wrap
it up.
Talked a little bit about the
industry of what's going on, law,
software engineers moving data and
data engineering. They need the
tools to help them be able to
operationalize any incidents that
occur within their data pipelines.
We gave a simplistic example of of a very complex workflow
that engineers are constantly dealing with, which is
pumping source data all the way to
their end consumers and dealing with
tons of different tools in between
that could prevent a
bad issue or a
a dirty data problem
downstream to the consumers.
And we also had a little fun talking
about how this connects into a
train analogy.
So I really hope you enjoyed this video about what is data observability
and check back for the channel for more videos in the future.
Thanks, everyone.
If you like this video and want to see more like it, please like and subscribe.
If you have any questions, please drop them in the comments below.