Simplifying Monitoring with Golden Signals
Key Points
- The traditional approach to monitoring complex micro‑service environments forces owners to chase numerous technology‑specific metrics and call multiple experts, slowing down root‑cause identification and increasing latency for end users.
- Site Reliability Engineering (SRE) recommends focusing on only four “golden signals” – latency, errors, traffic, and saturation – rather than tracking every possible metric across heterogeneous services.
- By applying the golden signals and leveraging APM tools that surface the immediate downstream dependencies (one hop away), teams can quickly eliminate services that are healthy and narrow the search space.
- This streamlined, signal‑driven monitoring dramatically reduces mean time to recovery (MTTR) and helps maintain consistent end‑user performance despite a diverse tech stack.
Sections
Full Transcript
# Simplifying Monitoring with Golden Signals **Source:** [https://www.youtube.com/watch?v=rnnhtzIgjvQ](https://www.youtube.com/watch?v=rnnhtzIgjvQ) **Duration:** 00:05:12 ## Summary - The traditional approach to monitoring complex micro‑service environments forces owners to chase numerous technology‑specific metrics and call multiple experts, slowing down root‑cause identification and increasing latency for end users. - Site Reliability Engineering (SRE) recommends focusing on only four “golden signals” – latency, errors, traffic, and saturation – rather than tracking every possible metric across heterogeneous services. - By applying the golden signals and leveraging APM tools that surface the immediate downstream dependencies (one hop away), teams can quickly eliminate services that are healthy and narrow the search space. - This streamlined, signal‑driven monitoring dramatically reduces mean time to recovery (MTTR) and helps maintain consistent end‑user performance despite a diverse tech stack. ## Sections - [00:00:00](https://www.youtube.com/watch?v=rnnhtzIgjvQ&t=0s) **Untitled Section** - ## Full Transcript
today I'd like to talk a little about
the site reliability or sre discipline
and how we can apply it to simplifying
monitoring for complex modern
applications this will help us identify
root causes more quickly and drastically
reduce the mean time to recovery so that
we can maintain the end-user performance
that we want for our applications so
first let's take a look at what happens
before we've applied these SRE
principles to our monitoring so let's
say that I'm the owner of an application
and I've gotten an alert that says that
I'm having a latency issue now my
application is really critical for this
business and so I need to find the root
cause quickly but because I'm part of
this complex micro service topology it
can be really difficult to figure out
where exactly the root cause is coming
from and to make things more complex all
of my dependencies could be based on
different technologies so let's say one
is built on nodejs
one is a db2 database another is written
in Swift and so on now all of these have
different metrics that are typically
monitored and I may not be an expert in
any of these different technologies so
it may be difficult for me personally to
go in and figure out what the problem is
so I would have to call in a expert for
each of these technologies now as you
can imagine this is time consuming for
everyone to go through their service
figure out if there is a problem or if I
need to keep going downstream and all
the while my users are still
experiencing this latency issue now what
if there was a better way this is what
we can learn from the SRA discipline
which tells us that there's really only
four key performance indicators that we
need to monitor not all the different
metrics for each technology and we call
these golden signals
so the golden signals are latency which
is the time it takes to service a
request errors which is a view of the
request error rate traffic which is the
demand placed on the system and
saturation which is our utilization
versus max capacity now let's go back to
our initial example and see how this
would work applying the golden signals
so my service will call it service a we
know we have a latency issue now we know
that latency is typically a symptom and
if we examine the service let's say
we're not seeing any of the causes so we
know we have to keep looking downstream
but we don't want to go back to this
complicated micro service topology and
try and figure it all out
so some APM tools can help you out with
this by identifying only the services
that are one hop away from my service in
question so let's say we have services B
C and D that are connected to my service
a that's having the problem now no
matter what technology these services
are built on all we need to do is go
look at the golden signals so let's say
we look at the golden signals for
service B and everything looks fine so
we know service B is not the problem and
let's say service C same scenario we
don't see any issues so we can eliminate
that as the problem now service D let's
say that we're seeing an issue with our
saturation which is trending upwards so
right there after only a few minutes
we've identified service D is likely our
root cause so now instead of having to
pull in the experts for each of these
different services now we can go
directly to service D and let them know
that we've identified that they're
likely a cause of this issue that we're
having and they can go about fixing it
and what's even better is if they're
using golden signals to
their service it's very likely they've
already identified this and are already
working on the fix so as you can see
this process drastically improves the
time that it takes to go through this
complex topology and many different
technologies to figure out where your
root causes and identify exactly how to
fix it so when you're identifying an APM
tool to use make sure that it offers the
ability to use these golden signals and
this one hop dependency view so that you
can quickly identify the root causes and
get your service restored as quickly as
possible thanks for watching this video
on simplifying monitoring for modern
applications
you