SRE Golden Signals Explained
Key Points
- The speaker likens SRE Golden Signals to a car’s check‑engine light, warning of issues early so they don’t turn into costly, catastrophic failures.
- Golden Signals for microservices are defined as latency, error rate (with severity differences like 500 vs 400 errors), traffic volume, and saturation (resource utilization versus capacity).
- An example microservices stack is described, showing web apps that call backend services, which in turn rely on authentication, transaction, and database‑wrapper services, all abstracted behind APIs.
- Developers favor microservices because they can pick the best technology for each component, but this flexibility creates operational complexity that requires broader expertise to diagnose and resolve problems.
- Effective SRE monitoring therefore hinges on tracking the four signals to detect anomalies early, balancing rapid development benefits with the need for robust ops support.
Sections
- Check Engine Light for SRE - The speaker likens a car’s check‑engine warning to SRE’s Golden Signals, explaining how early alerts—specifically latency, error rates, and traffic—help detect service issues before they cause severe failures.
- Microservices, Ops Complexity, and SRE Signals - The speaker explains how microservice architecture increases operational expertise demands, but using SRE’s four golden signals enables quick root‑cause isolation of latency and error issues, such as an oversaturated data service.
Full Transcript
# SRE Golden Signals Explained **Source:** [https://www.youtube.com/watch?v=-U9E1PhrM3o](https://www.youtube.com/watch?v=-U9E1PhrM3o) **Duration:** 00:05:32 ## Summary - The speaker likens SRE Golden Signals to a car’s check‑engine light, warning of issues early so they don’t turn into costly, catastrophic failures. - Golden Signals for microservices are defined as latency, error rate (with severity differences like 500 vs 400 errors), traffic volume, and saturation (resource utilization versus capacity). - An example microservices stack is described, showing web apps that call backend services, which in turn rely on authentication, transaction, and database‑wrapper services, all abstracted behind APIs. - Developers favor microservices because they can pick the best technology for each component, but this flexibility creates operational complexity that requires broader expertise to diagnose and resolve problems. - Effective SRE monitoring therefore hinges on tracking the four signals to detect anomalies early, balancing rapid development benefits with the need for robust ops support. ## Sections - [00:00:00](https://www.youtube.com/watch?v=-U9E1PhrM3o&t=0s) **Check Engine Light for SRE** - The speaker likens a car’s check‑engine warning to SRE’s Golden Signals, explaining how early alerts—specifically latency, error rates, and traffic—help detect service issues before they cause severe failures. - [00:03:06](https://www.youtube.com/watch?v=-U9E1PhrM3o&t=186s) **Microservices, Ops Complexity, and SRE Signals** - The speaker explains how microservice architecture increases operational expertise demands, but using SRE’s four golden signals enables quick root‑cause isolation of latency and error issues, such as an oversaturated data service. ## Full Transcript
If you're like me and you own an
old car, you've probably had
this experience.
You're driving down the road and you
see that dreaded check engine light.
You may be thinking, "OK, this is bad news. I'm gonna need a tow truck, maybe have a repair",
and you might be right. But imagine that same scenario where you didn't have that alert to the problem,
and you just kept driving. What would happen?
You'd have smoke blowing out the back. You'd potentially need a tow
truck and have an even more expensive repair.
That same sort of analogy applies
to SRE Golden Signals.
Their purpose is to alert you to a problem before it becomes serious.
Let's go ahead and put a formal definition behind these.
Now these are for your microservice applications,
and the first metric that you're looking for is called latency.
That refers to the time between when you make a request and you actually get a response.
So for example, with a web application, it might be 200 to 400 milliseconds
and for an API call, it could be a fraction of that, say, 20 milliseconds.
The next metric is errors.
Now, errors happen, that's a normal thing. But if there's too many,
if there's a sudden spike, it can indicate a problem.
Plus also, you have to keep in mind that not all errors are created equal.
For example, a 500 error, where the server's down, is much more serious than, say, a 400 error,
which means you can simply retry and potentially the problem will resolve itself.
T is for traffic.
Traffic refers to the amount of requests coming in versus your expectations.
And finally, S for saturation.
Saturation is the actual load versus your expected capacity.
You could think of it as the tachometer on your car.
It has a red line, and when it's oversaturated, you're receiving more requests than you can really handle.
Now with that definition out of the way,
let's look at an example of a microservices architecture.
Example application.
And we have several web apps
running on a public cloud.
And they're talking to a back end service.
The back end service, in turn, relies on another microservice to handle authentication.
The back end service uses a transaction service to get the information that's requested by the user,
and it in turn relies on a data microservices, which might be wrappering a database like DB2 or MySQL.
The idea is these boxes here are wrappers of specific services for microservices,
and the user of that doesn't have to concern themselves with the underlying implementation.
Now, from an ops and devops perspective, there are tradeoffs to consider.
For developers, they really like microservices because it allows them to choose the best technology.
That's because of microservices has encapsulated all the implementation, and they simply have an API.
On the other hand, more technology choices means you have more need for expertise on the ops team
to be able to diagnose, pull logs, and find out what is the root cause.
The dev team likes that you can have microservices with more frequent deployments.
That means that they can deploy on a schedule that is convenient for the development schedule
versus a production schedule.
But that means potentially more change the ops team has to be aware of.
Now change isn't a bad thing inherently.
It could be that there are more frequent changes and they're smaller, and thus when you have to diagnose a problem,
it actually becomes easier.
So now let's go through and use the four golden signals - SRE golden signals - to diagnose a potential problem.
So say, for example, your end user is reporting response time errors.
That in SRE golden signals terms is a latency error.
So you look for these services that it's immediately dependent upon, and you find that these two
are all within specification for latency, errors, traffic, and saturation.
However, the transaction service is reporting latency
and error problems. And going a little bit further down the line,
you see the data services reporting that it's oversaturated.
So just by monitoring these four different signals, you're able to isolate the potential cause
and get to the root cause more quickly. But to do that in a more organized fashion,
you can use what's called an APM dashboard,
or application program management.
That dashboard
includes views that encapsulate the four SRE golden signals.
The ops team can use that to monitor problems and be able to address them before they become more serious.
But they're not limited to that.
For example, the dev team could have their own custom dashboards just for their services.
And so when they have a problem, they may potentially even know about it before the ops team is reporting a problem.
And that ultimately can mean that you drive down your mean time to resolution and find your root cause faster,
or in terms of a car, that means your users will see less check engine lights.