Learning Library

← Back to Library

SRE Golden Signals Explained

Key Points

  • The speaker likens SRE Golden Signals to a car’s check‑engine light, warning of issues early so they don’t turn into costly, catastrophic failures.
  • Golden Signals for microservices are defined as latency, error rate (with severity differences like 500 vs 400 errors), traffic volume, and saturation (resource utilization versus capacity).
  • An example microservices stack is described, showing web apps that call backend services, which in turn rely on authentication, transaction, and database‑wrapper services, all abstracted behind APIs.
  • Developers favor microservices because they can pick the best technology for each component, but this flexibility creates operational complexity that requires broader expertise to diagnose and resolve problems.
  • Effective SRE monitoring therefore hinges on tracking the four signals to detect anomalies early, balancing rapid development benefits with the need for robust ops support.

Full Transcript

# SRE Golden Signals Explained **Source:** [https://www.youtube.com/watch?v=-U9E1PhrM3o](https://www.youtube.com/watch?v=-U9E1PhrM3o) **Duration:** 00:05:32 ## Summary - The speaker likens SRE Golden Signals to a car’s check‑engine light, warning of issues early so they don’t turn into costly, catastrophic failures. - Golden Signals for microservices are defined as latency, error rate (with severity differences like 500 vs 400 errors), traffic volume, and saturation (resource utilization versus capacity). - An example microservices stack is described, showing web apps that call backend services, which in turn rely on authentication, transaction, and database‑wrapper services, all abstracted behind APIs. - Developers favor microservices because they can pick the best technology for each component, but this flexibility creates operational complexity that requires broader expertise to diagnose and resolve problems. - Effective SRE monitoring therefore hinges on tracking the four signals to detect anomalies early, balancing rapid development benefits with the need for robust ops support. ## Sections - [00:00:00](https://www.youtube.com/watch?v=-U9E1PhrM3o&t=0s) **Check Engine Light for SRE** - The speaker likens a car’s check‑engine warning to SRE’s Golden Signals, explaining how early alerts—specifically latency, error rates, and traffic—help detect service issues before they cause severe failures. - [00:03:06](https://www.youtube.com/watch?v=-U9E1PhrM3o&t=186s) **Microservices, Ops Complexity, and SRE Signals** - The speaker explains how microservice architecture increases operational expertise demands, but using SRE’s four golden signals enables quick root‑cause isolation of latency and error issues, such as an oversaturated data service. ## Full Transcript
0:00If you're like me and you own an 0:02old car, you've probably had 0:04this experience. 0:05You're driving down the road and you 0:07see that dreaded check engine light. 0:10You may be thinking, "OK, this is bad news. I'm gonna need a tow truck, maybe have a repair", 0:15and you might be right. But imagine that same scenario where you didn't have that alert to the problem, 0:22and you just kept driving. What would happen? 0:23You'd have smoke blowing out the back. You'd potentially need a tow 0:26truck and have an even more expensive repair. 0:29That same sort of analogy applies 0:31to SRE Golden Signals. 0:33Their purpose is to alert you to a problem before it becomes serious. 0:38Let's go ahead and put a formal definition behind these. 0:42Now these are for your microservice applications, 0:46and the first metric that you're looking for is called latency. 0:53That refers to the time between when you make a request and you actually get a response. 0:57So for example, with a web application, it might be 200 to 400 milliseconds 1:00and for an API call, it could be a fraction of that, say, 20 milliseconds. 1:06The next metric is errors. 1:09Now, errors happen, that's a normal thing. But if there's too many, 1:13if there's a sudden spike, it can indicate a problem. 1:16Plus also, you have to keep in mind that not all errors are created equal. 1:18For example, a 500 error, where the server's down, is much more serious than, say, a 400 error, 1:24which means you can simply retry and potentially the problem will resolve itself. 1:29T is for traffic. 1:33Traffic refers to the amount of requests coming in versus your expectations. 1:38And finally, S for saturation. 1:44Saturation is the actual load versus your expected capacity. 1:49You could think of it as the tachometer on your car. 1:51It has a red line, and when it's oversaturated, you're receiving more requests than you can really handle. 1:58Now with that definition out of the way, 2:00let's look at an example of a microservices architecture. 2:04Example application. 2:08And we have several web apps 2:13running on a public cloud. 2:16And they're talking to a back end service. 2:19The back end service, in turn, relies on another microservice to handle authentication. 2:26The back end service uses a transaction service to get the information that's requested by the user, 2:32and it in turn relies on a data microservices, which might be wrappering a database like DB2 or MySQL. 2:40The idea is these boxes here are wrappers of specific services for microservices, 2:47and the user of that doesn't have to concern themselves with the underlying implementation. 2:53Now, from an ops and devops perspective, there are tradeoffs to consider. 3:01For developers, they really like microservices because it allows them to choose the best technology. 3:08That's because of microservices has encapsulated all the implementation, and they simply have an API. 3:14On the other hand, more technology choices means you have more need for expertise on the ops team 3:23to be able to diagnose, pull logs, and find out what is the root cause. 3:29The dev team likes that you can have microservices with more frequent deployments. 3:36That means that they can deploy on a schedule that is convenient for the development schedule 3:40versus a production schedule. 3:42But that means potentially more change the ops team has to be aware of. 3:46Now change isn't a bad thing inherently. 3:48It could be that there are more frequent changes and they're smaller, and thus when you have to diagnose a problem, 3:53it actually becomes easier. 3:56So now let's go through and use the four golden signals - SRE golden signals - to diagnose a potential problem. 4:05So say, for example, your end user is reporting response time errors. 4:10That in SRE golden signals terms is a latency error. 4:16So you look for these services that it's immediately dependent upon, and you find that these two 4:20are all within specification for latency, errors, traffic, and saturation. 4:25However, the transaction service is reporting latency 4:30and error problems. And going a little bit further down the line, 4:32you see the data services reporting that it's oversaturated. 4:36So just by monitoring these four different signals, you're able to isolate the potential cause 4:42and get to the root cause more quickly. But to do that in a more organized fashion, 4:48you can use what's called an APM dashboard, 4:51or application program management. 4:54That dashboard 4:57includes views that encapsulate the four SRE golden signals. 5:03The ops team can use that to monitor problems and be able to address them before they become more serious. 5:08But they're not limited to that. 5:10For example, the dev team could have their own custom dashboards just for their services. 5:15And so when they have a problem, they may potentially even know about it before the ops team is reporting a problem. 5:21And that ultimately can mean that you drive down your mean time to resolution and find your root cause faster, 5:28or in terms of a car, that means your users will see less check engine lights.