Curated AI Prevents Hallucinations in Incident Response

11m • Unknown Channel • devops • deep-dive • advanced • Watch on YouTube ↗

Sections

00:00:00 Agentic AI Mitigates Sleep‑Inertia Incidents - The speaker explains how sleep inertia delays SRE response to critical 2 a.m. system anomalies and argues that agentic AI can accelerate anomaly detection and root‑cause analysis, while cautioning about potential pitfalls of over‑reliance.
00:03:23 Topology-Aware Context Curation for AI - The speaker explains how an agentic AI uses a real‑time service dependency graph to filter melt data—metrics, events, logs, and telemetry—so that only incident‑relevant signals are fed to the model, boosting efficiency and relevance.
00:08:04 AI-Driven Validation and Runbook Generation - The passage explains how an agentic AI supports SREs by supplying evidence for root‑cause hypotheses, offering verification steps, and automatically generating step‑by‑step runbooks to remediate incidents.

Full Transcript

# Curated AI Prevents Hallucinations in Incident Response **Source:** [https://www.youtube.com/watch?v=5Igexz7kzMo](https://www.youtube.com/watch?v=5Igexz7kzMo) **Duration:** 00:11:44 ## Sections - [00:00:00](https://www.youtube.com/watch?v=5Igexz7kzMo&t=0s) **Agentic AI Mitigates Sleep‑Inertia Incidents** - The speaker explains how sleep inertia delays SRE response to critical 2 a.m. system anomalies and argues that agentic AI can accelerate anomaly detection and root‑cause analysis, while cautioning about potential pitfalls of over‑reliance. - [00:03:23](https://www.youtube.com/watch?v=5Igexz7kzMo&t=203s) **Topology-Aware Context Curation for AI** - The speaker explains how an agentic AI uses a real‑time service dependency graph to filter melt data—metrics, events, logs, and telemetry—so that only incident‑relevant signals are fed to the model, boosting efficiency and relevance. - [00:08:04](https://www.youtube.com/watch?v=5Igexz7kzMo&t=484s) **AI-Driven Validation and Runbook Generation** - The passage explains how an agentic AI supports SREs by supplying evidence for root‑cause hypotheses, offering verification steps, and automatically generating step‑by‑step runbooks to remediate incidents. ## Full Transcript

0:00Did you know that if I woke you up and then asked you to work on a problem like well resolving an 0:06IT system anomaly, it would take you on average something like 22 minutes to move from a sleeping 0:16state to a cognitive state where you're like actually productive? And with every minute of 0:22downtime costing potentially thousands of dollars, that can be an expensive result of sleep inertia. 0:32Fortunately, agentic AI can help with anomaly detection and with resolution. But before I get 0:39to how, let's discuss where the magic bullet of let's use AI could actually lead you astray. So, 0:48imagine it's 2:00 a.m. and an observability tool just detected a business impacting instance. 0:55Something pretty bad like like your authentication service rejecting 90% of user logins or uh maybe a 1:02super laggy payment gateway. So, now it's up to our sleepy SR to come to the rescue. That's site 1:13reliability engineer. And what they need to do is first of all identify what the specific 1:22problem is. When they've done that, they need to figure out the cause of the problem. And 1:30once they figured that out, then they need to come to a resolution. And on the face of it, 1:38this is an area where AI is well suited to help because IT environments, they generate massive 1:45volumes of telemetry data like logs and traces. And traditional incident response requires SREs 1:53to manually sift through all of this noisy data and diagnose the probable root cause. So, lots of 2:02data. We're looking for a needle in a hay stack. Gosh, why not just send all of this telemetry data 2:08into a large language model and then ask it to sift through everything and figure out the root 2:16cause for us? Well, many LLMs have pretty huge context windows, but they're not bottomless pits, 2:24and a single node cluster can crank out gigabytes of log lines per hour. So, if you pipe that fire 2:30hose straight into the large language model and then ask it to come up with a cause, well, 2:37welcome to hallucination city. Because LLMs, they rely on statistical patterns and if you overfeed 2:44them unrelated noise, they will confidently fabricate causal links that just don't exist. 2:49Because the LLM's goal is to predict plausible words rather than to verify facts. It will happily 2:55stitch together all sorts of coincidences like CPU blips and benign restarts and old warning logs, 3:02it will stitch them together into kind of a neat but entirely imaginary narrative. So, 3:08if we want to use AI in anomaly detection and resolution, this brute force dump it all into the 3:16model kind of approach isn't going to get us very far. What we actually need is context curation. 3:23And that's the first of several areas where agentic AI can help. So we have here a whole bunch 3:31of collected data. We've got metrics and events. We've got logs. And we've got telemetry. That's 3:39actually better known as Melt. And then we've got this thing that I mentioned already called context 3:46curation, which is going to take a look at all of this melt data. And instead of just dumping all 3:55of that collected data straight into our AI model here, we're actually going to do an intermediary 4:03step, which is we are going to strategically feed it only the signals that actually matter 4:09for the instant at hand. And we're going to do this through topology aware correlation. Now, 4:17an observability platform maintains a real-time map of how all the services connect and depend on 4:22each other and it knows that an authentication service that talks to a user database and that 4:28sits behind a load balancer which connects to and well a load of stuff and you get the 4:32picture. But when an incident fires, the agent doesn't just grab random logs from everywhere. 4:39It uses this dependency graph just to pull in the telemetry data only from the components that are 4:45actually involved. So when that authentication service starts rejecting logins at 2 a.m., well, 4:52the agent knows to check the user database it depends on and maybe the Reddis cache it uses 4:57for sessions and any recent deployments to those specific services. What it's not doing is wasting 5:03time analyzing logs from a completely unrelated reporting microservice that just happens to be 5:08running on the same cluster. So let's work through how an AI incident investigation agent might work 5:16using this contextually correlated data. Now the process starts when an anomaly triggers 5:22an alert. So the incoming thing that starts this all off is we have uh incident alert coming in. 5:31And to be clear, this is a post incident detection scenario here. We're not predicting when something 5:37might go wrong. We're diagnosing what happens so that we can fix it. So from the instant alert, 5:44the agentic AI, it considers this curated content that we have been talking about. Now, 5:51that is just specific to the actual problem we're trying to solve. Now, we've talked on this channel 5:59before about how AI agents work. This basically works in a number of different phases. So, 6:06we start by agents perceiving their environment. And once they've perceived their environment, 6:14they can reason forward with the best steps that they should take. Once they've reasoned, 6:22they can actually act on that action plan that they've built and then they observe the results 6:31of that action. And then round and round we go back to perceive in a feedback loop. And that 6:38is what's happening here where we're going to form a hypothesis as to what the actual problem 6:47is. That's using causal AI to analyze those disparate melt data sources. Now the agent 6:53will systematically request additional data and evidence needed to validate or refine its 6:58hypothesis. So we might have more curated content coming in here. So for example, 7:04if a a web service is slow and the agent well it might go and fetch some related logs, then 7:11it might notice that there's an error connecting to the database. So that prompts it to retrieve 7:15some database metrics and then it realizes the database was recently updated. So now it's going 7:19to prompt a check of configuration changes and so on and on and on we go. And ultimately this leads 7:26to the big moment, the moment of identification of the probable root cause. So this is what we 7:34think was the problem all along. That's where the agent pinpoints the most likely root cause of the 7:43incident. But it doesn't stop there. We also have explanability. Now, explanability actually puts 7:51some weight behind this root cause because the agent's reasoning process, how it arrived at the 7:58probable root cause, can be made transparent to the human operator, showing its chain of thought. 8:04And the agent can also provide some supporting evidence as well for that probable root cause that 8:13led to that conclusion. and together the chain of thought reasoning and the supporting evidence 8:19can be reviewed by an SRE to supervise and validate the agent's analysis. So we figured out 8:27the probable root cause. Great. But the ultimate goal here is to actually resolve the incident and 8:34agentic AI can assist an SRE in four ways to do that. So one way that they can assist is in 8:43validation. So rather than taking a probable root cause at face value, agentic AI can generate steps 8:50to help an SRE validate that the identified root cause is actually correct. We still want 8:56human input into a production system before remediation. So providing verification steps 9:02that will be a step in the right direction. Now upon validating that we're on the right path, an 9:09agentic AI can then produce a stepbystep runbook. Now that is kind of an action plan to fix the 9:18issue. This runbook's essentially an ordered list of recommended remediation steps. So for example, 9:23if root cause is a filled disk causing a database to crash, well the notebook might one first of 9:31all archive old log files on the DB server to free space. Then it might restart the database service 9:39and then it might monitor disk usage growth and configure an alert if it exceeds a certain 9:45capacity. The idea is that the SRE can follow the script in this runbook here quickly and even if 9:52they're not deeply familiar with that component, it will still guide them through what to do. 9:57Aentic AI can also take these suggested actions and it can build automation scripts or what we can 10:05call workflows to help as well. So for example, the agent could turn each of these runbook steps 10:11into a bash script or an ansible playbook snippet. And here the AI provides the exact command syntax 10:18and the exact parameters for each step. Now another helpful byproduct of these AI agents 10:25is the automatic documentation that comes from all of this. So after resolution the AI can generate a 10:33summary incident report essentially writing the post incident review for you and agentic AI can 10:40also document an ongoing summary of the incident in progress so that new people working on the 10:46incident are brought up to speed. So, agentic AI can help redefine how IT teams handle anomalies 10:54and outages. And these agents operate under human oversight. They're augmenting rather than 11:01replacing human decision makers. And an SRE can verify the AI's findings and then focus energy on 11:07the remediation steps, many of which the AI might have already helped set up. And all of this leads 11:15to a substantial reduction in the all important, MTTR. That's mean time to repair. Not to mention 11:26less operational stress and a bit less sleep inertia when those 2 a.m. system alerts come in.