Learning Library

← Back to Library

Curated AI Prevents Hallucinations in Incident Response

Full Transcript

# Curated AI Prevents Hallucinations in Incident Response **Source:** [https://www.youtube.com/watch?v=5Igexz7kzMo](https://www.youtube.com/watch?v=5Igexz7kzMo) **Duration:** 00:11:44 ## Sections - [00:00:00](https://www.youtube.com/watch?v=5Igexz7kzMo&t=0s) **Agentic AI Mitigates Sleep‑Inertia Incidents** - The speaker explains how sleep inertia delays SRE response to critical 2 a.m. system anomalies and argues that agentic AI can accelerate anomaly detection and root‑cause analysis, while cautioning about potential pitfalls of over‑reliance. - [00:03:23](https://www.youtube.com/watch?v=5Igexz7kzMo&t=203s) **Topology-Aware Context Curation for AI** - The speaker explains how an agentic AI uses a real‑time service dependency graph to filter melt data—metrics, events, logs, and telemetry—so that only incident‑relevant signals are fed to the model, boosting efficiency and relevance. - [00:08:04](https://www.youtube.com/watch?v=5Igexz7kzMo&t=484s) **AI-Driven Validation and Runbook Generation** - The passage explains how an agentic AI supports SREs by supplying evidence for root‑cause hypotheses, offering verification steps, and automatically generating step‑by‑step runbooks to remediate incidents. ## Full Transcript
0:00Did you know that if I woke you up and then asked  you to work on a problem like well resolving an 0:06IT system anomaly, it would take you on average  something like 22 minutes to move from a sleeping 0:16state to a cognitive state where you're like  actually productive? And with every minute of 0:22downtime costing potentially thousands of dollars,  that can be an expensive result of sleep inertia. 0:32Fortunately, agentic AI can help with anomaly  detection and with resolution. But before I get 0:39to how, let's discuss where the magic bullet of  let's use AI could actually lead you astray. So, 0:48imagine it's 2:00 a.m. and an observability tool  just detected a business impacting instance. 0:55Something pretty bad like like your authentication  service rejecting 90% of user logins or uh maybe a 1:02super laggy payment gateway. So, now it's up to  our sleepy SR to come to the rescue. That's site 1:13reliability engineer. And what they need to  do is first of all identify what the specific 1:22problem is. When they've done that, they need  to figure out the cause of the problem. And 1:30once they figured that out, then they need to  come to a resolution. And on the face of it, 1:38this is an area where AI is well suited to help  because IT environments, they generate massive 1:45volumes of telemetry data like logs and traces.  And traditional incident response requires SREs 1:53to manually sift through all of this noisy data  and diagnose the probable root cause. So, lots of 2:02data. We're looking for a needle in a hay stack.  Gosh, why not just send all of this telemetry data 2:08into a large language model and then ask it to  sift through everything and figure out the root 2:16cause for us? Well, many LLMs have pretty huge  context windows, but they're not bottomless pits, 2:24and a single node cluster can crank out gigabytes  of log lines per hour. So, if you pipe that fire 2:30hose straight into the large language model  and then ask it to come up with a cause, well, 2:37welcome to hallucination city. Because LLMs, they  rely on statistical patterns and if you overfeed 2:44them unrelated noise, they will confidently  fabricate causal links that just don't exist. 2:49Because the LLM's goal is to predict plausible  words rather than to verify facts. It will happily 2:55stitch together all sorts of coincidences like CPU  blips and benign restarts and old warning logs, 3:02it will stitch them together into kind of a  neat but entirely imaginary narrative. So, 3:08if we want to use AI in anomaly detection and  resolution, this brute force dump it all into the 3:16model kind of approach isn't going to get us very  far. What we actually need is context curation. 3:23And that's the first of several areas where  agentic AI can help. So we have here a whole bunch 3:31of collected data. We've got metrics and events.  We've got logs. And we've got telemetry. That's 3:39actually better known as Melt. And then we've got  this thing that I mentioned already called context 3:46curation, which is going to take a look at all of  this melt data. And instead of just dumping all 3:55of that collected data straight into our AI model  here, we're actually going to do an intermediary 4:03step, which is we are going to strategically  feed it only the signals that actually matter 4:09for the instant at hand. And we're going to do  this through topology aware correlation. Now, 4:17an observability platform maintains a real-time  map of how all the services connect and depend on 4:22each other and it knows that an authentication  service that talks to a user database and that 4:28sits behind a load balancer which connects  to and well a load of stuff and you get the 4:32picture. But when an incident fires, the agent  doesn't just grab random logs from everywhere. 4:39It uses this dependency graph just to pull in the  telemetry data only from the components that are 4:45actually involved. So when that authentication  service starts rejecting logins at 2 a.m., well, 4:52the agent knows to check the user database it  depends on and maybe the Reddis cache it uses 4:57for sessions and any recent deployments to those  specific services. What it's not doing is wasting 5:03time analyzing logs from a completely unrelated  reporting microservice that just happens to be 5:08running on the same cluster. So let's work through  how an AI incident investigation agent might work 5:16using this contextually correlated data. Now  the process starts when an anomaly triggers 5:22an alert. So the incoming thing that starts this  all off is we have uh incident alert coming in. 5:31And to be clear, this is a post incident detection  scenario here. We're not predicting when something 5:37might go wrong. We're diagnosing what happens so  that we can fix it. So from the instant alert, 5:44the agentic AI, it considers this curated  content that we have been talking about. Now, 5:51that is just specific to the actual problem we're  trying to solve. Now, we've talked on this channel 5:59before about how AI agents work. This basically  works in a number of different phases. So, 6:06we start by agents perceiving their environment.  And once they've perceived their environment, 6:14they can reason forward with the best steps  that they should take. Once they've reasoned, 6:22they can actually act on that action plan that  they've built and then they observe the results 6:31of that action. And then round and round we go  back to perceive in a feedback loop. And that 6:38is what's happening here where we're going to  form a hypothesis as to what the actual problem 6:47is. That's using causal AI to analyze those  disparate melt data sources. Now the agent 6:53will systematically request additional data  and evidence needed to validate or refine its 6:58hypothesis. So we might have more curated  content coming in here. So for example, 7:04if a a web service is slow and the agent well  it might go and fetch some related logs, then 7:11it might notice that there's an error connecting  to the database. So that prompts it to retrieve 7:15some database metrics and then it realizes the  database was recently updated. So now it's going 7:19to prompt a check of configuration changes and so  on and on and on we go. And ultimately this leads 7:26to the big moment, the moment of identification  of the probable root cause. So this is what we 7:34think was the problem all along. That's where the  agent pinpoints the most likely root cause of the 7:43incident. But it doesn't stop there. We also have  explanability. Now, explanability actually puts 7:51some weight behind this root cause because the  agent's reasoning process, how it arrived at the 7:58probable root cause, can be made transparent to  the human operator, showing its chain of thought. 8:04And the agent can also provide some supporting  evidence as well for that probable root cause that 8:13led to that conclusion. and together the chain  of thought reasoning and the supporting evidence 8:19can be reviewed by an SRE to supervise and  validate the agent's analysis. So we figured out 8:27the probable root cause. Great. But the ultimate  goal here is to actually resolve the incident and 8:34agentic AI can assist an SRE in four ways to  do that. So one way that they can assist is in 8:43validation. So rather than taking a probable root  cause at face value, agentic AI can generate steps 8:50to help an SRE validate that the identified  root cause is actually correct. We still want 8:56human input into a production system before  remediation. So providing verification steps 9:02that will be a step in the right direction. Now  upon validating that we're on the right path, an 9:09agentic AI can then produce a stepbystep runbook.  Now that is kind of an action plan to fix the 9:18issue. This runbook's essentially an ordered list  of recommended remediation steps. So for example, 9:23if root cause is a filled disk causing a database  to crash, well the notebook might one first of 9:31all archive old log files on the DB server to free  space. Then it might restart the database service 9:39and then it might monitor disk usage growth  and configure an alert if it exceeds a certain 9:45capacity. The idea is that the SRE can follow the  script in this runbook here quickly and even if 9:52they're not deeply familiar with that component,  it will still guide them through what to do. 9:57Aentic AI can also take these suggested actions  and it can build automation scripts or what we can 10:05call workflows to help as well. So for example,  the agent could turn each of these runbook steps 10:11into a bash script or an ansible playbook snippet.  And here the AI provides the exact command syntax 10:18and the exact parameters for each step. Now  another helpful byproduct of these AI agents 10:25is the automatic documentation that comes from all  of this. So after resolution the AI can generate a 10:33summary incident report essentially writing the  post incident review for you and agentic AI can 10:40also document an ongoing summary of the incident  in progress so that new people working on the 10:46incident are brought up to speed. So, agentic AI  can help redefine how IT teams handle anomalies 10:54and outages. And these agents operate under  human oversight. They're augmenting rather than 11:01replacing human decision makers. And an SRE can  verify the AI's findings and then focus energy on 11:07the remediation steps, many of which the AI might  have already helped set up. And all of this leads 11:15to a substantial reduction in the all important,  MTTR. That's mean time to repair. Not to mention 11:26less operational stress and a bit less sleep  inertia when those 2 a.m. system alerts come in.