AI‑Driven Incident Resolution with Watson AIOps
Key Points
- Faster, more frequent cloud deployments boost delivery speed but also increase incident volume and resolution time, straining IT operations and potentially upsetting customers.
- Incident resolution is measured by metrics such as Mean Time to Resolution (MTTR), Mean Time to Fix (MTTF), and especially Mean Time to Identify (MTTI), which can vary widely depending on operator knowledge and system complexity.
- IBM Cloud Pak for Watson AIOps leverages machine‑learning‑driven anomaly detection and unsupervised learning on multi‑source data (logs, PagerDuty, Splunk, ServiceNow, etc.) to automatically surface likely causes and cut MTTI without extensive model training.
- The platform provides out‑of‑the‑box, pre‑trained models and intelligent search of prior run‑books to quickly resolve “lucky‑day” incidents, while for more complex “not‑so‑lucky” cases it auto‑generates incident summaries, hypothesizes impacted services, and groups change‑related events to speed root‑cause analysis.
Sections
- AI-Driven Incident Resolution Acceleration - The speaker highlights that faster cloud deployments lengthen incident resolution times for ops teams, then demonstrates how IBM Cloud Pak for Watson AIOps leverages automation and machine‑learning‑based anomaly detection to shorten mean time to identify (MTTI) and overall mean time to resolution.
- AI-Driven Incident Resolution - The passage explains how Cloud Pak for Watson AIOps uses smart topology, AI, and NLP to summarize diagnostics, consolidate siloed data, surface similar tickets, and guide run‑book execution to resolve incidents efficiently.
- Proactive Root Cause Analysis with AIOps - The segment explains how correlating metrics—such as a spiking disk‑busy rate—helps pinpoint a database overload as the app’s slowdown cause, and how Cloud Pak for Watson AIOps automates this detection and resolution to boost operational efficiency.
Full Transcript
# AI‑Driven Incident Resolution with Watson AIOps **Source:** [https://www.youtube.com/watch?v=ph8p-eP9Y90](https://www.youtube.com/watch?v=ph8p-eP9Y90) **Duration:** 00:07:43 ## Summary - Faster, more frequent cloud deployments boost delivery speed but also increase incident volume and resolution time, straining IT operations and potentially upsetting customers. - Incident resolution is measured by metrics such as Mean Time to Resolution (MTTR), Mean Time to Fix (MTTF), and especially Mean Time to Identify (MTTI), which can vary widely depending on operator knowledge and system complexity. - IBM Cloud Pak for Watson AIOps leverages machine‑learning‑driven anomaly detection and unsupervised learning on multi‑source data (logs, PagerDuty, Splunk, ServiceNow, etc.) to automatically surface likely causes and cut MTTI without extensive model training. - The platform provides out‑of‑the‑box, pre‑trained models and intelligent search of prior run‑books to quickly resolve “lucky‑day” incidents, while for more complex “not‑so‑lucky” cases it auto‑generates incident summaries, hypothesizes impacted services, and groups change‑related events to speed root‑cause analysis. ## Sections - [00:00:00](https://www.youtube.com/watch?v=ph8p-eP9Y90&t=0s) **AI-Driven Incident Resolution Acceleration** - The speaker highlights that faster cloud deployments lengthen incident resolution times for ops teams, then demonstrates how IBM Cloud Pak for Watson AIOps leverages automation and machine‑learning‑based anomaly detection to shorten mean time to identify (MTTI) and overall mean time to resolution. - [00:03:06](https://www.youtube.com/watch?v=ph8p-eP9Y90&t=186s) **AI-Driven Incident Resolution** - The passage explains how Cloud Pak for Watson AIOps uses smart topology, AI, and NLP to summarize diagnostics, consolidate siloed data, surface similar tickets, and guide run‑book execution to resolve incidents efficiently. - [00:06:11](https://www.youtube.com/watch?v=ph8p-eP9Y90&t=371s) **Proactive Root Cause Analysis with AIOps** - The segment explains how correlating metrics—such as a spiking disk‑busy rate—helps pinpoint a database overload as the app’s slowdown cause, and how Cloud Pak for Watson AIOps automates this detection and resolution to boost operational efficiency. ## Full Transcript
- First the good news:
cloud development strategies mean more
and faster deployments.
Now the bad news:
more and faster deployments can impact your IT operations
team, increasing the time to resolve incidents.
That potentially means unhappy customers and more resources
dedicated to keeping your systems running smoothly.
Hi, I'm Dan Kehn from IBM Cloud®.
So what's your ops team to do?
I'll cover that question in two quick demonstrations,
but first let's briefly review the phases
of incident resolution, then I'll explain how automation
and AI can help.
Mean time to resolution is the big picture.
It covers everything from the time the problem has started
until it's finally resolved.
The longer it takes to resolve,
the worse the impact to your organization.
Some parts of resolution time are consistent,
like mean time to fix, or MTTF.
Others vary significantly, like mean time to identify,
or MTTI, which can run from hours to days.
That's because it relies on operators' experience
and knowledge of the system relationships.
To help lower MTTI, IBM Cloud Pak® for Watson AIOps
comes with built-in anomaly identification strategies
that use machine learning.
Of course, machine learning works best when there's lots
of varied, high-quality data.
That's why Cloud Pak for Watson AIOps consumes data
from many sources, it then uses AI to discover
relationships across these different data sources
and weigh possible causes.
This unsupervised learning reduces the time needed
to realize the value of AI,
so instead of requiring extensive training,
you can get started right away,
with out-of-the-box, pre-trained models.
Okay, with that intro out of the way,
I'd like to walk you through two incidents and how
Cloud Pak for Watson AIOps can help you resolve
them more quickly.
The first I call lucky day.
With the smart search of prior solutions,
you close the incident using documented steps in a run book.
The second, which unfortunately happens more often
than we'd like to admit, I call not-so-lucky day.
This is an undiscovered problem that requires sifting
through misbehaving servers and confirming the root cause.
Imagine you're at lunch and a notification
from Cloud Pak for Watson AIOps comes in.
You double click to check it out.
In the problem summary, you recognize one of the services
you monitor, so you decide to investigate.
Cloud Pak for Watson AIOps shows you a summary view
based on data gathered directly from your app monitoring
logs and from integrated tools like PagerDuty, Splunk,
and ServiceNow.
The chat entry shows you several different fields.
First, the impacted application,
the train ticket application.
Next, a hypothesized storage of the problem,
the ticket info service.
It also shows you incident severity and status.
Finally, you can see a summary ticket that was automatically
generated by Watson AIOps.
Two key questions for resolving a problem are what changed
and what happened nearby?
Cloud Pak for Watson AIOps groups events
that represent change and a topology map
that represents nearby connected services.
That is what changed, how recently it changed,
and how frequently it changed gives you hints
about the source of the problem.
With smart topology and an understanding of the context,
you now know where to start.
Okay, I've shown you how Cloud Pak for Watson AIOps
provides a summary of key diagnostic information,
but it also helps consolidate information
from multiple tools across different data silos.
This view shows a summary of the anomalies that underlie
the problem report.
This helps reduce information overload
and avoid notification flooding.
It also saves you from the hassle of chasing problems
across different tools.
Now that you have a better understanding of the incident,
you take action to resolve it.
Cloud Pak for Watson AIOps has identified similar tickets
based on data interpreted with natural language processing
and pre-trained AI models.
This can help you quickly identify relevant tickets
with possible solutions.
By pinpointing specific actions that your team
has taken in the past, you don't have to deal with the
tedium of manually reviewing a list of prior tickets.
You confirm the run book matches the current problem
and resolve the incident.
Excellent.
And the prior investigation, we got lucky.
The problem had been resolved once before,
so we only had to execute a run book.
But what if it wasn't so easy?
And that's where AI and machine learning really shine.
Cloud Pak for Watson AIOps consumes huge volumes
of your system data,
structured data like configuration topology,
semi-structured data like logs and ticket information,
even unstructured data like commit comments.
Based on this data, it learns what normal looks like so
it can alert you when metrics are outside expected bounds.
But the metric manager in Cloud Pak for Watson AIOps
doesn't rely on fixed threshold tracking.
This avoids the trap where a high fixed threshold
generates too few alerts and real problems
are ignored until they become serious
or low threshold generates too many alerts
and your operators simply tune them out.
Instead, Cloud Pak for Watson AIOps uses machine learning
to understand what's normal behavior
for key performance metrics and automatically sets
adaptive threshold based on actual system experience.
Now let's look at how Cloud Pak for Watson AIOps
helps you with your not-so-lucky day.
It's later in the week and you're handling another incident.
This time it's a claims app and users are reporting
really bad response times.
You start your investigation by opening
the events dashboard.
Cloud Pak for Watson AIOps recognizes many data sources
for event correlation.
For example, data from Log DNA, Serves Now, PagerDuty
and hundreds of other integrations.
The dashboard groups related events based
on inferred associations like topology,
the time occurrence, and location.
Let's take a closer look at the event leading up
to the claims app slowing down.
This metrics timeline can help you
with problem determination and assess potential impacts.
The green indicates normal behavior over time.
You can visualize baseline performance compared
to the recently-captured data.
The secondary view shows metrics
based on application observability.
These are metrics discovered and identified
as being related to the app's response time.
Here we see the disk busy metric is unstable,
let's add it to the timeline for further investigation.
Now the primary review shows that just before
the response time problems, the disk busy
for the storage service went up
to nearly 100% utilization and it stayed there.
That's never good.
Based on this brief analysis, you know the data store
for the database was overloaded.
It's a prime candidate for the root cause
of the app's slowdown.
The next step is confirming your analysis by checking
the service logs and then proposing a proper solution.
The analysis of the relationships between these metrics
help you understand the full scope of the problem.
Once the fix is delivered, you can say with confidence,
you've identified and resolved the true cause.
Okay, let's recap.
When it comes to IT Ops, it's better to be proactive
than constantly being forced into a reactive posture.
With dynamically-determined rules, the data analysis
by Cloud Pak for Watson AIOps helps you get
to resolution quicker, potentially before your users
even notice a problem.
And you don't have to manage rules,
consider how they interact with each other,
or worry about how rules should change
when the environment changes.
What can automation mean to your company?
How about 25% more time spent on work that drives
your business or reducing manual labor costs by 50%?
Thanks for watching.
If you'd like to see more videos like this in the future,
please click like and subscribe.
If you want to learn more about Cloud Pak for Watson AIOps,
make sure to check out the links in the description.