Learning Library

← Back to Library

AIOps: Preventing Downtime Costs

Key Points

  • Unplanned IT downtime can cost businesses millions, damage their brand, and even trigger regulatory penalties.
  • AIOps (Artificial Intelligence for Operations) leverages AI, machine learning, and advanced analytics on operational data to give IT teams faster, data‑driven decision‑making power.
  • In a typical outage—like an invoicing app failing for a real‑estate firm—AIOps helps pinpoint the problem and accelerate restoration.
  • The platform ingests key sources (events, metrics, logs, alerts), creates baselines and thresholds, and defines what “normal” looks like for the application.
  • It then contextualizes and surfaces actionable insights through collaborative tools, enabling operators and SREs to engage, troubleshoot, and restore service efficiently.

Full Transcript

# AIOps: Preventing Downtime Costs **Source:** [https://www.youtube.com/watch?v=XbYKAJc5jhg](https://www.youtube.com/watch?v=XbYKAJc5jhg) **Duration:** 00:05:31 ## Summary - Unplanned IT downtime can cost businesses millions, damage their brand, and even trigger regulatory penalties. - AIOps (Artificial Intelligence for Operations) leverages AI, machine learning, and advanced analytics on operational data to give IT teams faster, data‑driven decision‑making power. - In a typical outage—like an invoicing app failing for a real‑estate firm—AIOps helps pinpoint the problem and accelerate restoration. - The platform ingests key sources (events, metrics, logs, alerts), creates baselines and thresholds, and defines what “normal” looks like for the application. - It then contextualizes and surfaces actionable insights through collaborative tools, enabling operators and SREs to engage, troubleshoot, and restore service efficiently. ## Sections - [00:00:00](https://www.youtube.com/watch?v=XbYKAJc5jhg&t=0s) **AIOps: Preventing Costly Downtime** - The speaker highlights the multi‑million‑dollar, brand‑damage, and regulatory risks of unexpected IT outages before defining AIOps as the application of AI, machine learning, and advanced analytics to operational data to empower IT professionals to detect, diagnose, and remediate incidents quickly. - [00:03:06](https://www.youtube.com/watch?v=XbYKAJc5jhg&t=186s) **AIOps-Driven ChatOps Incident Resolution** - The passage outlines how AIOps ingests logs, provides contextual insights via chat‑ops to SREs, and enables automated script execution to quickly resolve incidents. ## Full Transcript
0:00So what can a few minutes of unplanned downtime  cost a business? Six or seven figures in lost 0:05revenue? A damaged brand? Regulatory action? Hi,  my name is Albert Traylor and I'm with IBM 0:13Cloud, and today I'm here to talk about artificial  intelligence for operations, also known as AIOps. 0:23So before we go into the definition of  AIOps, let's think about a scenario. 0:27Imagine that you are an it operations professional 0:31at a successful real estate company.  Let's call that company "Housing For All." 0:38At h4a you support a portfolio of applications  one of which is an invoicing application used by 0:45tens of thousands of partners every day. For this  specific application your focus is to make sure 0:53it's up and that partners can deliver  invoices consistently on it on a regular basis. 0:59One day you settle into your desk you get  a cup of coffee and you get a phone call, 1:04just like that. Out of nowhere a sales rep is  calling to complain that a partner is trying 1:09to upload an invoice and has been unable to  all morning because the application is down. 1:15What do you do in that scenario to get  this application back up and running? 1:19So before we continue down that scenario let's  talk about AIOps and the textbook definition. 1:27Artificial intelligence for operations is about  the application of artificial intelligence 1:32machine learning models and advanced  analytics to IT operational data. The objective 1:39is to empower it professionals and operations  professionals with the data they need to make 1:45decisions and ultimately resolve and  restore service to an application 1:50faster. So with that definition in mind, let's talk  about how we can get this invoice application 1:58back up and running. Now let's think about the  key data sources for our invoicing application. 2:06Let's start with events, and metrics, 2:13logs, alerts, and a few others but for now  let's focus on these data sources 2:22for our specific invoicing application. In the real  world this will look different depending on your 2:27application architecture, the type of data sources  that are applicable for your application, and 2:32data regulatory requirements. So we've got our key  data sources, now how does this fit into our model 2:38for AIOps? Let's think about three key steps. The  first one we're going to call monitor and discover. 2:52So in this first step the data is  ingested by the AIOps platform, 2:56and is thresholded and creates  baselines for your specific application. 3:02So let's think about it this way what  is normal for my invoicing application? 3:06What is the log ingestion rate, how many errors  is acceptable based on our slo, or service level 3:12objective? So we now have that information, the next  step of the process is about engage and context. 3:28So the next step is engage in context.  So this is where AIOps really shines 3:34it takes all of that ingested data, and it  surfaces it to an IT operations professional 3:40or site reliability engineer in the form of a  collaboration solution, also known as chat ops. 3:46So up until now everything's been done in the  background and as soon as this incident pops 3:50up with our invoicing application it surfaced  via chat ops to our site reliability engineer. 3:56Less is more here they have the context on  where the incident is located in the application, 4:01what specific actions are recommended  to resolve this, and most importantly, 4:06how are those actions based on incidents  like this that have come up in the past? 4:11So now our sre is armed with that information,  and it's our last phase which is act and automate. 4:24So everything's been done in the background, this  information has been surfaced to our IT officer 4:30sre, ITOps professional sre, and then finally we  have to act and automate to resolve this issue. 4:38The suggested options that are available via  chat ops enable the ITOps to select what has 4:43worked in the past and with a click activate  a script or run book to resolve this issue 4:48as soon as it's detected. This gets our  invoicing application back up and running faster 4:53and make sure that our partners  are happier with this experience. 4:58So now we've got the overview, three key steps,  the type of data that's ingested by AIOps. 5:04In summary this system allows its  professionals to solve problems faster, 5:09to keep applications up and running, and  help protect the business in the long run. 5:14Thank you. If you have any questions please drop us  a line below. If you want to see more videos like 5:20this in the future, please like and subscribe, and  don't forget you can grow your skills and earn a 5:25and earn a badge with IBM CloudLabs, which are free browser based interactive Kubernetes labs.