Learning Library

← Back to Library

Ensuring AI Behavior in Production

5m • Unknown Channel • ai-ml • tutorial • intermediate • Watch on YouTube ↗

Key Points

AI models can drift after deployment, exhibiting unintended behaviors (e.g., speaking like a toddler or using profanity), so safeguards are essential.
Data scientists rigorously test models in a “development sandbox” to ensure outputs match expectations before moving them to production.
One key monitoring method is comparing model outputs in production to known ground‑truth results (e.g., actual churn outcomes or human‑written references).
Another method is checking that production performance mirrors the development baseline (e.g., similar churn prediction rates), and any deviation signals the need for model review.

Sections

Full Transcript

# Ensuring AI Behavior in Production **Source:** [https://www.youtube.com/watch?v=4gC3oueK9Gc](https://www.youtube.com/watch?v=4gC3oueK9Gc) **Duration:** 00:05:13 ## Summary - AI models can drift after deployment, exhibiting unintended behaviors (e.g., speaking like a toddler or using profanity), so safeguards are essential. - Data scientists rigorously test models in a “development sandbox” to ensure outputs match expectations before moving them to production. - One key monitoring method is comparing model outputs in production to known ground‑truth results (e.g., actual churn outcomes or human‑written references). - Another method is checking that production performance mirrors the development baseline (e.g., similar churn prediction rates), and any deviation signals the need for model review. ## Sections - [00:00:00](https://www.youtube.com/watch?v=4gC3oueK9Gc&t=0s) **Guarding AI Against Unexpected Output** - The speaker explains why AI models can deviate from intended behavior after deployment and outlines three methods to ensure they remain consistent with their design. - [00:03:12](https://www.youtube.com/watch?v=4gC3oueK9Gc&t=192s) **Detecting Model Drift in Production** - The speaker outlines how to monitor models by comparing production outputs and input data to development baselines and ground truth to identify performance drift or mismatches. ## Full Transcript

0:00So if you have AI and it's designed to talk like your basic 10th grader 0:05and it starts talking like a two year old, that's not good,. That's a problem. 0:10If you have AI and it starts cursing like a sailor after it's deployed and out in the world, that's not a good thing either. 0:17So how do you keep that from happening? 0:19So today in this video we're going to walk through three different ways 0:21that you can keep your AI doing what it's supposed to do. 0:25I think before we get into that, I think it's important though to understand what data scientists and AI engineers do. 0:32AI engineers and data scientists, we build models, 0:36and typically we build these models and what I'll call or what we call the development space, 0:45and you can think of our development space as kind of like a sandbox. 0:49It's our happy little world where we take inputs, 0:57build models, 1:02and these models create output. 1:07And while we're developing these models, we're very meticulous. 1:11We wanna make sure that this output is exactly what we want it to be, right? 1:17We don't want it to be, if we want the hey hi to speak like your average 10th grader 1:21and the output is speaking like a two-year-old, we're gonna go back and we're gonna fix that. 1:26If the model is designed to predict customer churn and it's not, we're gonna go back and fix that. 1:32But once we get to a point where we're happy with the model, we think it's wonderful, 1:37we put it into production, or we deploy it, we put it out into the world. 1:42Typically we'll call this our production space. 1:49like the deployment space in production. 1:59A model is going to have input 2:10and output. 2:15So how do we ensure that this model is doing what it's supposed to do? 2:20So we have really three different methods. 2:23The first is what we call comparing the model output to ground truth, right? 2:27So if we've built a model to predict churn, 2:31and those predictions are not accurate, 2:34like the people we've predicted to cancel their service don't cancel their service, 2:39we know there's a problem. 2:40So we can compare this output to some sort of ground truth. 2:51In the generative world, 2:54we can take, like if we have AI that's writing emails, based on some kind of prompt 3:00or some kind of stimulus, we can have a human write an email to the same stimulus, 3:05and the output coming from the AI in the human based on the same stimulus should be similar. 3:12If they're not, we've got an issue and we need to go back and look at our model. 3:16But again, in both situations, we're comparing the output of the model to some sort of ground truth. 3:21The other thing we can do, 3:22remember over here in the development space where the data scientists were 3:26very, very rigorous in terms of what their model was doing, we can compare 3:31the output in deployment to the output in development. 3:35So if we're predicting a model that should be predicting on average a churn rate of 4%, 3:42but in development, the average churn rate was like, let's say, .4%, that's an issue, right? 3:48What we're seeing in the deployment that is different than what we saw when we developed the model. 3:53Likewise, if we built a model to talk like your average 10th grader 3:56and all of a sudden it's talking like a two-year-old, that's an issue. 3:59And we could tell that by comparing it to the output that was created in the development. 4:04We can also compare the input data from production to development. 4:10Like if the average age of the data going into our model in development was 25 years 4:16and deployment or in production we're noticing that it's 50 years. 4:21Whoa, hold on, something could go wrong. 4:23We've had a lot, we're feeding a lot older group of people into this model. 4:29So comparing it to ground truth, comparing it to the model in development, 4:33comparing it to ground truth is sometimes called accuracy. 4:36Comparing it to ground truth is sometimes called model drift. 4:40The third thing we can do is we can create flags or filters around this output. 4:46For example, we can have a PII flag. 4:49So if somebody's social security number shows up in this output, 4:52this flag is gonna get flagged and we know, okay, we can't send this out into the world. 4:56Likewise, if it's hate, abuse, or profanity, HAP, we can identify in this output, flag it, get it out of there. 5:04So anyway, those are three ways that you can ensure that your AI is doing what it's supposed to do. 5:11I hope this was helpful. 5:12Thank you so much.