Learning Library

← Back to Library

Taming AI Slop with Automated Quality Checks

12m • Unknown Channel • ai-ml • tutorial • intermediate • Watch on YouTube ↗

Key Points

Companies are overwhelmed by an “AI slop” problem, where AI can produce massive amounts of content—PRDs, marketing copy, blogs—but there’s no reliable way to ensure that output meets quality standards.
Human reviewers simply don’t have the capacity to examine dozens or hundreds of AI‑generated items, forcing many teams to either eyeball everything or skip review altogether.
The fix is to treat AI “eyeballs” as a primary quality gate, using LLMs to evaluate and filter work, but this requires carefully crafted, robust prompts rather than vague commands like “Is this good?”
The effectiveness of AI‑based quality checks varies by model, its training data, and past interactions, so organizations must standardize prompting to achieve consistent, predictable assessments.
Successful implementation balances the “supply side” (prompting AI to create higher‑quality content) with the “filter side” (prompting AI to reliably evaluate and approve that content).

Sections

Full Transcript

# Taming AI Slop with Automated Quality Checks **Source:** [https://www.youtube.com/watch?v=rY6MOdXv02M](https://www.youtube.com/watch?v=rY6MOdXv02M) **Duration:** 00:12:11 ## Summary - Companies are overwhelmed by an “AI slop” problem, where AI can produce massive amounts of content—PRDs, marketing copy, blogs—but there’s no reliable way to ensure that output meets quality standards. - Human reviewers simply don’t have the capacity to examine dozens or hundreds of AI‑generated items, forcing many teams to either eyeball everything or skip review altogether. - The fix is to treat AI “eyeballs” as a primary quality gate, using LLMs to evaluate and filter work, but this requires carefully crafted, robust prompts rather than vague commands like “Is this good?” - The effectiveness of AI‑based quality checks varies by model, its training data, and past interactions, so organizations must standardize prompting to achieve consistent, predictable assessments. - Successful implementation balances the “supply side” (prompting AI to create higher‑quality content) with the “filter side” (prompting AI to reliably evaluate and approve that content). ## Sections - [00:00:00](https://www.youtube.com/watch?v=rY6MOdXv02M&t=0s) **Tackling AI Content Slop** - The speaker explains that businesses are overwhelmed by low‑quality AI‑generated output, shifting the challenge from getting AI to help to establishing effective quality‑gates and a mindset that treats LLM attention as the primary focus for assessing and ensuring brand‑aligned content. - [00:03:46](https://www.youtube.com/watch?v=rY6MOdXv02M&t=226s) **AI Prompt for PRD Triage** - The speaker outlines how to use an AI prompt to assess product requirement documents, quickly identifying high‑value content and eliminating unnecessary clarification meetings. - [00:07:23](https://www.youtube.com/watch?v=rY6MOdXv02M&t=443s) **Designing an AI Slot Evaluation Framework** - The speaker outlines a structured process—leveraging prompts, scoring rubrics, dependency mapping, element checks, and JSON schemas—to convert AI slot assessment results into clear, actionable plain‑English feedback. - [00:11:04](https://www.youtube.com/watch?v=rY6MOdXv02M&t=664s) **Attention Gatekeeping for Quality** - The speaker urges directing the limited 2% of human attention toward business mechanisms—via a curated prompt pack—to filter low‑quality AI output and raise overall work standards, rather than relying on ineffective AI detectors. ## Full Transcript

0:00We have a supply explosion of AI content 0:03and it is an AI slop issue across every 0:06business I talk to. Every single one, 0:07they're like, "What do we do with AI 0:09slop?" We have product managers 0:11producing PRDs that are not good. We 0:13have marketers who are producing 0:14marketing copy and it's not good. But we 0:17managers or frankly any individual who 0:19wants to know they're actually writing 0:21well with AI, we don't know what to do. 0:23How do we know that what we're shipping 0:25is good quality? It's not about did the 0:28AI help. We're past that point. The AI 0:30is helping. It is about is the quality 0:32output good and we humans don't have the 0:36time to assess all of it because AI is 0:39so good at producing it. So we are in a 0:41world now where marketers have gone from 0:43can I write a blog post today to I could 0:46stamp out 50 and now you have to ask 0:48yourself can I assess the 50 that have 0:50been produced? Are they actually good? 0:52Are they brand aligned? Is the content 0:54right? Etc. And you know what people are 0:56doing for that? they're using their eyes 0:57or they're skipping it because there 1:00hasn't really been a good quality gate. 1:03But that's fixable. And what I want to 1:05tell you about is how I'm fixing it. So 1:07it is really simple. What you need to do 1:10is adopt the mindset that LLM attention 1:14is going to be the predominant attention 1:17mode in your business and everybody 1:19else's business going forward. You want 1:21to make sure you use that to your 1:23advantage. So if you think about a world 1:24where in your organization you have only 1:26so many human eyeballs but you have 1:27nearly infinite AI eyeballs. Well use 1:30the AI eyeballs to your advantage and 1:33get them to check the work. But it's not 1:36as simple as you know as saying please 1:38check this. Is it good? Right? Like I've 1:40seen people do that and you get widely 1:42varying results. It depends on the AI 1:44you're using. Is it chat GPT? Is it 1:46clot? Etc. It also depends on what the 1:48LLM already thinks of as good based on 1:50its training data and maybe it's past 1:52conversations with you. It's not 1:54predictable. And so we need to get to a 1:56point where we have much more robust 2:00prompting for quality. Think of it this 2:03way. A lot of what I talk about is 2:05robust prompting for what we would call 2:08the supply side. Robust prompting to 2:10make your work that you produce better. 2:12So I've talked about Excel in the past. 2:14I've written up guides for that. 2:15PowerPoint, etc. Talked about claude 2:18code. But what about the other side? 2:20What about the filter side? The side 2:22where you need to test work and see is 2:24it good? Maybe it's your own work. Maybe 2:26it's someone else's work. Maybe you're a 2:27manager and you're testing your team's 2:29work and you could get like a 100 blog 2:31posts in today from two or three team 2:33members and you don't have time to go 2:34through it. You'd be up all night. Well, 2:37that's where the prompt comes in. That's 2:38where you can get help. or you're an 2:40engineer and you're looking at this 2:42explosion of lovable vibecoded, you 2:44know, prototypes or PRDs that are coming 2:47from PMs and you don't know whether it's 2:49good. Again, a prompt can help fix it. I 2:51think specifics are really useful here. 2:53So, in a moment, I'm actually going to 2:55go through a real prompt with you that I 2:57wrote for this and I'm going to explain 2:59how it works. And I I I'm going to do 3:01several of these, right? This is very 3:03much a case where the attention filter, 3:05the quality filter needs to be per use 3:08case and per job family or else it's not 3:11any good. Like you need to have one that 3:13is PRD specific, that is blog post 3:15specific because otherwise you don't 3:18have enough context to effectively ask 3:21the AI to help filter for you. And your 3:23goal should be to have the AI do most of 3:26the reading to do what Andre Carpathy 3:28has suggested which is to have LLMs be 3:3198% of our attention and the human eyes 3:33have the 2% attention that matters. And 3:35so I want to give you the right 2% the 3:38highest quality pieces and effectively 3:41filter out the slop. That's my goal. 3:43Like if you are in a world where you 3:46have a 100 blog posts what are the two 3:47that matter? What are the two that are 3:48super high quality that you should 3:50surface now? That's super valuable 3:52information. But where is the PRD that 3:54is really well thought through? That is 3:56promotable information. Like if a PM is 3:58able to do that with AI, they should be 4:01on a promotion track. And you want to 4:03know that quickly. So that's what I'm 4:05setting out to do. How do I do it? Let 4:07me show you a sample prompt and we'll 4:10work through it together. Okay, here we 4:11are. Right on the prompt. We set the RO 4:14first. You're evaluating a product 4:15requirements document and your job is to 4:18determine if an engineering team can 4:20build this without needing three 4:22clarifying meetings. I love that 4:24specificity. You obviously could flip it 4:26around and like change it a bit. But 4:28what I want to be frank about in this 4:31prompt is how painful the current 4:34version is because whether it's three 4:36clarifying meetings for for PRDS or 4:38whether it is I can't tell which blog 4:41post to post or whether it is I don't 4:43know which of these customer email 4:44drafts or I can't tell which of these 4:46sales uh reachout drafts or I don't know 4:48about the webinar uh invite and the 4:50webinar event schedule. All of them you 4:53were having meetings. You were having 4:55human conversations. you were putting 4:56human eyeballs on them if they don't 4:58work. So let's set the stakes and then I 5:00define the axes. This is this gets very 5:01specific to the PRD and that is the 5:03point, right? So we ask about the 5:05completeness of the document. And by the 5:07way, this is somewhat scalable. You can 5:09ask about completeness of other 5:10artifacts too. In this case, we can ask 5:12about acceptance criteria. We can ask 5:14about whether edge cases are present. We 5:16can ask about whether non- goals are 5:18explicit or implicit. We can ask about 5:22sort of whether we and we can extend 5:24this, right? Like if I wanted to add to 5:26this, I could say, do we have at least 5:28seven requirements clearly enumerated? 5:30That's a bit of a random number. I chose 5:32not to include it here, but if you have 5:33a particular format for a particular 5:36artifact, it's easy to modify and add 5:38that in. And then I give right in right 5:40in the prompt, I give a score rubric, 5:42right? Like a five score looks like a 5:44measurable edge case documented 5:46non-goals prevention. It's a good score, 5:48right? A zero is it's untestable, right? 5:52It has a vague statement. And I have 5:53seen this as a PM. I have seen other 5:55people write really vague goals and it's 5:57just terrible and we all know it's bad. 5:59And by the way, I think the AI slop 6:01conversation is somewhat overhyped 6:05because I remember seeing phrases like 6:07this long before AI. In a sense, what we 6:10complain about as AI slop is the latest 6:13version at volume of a larger issue, 6:15which is that we have always had bad 6:17work problems and sloppy work problems. 6:20And now maybe now we have the tools to 6:22address it regardless of who wrote it. I 6:24don't care if AI wrote it. I care if 6:26it's good. So then we go down to the 6:27next one. Is it testable? So this is a 6:29very PRD specific thing, but you can do 6:32something similar with other types of 6:34documents. So, for example, you could 6:36ask if it's readable for a sales 6:39follow-up email, right? Does it read at 6:41an eighth grade level would be a 6:42reasonable test to have? And you can go 6:44through and sort of define that. And 6:45I've done that, right? Like this is not 6:46the only prompt I have. I have a bunch 6:47of prompts for different goals. My goal 6:49is to give you a complete pack that 6:51gives you a filter that you can apply to 6:55various business functions. And so 6:57again, we have the scoring, we have test 6:59cases, success rates, failure states, 7:01example inputs and outputs. This ensures 7:04that it's actually testable. Scope 7:06clarity. Uh this is another example of 7:09something that is PRD specific. We have 7:11our scoring. You can imagine how this 7:12can be somewhat different. If it's uh a 7:15customer announcement email that you're 7:17sending for a new product, uh you can 7:20ask if the product is clearly explained. 7:22The point here, and the reason I'm 7:23pulling in these examples from across 7:25the business, is because this is how 7:27many different places the AI slot 7:29problem touches. Everywhere I look, 7:31there's a slot problem. and we need to 7:33have a filter and that's why I was like 7:35you know what we're just going to write 7:36some prompts and make it easy decision 7:38framework key choices is the rationale 7:40explained or the trade-offs acknowledged 7:42etc very prd specific and then we get 7:44into scoring right so we go through 7:46these five right we have dependency 7:47mapping at the end there and we have an 7:49elements check is everything here and 7:51then we give it an output and what I 7:53want with this output that look I am not 7:55someone who's going to tell you JSON is 7:56magic JSON is not magic but it is 7:59certainly useful in cases where you want 8:01the LLM them to understand a particular 8:04rubric and output schema and follow it. 8:07In this case, what we really want is we 8:09want to understand a grading score. And 8:12so we want to be able to say based on 8:15these scores, how do we write this in 8:19plain English? And then how do we write 8:21a key sentence that someone can use as 8:24feedback? And this is where the magic 8:26lies, right? Because it's no good if you 8:29just can go through the elements at the 8:31top here and say, "Oh, this is bad." 8:34Right? Like, okay, it scored zero or it 8:36scored one. What you want to do is go 8:38through and say what is the actionable 8:40feedback someone can take and actually 8:42cycle through and sophisticated writers 8:45will be able to pull this JSON and feed 8:47it into an LLM along with the draft and 8:50get very actionable feedback for how to 8:52improve and use AI to actually power 8:54better writing higher quality writing 8:57and come back. And so I think that this 8:58is the beginning of a really interesting 9:00feedback loop to actually build an 9:04anti-slot machine at work. And so we 9:07have for each of these we have specific 9:09actionable feedback that we're 9:11specifying as an example. And then we 9:13have thresholds which you can you know 9:15what which you can be specific about 9:17right you can say I accept a three or 9:20overall or more or a 4.8 8 or whatever, 9:22right? But you have a score, you have a 9:24revise, accept, reject, and you have 9:26feedback. And really, isn't that all we 9:29need? We need someone who can say, 9:31"You've got to specify the Stripe API 9:34because if you don't specify the 9:35version, engineering is going to have to 9:37come back." And it's not that hard. 9:38Specify the most recent API version, but 9:40just be clear about it. That is the kind 9:42of feedback that LLMs are very good at 9:45providing if properly prompted, but that 9:47humans are stuck with providing now 9:49because we haven't had prompts to fix 9:51it. and that's why I built this. So 9:52there you go. I think slop is a fixable 9:55problem. I don't think it can be fixed 9:57with one magic bullet. There's not one 9:58magic prompt for this. Nor do I think 10:01just instituting this as a filter is the 10:04only way to fix it. I do think I've 10:06written lots of prompts for this. You 10:08have to write better too. But let's 10:10assume because every organization I've 10:12met has this problem. Let's assume you 10:14have an AI slot problem. Great. Safe 10:16assumption. How do you filter? That's 10:18what this is about. How do you make sure 10:20that you can use AI as a weapon in your 10:23favor so that you focus your attention 10:25where it matters so you scale useful 10:27feedback? And I think walking through 10:29this prompt structure should be helpful 10:31for you to understand this is how we 10:34think about grading a piece of work in 10:37context. And you'll notice I don't need 10:39to know a ton about your organization to 10:41do this. I can just assess the gates and 10:44say, "Okay, well, I think that from a 10:46good best practice PRD perspective, this 10:48is probably weaker." And the nice thing 10:50about a prompt is you can tweak it and 10:52say, "Okay, well, we're not an API 10:53business, right? Like maybe that's not 10:55what we care about. We care about front 10:56end." Okay, well, you can you can adjust 10:58that pretty quickly. You can have the 10:59same degree of specificity and it works. 11:01So, my challenge to you is this. Assume 11:04you live in a world where 2% of human 11:06attention matters and you need to put it 11:07in the right place. Find the mechanisms 11:10in your business. Find the mechanisms in 11:12your workflow that enable you to put 11:15that 2% attention where it matters. This 11:17is one of those. This is one of those 11:18that helps you to weed out AI slop. This 11:21is so much more useful than the 11:23deceptive AI detector stuff because the 11:26AI detector pretends that if you can 11:29detect AI, which they can't, then you 11:31will stop slop. But slop has always been 11:33a problem. We humans have produced poor 11:35quality work. As I've been saying, I've 11:37seen bad quality PRDs for a long time 11:39before AI. Well, this is really about 11:41raising the quality bar. We don't care 11:42how you wrote it. We care that you're 11:44accountable to raising the quality bar. 11:46And then this becomes a really useful 11:47feedback tool for yourself, for your 11:49manager, for whoever to improve the 11:51quality of the outputs. And so I'm 11:53putting together a complete prompt pack 11:55for different business purposes. So you 11:56get a sense of how this works for 11:58marketing, for CS, for sales, for 11:59product, etc. for engineering. And 12:01that's the goal, right? The goal is to 12:03give you a starter on a filter gate so 12:05you can put your attention where it 12:06matters and you stop drowning in the 12:08slot. It's a vlog.