Learning Library

← Back to Library

Taming AI Slop with Automated Quality Checks

Key Points

  • Companies are overwhelmed by an “AI slop” problem, where AI can produce massive amounts of content—PRDs, marketing copy, blogs—but there’s no reliable way to ensure that output meets quality standards.
  • Human reviewers simply don’t have the capacity to examine dozens or hundreds of AI‑generated items, forcing many teams to either eyeball everything or skip review altogether.
  • The fix is to treat AI “eyeballs” as a primary quality gate, using LLMs to evaluate and filter work, but this requires carefully crafted, robust prompts rather than vague commands like “Is this good?”
  • The effectiveness of AI‑based quality checks varies by model, its training data, and past interactions, so organizations must standardize prompting to achieve consistent, predictable assessments.
  • Successful implementation balances the “supply side” (prompting AI to create higher‑quality content) with the “filter side” (prompting AI to reliably evaluate and approve that content).

Full Transcript

# Taming AI Slop with Automated Quality Checks **Source:** [https://www.youtube.com/watch?v=rY6MOdXv02M](https://www.youtube.com/watch?v=rY6MOdXv02M) **Duration:** 00:12:11 ## Summary - Companies are overwhelmed by an “AI slop” problem, where AI can produce massive amounts of content—PRDs, marketing copy, blogs—but there’s no reliable way to ensure that output meets quality standards. - Human reviewers simply don’t have the capacity to examine dozens or hundreds of AI‑generated items, forcing many teams to either eyeball everything or skip review altogether. - The fix is to treat AI “eyeballs” as a primary quality gate, using LLMs to evaluate and filter work, but this requires carefully crafted, robust prompts rather than vague commands like “Is this good?” - The effectiveness of AI‑based quality checks varies by model, its training data, and past interactions, so organizations must standardize prompting to achieve consistent, predictable assessments. - Successful implementation balances the “supply side” (prompting AI to create higher‑quality content) with the “filter side” (prompting AI to reliably evaluate and approve that content). ## Sections - [00:00:00](https://www.youtube.com/watch?v=rY6MOdXv02M&t=0s) **Tackling AI Content Slop** - The speaker explains that businesses are overwhelmed by low‑quality AI‑generated output, shifting the challenge from getting AI to help to establishing effective quality‑gates and a mindset that treats LLM attention as the primary focus for assessing and ensuring brand‑aligned content. - [00:03:46](https://www.youtube.com/watch?v=rY6MOdXv02M&t=226s) **AI Prompt for PRD Triage** - The speaker outlines how to use an AI prompt to assess product requirement documents, quickly identifying high‑value content and eliminating unnecessary clarification meetings. - [00:07:23](https://www.youtube.com/watch?v=rY6MOdXv02M&t=443s) **Designing an AI Slot Evaluation Framework** - The speaker outlines a structured process—leveraging prompts, scoring rubrics, dependency mapping, element checks, and JSON schemas—to convert AI slot assessment results into clear, actionable plain‑English feedback. - [00:11:04](https://www.youtube.com/watch?v=rY6MOdXv02M&t=664s) **Attention Gatekeeping for Quality** - The speaker urges directing the limited 2% of human attention toward business mechanisms—via a curated prompt pack—to filter low‑quality AI output and raise overall work standards, rather than relying on ineffective AI detectors. ## Full Transcript
0:00We have a supply explosion of AI content 0:03and it is an AI slop issue across every 0:06business I talk to. Every single one, 0:07they're like, "What do we do with AI 0:09slop?" We have product managers 0:11producing PRDs that are not good. We 0:13have marketers who are producing 0:14marketing copy and it's not good. But we 0:17managers or frankly any individual who 0:19wants to know they're actually writing 0:21well with AI, we don't know what to do. 0:23How do we know that what we're shipping 0:25is good quality? It's not about did the 0:28AI help. We're past that point. The AI 0:30is helping. It is about is the quality 0:32output good and we humans don't have the 0:36time to assess all of it because AI is 0:39so good at producing it. So we are in a 0:41world now where marketers have gone from 0:43can I write a blog post today to I could 0:46stamp out 50 and now you have to ask 0:48yourself can I assess the 50 that have 0:50been produced? Are they actually good? 0:52Are they brand aligned? Is the content 0:54right? Etc. And you know what people are 0:56doing for that? they're using their eyes 0:57or they're skipping it because there 1:00hasn't really been a good quality gate. 1:03But that's fixable. And what I want to 1:05tell you about is how I'm fixing it. So 1:07it is really simple. What you need to do 1:10is adopt the mindset that LLM attention 1:14is going to be the predominant attention 1:17mode in your business and everybody 1:19else's business going forward. You want 1:21to make sure you use that to your 1:23advantage. So if you think about a world 1:24where in your organization you have only 1:26so many human eyeballs but you have 1:27nearly infinite AI eyeballs. Well use 1:30the AI eyeballs to your advantage and 1:33get them to check the work. But it's not 1:36as simple as you know as saying please 1:38check this. Is it good? Right? Like I've 1:40seen people do that and you get widely 1:42varying results. It depends on the AI 1:44you're using. Is it chat GPT? Is it 1:46clot? Etc. It also depends on what the 1:48LLM already thinks of as good based on 1:50its training data and maybe it's past 1:52conversations with you. It's not 1:54predictable. And so we need to get to a 1:56point where we have much more robust 2:00prompting for quality. Think of it this 2:03way. A lot of what I talk about is 2:05robust prompting for what we would call 2:08the supply side. Robust prompting to 2:10make your work that you produce better. 2:12So I've talked about Excel in the past. 2:14I've written up guides for that. 2:15PowerPoint, etc. Talked about claude 2:18code. But what about the other side? 2:20What about the filter side? The side 2:22where you need to test work and see is 2:24it good? Maybe it's your own work. Maybe 2:26it's someone else's work. Maybe you're a 2:27manager and you're testing your team's 2:29work and you could get like a 100 blog 2:31posts in today from two or three team 2:33members and you don't have time to go 2:34through it. You'd be up all night. Well, 2:37that's where the prompt comes in. That's 2:38where you can get help. or you're an 2:40engineer and you're looking at this 2:42explosion of lovable vibecoded, you 2:44know, prototypes or PRDs that are coming 2:47from PMs and you don't know whether it's 2:49good. Again, a prompt can help fix it. I 2:51think specifics are really useful here. 2:53So, in a moment, I'm actually going to 2:55go through a real prompt with you that I 2:57wrote for this and I'm going to explain 2:59how it works. And I I I'm going to do 3:01several of these, right? This is very 3:03much a case where the attention filter, 3:05the quality filter needs to be per use 3:08case and per job family or else it's not 3:11any good. Like you need to have one that 3:13is PRD specific, that is blog post 3:15specific because otherwise you don't 3:18have enough context to effectively ask 3:21the AI to help filter for you. And your 3:23goal should be to have the AI do most of 3:26the reading to do what Andre Carpathy 3:28has suggested which is to have LLMs be 3:3198% of our attention and the human eyes 3:33have the 2% attention that matters. And 3:35so I want to give you the right 2% the 3:38highest quality pieces and effectively 3:41filter out the slop. That's my goal. 3:43Like if you are in a world where you 3:46have a 100 blog posts what are the two 3:47that matter? What are the two that are 3:48super high quality that you should 3:50surface now? That's super valuable 3:52information. But where is the PRD that 3:54is really well thought through? That is 3:56promotable information. Like if a PM is 3:58able to do that with AI, they should be 4:01on a promotion track. And you want to 4:03know that quickly. So that's what I'm 4:05setting out to do. How do I do it? Let 4:07me show you a sample prompt and we'll 4:10work through it together. Okay, here we 4:11are. Right on the prompt. We set the RO 4:14first. You're evaluating a product 4:15requirements document and your job is to 4:18determine if an engineering team can 4:20build this without needing three 4:22clarifying meetings. I love that 4:24specificity. You obviously could flip it 4:26around and like change it a bit. But 4:28what I want to be frank about in this 4:31prompt is how painful the current 4:34version is because whether it's three 4:36clarifying meetings for for PRDS or 4:38whether it is I can't tell which blog 4:41post to post or whether it is I don't 4:43know which of these customer email 4:44drafts or I can't tell which of these 4:46sales uh reachout drafts or I don't know 4:48about the webinar uh invite and the 4:50webinar event schedule. All of them you 4:53were having meetings. You were having 4:55human conversations. you were putting 4:56human eyeballs on them if they don't 4:58work. So let's set the stakes and then I 5:00define the axes. This is this gets very 5:01specific to the PRD and that is the 5:03point, right? So we ask about the 5:05completeness of the document. And by the 5:07way, this is somewhat scalable. You can 5:09ask about completeness of other 5:10artifacts too. In this case, we can ask 5:12about acceptance criteria. We can ask 5:14about whether edge cases are present. We 5:16can ask about whether non- goals are 5:18explicit or implicit. We can ask about 5:22sort of whether we and we can extend 5:24this, right? Like if I wanted to add to 5:26this, I could say, do we have at least 5:28seven requirements clearly enumerated? 5:30That's a bit of a random number. I chose 5:32not to include it here, but if you have 5:33a particular format for a particular 5:36artifact, it's easy to modify and add 5:38that in. And then I give right in right 5:40in the prompt, I give a score rubric, 5:42right? Like a five score looks like a 5:44measurable edge case documented 5:46non-goals prevention. It's a good score, 5:48right? A zero is it's untestable, right? 5:52It has a vague statement. And I have 5:53seen this as a PM. I have seen other 5:55people write really vague goals and it's 5:57just terrible and we all know it's bad. 5:59And by the way, I think the AI slop 6:01conversation is somewhat overhyped 6:05because I remember seeing phrases like 6:07this long before AI. In a sense, what we 6:10complain about as AI slop is the latest 6:13version at volume of a larger issue, 6:15which is that we have always had bad 6:17work problems and sloppy work problems. 6:20And now maybe now we have the tools to 6:22address it regardless of who wrote it. I 6:24don't care if AI wrote it. I care if 6:26it's good. So then we go down to the 6:27next one. Is it testable? So this is a 6:29very PRD specific thing, but you can do 6:32something similar with other types of 6:34documents. So, for example, you could 6:36ask if it's readable for a sales 6:39follow-up email, right? Does it read at 6:41an eighth grade level would be a 6:42reasonable test to have? And you can go 6:44through and sort of define that. And 6:45I've done that, right? Like this is not 6:46the only prompt I have. I have a bunch 6:47of prompts for different goals. My goal 6:49is to give you a complete pack that 6:51gives you a filter that you can apply to 6:55various business functions. And so 6:57again, we have the scoring, we have test 6:59cases, success rates, failure states, 7:01example inputs and outputs. This ensures 7:04that it's actually testable. Scope 7:06clarity. Uh this is another example of 7:09something that is PRD specific. We have 7:11our scoring. You can imagine how this 7:12can be somewhat different. If it's uh a 7:15customer announcement email that you're 7:17sending for a new product, uh you can 7:20ask if the product is clearly explained. 7:22The point here, and the reason I'm 7:23pulling in these examples from across 7:25the business, is because this is how 7:27many different places the AI slot 7:29problem touches. Everywhere I look, 7:31there's a slot problem. and we need to 7:33have a filter and that's why I was like 7:35you know what we're just going to write 7:36some prompts and make it easy decision 7:38framework key choices is the rationale 7:40explained or the trade-offs acknowledged 7:42etc very prd specific and then we get 7:44into scoring right so we go through 7:46these five right we have dependency 7:47mapping at the end there and we have an 7:49elements check is everything here and 7:51then we give it an output and what I 7:53want with this output that look I am not 7:55someone who's going to tell you JSON is 7:56magic JSON is not magic but it is 7:59certainly useful in cases where you want 8:01the LLM them to understand a particular 8:04rubric and output schema and follow it. 8:07In this case, what we really want is we 8:09want to understand a grading score. And 8:12so we want to be able to say based on 8:15these scores, how do we write this in 8:19plain English? And then how do we write 8:21a key sentence that someone can use as 8:24feedback? And this is where the magic 8:26lies, right? Because it's no good if you 8:29just can go through the elements at the 8:31top here and say, "Oh, this is bad." 8:34Right? Like, okay, it scored zero or it 8:36scored one. What you want to do is go 8:38through and say what is the actionable 8:40feedback someone can take and actually 8:42cycle through and sophisticated writers 8:45will be able to pull this JSON and feed 8:47it into an LLM along with the draft and 8:50get very actionable feedback for how to 8:52improve and use AI to actually power 8:54better writing higher quality writing 8:57and come back. And so I think that this 8:58is the beginning of a really interesting 9:00feedback loop to actually build an 9:04anti-slot machine at work. And so we 9:07have for each of these we have specific 9:09actionable feedback that we're 9:11specifying as an example. And then we 9:13have thresholds which you can you know 9:15what which you can be specific about 9:17right you can say I accept a three or 9:20overall or more or a 4.8 8 or whatever, 9:22right? But you have a score, you have a 9:24revise, accept, reject, and you have 9:26feedback. And really, isn't that all we 9:29need? We need someone who can say, 9:31"You've got to specify the Stripe API 9:34because if you don't specify the 9:35version, engineering is going to have to 9:37come back." And it's not that hard. 9:38Specify the most recent API version, but 9:40just be clear about it. That is the kind 9:42of feedback that LLMs are very good at 9:45providing if properly prompted, but that 9:47humans are stuck with providing now 9:49because we haven't had prompts to fix 9:51it. and that's why I built this. So 9:52there you go. I think slop is a fixable 9:55problem. I don't think it can be fixed 9:57with one magic bullet. There's not one 9:58magic prompt for this. Nor do I think 10:01just instituting this as a filter is the 10:04only way to fix it. I do think I've 10:06written lots of prompts for this. You 10:08have to write better too. But let's 10:10assume because every organization I've 10:12met has this problem. Let's assume you 10:14have an AI slot problem. Great. Safe 10:16assumption. How do you filter? That's 10:18what this is about. How do you make sure 10:20that you can use AI as a weapon in your 10:23favor so that you focus your attention 10:25where it matters so you scale useful 10:27feedback? And I think walking through 10:29this prompt structure should be helpful 10:31for you to understand this is how we 10:34think about grading a piece of work in 10:37context. And you'll notice I don't need 10:39to know a ton about your organization to 10:41do this. I can just assess the gates and 10:44say, "Okay, well, I think that from a 10:46good best practice PRD perspective, this 10:48is probably weaker." And the nice thing 10:50about a prompt is you can tweak it and 10:52say, "Okay, well, we're not an API 10:53business, right? Like maybe that's not 10:55what we care about. We care about front 10:56end." Okay, well, you can you can adjust 10:58that pretty quickly. You can have the 10:59same degree of specificity and it works. 11:01So, my challenge to you is this. Assume 11:04you live in a world where 2% of human 11:06attention matters and you need to put it 11:07in the right place. Find the mechanisms 11:10in your business. Find the mechanisms in 11:12your workflow that enable you to put 11:15that 2% attention where it matters. This 11:17is one of those. This is one of those 11:18that helps you to weed out AI slop. This 11:21is so much more useful than the 11:23deceptive AI detector stuff because the 11:26AI detector pretends that if you can 11:29detect AI, which they can't, then you 11:31will stop slop. But slop has always been 11:33a problem. We humans have produced poor 11:35quality work. As I've been saying, I've 11:37seen bad quality PRDs for a long time 11:39before AI. Well, this is really about 11:41raising the quality bar. We don't care 11:42how you wrote it. We care that you're 11:44accountable to raising the quality bar. 11:46And then this becomes a really useful 11:47feedback tool for yourself, for your 11:49manager, for whoever to improve the 11:51quality of the outputs. And so I'm 11:53putting together a complete prompt pack 11:55for different business purposes. So you 11:56get a sense of how this works for 11:58marketing, for CS, for sales, for 11:59product, etc. for engineering. And 12:01that's the goal, right? The goal is to 12:03give you a starter on a filter gate so 12:05you can put your attention where it 12:06matters and you stop drowning in the 12:08slot. It's a vlog.