Learning Library

← Back to Library

Text Classification: Types and Techniques

13m • Unknown Channel • ai-ml • tutorial • beginner • Watch on YouTube ↗

Key Points

Text classification transforms raw text—like emails or Netflix movie descriptions—into automated categories such as spam vs. not‑spam or comedy vs. drama, reducing the need for manual labeling.
The three main classification tasks are binary (two classes), multiclass (one of many exclusive classes), and multi‑label (assigning multiple categories to a single item, e.g., an action‑adventure film).
The workflow centers on heavy preprocessing of raw text (cleaning punctuation, tokenization, etc.) before converting it into numerical vectors via word embeddings.
A pre‑trained language model (e.g., BERT, ChatGPT, Granite) is then fine‑tuned to the specific classification problem, leveraging its learned representations to predict the appropriate labels.

Sections

Full Transcript

# Text Classification: Types and Techniques **Source:** [https://www.youtube.com/watch?v=hHiPs_wICsE](https://www.youtube.com/watch?v=hHiPs_wICsE) **Duration:** 00:13:52 ## Summary - Text classification transforms raw text—like emails or Netflix movie descriptions—into automated categories such as spam vs. not‑spam or comedy vs. drama, reducing the need for manual labeling. - The three main classification tasks are binary (two classes), multiclass (one of many exclusive classes), and multi‑label (assigning multiple categories to a single item, e.g., an action‑adventure film). - The workflow centers on heavy preprocessing of raw text (cleaning punctuation, tokenization, etc.) before converting it into numerical vectors via word embeddings. - A pre‑trained language model (e.g., BERT, ChatGPT, Granite) is then fine‑tuned to the specific classification problem, leveraging its learned representations to predict the appropriate labels. ## Sections - [00:00:00](https://www.youtube.com/watch?v=hHiPs_wICsE&t=0s) **Understanding Text Classification Types** - A brief overview of text classification, illustrating binary, multiclass, and multilabel approaches using spam email and Netflix movie genre examples. - [00:03:05](https://www.youtube.com/watch?v=hHiPs_wICsE&t=185s) **Text Classification Pipeline Steps** - The speaker outlines the end‑to‑end workflow for text classification, covering raw‑text preprocessing, feature extraction via word embeddings, choosing an appropriate language model, and producing labeled outputs. - [00:06:17](https://www.youtube.com/watch?v=hHiPs_wICsE&t=377s) **AI Email Sorting & Sentiment** - The speaker outlines how AI models can automatically filter spam, gauge sentiment, categorize topics, and interpret customer feedback from email and social‑media messages. - [00:09:28](https://www.youtube.com/watch?v=hHiPs_wICsE&t=568s) **Balancing Data and Handling Ambiguity** - The speaker explains how to ensure a well‑balanced model by maintaining proper class ratios, clarifying ambiguous terms like “bank,” and providing a diverse spread of examples across sentiment subcategories. - [00:12:30](https://www.youtube.com/watch?v=hHiPs_wICsE&t=750s) **Model Validation and Drift Management** - It explains the need to continuously validate text classification models against data drift caused by real‑world changes, ensuring they keep correctly categorizing incoming business communications. ## Full Transcript

0:00So let's jump in with a quick question. 0:04How many of you have come across spam or in your email 0:08or while on Netflix, the different categories of a movie? 0:13Well, that's text classification. 0:18Text classification takes raw text, like these documents, 0:26and funnels them into a computational engine 0:31that then outputs different classifications. 0:34So it could be, in the two examples mentioned, 0:38a spam email or simply a not-spam. 0:48Or, in the Netflix examples, a comedy, 0:54drama, etc. 0:58So in today's world, we're constantly bombarded with 1:01tons of information. 1:03And what text classification can provide us 1:07is a means in order to simplify and automate 1:12the classification of different types of text without human input. 1:17Types of text classification. 1:22There's three major types. 1:24Starting with the least complex: binary classification. 1:30That can be expressed as either a one or a zero. 1:34Or in the email example, a spam versus not-spam. 1:39The second is multiclass classification, I'll put that. 1:52And that's either a 1... or rather, a 2, 1, 0. 2:01Or if using the email example, 2:04a business related email, 2:06a customer related email or an order email. 2:11The third, and the most complex, 2:13is what's called multi-label classification. 2:22And this kind of the most complex because 2:24you can assign a specific email or a specific type of text, 2:28multiple classifications. 2:31So switching over to the Netflix example, 2:35a movie can be classified as an action adventure. 2:39And it has those two classifications as just that one entity. 2:44So depending on the business use case, 2:47and text complexity, 2:50you'll go through and determine if you need to use 2:53one of these three major types. 2:57Key techniques of text classification. 3:02There's 4 key techniques. 3:05So the first one is how do you handle the raw text? 3:10Most of your time is spent preparing and preprocessing the text, 3:13and you usually do that in script languages such as Python. 3:18So you take the raw text, 3:20you extract it from the document, 3:22depending on your use case, you remove periods, hyphens, 3:27apostrophe S's, that sort of thing. 3:30It all depends on the use case. 3:31But, again, this is where most of your time is spent, 3:34working through and preprocessing that text. 3:37Before the next step, 3:40which is called feature extraction, I'll just put F.E. for short. 3:47So this is where you take the texts and you send it into 3:51what's called word embeddings. 3:54So this is a bit of a black box, 3:56and the details of it are outside of the scope of today's discussion, 4:01but you're essentially taking the raw text 4:04and then converting it into a long list of numbers. 4:10The third is the model. 4:15So when I say model, I mean a large language model 4:18like ChatGPT or Granite model or BERT model. 4:24And depending on what you're trying to classify, 4:27different types of models have different pre-trained 4:31with their own levels of text ... 4:34with their own pre-trained types of text. 4:38So in other words, there could be a model that's built on 4:42just classifying spam versus non-spam emails, 4:46or classifying different types of movies 4:50or different types of news documents. 4:53This is where you would select that type of model 4:56that's specific for your use case. 5:00Then the fourth type, or the fourth step, 5:03is the labeled output. 5:08So I'll just write "output". 5:13This part you need to work through iteratively. 5:15So this is just the 5:18the types of classifications that you're receiving 5:20from each of these steps. 5:23Depending on your output, 5:24you might have to go all the way back to your text 5:27and work it. 5:28You might have to go back to your feature extraction and adjust it. 5:32Or as mentioned, 5:33you might have the wrong model selected, 5:36so you might have to go back to that model 5:37and select a different one. 5:41So through these four key techniques, 5:45it gives you just an idea of what steps are required, 5:49iterative steps are required, in order to take the raw text, 5:53turn it into features, pass it through the model 5:57and then get our labeled output. 6:00So what are real world applications of text classification? 6:11I'm going to go through just a couple here, 6:13but the first one is, as mentioned previously, spam detection. 6:19So you get a bunch of emails. 6:21You're not sure if they're relevant to you 6:24or is it someone sending you something inappropriate. 6:27Well, you can add an AI text classification model 6:31onto your inbox and classify those emails 6:35as spam or not-spam. 6:39The next one is what's called sentiment analysis. 6:43So positive ... 6:45the classic examples are positive, negative or neutral. 6:49So if a string of text is happy or sad or neutral, 6:55and you can use that in the business world 7:01to determined customers and how they feel about something, 7:04how they feel about a product. 7:05Let it be how they post about it on Twitter or X, 7:10or how they post about it on Instagram, 7:13you can determine how they're feeling about something like that 7:16through sentiment analysis. 7:18The next one, and this is a more business-specific 7:22and internal specific type of application, 7:26but it's what's called topic categorization. 7:30So let's say, for example, 7:33a business is receiving emails from customers, 7:36and instead of having an administrator go through 7:39and manually classify those emails for, let's say, 7:43for an order or a technical request or a customer service request, 7:51you can have an AI model go through and classify 7:56each of those automatically into those categories. 8:02The fourth is what's called customer feedback. 8:07So this ties into, as mentioned, the others, 8:10such as with sentiment. 8:12But if you're trying to determine 8:15how a customer is feeling about something, 8:18let's say for example, they email you and they say, 8:21"this product is terrible, I want to return it and never buy something again". 8:26Well, from a business standpoint, 8:27you want to make sure that you speak to that customer 8:30immediately to try to rectify the situation. 8:33Whereas on the flip side of that, 8:36if a customer is happy with the product and just wants to send out a thank you, 8:40you don't need to prioritize that as immediately as you would with 8:45something a little more negative. 8:48So these are the four, I feel, real world major categories 8:54of text classification. 8:55Obviously there is a lot out there that you can do with this 9:00and the applications are almost limitless. 9:04So challenges and best practices when it comes to classifying text. 9:12The first one is what's termed as imbalanced data sets. 9:17So, you eed to make sure that you have the right number of examples 9:25for each type of thing that you're trying to classify. 9:29If you have too many of one type, or too little of another, 9:33your output and your model won't be as balanced as you want it to be. 9:38So you need to make sure that you have 9:40the right number relative to the output that you're expecting. 9:46The second is what's called ambiguous text. 9:57And this one's a little bit gray and it's relative to each use case. 10:01But the example I like to give is the word "bank". 10:08The word bank can have a couple applications, 10:11like the physical location where you store money 10:14or the side of a river. 10:17The model might not necessarily know what you want it to mean. 10:23So leading into it and leading into the text that you're using, 10:28you need to make sure that you have that specified. 10:34The third one is the diverse. 10:40Diverse, meaning you have a wide spread of different types of examples. 10:45So using sentiment analysis as an example here, 10:49positive, negative and neutral. 10:52We need to make sure that the types of training examples you have within each 10:58spread both the extremes of "extremely positive" 11:04to "kind of positive", and then into the negative, 11:08"extremely negative", "kind of negative", 11:12and then everything else in the middle, neutral. 11:15So it's important that you have the spread within each. 11:20Because if you don't, you might only be receiving 11:23classifications on extremely positive or extremely negative. 11:28While you're going to want to capture 11:33within that spread of each category, each subcategorization, those sentiments. 11:42So what can we do to fix this? 11:44Each of these components? 11:46So one of the things that we can do to fix this 11:49is through what's called, well, proper labeling. 11:56This can be really time intensive. 12:00But what I mean by that is you go through each of your training examples 12:05and manually read and discern: 12:10is this using the sentiment example? 12:12Is this positive? Is this negative? 12:15And then manually label that yourself. 12:18Don't rely on somebody else that might not be versed in 12:24the tasks that you're trying to perform. 12:29Do it yourself. 12:31Do it by hand. 12:35And then the last one within this, is validation. 12:46So what I mean by validation is 12:49making sure, once you train that first model, 12:53that the data that you're then receiving or sending it out in the real world 12:59still are being classified in the way that you want. 13:03So there's this thing called "drift" 13:05where if a world event comes along 13:08and changes what the sentiment of a particular idea or a topic would be, 13:13the the model would perform differently. 13:18So you need to be constantly going back and reviewing it 13:22to make sure that the model is classifying what you want it to classify. 13:28So to wrap things up, let's revisit why text classification is so powerful. 13:33Businesses are getting flooded by tons of information daily, 13:37thousands of emails, phone calls, etc.. 13:42So these text classification models are able to classify these things 13:47without human intervention, 13:49quickly and efficiently and repeatedly.