Text Classification: Types and Techniques
Key Points
- Text classification transforms raw text—like emails or Netflix movie descriptions—into automated categories such as spam vs. not‑spam or comedy vs. drama, reducing the need for manual labeling.
- The three main classification tasks are binary (two classes), multiclass (one of many exclusive classes), and multi‑label (assigning multiple categories to a single item, e.g., an action‑adventure film).
- The workflow centers on heavy preprocessing of raw text (cleaning punctuation, tokenization, etc.) before converting it into numerical vectors via word embeddings.
- A pre‑trained language model (e.g., BERT, ChatGPT, Granite) is then fine‑tuned to the specific classification problem, leveraging its learned representations to predict the appropriate labels.
Sections
- Understanding Text Classification Types - A brief overview of text classification, illustrating binary, multiclass, and multilabel approaches using spam email and Netflix movie genre examples.
- Text Classification Pipeline Steps - The speaker outlines the end‑to‑end workflow for text classification, covering raw‑text preprocessing, feature extraction via word embeddings, choosing an appropriate language model, and producing labeled outputs.
- AI Email Sorting & Sentiment - The speaker outlines how AI models can automatically filter spam, gauge sentiment, categorize topics, and interpret customer feedback from email and social‑media messages.
- Balancing Data and Handling Ambiguity - The speaker explains how to ensure a well‑balanced model by maintaining proper class ratios, clarifying ambiguous terms like “bank,” and providing a diverse spread of examples across sentiment subcategories.
- Model Validation and Drift Management - It explains the need to continuously validate text classification models against data drift caused by real‑world changes, ensuring they keep correctly categorizing incoming business communications.
Full Transcript
# Text Classification: Types and Techniques **Source:** [https://www.youtube.com/watch?v=hHiPs_wICsE](https://www.youtube.com/watch?v=hHiPs_wICsE) **Duration:** 00:13:52 ## Summary - Text classification transforms raw text—like emails or Netflix movie descriptions—into automated categories such as spam vs. not‑spam or comedy vs. drama, reducing the need for manual labeling. - The three main classification tasks are binary (two classes), multiclass (one of many exclusive classes), and multi‑label (assigning multiple categories to a single item, e.g., an action‑adventure film). - The workflow centers on heavy preprocessing of raw text (cleaning punctuation, tokenization, etc.) before converting it into numerical vectors via word embeddings. - A pre‑trained language model (e.g., BERT, ChatGPT, Granite) is then fine‑tuned to the specific classification problem, leveraging its learned representations to predict the appropriate labels. ## Sections - [00:00:00](https://www.youtube.com/watch?v=hHiPs_wICsE&t=0s) **Understanding Text Classification Types** - A brief overview of text classification, illustrating binary, multiclass, and multilabel approaches using spam email and Netflix movie genre examples. - [00:03:05](https://www.youtube.com/watch?v=hHiPs_wICsE&t=185s) **Text Classification Pipeline Steps** - The speaker outlines the end‑to‑end workflow for text classification, covering raw‑text preprocessing, feature extraction via word embeddings, choosing an appropriate language model, and producing labeled outputs. - [00:06:17](https://www.youtube.com/watch?v=hHiPs_wICsE&t=377s) **AI Email Sorting & Sentiment** - The speaker outlines how AI models can automatically filter spam, gauge sentiment, categorize topics, and interpret customer feedback from email and social‑media messages. - [00:09:28](https://www.youtube.com/watch?v=hHiPs_wICsE&t=568s) **Balancing Data and Handling Ambiguity** - The speaker explains how to ensure a well‑balanced model by maintaining proper class ratios, clarifying ambiguous terms like “bank,” and providing a diverse spread of examples across sentiment subcategories. - [00:12:30](https://www.youtube.com/watch?v=hHiPs_wICsE&t=750s) **Model Validation and Drift Management** - It explains the need to continuously validate text classification models against data drift caused by real‑world changes, ensuring they keep correctly categorizing incoming business communications. ## Full Transcript
So let's jump in with a quick question.
How many of you have come across spam or in your email
or while on Netflix, the different categories of a movie?
Well, that's text classification.
Text classification takes raw text, like these documents,
and funnels them into a computational engine
that then outputs different classifications.
So it could be, in the two examples mentioned,
a spam email or simply a not-spam.
Or, in the Netflix examples, a comedy,
drama, etc.
So in today's world, we're constantly bombarded with
tons of information.
And what text classification can provide us
is a means in order to simplify and automate
the classification of different types of text without human input.
Types of text classification.
There's three major types.
Starting with the least complex: binary classification.
That can be expressed as either a one or a zero.
Or in the email example, a spam versus not-spam.
The second is multiclass classification, I'll put that.
And that's either a 1... or rather, a 2, 1, 0.
Or if using the email example,
a business related email,
a customer related email or an order email.
The third, and the most complex,
is what's called multi-label classification.
And this kind of the most complex because
you can assign a specific email or a specific type of text,
multiple classifications.
So switching over to the Netflix example,
a movie can be classified as an action adventure.
And it has those two classifications as just that one entity.
So depending on the business use case,
and text complexity,
you'll go through and determine if you need to use
one of these three major types.
Key techniques of text classification.
There's 4 key techniques.
So the first one is how do you handle the raw text?
Most of your time is spent preparing and preprocessing the text,
and you usually do that in script languages such as Python.
So you take the raw text,
you extract it from the document,
depending on your use case, you remove periods, hyphens,
apostrophe S's, that sort of thing.
It all depends on the use case.
But, again, this is where most of your time is spent,
working through and preprocessing that text.
Before the next step,
which is called feature extraction, I'll just put F.E. for short.
So this is where you take the texts and you send it into
what's called word embeddings.
So this is a bit of a black box,
and the details of it are outside of the scope of today's discussion,
but you're essentially taking the raw text
and then converting it into a long list of numbers.
The third is the model.
So when I say model, I mean a large language model
like ChatGPT or Granite model or BERT model.
And depending on what you're trying to classify,
different types of models have different pre-trained
with their own levels of text ...
with their own pre-trained types of text.
So in other words, there could be a model that's built on
just classifying spam versus non-spam emails,
or classifying different types of movies
or different types of news documents.
This is where you would select that type of model
that's specific for your use case.
Then the fourth type, or the fourth step,
is the labeled output.
So I'll just write "output".
This part you need to work through iteratively.
So this is just the
the types of classifications that you're receiving
from each of these steps.
Depending on your output,
you might have to go all the way back to your text
and work it.
You might have to go back to your feature extraction and adjust it.
Or as mentioned,
you might have the wrong model selected,
so you might have to go back to that model
and select a different one.
So through these four key techniques,
it gives you just an idea of what steps are required,
iterative steps are required, in order to take the raw text,
turn it into features, pass it through the model
and then get our labeled output.
So what are real world applications of text classification?
I'm going to go through just a couple here,
but the first one is, as mentioned previously, spam detection.
So you get a bunch of emails.
You're not sure if they're relevant to you
or is it someone sending you something inappropriate.
Well, you can add an AI text classification model
onto your inbox and classify those emails
as spam or not-spam.
The next one is what's called sentiment analysis.
So positive ...
the classic examples are positive, negative or neutral.
So if a string of text is happy or sad or neutral,
and you can use that in the business world
to determined customers and how they feel about something,
how they feel about a product.
Let it be how they post about it on Twitter or X,
or how they post about it on Instagram,
you can determine how they're feeling about something like that
through sentiment analysis.
The next one, and this is a more business-specific
and internal specific type of application,
but it's what's called topic categorization.
So let's say, for example,
a business is receiving emails from customers,
and instead of having an administrator go through
and manually classify those emails for, let's say,
for an order or a technical request or a customer service request,
you can have an AI model go through and classify
each of those automatically into those categories.
The fourth is what's called customer feedback.
So this ties into, as mentioned, the others,
such as with sentiment.
But if you're trying to determine
how a customer is feeling about something,
let's say for example, they email you and they say,
"this product is terrible, I want to return it and never buy something again".
Well, from a business standpoint,
you want to make sure that you speak to that customer
immediately to try to rectify the situation.
Whereas on the flip side of that,
if a customer is happy with the product and just wants to send out a thank you,
you don't need to prioritize that as immediately as you would with
something a little more negative.
So these are the four, I feel, real world major categories
of text classification.
Obviously there is a lot out there that you can do with this
and the applications are almost limitless.
So challenges and best practices when it comes to classifying text.
The first one is what's termed as imbalanced data sets.
So, you eed to make sure that you have the right number of examples
for each type of thing that you're trying to classify.
If you have too many of one type, or too little of another,
your output and your model won't be as balanced as you want it to be.
So you need to make sure that you have
the right number relative to the output that you're expecting.
The second is what's called ambiguous text.
And this one's a little bit gray and it's relative to each use case.
But the example I like to give is the word "bank".
The word bank can have a couple applications,
like the physical location where you store money
or the side of a river.
The model might not necessarily know what you want it to mean.
So leading into it and leading into the text that you're using,
you need to make sure that you have that specified.
The third one is the diverse.
Diverse, meaning you have a wide spread of different types of examples.
So using sentiment analysis as an example here,
positive, negative and neutral.
We need to make sure that the types of training examples you have within each
spread both the extremes of "extremely positive"
to "kind of positive", and then into the negative,
"extremely negative", "kind of negative",
and then everything else in the middle, neutral.
So it's important that you have the spread within each.
Because if you don't, you might only be receiving
classifications on extremely positive or extremely negative.
While you're going to want to capture
within that spread of each category, each subcategorization, those sentiments.
So what can we do to fix this?
Each of these components?
So one of the things that we can do to fix this
is through what's called, well, proper labeling.
This can be really time intensive.
But what I mean by that is you go through each of your training examples
and manually read and discern:
is this using the sentiment example?
Is this positive? Is this negative?
And then manually label that yourself.
Don't rely on somebody else that might not be versed in
the tasks that you're trying to perform.
Do it yourself.
Do it by hand.
And then the last one within this, is validation.
So what I mean by validation is
making sure, once you train that first model,
that the data that you're then receiving or sending it out in the real world
still are being classified in the way that you want.
So there's this thing called "drift"
where if a world event comes along
and changes what the sentiment of a particular idea or a topic would be,
the the model would perform differently.
So you need to be constantly going back and reviewing it
to make sure that the model is classifying what you want it to classify.
So to wrap things up, let's revisit why text classification is so powerful.
Businesses are getting flooded by tons of information daily,
thousands of emails, phone calls, etc..
So these text classification models are able to classify these things
without human intervention,
quickly and efficiently and repeatedly.