Stemming vs Lemmatization Explained
Key Points
- Stemming is the process of reducing related word forms (e.g., “connected,” “connection,” “connect”) to a common base or “stem,” which acts like the stem of a plant.
- Search engines rely on stemming to return results that include all morphological variants of a query term (e.g., “invest,” “invested,” “investment”) so users find relevant information.
- In natural‑language processing, stemming is a token‑level preprocessing step that follows tokenization, which breaks documents into paragraphs, sentences, and ultimately individual word tokens.
- Unlike stemming, lemmatization aims to map words to their proper dictionary (lemma) forms rather than simply chopping off suffixes, providing a more linguistically accurate normalization.
- The lecture will cover how stemming works, compare it to lemmatization, demonstrate a stemming algorithm (a “stemmer”), and discuss practical caveats.
Sections
- Untitled Section
- Stemming vs Lemmatization Explained - The passage contrasts token-level stemming, a rule‑based heuristic that chops word endings, with lemmatization, a context‑aware process that maps words to their dictionary forms using resources like WordNet.
- Benefits of Stemming in NLP - Stemming groups morphological variants of words to improve search recall, reduce vocabulary size and feature dimensions, and consequently increase the accuracy and efficiency of statistical NLP models.
- Longest‑Match Stemming and Its Limits - The speaker explains how the stemming algorithm selects the longest matching substring (e.g., “caresses” → “cess”), applies rules about consonant collapsing and exponent measurement, and demonstrates a failure case where the rule to drop a trailing “e” incorrectly stems “therefore.”
- Stemming Pitfalls: Over- and Under‑Stemming - The speaker outlines common stemming issues—including over‑stemming, under‑stemming, and failures with proper nouns and homonyms—illustrating how these errors distort word meanings.
- Invitation to Comment - The speaker encourages viewers to post any questions or thoughts about the topic in the comment section below.
Full Transcript
# Stemming vs Lemmatization Explained **Source:** [https://www.youtube.com/watch?v=L5S6YPZcJt8](https://www.youtube.com/watch?v=L5S6YPZcJt8) **Duration:** 00:15:56 ## Summary - Stemming is the process of reducing related word forms (e.g., “connected,” “connection,” “connect”) to a common base or “stem,” which acts like the stem of a plant. - Search engines rely on stemming to return results that include all morphological variants of a query term (e.g., “invest,” “invested,” “investment”) so users find relevant information. - In natural‑language processing, stemming is a token‑level preprocessing step that follows tokenization, which breaks documents into paragraphs, sentences, and ultimately individual word tokens. - Unlike stemming, lemmatization aims to map words to their proper dictionary (lemma) forms rather than simply chopping off suffixes, providing a more linguistically accurate normalization. - The lecture will cover how stemming works, compare it to lemmatization, demonstrate a stemming algorithm (a “stemmer”), and discuss practical caveats. ## Sections - [00:00:00](https://www.youtube.com/watch?v=L5S6YPZcJt8&t=0s) **Untitled Section** - - [00:03:17](https://www.youtube.com/watch?v=L5S6YPZcJt8&t=197s) **Stemming vs Lemmatization Explained** - The passage contrasts token-level stemming, a rule‑based heuristic that chops word endings, with lemmatization, a context‑aware process that maps words to their dictionary forms using resources like WordNet. - [00:06:25](https://www.youtube.com/watch?v=L5S6YPZcJt8&t=385s) **Benefits of Stemming in NLP** - Stemming groups morphological variants of words to improve search recall, reduce vocabulary size and feature dimensions, and consequently increase the accuracy and efficiency of statistical NLP models. - [00:09:30](https://www.youtube.com/watch?v=L5S6YPZcJt8&t=570s) **Longest‑Match Stemming and Its Limits** - The speaker explains how the stemming algorithm selects the longest matching substring (e.g., “caresses” → “cess”), applies rules about consonant collapsing and exponent measurement, and demonstrates a failure case where the rule to drop a trailing “e” incorrectly stems “therefore.” - [00:12:44](https://www.youtube.com/watch?v=L5S6YPZcJt8&t=764s) **Stemming Pitfalls: Over- and Under‑Stemming** - The speaker outlines common stemming issues—including over‑stemming, under‑stemming, and failures with proper nouns and homonyms—illustrating how these errors distort word meanings. - [00:15:50](https://www.youtube.com/watch?v=L5S6YPZcJt8&t=950s) **Invitation to Comment** - The speaker encourages viewers to post any questions or thoughts about the topic in the comment section below. ## Full Transcript
What do plants and words have in common?
I'll give you a hint. It's on the
whiteboard.
Both of them have
stems.
For a plant, the stem is the central
part that connects it to the leaves, the
flowers, the fruits. And for a word,
each of the words have stems too. Today,
we'll be talking about stemming.
Consider the word connect.
This is a stem for words like connected,
connection,
connect, and of course the word connect
itself.
Reducing each of these different words
that I've listed over here to connect is
the process of stemming. In which case,
the connect is the base form of the
word.
Let's say for instance, you want to
become a millionaire. Honestly, who
doesn't?
So, what's the first step that you take
to know how to be a millionaire? Lot of
questions, right? Perhaps you start off
with a search query. You pull up your
favorite search engine and you type in
how to invest so that I can become a
millionaire. And what you'll notice is
the search results that pop up don't
just have the word invest but also have
words related to invest like invested,
invest,
investment
and so on. The process that is making
all of this happen so that you can
receive relevant search results which
cover all the different variations and
all the different forms of the words
like invest in this case is stemming.
That's what the magic is.
So today we'll be seeing about stemming,
what it entails, how it's used,
comparing it with another alternative,
seeing an algorithm of stemming which is
called a stemmer in action and ending
with some caveats.
Stemming is a text pre-processing
technique that's used in natural
language processing. Natural language
processing or NLP is a subbranch of
artificial intelligence.
It's the way our computers machines can
understand how you and I communicate
using text or speech. Natural language
processing includes different tasks to
take all of our documents or all of our
data set and break it down into smaller
components.
Let's say you have a set of documents.
You continue breaking it down into
smaller components to make it more
easily digestible for your machine.
So each document can be broken down into
paragraphs.
Each paragraph can be broken down into
sentences.
And finally each sentence can be broken
down into different words.
And these words over here are what are
called as token. And this entire process
that we have done of taking the data set
from the documents to the paragraph to
the sentences to the words is called
tokenization.
Stemming as a technique operates at the
level of tokens. And now we'll take a
deeper look into how that look. So you
got a glimpse of stemming but there's
also another text pre-processing
technique called leatization.
Let's take a look at the differences
between the two.
Stemming tries to cut the ends of the
word in the hope of getting to its base
word or a stem. In this case,
happy
would cut the Y and make it happy.
But in leatization, it tries to get to
the normalized form of a word. That is
the word form that already exists in the
dictionary. In which case happy will
just stay happy.
As you can imagine, stemming is more of
a huristic algorithm and is very rules
based. It looks at the ends of the words
and tries to guess or tries to estimate
what the base word could be. For
example, it would look at words ending
in ing and remove the ing
to get the base form, which in this case
works.
However, consider the word nothing.
It tries to apply the same logic to it
and you end up with not which is not
really correct.
Lemitization on the other hand would
actually give you nothing.
But the caveat here is that it requires
more context. It requires information
like part of the speech, the context of
the word, how it's being used and it
uses all of that with relation to
something called as word net. Wordet is
a huge graph which gives you
relationships amongst different words,
their synonyms, the type of definitions
that they have. so on and so forth.
That goes to say that stemming is fairly
easy and simple to implement
whereas leatization is computationally
more expensive
but also more accurate.
Think of an example
where you have the word better.
The step of it would result in better
which is also incorrect. But
lemitization powered with that
additional context and additional
knowledge would actually be able to tell
us that better is a form of good.
Choosing one or the other really depends
on your use case. If you want high
accuracy but you are okay with
compromising on it being computationally
expensive, go with limitization. If you
want that something that's simpler and
easier to implement while compromising
on accuracy a little bit, go with
stemming. Now let's see what the use
cases of stemming are.
Why should we use stemming?
First one is search engine or
information retrieval.
Think back to our example of wanting to
become a millionaire and putting in a
search query of how to invest to become
a millionaire.
Even though your query has the word
invest, the search results and documents
that come up has the word investment or
maybe investing or invests.
This is where stemming comes into play.
Trying to get all of those different
forms, the different morphological
variants of that particular word invest.
This in turn gives you more relevant
results thereby increasing the
efficiency of the search while also
increasing the accuracy of the results
that you get.
The next use is tempity reduction.
All of the different unique words that
exist in your documents is what
comprises the vocabulary, the entire
vocabulary.
Let's say for instance you have change
changing
and changed in your vocabulary
which are three unique words making your
vocabulary size three.
If instead you were to use only the stem
for it which in this case is changed
your vocabulary would now reduce to just
one word.
This leads to a reduction in dimensions
or the number of features that your
machine learning model will now have. It
again increases the accuracy
as well as the precision.
This helps you get better performances
with statistical NLP models, especially
the ones concerned with word embeddings
and topic models.
Well, enough talk about all of this.
Let's go and see how exactly a stemming
algorithm or a stemmer works.
So hopefully you're on board with
stemming. But now let's see the
algorithm in action. A stemmer is what a
stemming algorithm is called. And let's
see how that works.
One of the most widely used stemmers is
called the portal stemer. And the way it
works is that it looks at each word. It
identifies the consonants and vowels
present in there and then it does a
bunch of substitutions, eliminations
based on the number of consonants and
vowel pairs present in there. Let's say
for example we have the word
caresses
and these are the three rules that might
apply to caresses. You look at S Ses
present and replace it with SS or you
look at SS present and keep it the same.
Don't replace it. Or you look at S and
just eliminate it. That is replace it
with nothing.
As you can see for the word caresses,
more than one of these rules would
apply.
But the trick over here is to look at
the longest matching substring which in
our case is S SES and apply the first
rule over here.
So this will then become cess which is
correct.
Let's say instead the word was cess. In
that case ss would stay the same which
is again caress which is also correct.
However in life is perfect and the same
goes for the potter stemer. Let's look
at one example which showcases the
limitation that the stemer has.
Consider the word therefore.
I've identified each of the consonants
and vowels present in this particular
word. Let's look at these. So two
consonants coming together. We can
collapse them and make them one. So
that's C. Let's also look at the pairs
of VC's. That is a vowel and a consonant
coming together. As you can see there
are three of those occurrences
1 2 and three. In which case we would
denote it as we see raised to the power
the number of times it appears in this
case three and finally ending with the V
over here.
This particular exponent three over here
is called as measurement and based on
the algorithm there are certain rules
that apply to the exponent.
One of such rules like the ones that we
saw over here says that if the exponent
is non zero which means it is greater
than zero which is this for our case
and if the word in consideration ends
with an e you must eliminate e
which then gives us this as our stem.
Therefore minus the e which is
incorrect.
Even though this is a limitation of the
porter stemming algorithm, pter stemer
still remains one of the most widely
used because of the amount of rules that
it has and the amount of correct results
it does give even if stemming as a whole
is huristic and simplistic in nature.
Another stemmer that you might have
heard of is called the snowball stemer.
It's a modification of the porter
stemer. The border stemer was created
only to be used with English words.
However, the snowball stemer is
multilinguistic which means it does work
on languages other than English.
The Pythonic implementation of snowball
stemer which you can use using NLTK
which stands for natural language
toolkit which is an NLP package
available
has snowball stemmer in the event where
you can use it to remove stop words.
So stop words would then be eliminated
from the process of stemming. So those
are two ways that the snowball stemmer
is different from Potter stemer.
Let's continue looking at some more
issues and limitations that exist with
STEM. Two of the most common issues that
arise with stemming are over stemming
and understandming.
Let's take a look at some examples.
Consider universal universe and
university.
Over stemming as it sounds overdo the
stemming part or removes too much more
than necessary to the point where the
words can lose meaning themselves.
The step for each of those
would be
universe without the e. And based on our
knowledge of the English language, we
know that this is incorrect because
universal, universe, and university are
three words that mean totally different
things. The opposite end of this is
understanding where not enough removal
has happened. Consider the example of
alumnus, alumna and alumni. Alumna and
alumni would remain the same whereas
alimnus would become alumna giving us
three different stems or three different
base words which in this case is also
incorrect because each of these three
words mean the same thing and so should
have the same stem.
Some of the other challenges that exist
with stemming are
it fares badly with named entity
recognition.
For example, you might have some proper
nouns. Consider
Boeing.
Based on the examples that we have seen
previously, stemming would reduce the
ing from it giving you the stem of b
which we know is incorrect because it's
a proper noun. Another example might be
that of homonyms.
Consider rose the flower and rose
sun rising past of rise. In this case
too, the stemming algorithm would
inaccurately get the stem for each of
those as rise,
which does make sense for the sun rose,
but does not make sense for the flower
rose.
You might also run into issues trying to
attempt stemming on languages like
Arabic which have complex forms present
as it can be difficult to understand
what suffixes and what prefixes are
present.
Stemming is a simple yet powerful
technique when used in the right way.
Hopefully you learned something useful
along this way when we talked about
stemming and this has continued to grow
strong roots as you go on in your
journey of artificial intelligence.
If you like this video and want to see
more like it, please like and subscribe.
If you have any questions or want to
share your thoughts about this topic,
please leave a comment below.