Learning Library

← Back to Library

Stemming vs Lemmatization Explained

15m • Unknown Channel • ai-ml • tutorial • beginner • Watch on YouTube ↗

Key Points

Stemming is the process of reducing related word forms (e.g., “connected,” “connection,” “connect”) to a common base or “stem,” which acts like the stem of a plant.
Search engines rely on stemming to return results that include all morphological variants of a query term (e.g., “invest,” “invested,” “investment”) so users find relevant information.
In natural‑language processing, stemming is a token‑level preprocessing step that follows tokenization, which breaks documents into paragraphs, sentences, and ultimately individual word tokens.
Unlike stemming, lemmatization aims to map words to their proper dictionary (lemma) forms rather than simply chopping off suffixes, providing a more linguistically accurate normalization.
The lecture will cover how stemming works, compare it to lemmatization, demonstrate a stemming algorithm (a “stemmer”), and discuss practical caveats.

Sections

Full Transcript

# Stemming vs Lemmatization Explained **Source:** [https://www.youtube.com/watch?v=L5S6YPZcJt8](https://www.youtube.com/watch?v=L5S6YPZcJt8) **Duration:** 00:15:56 ## Summary - Stemming is the process of reducing related word forms (e.g., “connected,” “connection,” “connect”) to a common base or “stem,” which acts like the stem of a plant. - Search engines rely on stemming to return results that include all morphological variants of a query term (e.g., “invest,” “invested,” “investment”) so users find relevant information. - In natural‑language processing, stemming is a token‑level preprocessing step that follows tokenization, which breaks documents into paragraphs, sentences, and ultimately individual word tokens. - Unlike stemming, lemmatization aims to map words to their proper dictionary (lemma) forms rather than simply chopping off suffixes, providing a more linguistically accurate normalization. - The lecture will cover how stemming works, compare it to lemmatization, demonstrate a stemming algorithm (a “stemmer”), and discuss practical caveats. ## Sections - [00:00:00](https://www.youtube.com/watch?v=L5S6YPZcJt8&t=0s) **Untitled Section** - - [00:03:17](https://www.youtube.com/watch?v=L5S6YPZcJt8&t=197s) **Stemming vs Lemmatization Explained** - The passage contrasts token-level stemming, a rule‑based heuristic that chops word endings, with lemmatization, a context‑aware process that maps words to their dictionary forms using resources like WordNet. - [00:06:25](https://www.youtube.com/watch?v=L5S6YPZcJt8&t=385s) **Benefits of Stemming in NLP** - Stemming groups morphological variants of words to improve search recall, reduce vocabulary size and feature dimensions, and consequently increase the accuracy and efficiency of statistical NLP models. - [00:09:30](https://www.youtube.com/watch?v=L5S6YPZcJt8&t=570s) **Longest‑Match Stemming and Its Limits** - The speaker explains how the stemming algorithm selects the longest matching substring (e.g., “caresses” → “cess”), applies rules about consonant collapsing and exponent measurement, and demonstrates a failure case where the rule to drop a trailing “e” incorrectly stems “therefore.” - [00:12:44](https://www.youtube.com/watch?v=L5S6YPZcJt8&t=764s) **Stemming Pitfalls: Over- and Under‑Stemming** - The speaker outlines common stemming issues—including over‑stemming, under‑stemming, and failures with proper nouns and homonyms—illustrating how these errors distort word meanings. - [00:15:50](https://www.youtube.com/watch?v=L5S6YPZcJt8&t=950s) **Invitation to Comment** - The speaker encourages viewers to post any questions or thoughts about the topic in the comment section below. ## Full Transcript

0:00What do plants and words have in common? 0:03I'll give you a hint. It's on the 0:05whiteboard. 0:07Both of them have 0:10stems. 0:11For a plant, the stem is the central 0:13part that connects it to the leaves, the 0:16flowers, the fruits. And for a word, 0:19each of the words have stems too. Today, 0:22we'll be talking about stemming. 0:25Consider the word connect. 0:28This is a stem for words like connected, 0:33connection, 0:34connect, and of course the word connect 0:37itself. 0:38Reducing each of these different words 0:40that I've listed over here to connect is 0:43the process of stemming. In which case, 0:45the connect is the base form of the 0:47word. 0:50Let's say for instance, you want to 0:51become a millionaire. Honestly, who 0:54doesn't? 0:57So, what's the first step that you take 0:59to know how to be a millionaire? Lot of 1:02questions, right? Perhaps you start off 1:05with a search query. You pull up your 1:07favorite search engine and you type in 1:10how to invest so that I can become a 1:12millionaire. And what you'll notice is 1:15the search results that pop up don't 1:17just have the word invest but also have 1:21words related to invest like invested, 1:25invest, 1:26investment 1:28and so on. The process that is making 1:31all of this happen so that you can 1:33receive relevant search results which 1:36cover all the different variations and 1:38all the different forms of the words 1:40like invest in this case is stemming. 1:43That's what the magic is. 1:47So today we'll be seeing about stemming, 1:50what it entails, how it's used, 1:53comparing it with another alternative, 1:55seeing an algorithm of stemming which is 1:57called a stemmer in action and ending 2:00with some caveats. 2:02Stemming is a text pre-processing 2:04technique that's used in natural 2:05language processing. Natural language 2:08processing or NLP is a subbranch of 2:10artificial intelligence. 2:12It's the way our computers machines can 2:15understand how you and I communicate 2:18using text or speech. Natural language 2:21processing includes different tasks to 2:23take all of our documents or all of our 2:26data set and break it down into smaller 2:28components. 2:30Let's say you have a set of documents. 2:37You continue breaking it down into 2:38smaller components to make it more 2:40easily digestible for your machine. 2:44So each document can be broken down into 2:46paragraphs. 2:48Each paragraph can be broken down into 2:50sentences. 2:54And finally each sentence can be broken 2:56down into different words. 3:00And these words over here are what are 3:03called as token. And this entire process 3:06that we have done of taking the data set 3:08from the documents to the paragraph to 3:10the sentences to the words is called 3:13tokenization. 3:15Stemming as a technique operates at the 3:17level of tokens. And now we'll take a 3:19deeper look into how that look. So you 3:22got a glimpse of stemming but there's 3:24also another text pre-processing 3:26technique called leatization. 3:29Let's take a look at the differences 3:30between the two. 3:32Stemming tries to cut the ends of the 3:34word in the hope of getting to its base 3:37word or a stem. In this case, 3:41happy 3:44would cut the Y and make it happy. 3:49But in leatization, it tries to get to 3:52the normalized form of a word. That is 3:54the word form that already exists in the 3:57dictionary. In which case happy will 4:02just stay happy. 4:07As you can imagine, stemming is more of 4:10a huristic algorithm and is very rules 4:12based. It looks at the ends of the words 4:15and tries to guess or tries to estimate 4:18what the base word could be. For 4:20example, it would look at words ending 4:22in ing and remove the ing 4:27to get the base form, which in this case 4:29works. 4:31However, consider the word nothing. 4:37It tries to apply the same logic to it 4:40and you end up with not which is not 4:44really correct. 4:46Lemitization on the other hand would 4:48actually give you nothing. 4:51But the caveat here is that it requires 4:54more context. It requires information 4:56like part of the speech, the context of 4:59the word, how it's being used and it 5:02uses all of that with relation to 5:05something called as word net. Wordet is 5:08a huge graph which gives you 5:10relationships amongst different words, 5:12their synonyms, the type of definitions 5:14that they have. so on and so forth. 5:18That goes to say that stemming is fairly 5:21easy and simple to implement 5:25whereas leatization is computationally 5:28more expensive 5:30but also more accurate. 5:34Think of an example 5:37where you have the word better. 5:42The step of it would result in better 5:45which is also incorrect. But 5:47lemitization powered with that 5:49additional context and additional 5:51knowledge would actually be able to tell 5:53us that better is a form of good. 5:57Choosing one or the other really depends 6:00on your use case. If you want high 6:02accuracy but you are okay with 6:05compromising on it being computationally 6:07expensive, go with limitization. If you 6:10want that something that's simpler and 6:11easier to implement while compromising 6:14on accuracy a little bit, go with 6:16stemming. Now let's see what the use 6:19cases of stemming are. 6:22Why should we use stemming? 6:25First one is search engine or 6:28information retrieval. 6:30Think back to our example of wanting to 6:32become a millionaire and putting in a 6:35search query of how to invest to become 6:37a millionaire. 6:39Even though your query has the word 6:41invest, the search results and documents 6:43that come up has the word investment or 6:46maybe investing or invests. 6:50This is where stemming comes into play. 6:52Trying to get all of those different 6:54forms, the different morphological 6:56variants of that particular word invest. 7:00This in turn gives you more relevant 7:01results thereby increasing the 7:04efficiency of the search while also 7:08increasing the accuracy of the results 7:11that you get. 7:15The next use is tempity reduction. 7:21All of the different unique words that 7:23exist in your documents is what 7:25comprises the vocabulary, the entire 7:27vocabulary. 7:29Let's say for instance you have change 7:33changing 7:35and changed in your vocabulary 7:38which are three unique words making your 7:40vocabulary size three. 7:43If instead you were to use only the stem 7:46for it which in this case is changed 7:51your vocabulary would now reduce to just 7:54one word. 7:56This leads to a reduction in dimensions 7:58or the number of features that your 8:01machine learning model will now have. It 8:04again increases the accuracy 8:07as well as the precision. 8:11This helps you get better performances 8:14with statistical NLP models, especially 8:17the ones concerned with word embeddings 8:19and topic models. 8:22Well, enough talk about all of this. 8:24Let's go and see how exactly a stemming 8:26algorithm or a stemmer works. 8:30So hopefully you're on board with 8:32stemming. But now let's see the 8:34algorithm in action. A stemmer is what a 8:38stemming algorithm is called. And let's 8:40see how that works. 8:43One of the most widely used stemmers is 8:45called the portal stemer. And the way it 8:48works is that it looks at each word. It 8:51identifies the consonants and vowels 8:53present in there and then it does a 8:56bunch of substitutions, eliminations 8:59based on the number of consonants and 9:01vowel pairs present in there. Let's say 9:04for example we have the word 9:07caresses 9:10and these are the three rules that might 9:12apply to caresses. You look at S Ses 9:16present and replace it with SS or you 9:18look at SS present and keep it the same. 9:21Don't replace it. Or you look at S and 9:23just eliminate it. That is replace it 9:25with nothing. 9:27As you can see for the word caresses, 9:30more than one of these rules would 9:31apply. 9:33But the trick over here is to look at 9:35the longest matching substring which in 9:38our case is S SES and apply the first 9:41rule over here. 9:44So this will then become cess which is 9:47correct. 9:49Let's say instead the word was cess. In 9:52that case ss would stay the same which 9:55is again caress which is also correct. 9:59However in life is perfect and the same 10:03goes for the potter stemer. Let's look 10:05at one example which showcases the 10:08limitation that the stemer has. 10:11Consider the word therefore. 10:14I've identified each of the consonants 10:16and vowels present in this particular 10:18word. Let's look at these. So two 10:22consonants coming together. We can 10:23collapse them and make them one. So 10:26that's C. Let's also look at the pairs 10:28of VC's. That is a vowel and a consonant 10:32coming together. As you can see there 10:35are three of those occurrences 10:371 2 and three. In which case we would 10:40denote it as we see raised to the power 10:44the number of times it appears in this 10:46case three and finally ending with the V 10:50over here. 10:52This particular exponent three over here 10:55is called as measurement and based on 10:58the algorithm there are certain rules 11:01that apply to the exponent. 11:04One of such rules like the ones that we 11:06saw over here says that if the exponent 11:10is non zero which means it is greater 11:13than zero which is this for our case 11:17and if the word in consideration ends 11:20with an e you must eliminate e 11:25which then gives us this as our stem. 11:30Therefore minus the e which is 11:32incorrect. 11:34Even though this is a limitation of the 11:36porter stemming algorithm, pter stemer 11:39still remains one of the most widely 11:41used because of the amount of rules that 11:44it has and the amount of correct results 11:47it does give even if stemming as a whole 11:50is huristic and simplistic in nature. 11:54Another stemmer that you might have 11:55heard of is called the snowball stemer. 11:58It's a modification of the porter 12:00stemer. The border stemer was created 12:03only to be used with English words. 12:06However, the snowball stemer is 12:08multilinguistic which means it does work 12:11on languages other than English. 12:14The Pythonic implementation of snowball 12:16stemer which you can use using NLTK 12:20which stands for natural language 12:21toolkit which is an NLP package 12:23available 12:25has snowball stemmer in the event where 12:28you can use it to remove stop words. 12:32So stop words would then be eliminated 12:34from the process of stemming. So those 12:37are two ways that the snowball stemmer 12:39is different from Potter stemer. 12:42Let's continue looking at some more 12:44issues and limitations that exist with 12:46STEM. Two of the most common issues that 12:49arise with stemming are over stemming 12:52and understandming. 12:54Let's take a look at some examples. 12:57Consider universal universe and 12:59university. 13:01Over stemming as it sounds overdo the 13:04stemming part or removes too much more 13:07than necessary to the point where the 13:09words can lose meaning themselves. 13:12The step for each of those 13:14would be 13:17universe without the e. And based on our 13:21knowledge of the English language, we 13:23know that this is incorrect because 13:26universal, universe, and university are 13:29three words that mean totally different 13:31things. The opposite end of this is 13:34understanding where not enough removal 13:37has happened. Consider the example of 13:40alumnus, alumna and alumni. Alumna and 13:44alumni would remain the same whereas 13:47alimnus would become alumna giving us 13:50three different stems or three different 13:52base words which in this case is also 13:56incorrect because each of these three 13:59words mean the same thing and so should 14:02have the same stem. 14:05Some of the other challenges that exist 14:07with stemming are 14:10it fares badly with named entity 14:13recognition. 14:14For example, you might have some proper 14:17nouns. Consider 14:20Boeing. 14:23Based on the examples that we have seen 14:25previously, stemming would reduce the 14:27ing from it giving you the stem of b 14:32which we know is incorrect because it's 14:34a proper noun. Another example might be 14:38that of homonyms. 14:40Consider rose the flower and rose 14:46sun rising past of rise. In this case 14:50too, the stemming algorithm would 14:52inaccurately get the stem for each of 14:55those as rise, 14:59which does make sense for the sun rose, 15:02but does not make sense for the flower 15:04rose. 15:06You might also run into issues trying to 15:08attempt stemming on languages like 15:10Arabic which have complex forms present 15:14as it can be difficult to understand 15:16what suffixes and what prefixes are 15:18present. 15:22Stemming is a simple yet powerful 15:24technique when used in the right way. 15:27Hopefully you learned something useful 15:30along this way when we talked about 15:32stemming and this has continued to grow 15:36strong roots as you go on in your 15:39journey of artificial intelligence. 15:42If you like this video and want to see 15:44more like it, please like and subscribe. 15:48If you have any questions or want to 15:50share your thoughts about this topic, 15:52please leave a comment below.