Learning Library

← Back to Library

Stemming vs Lemmatization Explained

Key Points

  • Stemming is the process of reducing related word forms (e.g., “connected,” “connection,” “connect”) to a common base or “stem,” which acts like the stem of a plant.
  • Search engines rely on stemming to return results that include all morphological variants of a query term (e.g., “invest,” “invested,” “investment”) so users find relevant information.
  • In natural‑language processing, stemming is a token‑level preprocessing step that follows tokenization, which breaks documents into paragraphs, sentences, and ultimately individual word tokens.
  • Unlike stemming, lemmatization aims to map words to their proper dictionary (lemma) forms rather than simply chopping off suffixes, providing a more linguistically accurate normalization.
  • The lecture will cover how stemming works, compare it to lemmatization, demonstrate a stemming algorithm (a “stemmer”), and discuss practical caveats.

Full Transcript

# Stemming vs Lemmatization Explained **Source:** [https://www.youtube.com/watch?v=L5S6YPZcJt8](https://www.youtube.com/watch?v=L5S6YPZcJt8) **Duration:** 00:15:56 ## Summary - Stemming is the process of reducing related word forms (e.g., “connected,” “connection,” “connect”) to a common base or “stem,” which acts like the stem of a plant. - Search engines rely on stemming to return results that include all morphological variants of a query term (e.g., “invest,” “invested,” “investment”) so users find relevant information. - In natural‑language processing, stemming is a token‑level preprocessing step that follows tokenization, which breaks documents into paragraphs, sentences, and ultimately individual word tokens. - Unlike stemming, lemmatization aims to map words to their proper dictionary (lemma) forms rather than simply chopping off suffixes, providing a more linguistically accurate normalization. - The lecture will cover how stemming works, compare it to lemmatization, demonstrate a stemming algorithm (a “stemmer”), and discuss practical caveats. ## Sections - [00:00:00](https://www.youtube.com/watch?v=L5S6YPZcJt8&t=0s) **Untitled Section** - - [00:03:17](https://www.youtube.com/watch?v=L5S6YPZcJt8&t=197s) **Stemming vs Lemmatization Explained** - The passage contrasts token-level stemming, a rule‑based heuristic that chops word endings, with lemmatization, a context‑aware process that maps words to their dictionary forms using resources like WordNet. - [00:06:25](https://www.youtube.com/watch?v=L5S6YPZcJt8&t=385s) **Benefits of Stemming in NLP** - Stemming groups morphological variants of words to improve search recall, reduce vocabulary size and feature dimensions, and consequently increase the accuracy and efficiency of statistical NLP models. - [00:09:30](https://www.youtube.com/watch?v=L5S6YPZcJt8&t=570s) **Longest‑Match Stemming and Its Limits** - The speaker explains how the stemming algorithm selects the longest matching substring (e.g., “caresses” → “cess”), applies rules about consonant collapsing and exponent measurement, and demonstrates a failure case where the rule to drop a trailing “e” incorrectly stems “therefore.” - [00:12:44](https://www.youtube.com/watch?v=L5S6YPZcJt8&t=764s) **Stemming Pitfalls: Over- and Under‑Stemming** - The speaker outlines common stemming issues—including over‑stemming, under‑stemming, and failures with proper nouns and homonyms—illustrating how these errors distort word meanings. - [00:15:50](https://www.youtube.com/watch?v=L5S6YPZcJt8&t=950s) **Invitation to Comment** - The speaker encourages viewers to post any questions or thoughts about the topic in the comment section below. ## Full Transcript
0:00What do plants and words have in common? 0:03I'll give you a hint. It's on the 0:05whiteboard. 0:07Both of them have 0:10stems. 0:11For a plant, the stem is the central 0:13part that connects it to the leaves, the 0:16flowers, the fruits. And for a word, 0:19each of the words have stems too. Today, 0:22we'll be talking about stemming. 0:25Consider the word connect. 0:28This is a stem for words like connected, 0:33connection, 0:34connect, and of course the word connect 0:37itself. 0:38Reducing each of these different words 0:40that I've listed over here to connect is 0:43the process of stemming. In which case, 0:45the connect is the base form of the 0:47word. 0:50Let's say for instance, you want to 0:51become a millionaire. Honestly, who 0:54doesn't? 0:57So, what's the first step that you take 0:59to know how to be a millionaire? Lot of 1:02questions, right? Perhaps you start off 1:05with a search query. You pull up your 1:07favorite search engine and you type in 1:10how to invest so that I can become a 1:12millionaire. And what you'll notice is 1:15the search results that pop up don't 1:17just have the word invest but also have 1:21words related to invest like invested, 1:25invest, 1:26investment 1:28and so on. The process that is making 1:31all of this happen so that you can 1:33receive relevant search results which 1:36cover all the different variations and 1:38all the different forms of the words 1:40like invest in this case is stemming. 1:43That's what the magic is. 1:47So today we'll be seeing about stemming, 1:50what it entails, how it's used, 1:53comparing it with another alternative, 1:55seeing an algorithm of stemming which is 1:57called a stemmer in action and ending 2:00with some caveats. 2:02Stemming is a text pre-processing 2:04technique that's used in natural 2:05language processing. Natural language 2:08processing or NLP is a subbranch of 2:10artificial intelligence. 2:12It's the way our computers machines can 2:15understand how you and I communicate 2:18using text or speech. Natural language 2:21processing includes different tasks to 2:23take all of our documents or all of our 2:26data set and break it down into smaller 2:28components. 2:30Let's say you have a set of documents. 2:37You continue breaking it down into 2:38smaller components to make it more 2:40easily digestible for your machine. 2:44So each document can be broken down into 2:46paragraphs. 2:48Each paragraph can be broken down into 2:50sentences. 2:54And finally each sentence can be broken 2:56down into different words. 3:00And these words over here are what are 3:03called as token. And this entire process 3:06that we have done of taking the data set 3:08from the documents to the paragraph to 3:10the sentences to the words is called 3:13tokenization. 3:15Stemming as a technique operates at the 3:17level of tokens. And now we'll take a 3:19deeper look into how that look. So you 3:22got a glimpse of stemming but there's 3:24also another text pre-processing 3:26technique called leatization. 3:29Let's take a look at the differences 3:30between the two. 3:32Stemming tries to cut the ends of the 3:34word in the hope of getting to its base 3:37word or a stem. In this case, 3:41happy 3:44would cut the Y and make it happy. 3:49But in leatization, it tries to get to 3:52the normalized form of a word. That is 3:54the word form that already exists in the 3:57dictionary. In which case happy will 4:02just stay happy. 4:07As you can imagine, stemming is more of 4:10a huristic algorithm and is very rules 4:12based. It looks at the ends of the words 4:15and tries to guess or tries to estimate 4:18what the base word could be. For 4:20example, it would look at words ending 4:22in ing and remove the ing 4:27to get the base form, which in this case 4:29works. 4:31However, consider the word nothing. 4:37It tries to apply the same logic to it 4:40and you end up with not which is not 4:44really correct. 4:46Lemitization on the other hand would 4:48actually give you nothing. 4:51But the caveat here is that it requires 4:54more context. It requires information 4:56like part of the speech, the context of 4:59the word, how it's being used and it 5:02uses all of that with relation to 5:05something called as word net. Wordet is 5:08a huge graph which gives you 5:10relationships amongst different words, 5:12their synonyms, the type of definitions 5:14that they have. so on and so forth. 5:18That goes to say that stemming is fairly 5:21easy and simple to implement 5:25whereas leatization is computationally 5:28more expensive 5:30but also more accurate. 5:34Think of an example 5:37where you have the word better. 5:42The step of it would result in better 5:45which is also incorrect. But 5:47lemitization powered with that 5:49additional context and additional 5:51knowledge would actually be able to tell 5:53us that better is a form of good. 5:57Choosing one or the other really depends 6:00on your use case. If you want high 6:02accuracy but you are okay with 6:05compromising on it being computationally 6:07expensive, go with limitization. If you 6:10want that something that's simpler and 6:11easier to implement while compromising 6:14on accuracy a little bit, go with 6:16stemming. Now let's see what the use 6:19cases of stemming are. 6:22Why should we use stemming? 6:25First one is search engine or 6:28information retrieval. 6:30Think back to our example of wanting to 6:32become a millionaire and putting in a 6:35search query of how to invest to become 6:37a millionaire. 6:39Even though your query has the word 6:41invest, the search results and documents 6:43that come up has the word investment or 6:46maybe investing or invests. 6:50This is where stemming comes into play. 6:52Trying to get all of those different 6:54forms, the different morphological 6:56variants of that particular word invest. 7:00This in turn gives you more relevant 7:01results thereby increasing the 7:04efficiency of the search while also 7:08increasing the accuracy of the results 7:11that you get. 7:15The next use is tempity reduction. 7:21All of the different unique words that 7:23exist in your documents is what 7:25comprises the vocabulary, the entire 7:27vocabulary. 7:29Let's say for instance you have change 7:33changing 7:35and changed in your vocabulary 7:38which are three unique words making your 7:40vocabulary size three. 7:43If instead you were to use only the stem 7:46for it which in this case is changed 7:51your vocabulary would now reduce to just 7:54one word. 7:56This leads to a reduction in dimensions 7:58or the number of features that your 8:01machine learning model will now have. It 8:04again increases the accuracy 8:07as well as the precision. 8:11This helps you get better performances 8:14with statistical NLP models, especially 8:17the ones concerned with word embeddings 8:19and topic models. 8:22Well, enough talk about all of this. 8:24Let's go and see how exactly a stemming 8:26algorithm or a stemmer works. 8:30So hopefully you're on board with 8:32stemming. But now let's see the 8:34algorithm in action. A stemmer is what a 8:38stemming algorithm is called. And let's 8:40see how that works. 8:43One of the most widely used stemmers is 8:45called the portal stemer. And the way it 8:48works is that it looks at each word. It 8:51identifies the consonants and vowels 8:53present in there and then it does a 8:56bunch of substitutions, eliminations 8:59based on the number of consonants and 9:01vowel pairs present in there. Let's say 9:04for example we have the word 9:07caresses 9:10and these are the three rules that might 9:12apply to caresses. You look at S Ses 9:16present and replace it with SS or you 9:18look at SS present and keep it the same. 9:21Don't replace it. Or you look at S and 9:23just eliminate it. That is replace it 9:25with nothing. 9:27As you can see for the word caresses, 9:30more than one of these rules would 9:31apply. 9:33But the trick over here is to look at 9:35the longest matching substring which in 9:38our case is S SES and apply the first 9:41rule over here. 9:44So this will then become cess which is 9:47correct. 9:49Let's say instead the word was cess. In 9:52that case ss would stay the same which 9:55is again caress which is also correct. 9:59However in life is perfect and the same 10:03goes for the potter stemer. Let's look 10:05at one example which showcases the 10:08limitation that the stemer has. 10:11Consider the word therefore. 10:14I've identified each of the consonants 10:16and vowels present in this particular 10:18word. Let's look at these. So two 10:22consonants coming together. We can 10:23collapse them and make them one. So 10:26that's C. Let's also look at the pairs 10:28of VC's. That is a vowel and a consonant 10:32coming together. As you can see there 10:35are three of those occurrences 10:371 2 and three. In which case we would 10:40denote it as we see raised to the power 10:44the number of times it appears in this 10:46case three and finally ending with the V 10:50over here. 10:52This particular exponent three over here 10:55is called as measurement and based on 10:58the algorithm there are certain rules 11:01that apply to the exponent. 11:04One of such rules like the ones that we 11:06saw over here says that if the exponent 11:10is non zero which means it is greater 11:13than zero which is this for our case 11:17and if the word in consideration ends 11:20with an e you must eliminate e 11:25which then gives us this as our stem. 11:30Therefore minus the e which is 11:32incorrect. 11:34Even though this is a limitation of the 11:36porter stemming algorithm, pter stemer 11:39still remains one of the most widely 11:41used because of the amount of rules that 11:44it has and the amount of correct results 11:47it does give even if stemming as a whole 11:50is huristic and simplistic in nature. 11:54Another stemmer that you might have 11:55heard of is called the snowball stemer. 11:58It's a modification of the porter 12:00stemer. The border stemer was created 12:03only to be used with English words. 12:06However, the snowball stemer is 12:08multilinguistic which means it does work 12:11on languages other than English. 12:14The Pythonic implementation of snowball 12:16stemer which you can use using NLTK 12:20which stands for natural language 12:21toolkit which is an NLP package 12:23available 12:25has snowball stemmer in the event where 12:28you can use it to remove stop words. 12:32So stop words would then be eliminated 12:34from the process of stemming. So those 12:37are two ways that the snowball stemmer 12:39is different from Potter stemer. 12:42Let's continue looking at some more 12:44issues and limitations that exist with 12:46STEM. Two of the most common issues that 12:49arise with stemming are over stemming 12:52and understandming. 12:54Let's take a look at some examples. 12:57Consider universal universe and 12:59university. 13:01Over stemming as it sounds overdo the 13:04stemming part or removes too much more 13:07than necessary to the point where the 13:09words can lose meaning themselves. 13:12The step for each of those 13:14would be 13:17universe without the e. And based on our 13:21knowledge of the English language, we 13:23know that this is incorrect because 13:26universal, universe, and university are 13:29three words that mean totally different 13:31things. The opposite end of this is 13:34understanding where not enough removal 13:37has happened. Consider the example of 13:40alumnus, alumna and alumni. Alumna and 13:44alumni would remain the same whereas 13:47alimnus would become alumna giving us 13:50three different stems or three different 13:52base words which in this case is also 13:56incorrect because each of these three 13:59words mean the same thing and so should 14:02have the same stem. 14:05Some of the other challenges that exist 14:07with stemming are 14:10it fares badly with named entity 14:13recognition. 14:14For example, you might have some proper 14:17nouns. Consider 14:20Boeing. 14:23Based on the examples that we have seen 14:25previously, stemming would reduce the 14:27ing from it giving you the stem of b 14:32which we know is incorrect because it's 14:34a proper noun. Another example might be 14:38that of homonyms. 14:40Consider rose the flower and rose 14:46sun rising past of rise. In this case 14:50too, the stemming algorithm would 14:52inaccurately get the stem for each of 14:55those as rise, 14:59which does make sense for the sun rose, 15:02but does not make sense for the flower 15:04rose. 15:06You might also run into issues trying to 15:08attempt stemming on languages like 15:10Arabic which have complex forms present 15:14as it can be difficult to understand 15:16what suffixes and what prefixes are 15:18present. 15:22Stemming is a simple yet powerful 15:24technique when used in the right way. 15:27Hopefully you learned something useful 15:30along this way when we talked about 15:32stemming and this has continued to grow 15:36strong roots as you go on in your 15:39journey of artificial intelligence. 15:42If you like this video and want to see 15:44more like it, please like and subscribe. 15:48If you have any questions or want to 15:50share your thoughts about this topic, 15:52please leave a comment below.