Why do you always visit Google or your favourite search engine to seek an answer to a burning question? I assume it is down to the reliability or acceptable accuracy of these search networks in providing contents or rendering websites that meet your need? As humans, it could be somewhat difficult for others outside our immediate circle to understand our thoughts or intention due to our unique upbringing, genes and the impact of our immediate environment. To some extent, those with differing biological and experiential history of us might initially find it challenging to understand or deduce our intent from our language.Stemming & lemmatisation enable search engines to gain a better understanding of our queries and serve the most relevant web pages. Click To Tweet
Our environment and biological makeup play a critical role to how we communicate as humans but this also breeds a certain level of complexity. It is argued that the human language is considerably different from those of other animals due to two factors. Firstly, human language is believed to contain tens of thousands of inconsistently learned symbols (words) when compared to that of other animals. Secondly, there is complex compositional syntax with human language with the presence of part of speech such as nouns, verbs, adverbs and much more. The understanding derived from our sentences is a product of the individual meanings of the constituent parts or words. Language is quite complex and we are seeing a proliferation of new words, slangs and acronyms in this internet age. A very good example is the word ‘bad,’ could mean ‘good’ or ‘bad’ depending on the age group, location and setting of the user. This is just an example of how complex words could be and a misunderstood word could change the entire sentiment and meaning of a sentence.
This has led to the emergence of the field of natural language processing. Search engines utilise natural language processing (NLP) to train their machines to understand human language and serve the most relevant pages to meet the need of the given user. Natural language processing is defined as the interaction between computers and human natural language or the training of computers to understand human language. Stemming and lemmatization are two key elements in NLP and search engines utilise these two to gain a better understanding of our search queries and serve the appropriate contents or web pages.
Stemming is commonly used in the field of information retrieval and it refers to the process of truncating words to their stem or root form. It is believed that the original word is not expected to be semantically identical to the base, stem or root word. Stemming algorithms used by search engines handle stemmed words as synonyms and an expansion of the original root version. Here are a few examples of stemming, for the word ‘going’ base form is ‘go.’ In this case, the words are related and likely to have similar contextual meaning.
In linguistics, the word lemmatisation refers to a process of combining different forms of a word so they can be considered as a single item. Whilst in computational linguistics, lemmatisation is a multi-layered process that involves the utilisation of the given part of speech of a word in a sentence to produce an accurate and contextual base word. For example, the verb ‘to dive’ could show up as ‘dive’, ‘dived,’ ‘diving.’ In lemmatisation, the parts of speech and context of words determine their respective base or lemmas.
Examples of Stemming and Lemmatisation with code implementation:
I’ll be using a simple python code with the NLTK (Natural Language Tool Kit) library to illustrate the difference between stemming and lemmatisation.
In the above example, I am looking for the root form of the words ‘paid’ and ‘is’. You could easily guess that ‘paid’ is the past tense of the word ‘pay’ and ‘is’ is a third person singular of ‘be’. Via stemming, the root words could not be identified as it returned the words ‘paid’ and ‘is’ in the original inputted form. This clearly expresses that stemming functions on a single word without taking the part of speech or context into consideration. It is almost akin to a garage-in-garbage-out syndrome.
On the other hand, the above example clearly indicates that lemmatisation takes the part of speech and context into consideration and returns the appropriate base or root word. For the word ‘is’, the third person singular of ‘be’ is returned and the word ‘paid’ rightfully outputs the root word of ‘pay’.
By way of clarity, search engines utilise lemmatisation to gain a better understanding of a user’s query and serve the most relevant result. They look at the part of speech (example: verb, adverb, noun) of the words within each query and identify contents or web pages with the most relevant root words and serve that accordingly.
It is now important to have a quick look at search results on Google and ascertain how the search giant uses stemming or lemmatisation to render the top results. I ran a quick search on how to know you’ve paid the right price for your holiday.
The goal was to check the top pages and ascertain if the lemma of ‘paid’ (which is ‘pay’) was used more often in the contents or just the actual word ‘paid.’ Based on the top pages, it is quite clear that Google also utilises lemmatisation in understanding a user’s query and rendering results accordingly. Within the top three pages on Google, the lemma or lemmatised root word of ‘pay’ was used more often than the actual typed or stemmed word of ‘paid.’ Below is a good example of the top pages having a more lemmatised base of ‘pay’ than the originally typed word of ‘paid’.
In conclusion, stemming and lemmatisation are two important elements in natural language processing and information retrieval. These enable search engines to gain a better understanding of our queries and serve the most relevant web pages where applicable.