Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: March 18, 2024
In this tutorial, we’ll explain one methodology in natural language processing (NLP) that we can use as a text preprocessing step in our projects. Specifically, it’s about stemming and the difference between Porter and Lancaster stemming algorithms.
In most tutorials, stemming is presented as a process of reducing words into base form but rarely or never are mentioned different types of stemming methods.
Natural language processing (NLP) is a field of computer science and artificial intelligence that deals with the relationships between computers and human (natural) languages. In particular, the goal of NLP is how to program computers to process and analyze large amounts of natural language data in the same way that a human does. It has grown in importance in recent years due to the availability of digital text data and new statistical techniques for language processing.
There are many examples of NLP projects, but some of the most popular are related to:
As a result, these are very diverse tasks that require different text preprocessing techniques.
For a computer to understand text, we would need to convert it into numbers and then apply some mathematical operations. But before that, there are some preprocessing steps that are common to many NLP projects:
In brief, stemming is the process of reducing a word to its word stem. Word stem is a base or root form of the word and doesn’t need to be an existing word. For example, the Porter algorithm reduces the words “argue”, “argued”, “argues” and “arguing” to the stem “argu” which isn’t an existing word.
With stemming, we’re able to extract meaningful information from vast sources, like big data, and afterward help search engines to query information. It’s possible to get more results if we recognize and search more word forms. Also, when a word form is recognized, it may be possible to return search results that would otherwise be missed. Because of that, stemming is essential to search queries since, with stemming, we’re able to retrieve additional information.
There are many ways how stemming algorithms work. Some simple methods will only recognize prefixes and suffixes and strip them. Because of this simplicity, they are prone to errors. In many cases, they will strip wrong prefixes and suffixes. Also, it might be difficult to handle some word forms like irregular verbs, for example, words such as “saw” and “see”.
More complex stemming algorithms use lookup tables of different word forms in combination with well-known suffixes and prefixes. Some of them firstly determine the part of speech of a word and after apply different normalization rules for each part of speech.
In this article, we’ll explain in more details Porter and Lancaster stemming algorithms.
Porter stemmer is a suffix stripping algorithm. In short, it uses predefined rules to strip words into their base forms.
Every word can be represented as a sequence of consonants and vowels. Let’s denote a consonant with “c”, and a sequence of consonants of length greater than 0 with “C”. Similarly, “v” is a vowel and “V” a sequence of vowels of length greater than 0. Then, every word has one of the four forms
or as a single form
where square brackets denote the arbitrary presence of their contents. The above expression also can be written as
where is called the measure of the word. Some world examples with different
are:
To remove common suffixes, Porter stemmer applies more than 50 rules, grouped in 5 steps and some substeps. All rules have a form
(condition) S1 -> S2
This means that if a word has the suffix S1 and the part before suffix (stem) satisfies the condition, we replace S1 with S2. Also, some rules don’t have conditions. Below are some rules with word stemming examples:
Lancaster is one of the most aggressive stemmers as it tends to over stem many words. Like the Porter stemmer, the Lancaster stemmer consists of a set of rules where each rule specifies either deletion or replacement of an ending. Also, some rules are restricted to intact words, and some rules are applied iteratively as the word goes through them.
Because of more strict rules, there are two additional conditions to prevent the stemming of various short-rooted words:
Lancaster stemmer has more than 100 rules, around double that of Porter stemmer. Also, the authors defined rules using different notation than Porter’s stemming rules. Each rule has five components, two of which are optional:
[ending in reverse][optional intact flag “*”][remove total letters][optional append string][continuation symbol, “>” or “.”]
In particular, here are some examples of the rules:
In this article, we’ve explained Porter and Lancaster stemming algorithms. In brief, there are two main differences between them:
In most NLP projects, Porter’s stemming algorithm will give more meaningful results, but sometimes Lancaster stemmer might be worth trying.