1. Introduction

In this tutorial, we’ll explain one of the common concepts, used in natural language processing (NLP), which is called n-gram. It’s a basic term that most of the NLP courses and lectures cover. Besides that, data scientists, machine learning engineers, and developers often use n-grams in their NLP projects.

Therefore, in addition to the definition of the concept, we’ll also explain in which real-world applications we can use them.

2. Natural Languages Processing (NLP)

Natural language processing (NLP) is a branch of artificial intelligence that deals with the interactions between computers and human (natural) languages. The goal of NLP is to provide a computer system with the ability to comprehend human languages, as well as a human being, can. It has been in use for decades, but it has grown in popularity as a result of advances in computer hardware and software.

NLP includes a diverse range of sub-disciplines, spanning from the study of linguistic meaning to statistical machine translation, speech recognition, and question answering. Some of the most popular tasks in NLP are:

Consequently, these are very different tasks, and we use many different types of algorithms to solve NLP problems.

3. How Does NLP Work?

Generally, computers are not good at the understanding text. First of all, we need to convert text into numbers and then apply mathematical operations in a specific way, so that computers can process it. Even at this point, computers can identify words and their order, but they cannot understand their meaning.

There are some complex transformer-based systems that are able to lead a conversation with humans but they require a huge amount of data and time to learn. Also, even if the system output a meaningful response, it doesn’t mean that it understands text, but only gives the statistically most probable answer based on input data.

3.1. Phases of Developing NLP Systems

Anyway, NLP systems need somehow to process text into numbers. Moreover, besides model development, the most important phase in developing NLP systems is text preprocessing. Some common steps that we may apply include:

  • Text cleaning – converting text into lowercase and cleaning it by removing unsubstantial parts such as HTML tags, symbols, or sometimes numbers.
  • Stop words removal – excluding some common words that don’t provide useful information.
  • Lemmatization and stemming – simplifying words to their base or root form by following some rules from dictionaries, cutting off common prefixes and suffixes, and similar.
  • Tokenization – when we separate cleaned text into smaller units, such as words, characters, or some combinations of them.

After text cleaning, we are ready to convert text into a computer-readable format. Converted words will be directly processed into NLP models. Representation of a word, typically in the form of a vector, is called word embedding. Some of the common embedding methods are:

  • One-hot encoding – representing a word with a vector constructed of all zeros except a single component equal to one.
  • TF-IDF – term frequency-inverse document frequency is a measure for estimating the importance of words in a document among a collection of documents.
  • Word2vec – word representation learned by neural networks where semantically similar words have similar vector representation.

Instead of embedding one single word, it’s possible to divide words into sets of 2, 3, or N continuous words from the observed text and embed those sets which we call n-grams.

4. What Is an N-Gram?

Explained in one sentence, an n-gram is a sequence of N adjacent words or letters from a particular source of text. For instance, if we consider the sentence

The quick brown fox jumps over the lazy dog.

and if we want to find all 5-grams constructed from a sentence’s words, then we have

The quick brown fox jumps

quick brown fox jumps over

brown fox jumps over the

fox jumps over the lazy

jumps over the lazy dog.

The formula for calculating the number of n-grams in a sentence with K words is

(1)   \begin{align*} \text{N-Grams}_{K} = K - (N - 1). \end{align*}

Mostly in practice, we use n-grams with a small number N, such as 1-gram (unigram), 2-gram (bigram), and 3-gram (trigram). In general, an n-gram is a very simple concept but it’s used for a variety of things in text mining and NLP.

One special generalization of n-grams is skip-gram. In skip-grams, components don’t need to be consecutive from the observable text but they may leave gaps. For example, a 2-skip-n-gram will skip every second word in the text and form a sequence of N words. We use skip-grams to obtain a higher level of generalization that is provided by n-grams. Some researchers use skip-grams as featured in classification models or as methods in language modeling to decrease perplexity.

In general, there are many applications of n-grams. We’ll present some of them below.

5. N-Gram Applications

With n-grams is possible to develop probabilistic models based on word occurrences. For instance, if the goal is to predict what word will follow the word “United”, most likely, it would be “States” because it’s likely that in some text corpus, bigrams that start with the word “United” will end with the word “States”. Based on all bigrams in the text corpus, that start with the word “United”, models will be able to learn the probability of the following word.

Of course, to achieve reasonable results, models will need a huge corpus of text. With this logic is possible to construct auto-completion systems. There are similar systems in Gmail or Google docs. Besides that, n-grams can be used in models related to:

  • Spelling correction
  • Text summarization
  • Part-of-speech tagging and others

Also, as we mentioned before, instead of representing one single word as a vector, we can use n-grams. For example, if the goal is to construct a sentiment analysis model, and we have the following example of two sentences:

No, this service is good.
This service is no good.

with unigrams, in both cases, we’ll have the same set of embedding vectors. But with bigrams, and the negative word “no” in the sentences, models will most likely differ the sentiment of “No, this” and “no good” bigrams.

Lastly, there are some applications where n-grams are used in analyzing protein sequences as well as DNA sequences.

6. Conclusion

In this article, we briefly introduced the term NLP and how NLP works. Following the NLP workflow, we came to the concept of n-grams and explained it in detail following with a few examples. Lastly, we presented some of the applications where n-grams can be used.