1. Introduction

In this tutorial, we’ll discuss the primary methods for feature selection and feature reduction in text classification.

2. Curse of Dimensionality and Blessings of Selection

All machine learning is affected by a curse: the curse of dimensionality. Today, for the first time in history, recording new information has a cost that’s close to zero. Consequently, datasets that originate from the real world, whether they relate to texts or not, tend to include many more features than those that we deem informative.

For this reason, we’d like to reduce the number of features in a dataset and select only those that matter the most. This is particularly important for texts: a typical text corpus can have several thousand unique words, while only a few would figure in any given text.

Therefore, we’d like to select only the features from a text that lead to the largest increase in classification accuracy and ignore the others. Three simple techniques can help us: the chi-squared test, information gain, and the usage of n-grams instead of uni-grams.

3. Chi-Squared Distributions

The chi-squared test is one of the foundational techniques for selecting features in classification tasks. It uses very few assumptions: for this reason, it’s easy for us to remember and implement it.

Let’s assume that we’re conducting a classification task. We then have N observations that we want to classify in k independent classes. Each observation belongs to one and only one class, so no multiclass classification here.

Let’s call \boldsymbol{x_i} the number of observations that belong to each class \boldsymbol{i}:

    \[\sum_{i=1}^k x_i = N\]

We can now calculate the probability p_i that a given observation belongs to the i-th class of the k available. To do this, we divide the number of observations in a given class \boldsymbol{i} by the total number of observations. In this manner, p_i = x_i / N for each class i.

The sum of all observations in each class must amount to N. Therefore, we can also write this formula:

    \[\frac{\sum_{i=1}^k x_i}{N} = \sum_{i=1}^k p_i = 1\]

Because \sum_{i=1}^k p_i = 1, we can then treat p as the probability distribution that describes the likelihood, for any generic observation, to fall into each of the \boldsymbol{k} classes. Therefore, if p_i is the probability of belonging to class i, we can call m_i = N \times p_i the expected number of observations that are associated with the i-th class.

4. Chi-Squared Test

We can note that, in the construction we’ve developed so far, x_i and m_i have the same value for each of the k classes, so it may seem redundant to have both. However, if the number N of observations is sufficiently large, we can then assume that the population satisfies this rule:

    \[\chi^2 = \sum_{i=1}^k \frac{(x_i - m_i)^2}{m_i} = N \times \sum_{i=1}^k \frac {(m_i / N - p_i)^2}{p_i}\]

The right-hand side of the equation shows a division and multiplication by N so that the probability distribution emerges again. The advantage of reasoning in this manner is that we can use the expected counts \boldsymbol{m_i}, and not the measured observations \boldsymbol{x_i} since the latter don’t figure on the right-hand side.

We can then compare the real-world distributions with the \chi^2 abstract distribution, for which N approaches infinity, and compute a Pearson’s p value between the two. This is indeed the definition of Pearson’s chi-squared test.

5. Information Gain

Another method for selecting features in a dataset requires us to think in terms of entropy. We can ask ourselves: “By how much would the entropy in the distribution be reduced if we knew the value assumed by one of the features?”

This suggests that we can calculate the information we gain as the difference between two entropies. One is the entropy of the distribution, and the other, the entropy of the same distribution under the condition that one of its values is known.

Let’s call H(Y) the entropy for the random variable Y. Let’s also call H(Y|X) the conditional entropy for Y given the value assumed by X. We can then define the information gain IG(Y,X) as:

    \[IG(Y,X) = H(Y) - H(Y|X)\]

We can refer to our other article on this website if we need a refresher on computing the entropy and the conditional entropy for random variables. The information gain is just the difference between these two.

6. N-Grams and Their Frequency

The final method that we discuss is the simplest, both intellectually and computationally. When we work with texts, we can create a frequency distribution of the words contained therein. These constitute the uni-grams for those texts, and they’re the standard features for text classification.

However, some texts may contain words that are more meaningful when analyzed in pairs rather than individually. Pairs of consecutive words take the name of bi-grams, and we can use them as features to classify texts. In general, an n-gram comprising \boldsymbol{n} words should be more informative than the \boldsymbol{n} individual words taken alone, at the price of increased memory consumption.

7. Conclusions

In this article, we studied the most common techniques for feature selection and reduction for text classification.

Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.