Sentiment analysis is the task of automatically classifying texts according to the emotions they express. In the most simple scenario, we want to classify a text as positive, negative, or neutral. In more complex situations, we could identify specific emotions or compute the sentiment with respect to a specific entity.
Whichever the case, sentiment analysis is widely used to analyze users’ opinions about brands, movies, and books, for example by conducting large scale analysis of online reviews or social media activity.
There are lots of different ways to perform sentiment analysis, and using a dictionary is possibly the simplest one.
A sentiment analysis dictionary contains information about the emotions or polarity expressed by words, phrases, or concepts. In practice, a dictionary usually provides one or more scores for each word. We can then use them to compute the overall sentiment of an input sentence based on individual words.
In this tutorial, we’ll see an overview of some dictionaries for English and we’ll analyze common shortcomings of dictionary-based sentiment analysis.
2. Sentiment Analysis Dictionaries
As the name suggests, SentiWordNet assigns scores to WordNet synsets instead of words. In particular, each synset has both a positivity and a negativity score, lying between 0 and 1. This resource has very high coverage of more than 150k words. Moreover, since it’s based on WordNet, it inherits all its good features, like the separation between different parts of speech tags.
Since this dictionary works with WordNet synsets, it’ll assign different scores to the same word depending on its different meanings. For example, the word “attractive” can have at least the following two meanings:
- pleasing to the eye or mind especially through beauty or charm;
- having the properties of a magnet; the ability to draw or pull.
Thus, in SentiWordNet, we’ll find two sets of scores, one for each meaning. In particular the first meaning has scores of (pos=0.875, neg=0.0) and the second is a neutral (pos=0.0, neg=0.0).
In order to use this dictionary, we need to know the specific meaning of each word in a sentence. To do this, we can use Word Sense Disambiguation algorithms to predict the meaning of words based on their context.
Here’s what the two lines for the word “attractive” in SentiWordNet look like:
The first column is the part of speech tag, with “a” meaning “adjective”. Then, we have the WordNet synset ID followed by the positivity and negativity scores. The third column contains the set of terms in the synset, and lastly the WordNet definition.
SentiWords is very similar to SentiWordNet and, in fact, it’s derived from it. As opposed to SentiWordNet, SentiWords assigns scores directly to words rather than synsets. We call these prior polarities, i.e., polarities of the words independent of their context (and thus their meaning). This allows us to use this dictionary without having to disambiguate the input text first.
To compute these polarities, the authors used a high coverage and high precision algorithm that takes into account the data from SentiWordNet.
Since it’s indirectly derived from WordNet, this dictionary covers more than 150k words as well, making it one of the most extensive dictionaries for English. It’s also a great choice if we want to avoid using a Word Sense Disambiguation algorithm.
VADER is a lexicon and a rule-based sentiment analysis tool for social media text. The lexicon has been built manually, by aggregating ratings coming from 10 human annotators.
For this reason, it’s not as extensive as our previous examples as it contains just over 7000 words. Nevertheless, its precision should be higher than the resources created automatically. Moreover, being specifically tuned for social media, it also covers emojis and abbreviations (e.g., “lmao”, “lol”) that other dictionaries normally don’t.
Here’s a sample of the VADER dictionary:
Although it contains 4 columns, we only care about the first two: the target word and its polarity. The polarity is nothing more than the average of the 10 individual scores of the fourth column. The third column is the standard deviation of the individual scores.
The last two columns are only reported to perform statistical tests on the data, while the sentiment analysis tool only uses the polarity itself. Keep in mind that in this dictionary the scores can range from -4 to 4 instead of the usual -1 or 0 to 1 range.
3. Limitations of Dictionary-based Approaches
Using dictionaries is likely the simplest possible way to perform this sentiment analysis. However, it still often fails in handling all the complexities of language.
For example, let’s consider a simple sentence like “it gets very hot”. It doesn’t express any sentiment in and of itself. Nevertheless, we can consider it negative when talking about a laptop, and positive when talking about a stove.
Moreover, these systems have to correctly handle negations and other variations of language that can change the sentiment of words otherwise taken in isolation.
Last but not least, text can often contain sarcasm which is really hard to detect automatically and might skew the results of a sentiment analysis algorithm.
Based on these considerations, if training data is available, it’s usually more desirable to use a Machine Learning based approach, which often outperforms dictionary-based methods.
Sentiment analysis dictionaries can be a very useful aid when implementing a sentiment analysis system. They have their shortcomings but can provide value for a lot of use cases.
In this article, we’ve described three popular, free dictionaries and briefly discussed the limitations of dictionary-based approaches to sentiment analysis.