Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: June 11, 2024
Considering how quickly natural language processing methods have been advancing, it is now more important to develop good language models than before. These models are used in various AI applications.
Therefore, a knowledge of the evaluative metrics that NLP researchers use to compare language models with each other might come in handy. A metric known as perplexity has been subject to praise and attention.
In this tutorial, we’ll cover mathematical foundations, a practical use case, and how it helps us understand the performance of language models. Along the way, we’ll also touch on the limitation of perplexity, and how we can combine it with other evaluation metrics to get a better picture of a model’s capabilities.
Perplexity is a quantity used in probabilistic model inference to measure how well a probability distribution predicts a sample. It’s the exponentiation of the entropy of the distribution, which tracks the average number of bits required to encode the samples.
For a given probability distribution over a sequence
, the perplexity is calculated as
, where
is the entropy.
In the context of language models, it’s the exponentiation of the average negative log-likelihood of a sequence of words. For a given language model, it’s defined as:
where is a sequence of words,
is the probability assigned to the sequence by the language model, and
is the total number of words in the sequence.
Intuitively, it’s just the weighted branching factor of the language model, which is the geometric mean of the number of next tokens that the model thinks might be good at each point. Lower perplexity means the model is predicting better, assigning a higher probability to the correct tokens. Higher perplexity means the model is less certain about the sequence.
To get a sense of how perplexity is computed, consider the following pseudo-code:
function cal_perpl(model, t_data):
t_log_proba = 0
n_tok = 0
for seq in t_data:
log_proba = 0
for tok in seq:
log_proba += log(model.predict_proba(tok, context))
n_tok+= 1
t_log_proba += log_proba
avr_log_proba = t_log_proba / n_tok
perplexity = exp(-avr_log_proba)
return perplexity
To calculate perplexity, we first need to initialize counts for the total log probability and the number of test tokens to get started. This setup allows us to compute the average log probability later. We then iterate on the sequences in the test data to process one sequence at a time. This ensures that the context-dependent predictions preserve their integrity.
We start by making a variable that keeps track of the log probability for the sequence, then run a loop to iterate over the sequence tokens. For each token, we run the language model’s prediction function to gather the log probability of that token given the context (in this case, the context is just the previous tokens).
We then add this value to the cumulative log probability of the sequence and increment the counter for the number of tokens. In that way, we can capture the probability contribution from each token.
We calculate the arithmetic average value of log probability in all the sequences by dividing the sum of all the values of log probability by the total number of tokens. Finally, we define perplexity by taking the exponent of the negative average log probability.
We perform the exponentiation to transform the log probability scale back to a more comprehensible scale. Then, we can return the perplexity, which tells us how well our language model can predict the test data.
The visualization below provides a process used to compute perplexity according to the above-mentioned description:
The above image exemplifies the process step at a time from starting with counts initialization to iterating through sequences and tokens. It computes log probabilities and associates them with perplexity which is finally returned.
Suppose that we’re dealing with the perplexity scores produced by a language model trained on a corpus of English text. We want to test that model’s performance on a held-out test set. If the perplexity score for this test set is 100, it means that, on average, for any position in the sequence, the model’s probability for the correctly generated next token is 1/100 (or 0.01). This reflects a relatively high level of uncertainty or perplexity in the model’s predictions.
Thus, a perplexity score of 1 would indicate that the model is perfectly certain about the sequence (by giving the right tokens a probability of 1 at all positions), but in practice this is unrealistic. Language models will always be probabilistic and need to account for the inherent ambiguity and variation of natural language.
While perplexity is popular and quite useful as a metric, it has some theoretical flaws. The first is fairly trivial: while it quantifies the probability of the observed sequence of tokens (or words), it does not quantify anything else meaningful about natural language, like semantic coherence, pragmatic appropriateness, or text quality.
A high probability given by a language model to a sequence of words could be purely syntactic. It might be a chain of words that appears elegant but is inadequate for some other reason relating to meaning or pragmatics within the context. The score assigned for perplexity will be low, but the text that will be generated will be a piece of drivel from a human point of view.
Perplexity is not used in isolation because of several limitations but is typically paired with other evaluation metrics and human judgments. These include:
Perplexity has other failings. It’s highly sensitive to vocabulary discrepancies, so models with different vocabulary sizes can sometimes not be fairly compared. It’s not possible to compare perplexity scores between datasets, as one dataset may naturally have higher perplexity than another.
It’s a poor metric by which to evaluate a model’s handling of ambiguity and creativity because it favors training data-like outputs over more novel interpretations or generations. Consequently, sentences generated by the model would be overly constrained to reproduce that training data, rather than handling ambiguity or generating novel creative text.
complementing perplexity with other complementary metrics can give researchers and developers a better sense of how a given language model performs.
An important component of conversational AI is a language model, where the text generated by the algorithm is conversational and sounds like text generated by humans. Suppose we’re building several language models using a neural network trained on an extensive corpus of conversational data.
These data can include threads of complaints or questions and answers posted by human users in forums. We can consider the following factors while developing our chatbot:
Based on this comprehensive evaluation, we can then decide which model to deploy into our conversational AI application.
The formula for computing perplexity is well-known. However, much work in Computational Linguistics revolves around various extensions that can make it more accurate (and hence, applicable to larger training corpora).
In the context of language modeling, it’s the importance of sampling techniques that reduce the computational cost of perplexity calculations on large data sets. To reduce the computational burden, we can consider the following factors:
An example of more sophisticated methods involves hierarchical (or multi‑level) perplexity computation. Instead of computing perplexity over the entire corpus all at once, this method computes it at different levels of granularity. These levels include words, phrases, and sentences. This provides a detailed evaluation of the model and helps diagnose the issues.
For our conversational AI system, rather than providing an overall perplexity for each model on the test set, we can decompose it across levels:
Using these multi-level perplexity scores, we can map out the strengths and relative weaknesses of each model at different granularity levels.
In this article, we have learned about ‘Perplexity’, which offers a metric for assessing how well language models are performing on linguistic tasks.
It’s a common metric and represents an accepted standard by which LMs from different engineers and codebases can be compared with each other.
However, this is far from perfect and doesn’t provide information about more subtle aspects of human-sounding text. To conclude, every metric has its limits, and perplexity is no exception. All the other approaches described in the article are complementary to each other.