Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: March 18, 2024
In this tutorial, we explain the Bleu score in Natural Language Processing (NLP).
BLEU (Bilingual Evaluation Understudy) is a quantitative metric for measuring the quality of an output text based on multiple reference texts.
We need it in NLP tasks for estimating the performance of systems with textual output, e.g., tools for image summarization, question-answering systems, and chatbots. We train them using datasets in which inputs (such as questions) are paired with the reference texts we expect at the output (such as the correct answers to those questions). So, a metric for estimating output quality is necessary for training and evaluating such models.
To calculate the BLEU score, we need to be familiar with N-grams, precision, and clipped precision.
Precision refers to the number of words in the output text that are also in the reference text:
For example:
The words “She”, “the”, and “milk” in the output text occurred in the reference text. Therefore, the precision is .
But, there are two downsides to the above precision measure. First, it doesn’t catch repetition. For example, if the output text was “milk milk milk”, the precision would be .
Second, it doesn’t take multiple reference texts into account. To address these issues, we use clipped precision.
Let’s take a look at an example of an output text with repetition that also has multiple reference texts:
In this example, the words “She”, “eats”, and “a” in the output text occur in at least one reference text. However, the word “she” is repeated three times.
In clipped precision, we bound the word count in the output text from above to the maximum count of the corresponding word in any of the reference texts:
In our example, the maximum count of “She” in the reference texts is 2 (in the reference text 1). Therefore, the clipped number of “She” becomes 2. If we had more than three occurrences of “She” in the reference text, the clipped number would be cut to 3. Similarly, the output words “eats” and “a” have only one occurrence in the reference texts. So, their clipped number is 1.
Since there are seven words in the output text, the clipped precision is:
The BLEU score is always defined with a particular in mind. It uses
-grams with
, and we denote it as
:
The geometric mean precision score is a weighted geometric mean of the clipped -gram precisions for
, and the brevity penalty favors longer output texts.
The formula for the clipped precision of -grams is:
So, the weighted geometric mean is:
The values are weights we give to each clipped precision. Typically, we use uniform weights. So, for
with 4 clipped precisions, we’ll use
(
).
We penalize short output text to avoid high scores when they don’t make sense. For example, if we have an output text with just one word that also occurs in the reference text, we’ll end up with . To solve this issue, we need a new factor to penalize short output texts. That’s what the brevity penalty does:
In the formula, is the number of words in the output text, and
is the number of words in the reference text.
Let’s calculate for the following example:
First, we calculate the clipped precisions:
Then, we compute the weighted geometric mean precision and get 0.516. Next, we compute the brevity score with and
, which results in 1. Finally, the
.
Due to its simplicity and being explainable and language-independent, the BLEU score has been widely used in NLP.
However, this metric neither considers the meanings of the word nor understands the significance of the words in the text. For example, the propositions usually have the lowest level of importance. However, BLEU sees them as important as noun and verb keywords.
To add to these downsides, BLEU doesn’t understand the variants of the words and can’t take the word order into account.
In this article, we examined the BLEU score as a widely used metric for evaluating text outputs in NLP. It’s simple to compute but falls short when it comes to text semantics.