Natural Language Processing: Bleu Score | Baeldung on Computer Science

1. Introduction

In this tutorial, we explain the Bleu score in Natural Language Processing (NLP).

2. Why Do We Need the BLEU Score?

BLEU (Bilingual Evaluation Understudy) is a quantitative metric for measuring the quality of an output text based on multiple reference texts.

We need it in NLP tasks for estimating the performance of systems with textual output, e.g., tools for image summarization, question-answering systems, and chatbots. We train them using datasets in which inputs (such as questions) are paired with the reference texts we expect at the output (such as the correct answers to those questions). So, a metric for estimating output quality is necessary for training and evaluating such models.

3. The BLEU Score

To calculate the BLEU score, we need to be familiar with N-grams, precision, and clipped precision.

3.1. Precision

Precision refers to the number of words in the output text that are also in the reference text:

$Precision =\frac{\text { (Number of Words in the Output Text that Occurred in the Reference Text) }}{\text { (Total Number of Words in the Output Text) }}$

For example:

Output = She drinks the milk.
Reference = She drank the milk.

The words “She”, “the”, and “milk” in the output text occurred in the reference text. Therefore, the precision is $\frac{3}{4}$ .

But, there are two downsides to the above precision measure. First, it doesn’t catch repetition. For example, if the output text was “milk milk milk”, the precision would be $\frac{3}{3}=100\%$ .

Second, it doesn’t take multiple reference texts into account. To address these issues, we use clipped precision.

3.2. Clipped Precision

Let’s take a look at an example of an output text with repetition that also has multiple reference texts:

Output text: She She She eats a sour cherry.
Reference text 1: She is eating a blueberry as she loves it.
Reference text 2: She eats a fruit of her favorite.

In this example, the words “She”, “eats”, and “a” in the output text occur in at least one reference text. However, the word “she” is repeated three times.

In clipped precision, we bound the word count in the output text from above to the maximum count of the corresponding word in any of the reference texts:

$Clipped Precision =\frac{\text { (Clipped Number of Words in the Output Text that Occurred in the Reference Text) }}{\text { (Total Number of Words in the Output Text) }}$

In our example, the maximum count of “She” in the reference texts is 2 (in the reference text 1). Therefore, the clipped number of “She” becomes 2. If we had more than three occurrences of “She” in the reference text, the clipped number would be cut to 3. Similarly, the output words “eats” and “a” have only one occurrence in the reference texts. So, their clipped number is 1.

Since there are seven words in the output text, the clipped precision is:

$\frac{2+1+1}{7} = \frac{4}{7}$

3.3. BLEU Score Calculation

The BLEU score is always defined with a particular $N$ in mind. It uses $K$ -grams with $K=1,2,\ldots,N$ , and we denote it as $BLEU_N$ :

$BLEU_N=\text{(Brevity Penalty)} \times \text{(Weighted Geometric Mean Precision Score)}$

The geometric mean precision score is a weighted geometric mean of the clipped $\boldsymbol{n}$ -gram precisions for $\boldsymbol{n=1,2,\ldots, N}$ , and the brevity penalty favors longer output texts.

The formula for the clipped precision of $n$ -grams is:

$CP_n=\frac{\text { (Clipped number of n-grams in the output text that occur in the reference text) }}{\text { (Total number of n-grams in the output text) }}$

So, the weighted geometric mean is:

$\text{Weighted Geometric Mean Precision } =\prod_{n=1}^N CP_n^{w_n}$

The $w_n$ values are weights we give to each clipped precision. Typically, we use uniform weights. So, for $BLEU_4$ with 4 clipped precisions, we’ll use $w_n=\frac{1}{4}$ ( $n=1,2,3,4$ ).

3.4. Brevity Penalty

We penalize short output text to avoid high scores when they don’t make sense. For example, if we have an output text with just one word that also occurs in the reference text, we’ll end up with $BLEU_1 = 1$ . To solve this issue, we need a new factor to penalize short output texts. That’s what the brevity penalty does:

$\text{ Brevity Penalty }= \begin{cases}1, & \text { if } c>r \\ e^{\left(1-\frac{r}{c}\right)}, & \text { if } c \leq r\end{cases}$

In the formula, $c$ is the number of words in the output text, and $r$ is the number of words in the reference text.

4. Example

Let’s calculate $BLEU_4$ for the following example:

Output Text: The match was postponed because of the snow.
Reference Text: The match was postponed because it was snowing.

First, we calculate the clipped precisions:

$CP_1 = \frac{5}{8}$
$CP_2 = \frac{4}{7}$
$CP_3 = \frac{3}{6}$
$CP_4 = \frac{2}{5}$

Then, we compute the weighted geometric mean precision and get 0.516. Next, we compute the brevity score with $c=8$ and $r=8$ , which results in 1. Finally, the $BLEU_4 = 1\times 0.516 = 0.516$ .

5. Pros and Cons of BLEU Score

Due to its simplicity and being explainable and language-independent, the BLEU score has been widely used in NLP.

However, this metric neither considers the meanings of the word nor understands the significance of the words in the text. For example, the propositions usually have the lowest level of importance. However, BLEU sees them as important as noun and verb keywords.

To add to these downsides, BLEU doesn’t understand the variants of the words and can’t take the word order into account.

6. Conclusion

In this article, we examined the BLEU score as a widely used metric for evaluating text outputs in NLP. It’s simple to compute but falls short when it comes to text semantics.

Full Archive

About Baeldung

Core Concepts

Operating Systems

Artificial Intelligence

Graph Theory

Latex