BLEU ( bilingual evaluation understudy ) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. Invented at IBM in 2001, BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.
41-415: Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. Those scores are then averaged over the whole corpus to reach an estimate of the translation's overall quality. Intelligibility or grammatical correctness are not taken into account. BLEU's output is always a number between 0 and 1. This value indicates how similar
82-403: A x = 2 {\displaystyle ~m_{max}=2} , thus m w {\displaystyle ~m_{w}} is clipped to 2. These clipped counts m w {\displaystyle ~m_{w}} are then summed over all distinct words in the candidate. This sum is then divided by the total number of unigrams in the candidate translation. In
123-510: A corpus ( pl. : corpora ) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources , either annotated or unannotated. Annotated, they have been used in corpus linguistics for statistical hypothesis testing , checking occurrences or validating linguistic rules within a specific language territory. A corpus may contain texts in a single language ( monolingual corpus ) or text data in multiple languages ( multilingual corpus ). In order to make
164-409: A benchmark for the assessment of any new evaluation metric. There are however a number of criticisms that have been voiced. It has been noted that, although in principle capable of evaluating translations of any language, BLEU cannot, in its present form, deal with languages lacking word boundaries. Designed to be used for several reference translation, in practice it's used with only the single one. BLEU
205-649: A choice of w {\displaystyle w} , the BLEU score is B L E U w ( S ^ ; S ) := B P ( S ^ ; S ) ⋅ exp ( ∑ n = 1 ∞ w n ln p n ( S ^ ; S ) ) {\displaystyle BLEU_{w}({\hat {S}};S):=BP({\hat {S}};S)\cdot \exp \left(\sum _{n=1}^{\infty }w_{n}\ln p_{n}({\hat {S}};S)\right)} In words, it
246-520: A list of reference candidate strings S i := ( y ( i , 1 ) , . . . , y ( i , N i ) ) {\displaystyle S_{i}:=(y^{(i,1)},...,y^{(i,N_{i})})} . Given any string y = y 1 y 2 ⋯ y K {\displaystyle y=y_{1}y_{2}\cdots y_{K}} , and any integer n ≥ 1 {\displaystyle n\geq 1} , we define
287-463: A specific language territory. A corpus may contain texts in a single language ( monolingual corpus ) or text data in multiple languages ( multilingual corpus ). In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation . An example of annotating a corpus is part-of-speech tagging , or POS-tagging , in which information about each word's part of speech (verb, noun, adjective, etc.)
328-480: A translation which consisted of all the words in each of the references. To produce a score for the whole corpus, the modified precision scores for the segments are combined using the geometric mean multiplied by a brevity penalty to prevent very short candidates from receiving too high a score. Let r be the total length of the reference corpus, and c the total length of the translation corpus. If c ≤ r {\displaystyle c\leq r} ,
369-526: Is a weighted geometric mean of all the modified n-gram precisions, multiplied by the brevity penalty. We use the weighted geometric mean, rather than the weighted arithmetic mean, to strongly favor candidate corpuses that are simultaneously good according to multiple n-gram precisions. The most typical choice, the one recommended in the original paper, is w 1 = ⋯ = w 4 = 1 4 {\displaystyle w_{1}=\cdots =w_{4}={\frac {1}{4}}} . This
410-544: Is a probability distribution on { 1 , 2 , 3 , ⋯ } {\displaystyle \{1,2,3,\cdots \}} , that is, ∑ i = 1 ∞ w i = 1 {\displaystyle \sum _{i=1}^{\infty }w_{i}=1} , and ∀ i ∈ { 1 , 2 , 3 , ⋯ } , w i ∈ [ 0 , 1 ] {\displaystyle \forall i\in \{1,2,3,\cdots \},w_{i}\in [0,1]} . With
451-438: Is a set of unique elements, not a multiset allowing redundant elements, so that, for example, G 2 ( a b a b ) = { a b , b a } {\displaystyle G_{2}(abab)=\{ab,ba\}} . Given any two strings s , y {\displaystyle s,y} , define the substring count C ( s , y ) {\displaystyle C(s,y)} to be
SECTION 10
#1732772070298492-487: Is added to the corpus in the form of tags . Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual. Some corpora have further structured levels of analysis applied. In particular, smaller corpora may be fully parsed . Such corpora are usually called Treebanks or Parsed Corpora . The difficulty of ensuring that
533-432: Is illustrated in the following example from Papineni et al. (2002): Of the seven words in the candidate translation, all of them appear in the reference translations. Thus the candidate text is given a unigram precision of, where m {\displaystyle ~m} is number of words from the candidate that are found in the reference, and w t {\displaystyle ~w_{t}}
574-482: Is infamously dependent on the tokenization technique, and scores achieved with different ones are incomparable (which is often overlooked); in order to improve reproducibility and comparability, SacreBLEU variant was designed. It has been argued that although BLEU has significant advantages, there is no guarantee that an increase in BLEU score is an indicator of improved translation quality. Text corpus In linguistics and natural language processing ,
615-891: Is merely a straightforward generalization of the prototypical case: one candidate sentence and one reference sentence. In this case, it is p n ( { y ^ } ; { y } ) = ∑ s ∈ G n ( y ^ ) min ( C ( s , y ^ ) , C ( s , y ) ) ∑ s ∈ G n ( y ^ ) C ( s , y ^ ) {\displaystyle p_{n}(\{{\hat {y}}\};\{y\})={\frac {\sum _{s\in G_{n}({\hat {y}})}\min(C(s,{\hat {y}}),C(s,y))}{\sum _{s\in G_{n}({\hat {y}})}C(s,{\hat {y}})}}} To work up to this expression, we start with
656-713: Is not normalized. If both the reference and the candidate sentences are long, the count could be big, even if the candidate is of very poor quality. So we normalize it ∑ s ∈ G n ( y ^ ) min ( C ( s , y ^ ) , C ( s , y ) ) ∑ s ∈ G n ( y ^ ) C ( s , y ^ ) {\displaystyle {\frac {\sum _{s\in G_{n}({\hat {y}})}\min(C(s,{\hat {y}}),C(s,y))}{\sum _{s\in G_{n}({\hat {y}})}C(s,{\hat {y}})}}} The normalization
697-721: Is one. The modified n-gram precision unduly gives a high score for candidate strings that are " telegraphic ", that is, containing all the n-grams of the reference strings, but for as few times as possible. In order to punish candidate strings that are too short, define the brevity penalty to be B P ( S ^ ; S ) := e − ( r / c − 1 ) + {\displaystyle BP({\hat {S}};S):=e^{-(r/c-1)^{+}}} where ( r / c − 1 ) + = max ( 0 , r / c − 1 ) {\displaystyle (r/c-1)^{+}=\max(0,r/c-1)}
738-490: Is retained. The longer n -gram scores account for the fluency of the translation, or to what extent it reads like "good English". An example of a candidate translation for the same references as above might be: In this example, the modified unigram precision would be, as the word 'the' and the word 'cat' appear once each in the candidate, and the total number of words is two. The modified bigram precision would be 1 / 1 {\displaystyle 1/1} as
779-404: Is similar to y ( 1 ) , . . . , y ( N ) {\displaystyle y^{(1)},...,y^{(N)}} , and close to 0 if not. As an analogy, the BLEU score is like a language teacher trying to score the quality of a student translation y ^ {\displaystyle {\hat {y}}} by checking how closely it follows
820-409: Is such that it is always a number in [ 0 , 1 ] {\displaystyle [0,1]} , allowing meaningful comparisons between corpuses. It is zero if none of the n-substrings in candidate is in reference. It is one if every n-gram in the candidate appears in reference, for at least as many times as in candidate. In particular, if the candidate is a substring of the reference, then it
861-764: Is the length of y {\displaystyle y} . r {\displaystyle r} is the effective reference corpus length , that is, r := ∑ i = 1 M | y ( i , j ) | {\displaystyle r:=\sum _{i=1}^{M}|y^{(i,j)}|} where y ( i , j ) = arg min y ∈ S i | | y | − | y ^ ( i ) | | {\displaystyle y^{(i,j)}=\arg \min _{y\in S_{i}}||y|-|{\hat {y}}^{(i)}||} , that is,
SECTION 20
#1732772070298902-467: Is the positive part of r / c − 1 {\displaystyle r/c-1} . c {\displaystyle c} is the length of the candidate corpus, that is, c := ∑ i = 1 M | y ^ ( i ) | {\displaystyle c:=\sum _{i=1}^{M}|{\hat {y}}^{(i)}|} where | y | {\displaystyle |y|}
943-432: Is the total number of words in the candidate. This is a perfect score, despite the fact that the candidate translation above retains little of the content of either of the references. The modification that BLEU makes is fairly straightforward. For each word in the candidate translation, the algorithm takes its maximum total count, m m a x {\displaystyle ~m_{max}} , in any of
984-1076: The modified n-gram precision function to be p n ( S ^ ; S ) := ∑ i = 1 M ∑ s ∈ G n ( y ^ ( i ) ) min ( C ( s , y ^ ( i ) ) , max y ∈ S i C ( s , y ) ) ∑ i = 1 M ∑ s ∈ G n ( y ^ ( i ) ) C ( s , y ^ ( i ) ) {\displaystyle p_{n}({\hat {S}};S):={\frac {\sum _{i=1}^{M}\sum _{s\in G_{n}({\hat {y}}^{(i)})}\min(C(s,{\hat {y}}^{(i)}),\max _{y\in S_{i}}C(s,y))}{\sum _{i=1}^{M}\sum _{s\in G_{n}({\hat {y}}^{(i)})}C(s,{\hat {y}}^{(i)})}}} The modified n-gram, which looks complicated,
1025-714: The BLEU score. A basic, first attempt at defining the BLEU score would take two arguments: a candidate string y ^ {\displaystyle {\hat {y}}} and a list of reference strings ( y ( 1 ) , . . . , y ( N ) ) {\displaystyle (y^{(1)},...,y^{(N)})} . The idea is that B L E U ( y ^ ; y ( 1 ) , . . . , y ( N ) ) {\displaystyle BLEU({\hat {y}};y^{(1)},...,y^{(N)})} should be close to 1 when y ^ {\displaystyle {\hat {y}}}
1066-432: The above example, the modified unigram precision score would be: In practice, however, using individual words as the unit of comparison is not optimal. Instead, BLEU computes the same modified precision metric using n-grams . The length which has the "highest correlation with monolingual human judgements" was found to be four. The unigram scores are found to account for the adequacy of the translation, how much information
1107-459: The bigram, "the cat" appears once in the candidate. It has been pointed out that precision is usually twinned with recall to overcome this problem , as the unigram recall of this example would be 3 / 6 {\displaystyle 3/6} or 2 / 7 {\displaystyle 2/7} . The problem being that as there are multiple reference translations, a bad translation could easily have an inflated recall, such as
1148-409: The brevity penalty applies, defined to be e ( 1 − r / c ) {\displaystyle e^{(1-r/c)}} . (In the case of multiple reference sentences, r is taken to be the sum of the lengths of the sentences whose lengths are closest to the lengths of the candidate sentences. However, in the version of the metric used by NIST evaluations prior to 2009,
1189-410: The candidate text is to the reference texts, with values closer to 1 representing more similar texts. Few human translations will attain a score of 1, since this would indicate that the candidate is identical to one of the reference translations. For this reason, it is not necessary to attain a score of 1. Because there are more opportunities to match, adding additional reference translations will increase
1230-399: The corpora more useful for doing linguistic research, they are often subjected to a process known as annotation . An example of annotating a corpus is part-of-speech tagging , or POS-tagging , in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags . Another example is indicating the lemma (base) form of each word. When
1271-626: The count is 6, not 2. In the above situation, however, the candidate string is too short. Instead of 3 appearances of a b {\displaystyle ab} it contains only one, so we add a minimum function to correct for that: ∑ s ∈ G n ( y ^ ) min ( C ( s , y ^ ) , C ( s , y ) ) {\displaystyle {\sum _{s\in G_{n}({\hat {y}})}\min(C(s,{\hat {y}}),C(s,y))}} This count summation cannot be used to compare between sentences, since it
BLEU - Misplaced Pages Continue
1312-664: The language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual. Some corpora have further structured levels of analysis applied. In particular, smaller corpora may be fully parsed . Such corpora are usually called Treebanks or Parsed Corpora . The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these corpora are usually smaller, containing around one to three million words. Other levels of linguistic structured analysis are possible, including annotations for morphology , semantics and pragmatics . Corpora are
1353-503: The main knowledge base in corpus linguistics . Other notable areas of application include: Text corpus In linguistics and natural language processing , a corpus ( pl. : corpora ) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources , either annotated or unannotated. Annotated, they have been used in corpus linguistics for statistical hypothesis testing , checking occurrences or validating linguistic rules within
1394-470: The most obvious n-gram count summation: ∑ s ∈ G n ( y ^ ) C ( s , y ) = number of n-substrings in y ^ that appear in y {\displaystyle \sum _{s\in G_{n}({\hat {y}})}C(s,y)={\text{number of n-substrings in }}{\hat {y}}{\text{ that appear in }}y} This quantity measures how many n-grams in
1435-997: The number of appearances of s {\displaystyle s} as a substring of y {\displaystyle y} . For example, C ( a b , a b c b a b ) = 2 {\displaystyle C(ab,abcbab)=2} . Now, fix a candidate corpus S ^ := ( y ^ ( 1 ) , ⋯ , y ^ ( M ) ) {\displaystyle {\hat {S}}:=({\hat {y}}^{(1)},\cdots ,{\hat {y}}^{(M)})} , and reference candidate corpus S = ( S 1 , ⋯ , S M ) {\displaystyle S=(S_{1},\cdots ,S_{M})} , where each S i := ( y ( i , 1 ) , . . . , y ( i , N i ) ) {\displaystyle S_{i}:=(y^{(i,1)},...,y^{(i,N_{i})})} . Define
1476-732: The reference answers y ( 1 ) , . . . , y ( N ) {\displaystyle y^{(1)},...,y^{(N)}} . Since in natural language processing, one should evaluate a large set of candidate strings, one must generalize the BLEU score to the case where one has a list of M candidate strings (called a " corpus ") ( y ^ ( 1 ) , ⋯ , y ^ ( M ) ) {\displaystyle ({\hat {y}}^{(1)},\cdots ,{\hat {y}}^{(M)})} , and for each candidate string y ^ ( i ) {\displaystyle {\hat {y}}^{(i)}} ,
1517-514: The reference sentence are reproduced by the candidate sentence. Note that we count the n-substrings , not n-grams . For example, when y ^ = a b a , y = a b a b a b a , n = 2 {\displaystyle {\hat {y}}=aba,y=abababa,n=2} , all the 2-substrings in y ^ {\displaystyle {\hat {y}}} (ab and ba) appear in y {\displaystyle y} 3 times each, so
1558-605: The reference translations. In the example above, the word "the" appears twice in reference 1, and once in reference 2. Thus m m a x = 2 {\displaystyle ~m_{max}=2} . For the candidate translation, the count m w {\displaystyle m_{w}} of each word is clipped to a maximum of m m a x {\displaystyle m_{max}} for that word. In this case, "the" has m w = 7 {\displaystyle ~m_{w}=7} and m m
1599-507: The sentence from S i {\displaystyle S_{i}} whose length is as close to | y ^ ( i ) | {\displaystyle |{\hat {y}}^{(i)}|} as possible. There is not a single definition of BLEU, but a whole family of them, parametrized by the weighting vector w := ( w 1 , w 2 , ⋯ ) {\displaystyle w:=(w_{1},w_{2},\cdots )} . It
1640-421: The set of its n-grams to be G n ( y ) = { y 1 ⋯ y n , y 2 ⋯ y n + 1 , ⋯ , y K − n + 1 ⋯ y K } {\displaystyle G_{n}(y)=\{y_{1}\cdots y_{n},y_{2}\cdots y_{n+1},\cdots ,y_{K-n+1}\cdots y_{K}\}} Note that it
1681-404: The shortest reference sentence had been used instead.) iBLEU is an interactive version of BLEU that allows a user to visually examine the BLEU scores obtained by the candidate translations. It also allows comparing two different systems in a visual and interactive manner which is useful for system development. BLEU has frequently been reported as correlating well with human judgement, and remains