Metrics for evaluating summarization of texts performed by Transformers: how to evaluate the quality of summaries

Fabiano Falcão
7 min readApr 22, 2023

--

Text summarization performed by Transformers is one of the most fascinating and advanced technologies in the field of natural language processing. But how do you know if the summaries generated by these templates are of high quality? That’s where assessment metrics come in. In this article, we’ll explore the most popular metrics used to evaluate the quality of summaries generated by Transformers, so you’ll know how to measure the effectiveness of your work.

Text summarization evaluation metrics are crucial to ensure that the summaries generated are accurate, cohesive and relevant. These metrics help quantify the quality of the language model’s work and improve it over time. Here are some of the most popular metrics used in evaluating text summarization performed by Transformers:

  • ROUGE
  • BLEU
  • BERTScore
  • METEOR

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE is one of the most popular metrics for evaluating summarization quality. It measures the n-gram overlap between the generated summary and the reference summary. ROUGE is calculated for different n-grams (1-gram, 2-gram, 3-gram), and the scores are combined to obtain an overall score.
Here is an example of how to calculate ROUGE:

import torch
from rouge import Rouge

# Define the generated summary and the reference summary
generated_summary = "Text summarization with Transformers is efficient for producing precise and relevant summaries."
reference_summary = "Text summarization with Transformers can be used to produce precise and relevant summaries."
# Initialize the ROUGE object
rouge = Rouge()
# Calculate ROUGE for the generated and reference summaries
scores = rouge.get_scores(generated_summary, reference_summary)
# Print the results
print(scores)

The results would be a dictionary of scores for each n-gram:

[{'rouge-1': {'r': 0.6153846153846154, 'p': 0.6666666666666666, 'f': 0.6399999950080001}, 'rouge-2': {'r': 0.5, 'p': 0.5454545454545454, 'f': 0.5217391254442345}, 'rouge-l': {'r': 0.6153846153846154, 'p': 0.6666666666666666, 'f': 0.6399999950080001}}]

This example illustrates how easy it is to calculate ROUGE. The ROUGE metric is one of the many options available to evaluate the quality of text summarization performed by Transformers.

There are several variations of the ROUGE metric, including ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-S. Each of these variations computes the similarity between the reference and generated word strings a little differently.

  • ROUGE-1: also known as unigram, measures the overlap of unigrams (individual words) between the generated summary and the reference summary. It calculates the proportion of words in the generated summary that are also present in the reference summary. Example: Reference text: “The cat is on the rug” Generated text: “The dog is on the rug” ROUGE-1 = 3/5 = 0.6.
  • ROUGE-2: also known as bigram, measures the overlap of bigrams (pairs of consecutive words) between the generated summary and the reference summary. It calculates the proportion of bigrams in the generated summary that are also present in the reference summary.
  • ROUGE-L: measures the similarity between the word sequence of the generated abstract and the reference abstract using the longest sequence of words in common. Unlike ROUGE-1 and ROUGE-2, which use a simple word count approach, ROUGE-L uses a string matching approach.
  • ROUGE-Lsum: is a variation of ROUGE-L that divides the generated summary and the reference summary into sentence units and measures the similarity between these sentence units.

In summary, the main difference between ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-Lsum is the way they measure the overlap between the word strings of the generated summary and the reference summary. While ROUGE-1 and ROUGE-2 use a simple word count approach, ROUGE-L and ROUGE-Lsum use a more sophisticated string matching approach.

Experiment with other metrics and find out which one works best for your project.

BLEU (Bilingual Evaluation Understudy)

BLEU a metric used to evaluate the quality of machine translation from one language to another. However, it can also be used to evaluate the quality of automatic text summarization. The BLEU metric compares the template-generated text to the reference text (either the original text or a shortened version of the original text) and assigns a score based on word overlap between the two texts.

The BLEU score ranges from 0 to 1. The closer to 1, the better the quality of the summary. To calculate the BLEU score, we use a formula that takes into account the amount of overlapping words between the generated text and the reference text. The more words in common, the higher the BLEU score.
Let’s take a look at a Python code example:

import torch
from torchtext.data.metrics import bleu_score

# define the source text and reference text
source_text = ["This is an example message for summarization."]
reference_text = ["This is an example message for summarizing."]
# define the text generated by the model
generated_text = ["This is an example message for automatic summarization."]
# calculate the BLEU score
score = bleu_score(generated_text, reference_text)
print(f'BLEU Score: {score*100:.2f}')
"BLEU Score: 44.84

In this example, we set the source text and reference text to “This is a sample message for summarization” and “This is a sample message for summarization”. The text generated by the template is “This is a sample message for automatic summarization”. Then we calculate the BLEU score using the bleu_score function from the PyTorch library.

In addition to BLEU, another common metric used to assess summary quality is BP (Breath Penalty), which penalizes the generation of very long summaries. BP calculates a score based on the ratio of the length of the model-generated summary to the length of the reference summary. A value close to 1 indicates that the model generated a summary of similar size to the reference, while values less than 1 indicate that the model generated a smaller summary than expected.

Overall, it’s important to choose the right evaluation metric for the summarization task at hand, and remember that no metric is perfect. BLEU and BP are just a few of the many metrics used to assess summary quality, and it’s important to understand their limitations and use multiple metrics to get a more complete picture of model quality.

BERTScore

BERTScore is a text evaluation metric based on BERT language models. It measures the similarity between two sentences using the BERT representation of each word and calculating the F1 score, which is the harmonic mean between accuracy and coverage. BERTScore is capable of handling word, synonym, and antonym ambiguity issues, which makes it especially useful for assessing the quality of summaries generated by Transformer models.

BERTScore is easy to use and can be implemented using the BERTScore library in Python. First, you need to install the BERTScore library using the following command:

!pip install bert_score

You can then calculate the BERT score for a list of sentence pairs using the score function:

from bert_score import score

hypotheses = ["A brown fox jumps over a dog"]
references = ["A quick brown dog jumps over the lazy fox"]
P, R, F1 = score(hypotheses, references)

Where hypotheses is a list of sentences generated by the Transformer model and references is a list of reference sentences.

[(0.7849645614624023, 0.9661419987678528, 0.8652906413078308)]

The score function returns three values: the precision (P), the coverage (R), and the F1 score. The higher the F1 score, the better the quality of the summary.

BERTScore is able to deal with ambiguity problems of words, synonyms and antonyms, which makes it especially useful for evaluating the quality of summaries generated by Transformer models. Using the BERTScore library in Python, it’s easy to calculate the BERT score for a list of sentence pairs.

METEOR (Metric for Evaluation of Translation with Explicit ORdering)

METEOR is a metric that was originally developed to assess the quality of machine translations, but it can also be used to assess the quality of machine summaries. It evaluates the similarity between the generated summary and the reference summary, taking into account grammar and semantics.

METEOR is capable of handling different types of translation or summarization errors, such as word order problems, synonyms and ambiguities. It also takes into account the fluency of the text, that is, the ability of the generated text to be natural and well-written.

METEOR scores are given on a scale of 0 to 1, with higher values indicating greater similarity between the generated summary and the reference summary. However, it is important to remember that the interpretation of the score depends on the data set and the context in which the assessment is being carried out.

To calculate METEOR, it is necessary to have a set of reference summaries for each input text. In addition, it is necessary to pre-process the texts, performing stemming and removing stop words. The following is a Python code example to calculate METEOR using the nltk package:

import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.translate.meteor_score import meteor_score

reference = [['This', 'is', 'a', 'reference', 'summary']]
generated = ['This', 'is', 'a', 'generated', 'summary']
score = meteor_score(reference, generated)
print(score)
0.7500000000000001

In this example, we have a reference summary (reference) and a generated summary (generated). The meteor_score function from the nltk package takes these two parameters and returns the METEOR score.

It is important to emphasize that METEOR should not be used as the only evaluation metric. It should be used in conjunction with other metrics to provide a more complete and accurate assessment of the quality of summaries generated by Transformers models.

In summary, METEOR is a useful metric to evaluate the quality of summaries generated by Transformers models. It takes into account the grammar and semantics of the text and is able to deal with different types of errors. The use of metrics such as METEOR is essential to ensure that the generated summaries are relevant and consistent with the original text.

Conclusion

When using these metrics, it’s important to remember that none of them are perfect, and that human judgment is still essential to ensure the quality of summaries generated by Transformers. However, these metrics can help you measure and improve the quality of your summaries and ensure they are accurate and relevant to your users.

--

--

Fabiano Falcão
Fabiano Falcão

Written by Fabiano Falcão

Software Engineer at Brazilian Chamber of Deputies / Artificial Intelligence Research at University of Brasília (UnB)