Large Language Model (LLM) Evaluation Metrics – BLEU and ROUGE

How are language models evaluated? In traditional machine learning, we have metrics like accuracy, f1-score, precision, recall etc. But how can you calculate objectively how the model performed, when the label is I like to drink coffee over tea and the model’s output is I prefer coffee to tea. As humans, we can clearly see that these two have the same meaning, but how can a machine make the same evaluation?

Well, there are two approaches –

  1. BLEU – Bilingual Evaluation Understudy is a metric used to evaluate the quality of machine-generated translations against one or more reference translations. It measures the similarity between the machine-generated translation and the reference translations based on the n-grams (contiguous sequences of n words) present in both. BLEU score ranges from 0 to 1, with a higher score indicating a better match between the generated translation and the references. A score of 1 means a perfect match, while a score of 0 means no overlap between the generated and reference translations.
  2. ROUGE – Recall-Oriented Understudy for Gisting Evaluation) is a widely used evaluation metric for assessing the quality of automatic summaries generated by text summarization systems. It measures the similarity between the generated summary and one or more reference summaries. ROUGE calculates the precision and recall scores by comparing the n-gram units (such as words or sequences of words) in the generated summary with those in the reference summaries. It focuses on the recall score, which measures how much of the important information from the reference summaries is captured by the generated summary.

Let us take an example and calculate both the metrics. Suppose the label is "The cat sat on the mat." and the model’s output is "The dog slept on the couch."

Here is the python code to calculate ROUGE score –

from collections import Counter
import re


def calculate_ROUGE(generated_summary, reference_summary, n):
    # Tokenize the generated summary and reference summary into n-grams
    generated_ngrams = generate_ngrams(generated_summary, n)
    reference_ngrams = generate_ngrams(reference_summary, n)

    # Calculate the recall score
    matching_ngrams = len(set(generated_ngrams) & set(reference_ngrams))
    recall_score = matching_ngrams / len(reference_ngrams)

    return recall_score


def generate_ngrams(text, n):
    # Preprocess text by removing punctuation and converting to lowercase
    text = re.sub(r'[^\w\s]', '', text.lower())

    # Generate n-grams from the preprocessed text
    words = text.split()
    ngrams = [tuple(words[i:i+n]) for i in range(len(words) - n + 1)]

    return ngrams


# Example usage
generated_summary = "The dog slept on the couch."
reference_summary = "The cat sat on the mat."
n = 2  # bigram

rouge_score = calculate_ROUGE(generated_summary, reference_summary, n)
print(f"ROUGE-{n} score: {rouge_score}")
>> ROUGE-2 score: 0.2

If we use n = 2, which means bigrams, then the ROUGE-2 score is 0.2 .

Similarly, lets calculate the BLEU score –

from collections import Counter
import nltk.translate.bleu_score as bleu


def calculate_BLEU(generated_summary, reference_summary, n):
    # Tokenize the generated summary and reference summary
    generated_tokens = generated_summary.split()
    reference_tokens = reference_summary.split()

    # Calculate the BLEU score
    weights = [1.0 / n] * n  # Weights for n-gram precision calculation
    bleu_score = bleu.sentence_bleu([reference_tokens], generated_tokens, weights=weights)

    return bleu_score


# Example usage
generated_summary = "The dog slept on the couch."
reference_summary = "The cat sat on the mat."
n = 2  # Bigram

bleu_score = calculate_BLEU(generated_summary, reference_summary, n)
print(f"BLEU-{n} score: {bleu_score}")
>> BLEU-2 score: 0.316227766016838

So, we get two different scores from these two different approaches.

The evaluation metric you will chose to fine tune your LLM models will depend on the task at hand, but usually BLEU is used for machine translations and ROUGE is used for summarisation tasks.

Comments

2 responses to “Large Language Model (LLM) Evaluation Metrics – BLEU and ROUGE”

  1.  Avatar
    Anonymous

    I think the ROUGE score calculation contains one problem. The set is not taking duplicates, thus when calculating it with two sentences that are exactly the same, it will never be one. One correction would be:     matching_ngrams = sum(1 for ng in generated_ngrams if ng in reference_ngrams)

    Like

    1. sahaymaniceet Avatar

      Thanks for pointing out the mistake, I’ll check and update the article.

      Like

Leave a reply to Anonymous Cancel reply