How are language models evaluated? In traditional machine learning, we have metrics like accuracy, f1-score, precision, recall etc. But how can you calculate objectively how the model performed, when the label is I like to drink coffee over tea and the model’s output is I prefer coffee to tea. As humans, we can clearly see that these two have the same meaning, but how can a machine make the same evaluation?
Well, there are two approaches –
- BLEU – Bilingual Evaluation Understudy is a metric used to evaluate the quality of machine-generated translations against one or more reference translations. It measures the similarity between the machine-generated translation and the reference translations based on the n-grams (contiguous sequences of n words) present in both. BLEU score ranges from 0 to 1, with a higher score indicating a better match between the generated translation and the references. A score of 1 means a perfect match, while a score of 0 means no overlap between the generated and reference translations.
- ROUGE – Recall-Oriented Understudy for Gisting Evaluation) is a widely used evaluation metric for assessing the quality of automatic summaries generated by text summarization systems. It measures the similarity between the generated summary and one or more reference summaries. ROUGE calculates the precision and recall scores by comparing the n-gram units (such as words or sequences of words) in the generated summary with those in the reference summaries. It focuses on the recall score, which measures how much of the important information from the reference summaries is captured by the generated summary.
Let us take an example and calculate both the metrics. Suppose the label is "The cat sat on the mat." and the model’s output is "The dog slept on the couch."
Here is the python code to calculate ROUGE score –
from collections import Counter
import re
def calculate_ROUGE(generated_summary, reference_summary, n):
# Tokenize the generated summary and reference summary into n-grams
generated_ngrams = generate_ngrams(generated_summary, n)
reference_ngrams = generate_ngrams(reference_summary, n)
# Calculate the recall score
matching_ngrams = len(set(generated_ngrams) & set(reference_ngrams))
recall_score = matching_ngrams / len(reference_ngrams)
return recall_score
def generate_ngrams(text, n):
# Preprocess text by removing punctuation and converting to lowercase
text = re.sub(r'[^\w\s]', '', text.lower())
# Generate n-grams from the preprocessed text
words = text.split()
ngrams = [tuple(words[i:i+n]) for i in range(len(words) - n + 1)]
return ngrams
# Example usage
generated_summary = "The dog slept on the couch."
reference_summary = "The cat sat on the mat."
n = 2 # bigram
rouge_score = calculate_ROUGE(generated_summary, reference_summary, n)
print(f"ROUGE-{n} score: {rouge_score}")
>> ROUGE-2 score: 0.2
If we use n = 2, which means bigrams, then the ROUGE-2 score is 0.2 .
Similarly, lets calculate the BLEU score –
from collections import Counter
import nltk.translate.bleu_score as bleu
def calculate_BLEU(generated_summary, reference_summary, n):
# Tokenize the generated summary and reference summary
generated_tokens = generated_summary.split()
reference_tokens = reference_summary.split()
# Calculate the BLEU score
weights = [1.0 / n] * n # Weights for n-gram precision calculation
bleu_score = bleu.sentence_bleu([reference_tokens], generated_tokens, weights=weights)
return bleu_score
# Example usage
generated_summary = "The dog slept on the couch."
reference_summary = "The cat sat on the mat."
n = 2 # Bigram
bleu_score = calculate_BLEU(generated_summary, reference_summary, n)
print(f"BLEU-{n} score: {bleu_score}")
>> BLEU-2 score: 0.316227766016838
So, we get two different scores from these two different approaches.
The evaluation metric you will chose to fine tune your LLM models will depend on the task at hand, but usually BLEU is used for machine translations and ROUGE is used for summarisation tasks.
Leave a comment