Tag: GPT-4

Is Llama 3 Really Better Than Mistral?

With the recent launch of the much anticipated Llama 3, I decided to use both Mistral, which is one of the best small (7B) language models out there, and Llama 3, which according to its benchmark scores claims to outperform Mistral. But is it really better when it comes to using it as the LLM in your RAG applications? To test this, I put the same questions to both Mistral and Llama 3 and the results will surprise you.

Link to Collab

I created a RAG application using Ollama. In case you want to know how you can do it yourself, you can check out this post. I used the Elden Ring Wikipedia article as the document for contextual retrieval. I was using conversation buffer memory, which just passes the entire conversational history as context back to the language model. Furthermore, I asked the same question to both LLMs, and at the end we also asked the same question to the current king of LLMs, GPT-4. The question was –

"How many awards did Elden Ring Win, and did it win Game of the year award ?"

The entire prompt with the context was –

Be precise in your response. Given the context - Elden Ring winning Game of the Year
at the 23rd Game Developers Choice
AwardsSome reviewers criticized a number of the game's menu and accessibility systems.[84][85] Reviewers
complained about the poor performance of the Window s version; framerate issues were commonly
mentioned.[81][86] Reviewers noted the story of Elden Ring lacks Martin's writing style. Kyle Orland of Ars
Technica said the game's storytelling is "characteristically sparse and cryptic", and differs from the
expectations of Martin's fans.[76] Chris Carter of Destructoid called the story "low key" but said it is better-
told than those of previous FromSoftware games.[80] Aoife Wilson of Eurogam er said George R. R.
Martin's heavy inclusion in the marketing was "baffling" when his contributions to the overall narrative
were unclear.[72] Mitchell Saltzman did not mind the lack of Martin's style, saying the side-stories rather
than any gr and, ove rarching pl ot kept him "enthralled".[70]

120. Mejia, Ozzie (January 26, 2023). "Elden Ring & Stray lead Game Developers Choice
Awards 2023 nominees" (https://www.shacknews.com/article/133863/gdc-2023-award-nomi
nees). Shacknews. Archived (https://web.archive.org/web/20230127040625/https://www.sha
cknews.com/article/133863/gdc-2023-award-nominees) from the original on January 27,
2023. Retrieved January 27, 2023.
121. Beth Elderkin (March 22, 2023). "'Elden Ring' Wins Game Of The Year At The 2023 Game
Developers Choice Awards" (https://gdconf.com/news/elden-ring-wins-game-year-2023-gam
e-developers-choice-awards). Game Developers Choice Conference. Archived (https://web.
archive.org/web/20230323091858/https://gdconf.com/news/elden-ring-wins-game-year-2023
-game-developers-choice-awards) from the original on March 23, 2023. Retrieved March 23,
2023.
122. "gamescom award 2021: These were the best games of the year" (https://www.gamescom.gl
obal/en/gamescom/gamescom-award/gamescom-award-review-2021). Gamescom.

Tin Pan Alley Award for
Best Music in a GameNominated
The Steam
AwardsJanuary 3, 2023Game of the Year Won
[132]
Best Game You Suck At Won
The Streamer
AwardsMarch 11, 2023Stream Game of the
YearWon[133]
1. Sawyer, Will; Franey, Joel (April 8, 2022). "Where Elden Ring takes place and the story
explained" (https://www.gamesradar.com/elden-ring-where-does-it-take-place-setting-story-l
ore/). gamesradar. Archived (https://web.archive.org/web/20220402212714/https://www.gam
esradar.com/elden-ring-where-does-it-take-place-setting-story-lore/) from the original on April
2, 2022. Retrieved July 26, 2022.
2. Knapp, Mark (June 16, 2021). "Elden Ring: Release Date, Gameplay, and What We Know
So Far" (https://www.ign.com/articles/elden-ring-release-date-news-gameplay-trailer-story).
IGN. Ziff Davis. Archived (https://web.archive.org/web/20220303124310/https://www.ign.co
m/articles/elden-ring-release-date-news-gameplay-trailer-story) from the original on March 3,
2022. Retrieved July 6, 2021.

Elden Ring sold 13.4 million copies worldwide by the end of March 2022,[87] and 20 million by February
2023.[88] It was the best-selling game in several regions between February and March 2022,[89][90] and is
the fastest-selling Bandai Namco game of all time.[91][92] It was the second-bestselling game of 2022 in the
US after Call of Duty: Modern Warfare II,[93] the third-bestselling in Europe,[94] and the tenth-bestselling in
Japan at retail.[95]
Elden Ring won many awards and accolades. It was listed as the
2022' s Game of the Year by several publications, including Ars
Technica,[96] Destructoid,[97] EGM,[98] Eurogam er,[99] Game
Informer,[100] GamesRadar +,[101] GameSpot,[102] IGN,[103] PC
Gamer,[104] and Polygon.[105] It is among the most awarded games
in history with 324 G ame of the Year awards.[106][107]S a l e s
A w a r d s,  Answer the following question - How many awards did Elden Ring Win, and did it win Game of the year award ?.

Mistral

1. "Elden Ring" (https://www.metacritic.com/game/pc/elden-ring). Metacritic. Archived (https://web.
archive.org/web/20220402212714/https://www.metacritic.com/game/pc/elden-ring) from the
original on April 2, 2022. Retrieved April 2, 2022.
2. "Elden Ring" (https://www.met

Llama 3 – 8B

Elden Ring winning Game of the Year
at the 23rd Game Developers Choice
AwardsSome reviewers criticized a number of the game's menu and accessibility systems.[84][85] Reviewers
complained about the poor performance of the Window s version; framerate issues were commonly
mentioned.[81][86] Reviewers noted the story of Elden Ring lacks Martin's writing style. Kyle Orland of Ars
Technica said the game's storytelling is "characteristically sparse and cryptic", and differs from the
expectations of Martin's fans.[76] Chris Carter

We can clearly see that Llama 3 – 8B got one part right, however, from the context it couldn’t find that it won 324 awards in total.

GPT-4

'Elden Ring won 324 Game of the Year awards. Yes, it did win the Game of the Year award at the 23rd Game Developers Choice Awards.'

GPT-4 is still far ahead of smaller LLMs, but Llama 3 8B has improved compared to Mistral.

April 19, 2024

GPT-4 Vision API – How to Guide

In the Dev Conference, OpenAI announced the GPT-4 Vision API. With access to this one can develop many tools, with the GPT-4-turbo model being the engine of your tool. The use cases can range from information retrieval to classification models.

In this article, we will go over how you can use the vision API, how can you pass multiple images with the API and some tricks you should be using to improve the response.

Firstly, you should have a billing account with OpenAI and also some credits to use this API, as, unlike ChatGPT, here you’re charged per token and not a flat fee, so be careful in your experiments.

The API –

The API consists of two parts –

Header – Here you pass your authentication key and if you want the organisation id
Payload – This is where the meat of your request lies. The image can be passed either as a URL or a base64 encoded image. I prefer to pass it in the latter way.

# To encode the image in base64

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

# Path to your image
image_path = "./sample.png"

# Getting the base64 string
base64_image = encode_image(image_path)

Let’s look at the API format

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {API_KEY_HERE}"
}

payload = {
    "model": "gpt-4-vision-preview",
    "messages": [
         {"role": <user or system>,
          "content" : [{"type": <text or image_url>,
                        "text or image_url": <text or image_url>}]
}
    ],
    "max_tokens": <max tokens here>
}

response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)

Let’s take an example.
Suppose I want to create a prompt which has a system prompt and a user prompt which can extract JSON output from an image. My payload will look like this.

payload = {
    "model": "gpt-4-vision-preview",
    "messages": [
# First define the system prompt        
{
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "You are a system that always extracts information from an image in a json_format"
            }
        ]
    },
        
    # Define the user prompt  
      {
        "role": "user",
# Under the user prompt, I pass two content, one text and one image
        "content": [
          {
            "type": "text",
            "text": """Extract the grades from this image in a structured format. Only return the output.
                       ```
                       [{"subject": "<subject>", "grade": "<grade>"}]
                       ```"""
          },
          {
            "type": "image_url",
            "image_url": {
              "url": f"data:image/jpeg;base64,{base64_image}"
            }
          }
        ]
      }
    ],
    "max_tokens": 500 # Return no more than 500 completion tokens
}

The return I get from the API is exactly how i wanted.

```json
[
  {"subject": "English", "grade": "A+"},
  {"subject": "Math", "grade": "B-"},
  {"subject": "Science", "grade": "B+"},
  {"subject": "History", "grade": "C+"}
]
```

This is just an example of how just by using the correct prompt we can built an information retrieval system on images using the vision API.

In the next article, we will build a classifier using the API, which will involve no knowledge of Machine Learning and just by using the API we will build a state-of-the-art image classifier.

November 11, 2023

Large Language Model (LLM) Evaluation Metrics – BLEU and ROUGE
How are language models evaluated? In traditional machine learning, we have metrics like accuracy, f1-score, precision, recall etc. But how can you calculate objectively how the model performed, when the label is I like to drink coffee over tea and the model’s output is I prefer coffee to tea. As humans, we can clearly see that these two have the same meaning, but how can a machine make the same evaluation?

Well, there are two approaches –
1. BLEU – Bilingual Evaluation Understudy is a metric used to evaluate the quality of machine-generated translations against one or more reference translations. It measures the similarity between the machine-generated translation and the reference translations based on the n-grams (contiguous sequences of n words) present in both. BLEU score ranges from 0 to 1, with a higher score indicating a better match between the generated translation and the references. A score of 1 means a perfect match, while a score of 0 means no overlap between the generated and reference translations.
2. ROUGE – Recall-Oriented Understudy for Gisting Evaluation) is a widely used evaluation metric for assessing the quality of automatic summaries generated by text summarization systems. It measures the similarity between the generated summary and one or more reference summaries. ROUGE calculates the precision and recall scores by comparing the n-gram units (such as words or sequences of words) in the generated summary with those in the reference summaries. It focuses on the recall score, which measures how much of the important information from the reference summaries is captured by the generated summary.
Let us take an example and calculate both the metrics. Suppose the label is "The cat sat on the mat." and the model’s output is "The dog slept on the couch."

Here is the python code to calculate ROUGE score –
```
from collections import Counter
import re


def calculate_ROUGE(generated_summary, reference_summary, n):
    # Tokenize the generated summary and reference summary into n-grams
    generated_ngrams = generate_ngrams(generated_summary, n)
    reference_ngrams = generate_ngrams(reference_summary, n)

    # Calculate the recall score
    matching_ngrams = len(set(generated_ngrams) & set(reference_ngrams))
    recall_score = matching_ngrams / len(reference_ngrams)

    return recall_score


def generate_ngrams(text, n):
    # Preprocess text by removing punctuation and converting to lowercase
    text = re.sub(r'[^\w\s]', '', text.lower())

    # Generate n-grams from the preprocessed text
    words = text.split()
    ngrams = [tuple(words[i:i+n]) for i in range(len(words) - n + 1)]

    return ngrams


# Example usage
generated_summary = "The dog slept on the couch."
reference_summary = "The cat sat on the mat."
n = 2  # bigram

rouge_score = calculate_ROUGE(generated_summary, reference_summary, n)
print(f"ROUGE-{n} score: {rouge_score}")
>> ROUGE-2 score: 0.2
```
If we use n = 2, which means bigrams, then the ROUGE-2 score is 0.2 .

Similarly, lets calculate the BLEU score –
```
from collections import Counter
import nltk.translate.bleu_score as bleu


def calculate_BLEU(generated_summary, reference_summary, n):
    # Tokenize the generated summary and reference summary
    generated_tokens = generated_summary.split()
    reference_tokens = reference_summary.split()

    # Calculate the BLEU score
    weights = [1.0 / n] * n  # Weights for n-gram precision calculation
    bleu_score = bleu.sentence_bleu([reference_tokens], generated_tokens, weights=weights)

    return bleu_score


# Example usage
generated_summary = "The dog slept on the couch."
reference_summary = "The cat sat on the mat."
n = 2  # Bigram

bleu_score = calculate_BLEU(generated_summary, reference_summary, n)
print(f"BLEU-{n} score: {bleu_score}")
>> BLEU-2 score: 0.316227766016838
```
So, we get two different scores from these two different approaches.

The evaluation metric you will chose to fine tune your LLM models will depend on the task at hand, but usually BLEU is used for machine translations and ROUGE is used for summarisation tasks.
July 8, 2023