Tag: GPT-4

  • Is Llama 3 Really Better Than Mistral?

    Is Llama 3 Really Better Than Mistral?

    With the recent launch of the much anticipated Llama 3, I decided to use both Mistral, which is one of the best small (7B) language models out there, and Llama 3, which according to its benchmark scores claims to outperform Mistral. But is it really better when it comes to using it as the LLM in your RAG applications? To test this, I put the same questions to both Mistral and Llama 3 and the results will surprise you.

    Link to Collab

    I created a RAG application using Ollama. In case you want to know how you can do it yourself, you can check out this post. I used the Elden Ring Wikipedia article as the document for contextual retrieval. I was using conversation buffer memory, which just passes the entire conversational history as context back to the language model. Furthermore, I asked the same question to both LLMs, and at the end we also asked the same question to the current king of LLMs, GPT-4. The question was –

    "How many awards did Elden Ring Win, and did it win Game of the year award ?"

    The entire prompt with the context was –

    Be precise in your response. Given the context - Elden Ring winning Game of the Year
    at the 23rd Game Developers Choice
    AwardsSome reviewers criticized a number of the game's menu and accessibility systems.[84][85] Reviewers
    complained about the poor performance of the Window s version; framerate issues were commonly
    mentioned.[81][86] Reviewers noted the story of Elden Ring lacks Martin's writing style. Kyle Orland of Ars
    Technica said the game's storytelling is "characteristically sparse and cryptic", and differs from the
    expectations of Martin's fans.[76] Chris Carter of Destructoid called the story "low key" but said it is better-
    told than those of previous FromSoftware games.[80] Aoife Wilson of Eurogam er said George R. R.
    Martin's heavy inclusion in the marketing was "baffling" when his contributions to the overall narrative
    were unclear.[72] Mitchell Saltzman did not mind the lack of Martin's style, saying the side-stories rather
    than any gr and, ove rarching pl ot kept him "enthralled".[70]
    
    120. Mejia, Ozzie (January 26, 2023). "Elden Ring & Stray lead Game Developers Choice
    Awards 2023 nominees" (https://www.shacknews.com/article/133863/gdc-2023-award-nomi
    nees). Shacknews. Archived (https://web.archive.org/web/20230127040625/https://www.sha
    cknews.com/article/133863/gdc-2023-award-nominees) from the original on January 27,
    2023. Retrieved January 27, 2023.
    121. Beth Elderkin (March 22, 2023). "'Elden Ring' Wins Game Of The Year At The 2023 Game
    Developers Choice Awards" (https://gdconf.com/news/elden-ring-wins-game-year-2023-gam
    e-developers-choice-awards). Game Developers Choice Conference. Archived (https://web.
    archive.org/web/20230323091858/https://gdconf.com/news/elden-ring-wins-game-year-2023
    -game-developers-choice-awards) from the original on March 23, 2023. Retrieved March 23,
    2023.
    122. "gamescom award 2021: These were the best games of the year" (https://www.gamescom.gl
    obal/en/gamescom/gamescom-award/gamescom-award-review-2021). Gamescom.
    
    Tin Pan Alley Award for
    Best Music in a GameNominated
    The Steam
    AwardsJanuary 3, 2023Game of the Year Won
    [132]
    Best Game You Suck At Won
    The Streamer
    AwardsMarch 11, 2023Stream Game of the
    YearWon[133]
    1. Sawyer, Will; Franey, Joel (April 8, 2022). "Where Elden Ring takes place and the story
    explained" (https://www.gamesradar.com/elden-ring-where-does-it-take-place-setting-story-l
    ore/). gamesradar. Archived (https://web.archive.org/web/20220402212714/https://www.gam
    esradar.com/elden-ring-where-does-it-take-place-setting-story-lore/) from the original on April
    2, 2022. Retrieved July 26, 2022.
    2. Knapp, Mark (June 16, 2021). "Elden Ring: Release Date, Gameplay, and What We Know
    So Far" (https://www.ign.com/articles/elden-ring-release-date-news-gameplay-trailer-story).
    IGN. Ziff Davis. Archived (https://web.archive.org/web/20220303124310/https://www.ign.co
    m/articles/elden-ring-release-date-news-gameplay-trailer-story) from the original on March 3,
    2022. Retrieved July 6, 2021.
    
    Elden Ring sold 13.4 million copies worldwide by the end of March 2022,[87] and 20 million by February
    2023.[88] It was the best-selling game in several regions between February and March 2022,[89][90] and is
    the fastest-selling Bandai Namco game of all time.[91][92] It was the second-bestselling game of 2022 in the
    US after Call of Duty: Modern Warfare II,[93] the third-bestselling in Europe,[94] and the tenth-bestselling in
    Japan at retail.[95]
    Elden Ring won many awards and accolades. It was listed as the
    2022' s Game of the Year by several publications, including Ars
    Technica,[96] Destructoid,[97] EGM,[98] Eurogam er,[99] Game
    Informer,[100] GamesRadar +,[101] GameSpot,[102] IGN,[103] PC
    Gamer,[104] and Polygon.[105] It is among the most awarded games
    in history with 324 G ame of the Year awards.[106][107]S a l e s
    A w a r d s,  Answer the following question - How many awards did Elden Ring Win, and did it win Game of the year award ?.
    

    Mistral

    1. "Elden Ring" (https://www.metacritic.com/game/pc/elden-ring). Metacritic. Archived (https://web.
    archive.org/web/20220402212714/https://www.metacritic.com/game/pc/elden-ring) from the
    original on April 2, 2022. Retrieved April 2, 2022.
    2. "Elden Ring" (https://www.met

    Llama 3 – 8B

    Elden Ring winning Game of the Year
    at the 23rd Game Developers Choice
    AwardsSome reviewers criticized a number of the game's menu and accessibility systems.[84][85] Reviewers
    complained about the poor performance of the Window s version; framerate issues were commonly
    mentioned.[81][86] Reviewers noted the story of Elden Ring lacks Martin's writing style. Kyle Orland of Ars
    Technica said the game's storytelling is "characteristically sparse and cryptic", and differs from the
    expectations of Martin's fans.[76] Chris Carter

    We can clearly see that Llama 3 – 8B got one part right, however, from the context it couldn’t find that it won 324 awards in total.

    GPT-4

    'Elden Ring won 324 Game of the Year awards. Yes, it did win the Game of the Year award at the 23rd Game Developers Choice Awards.'

    GPT-4 is still far ahead of smaller LLMs, but Llama 3 8B has improved compared to Mistral.

  • GPT-4 Vision API – How to Guide

    In the Dev Conference, OpenAI announced the GPT-4 Vision API. With access to this one can develop many tools, with the GPT-4-turbo model being the engine of your tool. The use cases can range from information retrieval to classification models.

    In this article, we will go over how you can use the vision API, how can you pass multiple images with the API and some tricks you should be using to improve the response.

    Firstly, you should have a billing account with OpenAI and also some credits to use this API, as, unlike ChatGPT, here you’re charged per token and not a flat fee, so be careful in your experiments.

    The API –

    The API consists of two parts –

    1. Header – Here you pass your authentication key and if you want the organisation id
    2. Payload – This is where the meat of your request lies. The image can be passed either as a URL or a base64 encoded image. I prefer to pass it in the latter way.
    # To encode the image in base64
    
    # Function to encode the image
    def encode_image(image_path):
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode('utf-8')
    
    # Path to your image
    image_path = "./sample.png"
    
    # Getting the base64 string
    base64_image = encode_image(image_path)

    Let’s look at the API format

    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {API_KEY_HERE}"
    }
    
    payload = {
        "model": "gpt-4-vision-preview",
        "messages": [
             {"role": <user or system>,
              "content" : [{"type": <text or image_url>,
                            "text or image_url": <text or image_url>}]
    }
        ],
        "max_tokens": <max tokens here>
    }
    
    response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)

    Let’s take an example.
    Suppose I want to create a prompt which has a system prompt and a user prompt which can extract JSON output from an image. My payload will look like this.

    payload = {
        "model": "gpt-4-vision-preview",
        "messages": [
    # First define the system prompt        
    {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are a system that always extracts information from an image in a json_format"
                }
            ]
        },
            
        # Define the user prompt  
          {
            "role": "user",
    # Under the user prompt, I pass two content, one text and one image
            "content": [
              {
                "type": "text",
                "text": """Extract the grades from this image in a structured format. Only return the output.
                           ```
                           [{"subject": "<subject>", "grade": "<grade>"}]
                           ```"""
              },
              {
                "type": "image_url",
                "image_url": {
                  "url": f"data:image/jpeg;base64,{base64_image}"
                }
              }
            ]
          }
        ],
        "max_tokens": 500 # Return no more than 500 completion tokens
    }

    The return I get from the API is exactly how i wanted.

    ```json
    [
      {"subject": "English", "grade": "A+"},
      {"subject": "Math", "grade": "B-"},
      {"subject": "Science", "grade": "B+"},
      {"subject": "History", "grade": "C+"}
    ]
    ```

    This is just an example of how just by using the correct prompt we can built an information retrieval system on images using the vision API.

    In the next article, we will build a classifier using the API, which will involve no knowledge of Machine Learning and just by using the API we will build a state-of-the-art image classifier.

  • Large Language Model (LLM) Evaluation Metrics – BLEU and ROUGE

    How are language models evaluated? In traditional machine learning, we have metrics like accuracy, f1-score, precision, recall etc. But how can you calculate objectively how the model performed, when the label is I like to drink coffee over tea and the model’s output is I prefer coffee to tea. As humans, we can clearly see that these two have the same meaning, but how can a machine make the same evaluation?

    Well, there are two approaches –

    1. BLEU – Bilingual Evaluation Understudy is a metric used to evaluate the quality of machine-generated translations against one or more reference translations. It measures the similarity between the machine-generated translation and the reference translations based on the n-grams (contiguous sequences of n words) present in both. BLEU score ranges from 0 to 1, with a higher score indicating a better match between the generated translation and the references. A score of 1 means a perfect match, while a score of 0 means no overlap between the generated and reference translations.
    2. ROUGE – Recall-Oriented Understudy for Gisting Evaluation) is a widely used evaluation metric for assessing the quality of automatic summaries generated by text summarization systems. It measures the similarity between the generated summary and one or more reference summaries. ROUGE calculates the precision and recall scores by comparing the n-gram units (such as words or sequences of words) in the generated summary with those in the reference summaries. It focuses on the recall score, which measures how much of the important information from the reference summaries is captured by the generated summary.

    Let us take an example and calculate both the metrics. Suppose the label is "The cat sat on the mat." and the model’s output is "The dog slept on the couch."

    Here is the python code to calculate ROUGE score –

    from collections import Counter
    import re
    
    
    def calculate_ROUGE(generated_summary, reference_summary, n):
        # Tokenize the generated summary and reference summary into n-grams
        generated_ngrams = generate_ngrams(generated_summary, n)
        reference_ngrams = generate_ngrams(reference_summary, n)
    
        # Calculate the recall score
        matching_ngrams = len(set(generated_ngrams) & set(reference_ngrams))
        recall_score = matching_ngrams / len(reference_ngrams)
    
        return recall_score
    
    
    def generate_ngrams(text, n):
        # Preprocess text by removing punctuation and converting to lowercase
        text = re.sub(r'[^\w\s]', '', text.lower())
    
        # Generate n-grams from the preprocessed text
        words = text.split()
        ngrams = [tuple(words[i:i+n]) for i in range(len(words) - n + 1)]
    
        return ngrams
    
    
    # Example usage
    generated_summary = "The dog slept on the couch."
    reference_summary = "The cat sat on the mat."
    n = 2  # bigram
    
    rouge_score = calculate_ROUGE(generated_summary, reference_summary, n)
    print(f"ROUGE-{n} score: {rouge_score}")
    >> ROUGE-2 score: 0.2
    

    If we use n = 2, which means bigrams, then the ROUGE-2 score is 0.2 .

    Similarly, lets calculate the BLEU score –

    from collections import Counter
    import nltk.translate.bleu_score as bleu
    
    
    def calculate_BLEU(generated_summary, reference_summary, n):
        # Tokenize the generated summary and reference summary
        generated_tokens = generated_summary.split()
        reference_tokens = reference_summary.split()
    
        # Calculate the BLEU score
        weights = [1.0 / n] * n  # Weights for n-gram precision calculation
        bleu_score = bleu.sentence_bleu([reference_tokens], generated_tokens, weights=weights)
    
        return bleu_score
    
    
    # Example usage
    generated_summary = "The dog slept on the couch."
    reference_summary = "The cat sat on the mat."
    n = 2  # Bigram
    
    bleu_score = calculate_BLEU(generated_summary, reference_summary, n)
    print(f"BLEU-{n} score: {bleu_score}")
    >> BLEU-2 score: 0.316227766016838
    

    So, we get two different scores from these two different approaches.

    The evaluation metric you will chose to fine tune your LLM models will depend on the task at hand, but usually BLEU is used for machine translations and ROUGE is used for summarisation tasks.