Category: AI

  • Q-Learning in Python: Reinforcement Learning on Frozen Lake

    Q-Learning in Python: Reinforcement Learning on Frozen Lake

    Ever seen an AI agent go from stumbling around cluelessly to mastering its environment, making perfect moves every single time? In this blog post, we’ll explore how to train an agent to do just that, transforming random, chaotic actions into smooth, optimal choices. We’ll dive into the fascinating world of Q-learning and discover how it empowers AI agents to learn and adapt. In case you want to follow along, here is the link to the collab notebook.

    What Is Q-Learning ?

    Q-learning is a type of reinforcement learning where an agent learns to make optimal decisions by interacting with its environment. The agent explores its surroundings, tries different actions, and observes the outcomes. It uses a Q-table to store Q-values, which represent the expected reward for taking a specific action in a given state. Over time, the agent updates its Q-values based on its experiences, gradually learning the best actions to take in each situation.

    source: HuggingFace

    The Q-value update formula takes in our former estimate of the Q-value and then adds the temporal difference error, which is crucial for correctly adjusting our predictions based on new information. We multiply this value by a learning rate to take small, manageable steps, akin to the incremental updates we see in machine learning algorithms, allowing for gradual refinement of our estimates. The Temporal Difference Error is particularly significant as it comprises not just the immediate reward received from a given action, but also includes the discounted estimate of the optimal Q-value in the next state that our selected action will lead us into; this next step’s predicted value is critical as it influences our future decisions. This entire process is essential for the learning agent to adapt effectively to its environment, correction of biases in the initial Q-value estimates, and thus improving the overall decision-making strategy. By subtracting this former estimate of the Q-value from the combined factors, we arrive at a refined estimate that enhances the agent’s ability to predict and maximize long-term rewards in a dynamic setting.

    The Frozen Lake Environment

    Enough of theory, now it’s time to train our agent on the Frozen Lake Environment. Imagine a frozen lake with slippery patches. Our agent’s goal is to navigate across the lake without falling into any holes. The agent can move up, down, left, or right, but the slippery surface makes its actions unpredictable. This simple environment provides a great starting point for understanding Q-learning. We will go over the training on the non-slipper environment. To see how the agent performs in the slippery environment, you can see the YouTube video for this.

    The first thing we will have to do is to initialize the environment.

    # Importing libraries
    import gymnasium as gym
    import numpy as np
    from matplotlib import pyplot as plt
    
    np.set_printoptions(precision=3)
    
    env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False, render_mode="rgb_array")
    print(f"There are {env.action_space.n} possible actions")
    print(f"There are {env.observation_space.n} states")
    >>>There are 4 possible actions
    >>>There are 16 states
    

    We can see that our world is 4×4 in size and thus has 16 possible states and there are 4 possible actions – up, down, left and right. We can take a look at the world.

    The goal of our agent is to reach the prize at the bottom-right. We can clearly see that it can do so by either going right->right->down->down->down->right or by following down->down->right->right->down->right. But how do we train the agent to come up with either of these path on its own.

    We do so by initially letting the agent explore the environment randomly, trying different actions to see what happens, without any predefined strategy guiding its decisions. This phase of exploration is crucial, as it allows the agent to gather diverse experiences and build a foundational understanding of the environment’s dynamics. As it gains experience over time, it starts exploiting its learned knowledge, choosing actions with higher Q-values that have been identified as beneficial through previous trials. This shift from exploration to exploitation represents a significant turning point in the agent’s learning process, where it leverages its accumulated data to make more informed decisions. Throughout its journey, the agent balances exploration and exploitation to ensure it both discovers new strategies and utilizes its existing knowledge effectively. By continuously adjusting this balance, the agent enhances its performance, ultimately leading to more efficient learning and improved decision-making capabilities in complex scenarios.

    To do so let’s establish some helper functions first –

    def get_action(epsilon, state, q_table):
        if np.random.rand() < epsilon:
            return np.random.randint(0, env.action_space.n)
        else:
            return np.argmax(q_table[state])
    
    def get_td_error(state, next_state, action, reward, q_table):
        former_q_est = q_table[state,action]
        td_target = reward+ gamma*np.max(q_table[next_state])
        td_error = td_target - former_q_est
        return td_error
    
    # As seen, we first define the Q-table and during the training epochs we update this value. 
    q_table = np.zeros((env.observation_space.n, env.action_space.n))
    

    We created two functions, The first function, get_action, determines the action based on epsilon, which controls the randomness of our actions.. Initially during training we keep the epsilon very high and lower it as the agent learns. The second function, get_td_error, calculates the temporal difference error after each step. We also created our q-table which is a combination n_states x n_actions= 16×4.

    We also have to establish training hyper-parameters.

    num_epochs = 1000
    gamma = 0.99
    lr = 0.1
    decay_rate=0.99
    epsilon = 1
    

    During training, in each epoch we update our q-table after each action. The epoch is done if we either fall into the hole or get to the prize. After the episode is done we decay the epsilon a bit and repeat the process again. After the training is done our q-table should have converged to optimal q-values for each state-action pair.

    for i in range(num_epochs):
        state, _ = env.reset()
        done = False
        while not done:
            action = get_action(epsilon, state, q_table)
            next_state, reward, done, _, _ = env.step(action)
            td_error = get_td_error(state, next_state, action, reward, q_table)
            q_table[state, action] = q_table[state, action] + lr*td_error
            state = next_state
        epsilon*=decay_rate
    

    Now that we’ve trained our agent, let’s see how it’s action looks like. The code for creating the animation is in the collab notebook.

    We can see that it always now follows the optimal path.

    Conclusion

    Q-learning is a powerful technique for training AI agents to make optimal decisions. By interacting with their environment and learning from their experiences, agents can master even complex tasks. As we’ve seen, the environment plays a crucial role in shaping the agent’s behavior.

    However, in complex environments with a vast number of states, traditional Q-learning becomes impractical. That’s where deep Q-learning comes in. By using deep neural networks, we can approximate Q-values without relying on an enormous Q-table. Stay tuned for our next blog post, where we’ll explore the intricacies of deep Q-learning.

  • Embed Documents Using Ollama – OllamaEmbeddings

    You can now create document embeddings using Ollama. Also once these embeddings are created, you can store them on a vector database. You can read this article where I go over how you can do so.

    from langchain_community.embeddings import OllamaEmbeddings
    ollama_emb = OllamaEmbeddings(
    model="mistral",
    )
    r1 = ollama_emb.embed_documents(
    [
    "Alpha is the first letter of Greek alphabet",
    "Beta is the second letter of Greek alphabet",
    "This is a random sentence"
    ]
    )
    r2 = ollama_emb.embed_query(
    "What is the second letter of Greek alphabet"
    )

    Let’s inspect the array shapes-

    print(np.array(r1).shape)
    >>> (3,4096)
    print(np.array(r2).shape)
    >>> (4096,)

    Now we can also find the cosine similarity between the vectors –

    from sklearn.metrics.pairwise import cosine_similarity
    cosine_similarity(np.array(r1), np.array(r2).reshape(1,-1))
    >>>array([[0.62087283],
    [0.65085897],
    [0.36985642]])

    Here we can clearly see that the second document in our 3 reference documents is the closest to our question. Similarly, you can also create embeddings from your text documents and store them and can later query them using Ollama and LangChain.

  • Temperature In Language Models – A way to control for Randomness

    Temperature In Language Models – A way to control for Randomness

    Temperature is a parameter that you can access with open-source LLMs which essentially guide the model on how random their behaviour is.

    Here is an image from cohere.ai

    In this image, we can see that if you increase the temperature, the probability distribution of the softmax output of the next token is changed. So when you sample from this probability distribution, there is a chance that you can select the output token which has a very low probability score in the original distribution.

    Similarly, there is also something known as top k and top p.

    They also work similarly to temperature. The higher their value, the more random, your output will be.

    Let’s take an example. What do you expect the completion of this sentence to be – The cat sat on the _____

    I think most of us will think mat, followed by other things where you can sit like porch, floor, etc. and not sky.

    Suppose we feed this to a text-generation model and the softmax probability distribution looks like this –

    token prob
    mat 0.6
    floor 0.2
    porch 0.1
    car 0.05
    bus 0.03
    sky 0.02

    If you set temperature = 0, then the most likely completion of the sentence that the model will return will be The cat sat on the mat

    But when we set temperature = 1, or a number which is high, then we could get the model to give the output as The cat sat on the sky Cause then the probability distribution of the softmax output is skewed artificially to generate less likely tokens. This can be good or bad based on the context of the problem.

    In the video below, we ran through a couple of settings and saw the effect these parameters had on the output of Llama-2-7b.

    #loading the model 
    
    import torch
    from peft import PeftModel, PeftConfig
    from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
    
    bnb_config = BitsAndBytesConfig(load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=False)
    
    model_id = "meta-llama/Llama-2-7b-chat-hf"
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config = bnb_config,device_map={"":0})

    Then we create the prompt template and a function to create a text-generation pipeline –

    import json
    import textwrap
    
    B_INST, E_INST = "[INST]", "[/INST]"
    B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
    DEFAULT_SYSTEM_PROMPT = """
    """
    
    
    
    def get_prompt(instruction, new_system_prompt=DEFAULT_SYSTEM_PROMPT ):
        SYSTEM_PROMPT = B_SYS + new_system_prompt + E_SYS
        prompt_template =  B_INST + SYSTEM_PROMPT + instruction + E_INST
        return prompt_template
    
    def create_pipeline(temperature = 0, top_p = 0.1, top_k = 3, max_new_tokens=512):
        pipe = pipeline("text-generation",
                    model=model,
                    tokenizer = tokenizer,
                    max_new_tokens = max_new_tokens,
                    temperature = temperature,
                    do_sample = True, 
                    top_p = top_p,
                    top_k = top_k)
        return pipe

    Now let’s see the model output when we pass this prompt to the model with different configurations.

    [INST]<<SYS>>
    
    
    <</SYS>>
    
    Complete the sentence - The cat sat on the [/INST]
    # Model with all params as low.
    pipe = create_pipeline(0.1)
    output = pipe.predict(prompt)
    print(output[0]['generated_text'])
    
    >>> [INST]<<SYS>>
    
    
    <</SYS>>
    
    Complete the sentence - The cat sat on the [/INST]  The cat sat on the mat.

    As expected, the model’s output was in line with our expectations.

    # Model with all params as high.
    pipe = create_pipeline(0.8, top_p = 0.8, top_k = 100)
    output = pipe.predict(prompt)
    print(output[0]['generated_text'])
    
    >>> [INST]<<SYS>>
    
    
    <</SYS>>
    
    Complete the sentence - The cat sat on the [/INST]  The cat sat on the windowsill.

    Here, we saw that by changing the parameters, the model’s output was also influenced.

  • Gorilla – A LLM to output API calls, paper walkthrough with a working example

    In the Youtube video, I go over Gorilla, a LLM which is fine-tuned on API calls.

    Let me know in case you want to learn more about such LLM or ML concepts in the comments below.

  • PDF ChatBot Demo with Gradio, Llama-2 and LangChain

    PDF ChatBot Demo with Gradio, Llama-2 and LangChain

    In this post, we will learn how you can create a chatbot which can read through your documents and answer any question. In addition, we will learn how to create a working demo using Gradio that you can share with your colleagues or friends.

    The google collab notebook can be found here.

  • Fine Tune Llama-2-13b on a single GPU on custom data.

    In this tutorial, we will walk through each step of fine-tuning Llama-2-13b model on a single GPU. I’ll be using a collab notebook but you can use your local machine, it just needs to have around 12 Gb of VRAM.

    The required libraries can be installed by running this in your notebook.

    !pip install -q transformers trl peft huggingface_hub datasets bitsandbytes accelerate

    First login to your huggingface account.

    from huggingface_hub import login
    login("<your token here>")

    Loading the tokenizer.

    model_id = "meta-llama/Llama-2-13b-chat-hf"
    import torch
    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
    
    from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, BitsAndBytesConfig
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token

    Now we will load the model in its quantised form, this reduces the memory requirements to fit the model, so it can run on a single GPU.

    bnb_config = BitsAndBytesConfig(load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=False)

    If you’ve a bit more GPU to play around, you can load the 8-bit model. Play around with this configuration based on your hardware specifications.

    model = AutoModelForCausalLM.from_pretrained(model_id,  quantization_config=bnb_config, use_cache=False)

    Now the below lines of code prepare the model for 4 or 8-bit training, otherwise, you get an error.

    from peft import prepare_model_for_kbit_training, get_peft_model, LoraConfig, TaskType
    
    model.gradient_checkpointing_enable()
    model = prepare_model_for_kbit_training(model)

    Then you define your LoRA config, mainly there are two parameters that you play around with – rank and lora_alpha. For more details, you can read about the params here.

    peft_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM, 
        inference_mode=False, 
        r=64, 
        lora_alpha=32, 
        lora_dropout=0.1,
    )
    model = get_peft_model(model, peft_config)
    

    Now the below cell is a helper function that shows how many trainable parameters are there.

    def print_trainable_parameters(model):
        """
        Prints the number of trainable parameters in the model.
        """
        trainable_params = 0
        all_param = 0
        for _, param in model.named_parameters():
            all_param += param.numel()
            if param.requires_grad:
                trainable_params += param.numel()
        print(
            f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
        )
    
    print_trainable_parameters(model)
    
    >>> trainable params: 52428800 || all params: 6724408320 || trainable%: 0.7796790067620403

    We can see with LoRA, there are very few parameters to train.

    To prepare your data, you can have it in any form you want, as long as it is with datasets, you can pass a formatting function while training, which can combine all text part of the data.

    Here you can change the training configurations, For LoRA you can start with a higher learning rate as the original weights are frozen, so you don’t have to worry about catastrophic forgetting. The arguments you want to play around with are per_device_train_batch_size and gradient_accumulation_steps as when you run out of memory then lower per_device_train_batch_size and increase gradient_accumulation_steps.

    max_seq_length = 512
    
    from transformers import TrainingArguments, EarlyStoppingCallback
    from trl import SFTTrainer
    output_dir = "./results"
    optim = "paged_adamw_32bit"
    training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=True,
        optim=optim,
        learning_rate=1e-4,
        logging_steps=10,
        max_steps=300,
        warmup_ratio=0.3,
        per_device_train_batch_size=8,
        gradient_accumulation_steps=4,
        gradient_checkpointing=True,
        save_total_limit = 5,
        fp16=True
        
    )
    

    Here I’m writing an example of a formatting function. My data already had a text field which had all the text data.

    def format_function(example):
        return example['text']

    But in case you don’t have text field, you can have it so that the function returns all text as one.

    Now we define the trainer.

    from trl import SFTTrainer
    peft_trainer = SFTTrainer(
        model=model,
        train_dataset=dataset,
        peft_config=peft_config,
        max_seq_length=max_seq_length,
        tokenizer=tokenizer,
        args=training_args,
        formatting_func=format_function)
    
    peft_trainer.train()

    Once the model has been trained, you can store is locally or push it to huggingface hub.

    Hope this tutorial cleared any doubts you had around fine-tuning LLMs on a single GPU.

  • Fine Tune Llama-2-7b with a custom dataset on google collab

    Fine Tune Llama-2-7b with a custom dataset on google collab

    I’ll add the code and explanations as text here, but everything is explained in the Youtube video.

    Link to collab notebook.

  • Large Language Model (LLM) Evaluation Metrics – BLEU and ROUGE

    How are language models evaluated? In traditional machine learning, we have metrics like accuracy, f1-score, precision, recall etc. But how can you calculate objectively how the model performed, when the label is I like to drink coffee over tea and the model’s output is I prefer coffee to tea. As humans, we can clearly see that these two have the same meaning, but how can a machine make the same evaluation?

    Well, there are two approaches –

    1. BLEU – Bilingual Evaluation Understudy is a metric used to evaluate the quality of machine-generated translations against one or more reference translations. It measures the similarity between the machine-generated translation and the reference translations based on the n-grams (contiguous sequences of n words) present in both. BLEU score ranges from 0 to 1, with a higher score indicating a better match between the generated translation and the references. A score of 1 means a perfect match, while a score of 0 means no overlap between the generated and reference translations.
    2. ROUGE – Recall-Oriented Understudy for Gisting Evaluation) is a widely used evaluation metric for assessing the quality of automatic summaries generated by text summarization systems. It measures the similarity between the generated summary and one or more reference summaries. ROUGE calculates the precision and recall scores by comparing the n-gram units (such as words or sequences of words) in the generated summary with those in the reference summaries. It focuses on the recall score, which measures how much of the important information from the reference summaries is captured by the generated summary.

    Let us take an example and calculate both the metrics. Suppose the label is "The cat sat on the mat." and the model’s output is "The dog slept on the couch."

    Here is the python code to calculate ROUGE score –

    from collections import Counter
    import re
    
    
    def calculate_ROUGE(generated_summary, reference_summary, n):
        # Tokenize the generated summary and reference summary into n-grams
        generated_ngrams = generate_ngrams(generated_summary, n)
        reference_ngrams = generate_ngrams(reference_summary, n)
    
        # Calculate the recall score
        matching_ngrams = len(set(generated_ngrams) & set(reference_ngrams))
        recall_score = matching_ngrams / len(reference_ngrams)
    
        return recall_score
    
    
    def generate_ngrams(text, n):
        # Preprocess text by removing punctuation and converting to lowercase
        text = re.sub(r'[^\w\s]', '', text.lower())
    
        # Generate n-grams from the preprocessed text
        words = text.split()
        ngrams = [tuple(words[i:i+n]) for i in range(len(words) - n + 1)]
    
        return ngrams
    
    
    # Example usage
    generated_summary = "The dog slept on the couch."
    reference_summary = "The cat sat on the mat."
    n = 2  # bigram
    
    rouge_score = calculate_ROUGE(generated_summary, reference_summary, n)
    print(f"ROUGE-{n} score: {rouge_score}")
    >> ROUGE-2 score: 0.2
    

    If we use n = 2, which means bigrams, then the ROUGE-2 score is 0.2 .

    Similarly, lets calculate the BLEU score –

    from collections import Counter
    import nltk.translate.bleu_score as bleu
    
    
    def calculate_BLEU(generated_summary, reference_summary, n):
        # Tokenize the generated summary and reference summary
        generated_tokens = generated_summary.split()
        reference_tokens = reference_summary.split()
    
        # Calculate the BLEU score
        weights = [1.0 / n] * n  # Weights for n-gram precision calculation
        bleu_score = bleu.sentence_bleu([reference_tokens], generated_tokens, weights=weights)
    
        return bleu_score
    
    
    # Example usage
    generated_summary = "The dog slept on the couch."
    reference_summary = "The cat sat on the mat."
    n = 2  # Bigram
    
    bleu_score = calculate_BLEU(generated_summary, reference_summary, n)
    print(f"BLEU-{n} score: {bleu_score}")
    >> BLEU-2 score: 0.316227766016838
    

    So, we get two different scores from these two different approaches.

    The evaluation metric you will chose to fine tune your LLM models will depend on the task at hand, but usually BLEU is used for machine translations and ROUGE is used for summarisation tasks.

  • I asked ChatGPT to write a language model

    I asked ChatGPT to write a language model. Here is the code that it returned.

    from keras.layers import Embedding, LSTM, Dense
    from keras.models import Sequential
    
    # Define the model
    model = Sequential()
    model.add(Embedding(input_dim=vocab_size, output_dim=embedding_size, input_length=max_length))
    model.add(LSTM(units=hidden_size))
    model.add(Dense(units=vocab_size, activation='softmax'))
    
    # Compile the model
    model.compile(optimizer='adam', loss='categorical_crossentropy')
    
    # Fit the model to the training data
    model.fit(x_train, y_train, batch_size=batch_size, epochs=num_epochs)
    
    

    So I decided to build a language model using it, but before I had to write a couple of lines of code myself. First the Tokenizer.

    class Tokenizer():
        def __init__(self, 
                     oov_token ='<unk>',
                     pad_token ='<pad>'):
            self.vocab = {}
            self.reverse_vocab = {}
            self.oov_token = oov_token
            self.pad_token = pad_token
            self.__add_to_dict(self.oov_token)
            self.__add_to_dict(self.pad_token)
            for i in range(10):
                self.__add_to_dict(str(i))
            for i in range(26):
                self.__add_to_dict(chr(ord('a') + i))
    
            # Add space and punctuation to the dictionary
            self.__add_to_dict('.')
            self.__add_to_dict(' ')
        
        def __add_to_dict(self, character):
            if character not in self.vocab:
                self.vocab[character] = len(self.vocab)
                self.reverse_vocab[self.vocab[character]] = character
            
        def tokenize(self, text):
            return [self.vocab[c] for c in text]
    
        def detokenize(self, text):
            return [self.reverse_vocab[c] for c in text]
        
        def get_vocabulary(self):
            return self.vocab
        
        def vocabulary_size(self):
            return len(self.vocab)
        
        def token_to_id(self,character):
            return self.vocab[character]
        
        def id_to_token(self , token):
            return self.reverse_vocab[token]
        
        def pad_seq(self,seq, max_len):
            return seq[:max_len] + [self.token_to_id(self.pad_token)]*(max_len-len(seq))
    

    Then I added the config, created a small corpus of text and prepared the training data it needs to train the model. To prepare the training data I also asked how should the corpus be created and created the corpus as it showed me.

    t = Tokenizer()
    vocab_size = t.vocabulary_size()
    embedding_size = 64
    max_length = vocab_size
    num_epochs = 50
    batch_size = 16
    
    corpus = ["this is a dog",
             "dogs live with humans", 
             "they are called what is known as pets",
             "cats are also pets",
             "there are also wolves in the jungle",
             "there are many animals in the jungle",
             "the lion is called the king of the jungle",
             "the largest animal in the jungle is the elephant",
             "we have horses also there",
             "foxes are quite quick and they are also very clever",
             "the zebra also looks like a horse but is not domesticated",
             "we go to the zoo to look at monkeys",
             "we all live together on this earth",
             'monkeys like bananas',
             'pandas eat bamboo',
             'tigers are dangerous',
             'zebras have stripes',
             'lions are the kings of the savannah',
             "it is upon humans to take care of this planet"]
    tokenized_corpus = [t.tokenize(line) for line in corpus]
    #Padding the tokenized corpus
    x_train = np.array([t.pad_seq(sentence, max_length) for sentence in tokenized_corpus])
    
    y_train = np.array([np.roll(sentence, -1) for sentence in x_train])
    
    
    

    Also it had a mistake in the code, which I corrected, and ran the training loop. You can find the full code here on my Github.

    I then asked ChatGPT to write the inference code for the model. Here is what it returned.

    def generate_text(seed_text, model, max_gen_length, tokenizer):
        # Tokenize the seed text
        seed_text = tokenizer.tokenize(seed_text)
        # Pad the seed text to the maximum length
        seed_text = tokenizer.pad_seq(seed_text, max_length=30)
        # Initialize an empty list to store the generated words
        generated_text = []
        # Append the seed text to the generated text
        generated_text.extend(seed_text)
        # Loop for the specified number of words to generate
        for i in range(max_gen_length):
            # Use the model to predict the next word
            next_word_probs = model.predict(np.array([seed_text]))[0]
            # Get the index of the most probable next word
            next_word_idx = np.argmax(next_word_probs)
            # Append the generated word to the generated text
            generated_text.append(next_word_idx)
            # update the seed text
            seed_text = np.delete(seed_text, 0)
            seed_text = np.append(seed_text, next_word_idx)
        # Convert the generated text from indices to words
        generated_text = [tokenizer.id_to_token(word) for word in generated_text]
        return "".join(generated_text)
    
    # Initialize the seed text
    seed_text = "The sky is"
    # Generate new text
    generated_text = generate_text(seed_text, model, max_gen_length=10, tokenizer=tokenizer)
    print(generated_text)
    
    

    Making a few changes to the code to suit our tokenizer class and model, here are a few inputs and outputs.

    Input - the sky is
    Output - the sky is<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>4444444444
    Input - "lion is the king of the jungle"
    Output - lion is the king of the jungle<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>4444444444

    Sure the output is terrible, but remember it is a very basic model architecture and we’ve not used transformers or temperature sampling to improve our language model. In my future posts, I’ll use ChatGPT to build upon these blocks to train even bigger and more complex language models.

    This shows how ChatGPT or similar large language models can enable developers in writing code or develop models in a short amount of time. It is

  • Why Tanh is a better activation function than sigmoid ?

    You might be asked that why often in neural networks, tanh is considered to be a better activation function than sigmoid.

    Sigmoid
    Tanh

    Andrew NG also mentions in his deep learning specialization course that tanh is almost always a better activation function than sigmoid. So why is that the case?

    There are a few reasons why the hyperbolic tangent (tanh) function is often considered to be a better activation function than the sigmoid function:

    1. The output of the tanh function is centered around zero, which means that the negative inputs will be mapped to negative values and the positive inputs will be mapped to positive values. This makes the learning process of the network more stable.
    2. The tanh function has a derivative that is well-behaved, meaning that it is relatively easy to compute and the gradient is relatively stable.
    3. The sigmoid function, on the other hand, maps all inputs to values between 0 and 1. This can cause the network to become saturated, and the gradients can become very small, making the network difficult to train.
    4. Another reason is that the range of the tanh function is [-1,1], while the range of the sigmoid function is [0,1], which makes the model output values more similar to the standard normal distribution. It will be less likely to saturate and more likely to have a good gradient flow, leading to faster convergence.
    5. Another advantage is that the tanh function is differentiable at all points. In contrast, the sigmoid function has a kink at 0, which may cause issues when computing gradients during back-propagation, and other optimization methods.

    All that being said, whether to use sigmoid or tanh depends on the specific problem and context, and it’s not always the case that one is clearly better than the other.