Tag: KerasNLP

  • I asked ChatGPT to write a language model

    I asked ChatGPT to write a language model. Here is the code that it returned.

    from keras.layers import Embedding, LSTM, Dense
    from keras.models import Sequential
    
    # Define the model
    model = Sequential()
    model.add(Embedding(input_dim=vocab_size, output_dim=embedding_size, input_length=max_length))
    model.add(LSTM(units=hidden_size))
    model.add(Dense(units=vocab_size, activation='softmax'))
    
    # Compile the model
    model.compile(optimizer='adam', loss='categorical_crossentropy')
    
    # Fit the model to the training data
    model.fit(x_train, y_train, batch_size=batch_size, epochs=num_epochs)
    
    

    So I decided to build a language model using it, but before I had to write a couple of lines of code myself. First the Tokenizer.

    class Tokenizer():
        def __init__(self, 
                     oov_token ='<unk>',
                     pad_token ='<pad>'):
            self.vocab = {}
            self.reverse_vocab = {}
            self.oov_token = oov_token
            self.pad_token = pad_token
            self.__add_to_dict(self.oov_token)
            self.__add_to_dict(self.pad_token)
            for i in range(10):
                self.__add_to_dict(str(i))
            for i in range(26):
                self.__add_to_dict(chr(ord('a') + i))
    
            # Add space and punctuation to the dictionary
            self.__add_to_dict('.')
            self.__add_to_dict(' ')
        
        def __add_to_dict(self, character):
            if character not in self.vocab:
                self.vocab[character] = len(self.vocab)
                self.reverse_vocab[self.vocab[character]] = character
            
        def tokenize(self, text):
            return [self.vocab[c] for c in text]
    
        def detokenize(self, text):
            return [self.reverse_vocab[c] for c in text]
        
        def get_vocabulary(self):
            return self.vocab
        
        def vocabulary_size(self):
            return len(self.vocab)
        
        def token_to_id(self,character):
            return self.vocab[character]
        
        def id_to_token(self , token):
            return self.reverse_vocab[token]
        
        def pad_seq(self,seq, max_len):
            return seq[:max_len] + [self.token_to_id(self.pad_token)]*(max_len-len(seq))
    

    Then I added the config, created a small corpus of text and prepared the training data it needs to train the model. To prepare the training data I also asked how should the corpus be created and created the corpus as it showed me.

    t = Tokenizer()
    vocab_size = t.vocabulary_size()
    embedding_size = 64
    max_length = vocab_size
    num_epochs = 50
    batch_size = 16
    
    corpus = ["this is a dog",
             "dogs live with humans", 
             "they are called what is known as pets",
             "cats are also pets",
             "there are also wolves in the jungle",
             "there are many animals in the jungle",
             "the lion is called the king of the jungle",
             "the largest animal in the jungle is the elephant",
             "we have horses also there",
             "foxes are quite quick and they are also very clever",
             "the zebra also looks like a horse but is not domesticated",
             "we go to the zoo to look at monkeys",
             "we all live together on this earth",
             'monkeys like bananas',
             'pandas eat bamboo',
             'tigers are dangerous',
             'zebras have stripes',
             'lions are the kings of the savannah',
             "it is upon humans to take care of this planet"]
    tokenized_corpus = [t.tokenize(line) for line in corpus]
    #Padding the tokenized corpus
    x_train = np.array([t.pad_seq(sentence, max_length) for sentence in tokenized_corpus])
    
    y_train = np.array([np.roll(sentence, -1) for sentence in x_train])
    
    
    

    Also it had a mistake in the code, which I corrected, and ran the training loop. You can find the full code here on my Github.

    I then asked ChatGPT to write the inference code for the model. Here is what it returned.

    def generate_text(seed_text, model, max_gen_length, tokenizer):
        # Tokenize the seed text
        seed_text = tokenizer.tokenize(seed_text)
        # Pad the seed text to the maximum length
        seed_text = tokenizer.pad_seq(seed_text, max_length=30)
        # Initialize an empty list to store the generated words
        generated_text = []
        # Append the seed text to the generated text
        generated_text.extend(seed_text)
        # Loop for the specified number of words to generate
        for i in range(max_gen_length):
            # Use the model to predict the next word
            next_word_probs = model.predict(np.array([seed_text]))[0]
            # Get the index of the most probable next word
            next_word_idx = np.argmax(next_word_probs)
            # Append the generated word to the generated text
            generated_text.append(next_word_idx)
            # update the seed text
            seed_text = np.delete(seed_text, 0)
            seed_text = np.append(seed_text, next_word_idx)
        # Convert the generated text from indices to words
        generated_text = [tokenizer.id_to_token(word) for word in generated_text]
        return "".join(generated_text)
    
    # Initialize the seed text
    seed_text = "The sky is"
    # Generate new text
    generated_text = generate_text(seed_text, model, max_gen_length=10, tokenizer=tokenizer)
    print(generated_text)
    
    

    Making a few changes to the code to suit our tokenizer class and model, here are a few inputs and outputs.

    Input - the sky is
    Output - the sky is<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>4444444444
    Input - "lion is the king of the jungle"
    Output - lion is the king of the jungle<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>4444444444

    Sure the output is terrible, but remember it is a very basic model architecture and we’ve not used transformers or temperature sampling to improve our language model. In my future posts, I’ll use ChatGPT to build upon these blocks to train even bigger and more complex language models.

    This shows how ChatGPT or similar large language models can enable developers in writing code or develop models in a short amount of time. It is