KerasNLP – ML EXPLAINED

I asked ChatGPT to write a language model. Here is the code that it returned.

from keras.layers import Embedding, LSTM, Dense
from keras.models import Sequential

# Define the model
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_size, input_length=max_length))
model.add(LSTM(units=hidden_size))
model.add(Dense(units=vocab_size, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy')

# Fit the model to the training data
model.fit(x_train, y_train, batch_size=batch_size, epochs=num_epochs)

So I decided to build a language model using it, but before I had to write a couple of lines of code myself. First the Tokenizer.

class Tokenizer():
    def __init__(self, 
                 oov_token ='<unk>',
                 pad_token ='<pad>'):
        self.vocab = {}
        self.reverse_vocab = {}
        self.oov_token = oov_token
        self.pad_token = pad_token
        self.__add_to_dict(self.oov_token)
        self.__add_to_dict(self.pad_token)
        for i in range(10):
            self.__add_to_dict(str(i))
        for i in range(26):
            self.__add_to_dict(chr(ord('a') + i))

        # Add space and punctuation to the dictionary
        self.__add_to_dict('.')
        self.__add_to_dict(' ')
    
    def __add_to_dict(self, character):
        if character not in self.vocab:
            self.vocab[character] = len(self.vocab)
            self.reverse_vocab[self.vocab[character]] = character
        
    def tokenize(self, text):
        return [self.vocab[c] for c in text]

    def detokenize(self, text):
        return [self.reverse_vocab[c] for c in text]
    
    def get_vocabulary(self):
        return self.vocab
    
    def vocabulary_size(self):
        return len(self.vocab)
    
    def token_to_id(self,character):
        return self.vocab[character]
    
    def id_to_token(self , token):
        return self.reverse_vocab[token]
    
    def pad_seq(self,seq, max_len):
        return seq[:max_len] + [self.token_to_id(self.pad_token)]*(max_len-len(seq))

Then I added the config, created a small corpus of text and prepared the training data it needs to train the model. To prepare the training data I also asked how should the corpus be created and created the corpus as it showed me.

t = Tokenizer()
vocab_size = t.vocabulary_size()
embedding_size = 64
max_length = vocab_size
num_epochs = 50
batch_size = 16

corpus = ["this is a dog",
         "dogs live with humans", 
         "they are called what is known as pets",
         "cats are also pets",
         "there are also wolves in the jungle",
         "there are many animals in the jungle",
         "the lion is called the king of the jungle",
         "the largest animal in the jungle is the elephant",
         "we have horses also there",
         "foxes are quite quick and they are also very clever",
         "the zebra also looks like a horse but is not domesticated",
         "we go to the zoo to look at monkeys",
         "we all live together on this earth",
         'monkeys like bananas',
         'pandas eat bamboo',
         'tigers are dangerous',
         'zebras have stripes',
         'lions are the kings of the savannah',
         "it is upon humans to take care of this planet"]
tokenized_corpus = [t.tokenize(line) for line in corpus]
#Padding the tokenized corpus
x_train = np.array([t.pad_seq(sentence, max_length) for sentence in tokenized_corpus])

y_train = np.array([np.roll(sentence, -1) for sentence in x_train])

Also it had a mistake in the code, which I corrected, and ran the training loop. You can find the full code here on my Github.

I then asked ChatGPT to write the inference code for the model. Here is what it returned.

def generate_text(seed_text, model, max_gen_length, tokenizer):
    # Tokenize the seed text
    seed_text = tokenizer.tokenize(seed_text)
    # Pad the seed text to the maximum length
    seed_text = tokenizer.pad_seq(seed_text, max_length=30)
    # Initialize an empty list to store the generated words
    generated_text = []
    # Append the seed text to the generated text
    generated_text.extend(seed_text)
    # Loop for the specified number of words to generate
    for i in range(max_gen_length):
        # Use the model to predict the next word
        next_word_probs = model.predict(np.array([seed_text]))[0]
        # Get the index of the most probable next word
        next_word_idx = np.argmax(next_word_probs)
        # Append the generated word to the generated text
        generated_text.append(next_word_idx)
        # update the seed text
        seed_text = np.delete(seed_text, 0)
        seed_text = np.append(seed_text, next_word_idx)
    # Convert the generated text from indices to words
    generated_text = [tokenizer.id_to_token(word) for word in generated_text]
    return "".join(generated_text)

# Initialize the seed text
seed_text = "The sky is"
# Generate new text
generated_text = generate_text(seed_text, model, max_gen_length=10, tokenizer=tokenizer)
print(generated_text)

Making a few changes to the code to suit our tokenizer class and model, here are a few inputs and outputs.

Input - the sky is
Output - the sky is<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>4444444444
Input - "lion is the king of the jungle"
Output - lion is the king of the jungle<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>4444444444

Sure the output is terrible, but remember it is a very basic model architecture and we’ve not used transformers or temperature sampling to improve our language model. In my future posts, I’ll use ChatGPT to build upon these blocks to train even bigger and more complex language models.

This shows how ChatGPT or similar large language models can enable developers in writing code or develop models in a short amount of time. It is

Tag: KerasNLP

I asked ChatGPT to write a language model