One often thinks that you can use deep learning for classification problems like text or image classification, or for similar tasks like segmentation, language models etc. But you can also do simple linear regression with deep learning libraries. I’ve also attached the GitHub Gist in case you want to explore the working notebook.
In this post I’ll go over the model, it’s explanation on how can you do linear regression with keras.
In Keras, it can be implemented using the Sequential model and the Dense layer. Here’s an example of how to implement linear regression with Keras:
First we take a toy regression problem from scikit-learn datasets.
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
X,y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.8)
Now we will need to define the model using Keras. That is actually very simple, you just have to take one sequential model with a Dense layer. The activation for this layer will be linear as we’re building a linear model and the loss will be mean squared error.
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
# define the model
model = Sequential()
model.add(Dense(units=1, activation='linear'))
# compile the model
model.compile(optimizer='sgd', loss='mean_squared_error', metrics = ['mae'])
#fit the model
model.fit(x=X_train, y=y_train, validation_data=(X_test,y_test),
epochs=100, batch_size=128)
Thats then all that is left is to call model.predict(X_test).
I asked ChatGPT to write a language model. Here is the code that it returned.
from keras.layers import Embedding, LSTM, Dense
from keras.models import Sequential
# Define the model
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_size, input_length=max_length))
model.add(LSTM(units=hidden_size))
model.add(Dense(units=vocab_size, activation='softmax'))
# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy')
# Fit the model to the training data
model.fit(x_train, y_train, batch_size=batch_size, epochs=num_epochs)
So I decided to build a language model using it, but before I had to write a couple of lines of code myself. First the Tokenizer.
class Tokenizer():
def __init__(self,
oov_token ='<unk>',
pad_token ='<pad>'):
self.vocab = {}
self.reverse_vocab = {}
self.oov_token = oov_token
self.pad_token = pad_token
self.__add_to_dict(self.oov_token)
self.__add_to_dict(self.pad_token)
for i in range(10):
self.__add_to_dict(str(i))
for i in range(26):
self.__add_to_dict(chr(ord('a') + i))
# Add space and punctuation to the dictionary
self.__add_to_dict('.')
self.__add_to_dict(' ')
def __add_to_dict(self, character):
if character not in self.vocab:
self.vocab[character] = len(self.vocab)
self.reverse_vocab[self.vocab[character]] = character
def tokenize(self, text):
return [self.vocab[c] for c in text]
def detokenize(self, text):
return [self.reverse_vocab[c] for c in text]
def get_vocabulary(self):
return self.vocab
def vocabulary_size(self):
return len(self.vocab)
def token_to_id(self,character):
return self.vocab[character]
def id_to_token(self , token):
return self.reverse_vocab[token]
def pad_seq(self,seq, max_len):
return seq[:max_len] + [self.token_to_id(self.pad_token)]*(max_len-len(seq))
Then I added the config, created a small corpus of text and prepared the training data it needs to train the model. To prepare the training data I also asked how should the corpus be created and created the corpus as it showed me.
t = Tokenizer()
vocab_size = t.vocabulary_size()
embedding_size = 64
max_length = vocab_size
num_epochs = 50
batch_size = 16
corpus = ["this is a dog",
"dogs live with humans",
"they are called what is known as pets",
"cats are also pets",
"there are also wolves in the jungle",
"there are many animals in the jungle",
"the lion is called the king of the jungle",
"the largest animal in the jungle is the elephant",
"we have horses also there",
"foxes are quite quick and they are also very clever",
"the zebra also looks like a horse but is not domesticated",
"we go to the zoo to look at monkeys",
"we all live together on this earth",
'monkeys like bananas',
'pandas eat bamboo',
'tigers are dangerous',
'zebras have stripes',
'lions are the kings of the savannah',
"it is upon humans to take care of this planet"]
tokenized_corpus = [t.tokenize(line) for line in corpus]
#Padding the tokenized corpus
x_train = np.array([t.pad_seq(sentence, max_length) for sentence in tokenized_corpus])
y_train = np.array([np.roll(sentence, -1) for sentence in x_train])
Also it had a mistake in the code, which I corrected, and ran the training loop. You can find the full code here on my Github.
I then asked ChatGPT to write the inference code for the model. Here is what it returned.
def generate_text(seed_text, model, max_gen_length, tokenizer):
# Tokenize the seed text
seed_text = tokenizer.tokenize(seed_text)
# Pad the seed text to the maximum length
seed_text = tokenizer.pad_seq(seed_text, max_length=30)
# Initialize an empty list to store the generated words
generated_text = []
# Append the seed text to the generated text
generated_text.extend(seed_text)
# Loop for the specified number of words to generate
for i in range(max_gen_length):
# Use the model to predict the next word
next_word_probs = model.predict(np.array([seed_text]))[0]
# Get the index of the most probable next word
next_word_idx = np.argmax(next_word_probs)
# Append the generated word to the generated text
generated_text.append(next_word_idx)
# update the seed text
seed_text = np.delete(seed_text, 0)
seed_text = np.append(seed_text, next_word_idx)
# Convert the generated text from indices to words
generated_text = [tokenizer.id_to_token(word) for word in generated_text]
return "".join(generated_text)
# Initialize the seed text
seed_text = "The sky is"
# Generate new text
generated_text = generate_text(seed_text, model, max_gen_length=10, tokenizer=tokenizer)
print(generated_text)
Making a few changes to the code to suit our tokenizer class and model, here are a few inputs and outputs.
Input - the sky is
Output - the sky is<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>4444444444
Input - "lion is the king of the jungle"
Output - lion is the king of the jungle<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>4444444444
Sure the output is terrible, but remember it is a very basic model architecture and we’ve not used transformers or temperature sampling to improve our language model. In my future posts, I’ll use ChatGPT to build upon these blocks to train even bigger and more complex language models.
This shows how ChatGPT or similar large language models can enable developers in writing code or develop models in a short amount of time. It is