Tag: ChatGPT

How does ChatGPT remember? LLM Memory Explained.
In the fascinating world of conversational AI, the ability of systems like ChatGPT to remember and refer back to earlier parts of a conversation is nothing short of magic. But how does this seemingly simple act of recollection work under the hood? Let’s dive into the concept of memory in large language models (LLMs) and uncover the mechanisms that enable these digital conversationalists to keep track of our chats.

The Essence of Memory in Conversational AI

Memory in conversational AI systems is about the ability to store and recall information from earlier interactions. This capability is crucial for maintaining the context and coherence of a conversation, allowing the LLM to reference past exchanges and build upon them meaningfully. This also gives the appearance that the LLM has intelligence when in reality they are stateless and have no inbuilt memory.

LangChain, a framework for building conversational AI applications, highlights the importance of memory in these systems. It distinguishes between two fundamental actions that a memory system needs to support: reading and writing.

What happens is that the LLM is passed an additional context of memory in addition to your input as a prompt so that it can process the information as if it had all the context from the get-go.

Building Memory into Conversational Systems

The development of an effective memory system involves two key design decisions: how the state is stored and how it is queried.

Storing: The Backbone of Memory

Underneath any memory system lies a history of all chat interactions. These can range from simple in-memory lists to sophisticated persistent databases. Storage is simple, you can store all past conversations in a database. You can either store them as simple text documents or use a vector database and store them as embeddings.

Querying: The Brain of Memory

Storing chat messages is only one part of the equation. The real magic happens in the querying phase, where data structures and algorithms work together to present a view of the message history that is most useful for the current context. This might involve returning the most recent messages, summarizing past interactions, or extracting and focusing on specific entities mentioned in the conversation.

Practical Implementation with LangChain

Here we will take a look at one way to store memory using LangChain.
```
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
```
Now you can attach this memory to any LLM chain and it will add the entire previous conversations as context to the LLM after each chain invoke. The advantage of using this kind of memory is that its simple to implement. The disadvantage is that in longer conversations you’re passing more tokens and the input prompt size explodes, meaning slower response and if you’re using paid models like GPT-4, then costs also increase.

Conclusion

The ability of systems like ChatGPT to remember past interactions is a cornerstone of effective chatbots. By leveraging sophisticated memory systems, developers can create applications that not only understand the current context but can also draw on previous exchanges to provide more coherent and engaging responses. As we continue to push the boundaries of what conversational AI can achieve, the exploration and enhancement of memory mechanisms will remain a critical area of focus.
March 3, 2024
Large Language Model (LLM) Evaluation Metrics – BLEU and ROUGE
How are language models evaluated? In traditional machine learning, we have metrics like accuracy, f1-score, precision, recall etc. But how can you calculate objectively how the model performed, when the label is I like to drink coffee over tea and the model’s output is I prefer coffee to tea. As humans, we can clearly see that these two have the same meaning, but how can a machine make the same evaluation?

Well, there are two approaches –
1. BLEU – Bilingual Evaluation Understudy is a metric used to evaluate the quality of machine-generated translations against one or more reference translations. It measures the similarity between the machine-generated translation and the reference translations based on the n-grams (contiguous sequences of n words) present in both. BLEU score ranges from 0 to 1, with a higher score indicating a better match between the generated translation and the references. A score of 1 means a perfect match, while a score of 0 means no overlap between the generated and reference translations.
2. ROUGE – Recall-Oriented Understudy for Gisting Evaluation) is a widely used evaluation metric for assessing the quality of automatic summaries generated by text summarization systems. It measures the similarity between the generated summary and one or more reference summaries. ROUGE calculates the precision and recall scores by comparing the n-gram units (such as words or sequences of words) in the generated summary with those in the reference summaries. It focuses on the recall score, which measures how much of the important information from the reference summaries is captured by the generated summary.
Let us take an example and calculate both the metrics. Suppose the label is "The cat sat on the mat." and the model’s output is "The dog slept on the couch."

Here is the python code to calculate ROUGE score –
```
from collections import Counter
import re


def calculate_ROUGE(generated_summary, reference_summary, n):
    # Tokenize the generated summary and reference summary into n-grams
    generated_ngrams = generate_ngrams(generated_summary, n)
    reference_ngrams = generate_ngrams(reference_summary, n)

    # Calculate the recall score
    matching_ngrams = len(set(generated_ngrams) & set(reference_ngrams))
    recall_score = matching_ngrams / len(reference_ngrams)

    return recall_score


def generate_ngrams(text, n):
    # Preprocess text by removing punctuation and converting to lowercase
    text = re.sub(r'[^\w\s]', '', text.lower())

    # Generate n-grams from the preprocessed text
    words = text.split()
    ngrams = [tuple(words[i:i+n]) for i in range(len(words) - n + 1)]

    return ngrams


# Example usage
generated_summary = "The dog slept on the couch."
reference_summary = "The cat sat on the mat."
n = 2  # bigram

rouge_score = calculate_ROUGE(generated_summary, reference_summary, n)
print(f"ROUGE-{n} score: {rouge_score}")
>> ROUGE-2 score: 0.2
```
If we use n = 2, which means bigrams, then the ROUGE-2 score is 0.2 .

Similarly, lets calculate the BLEU score –
```
from collections import Counter
import nltk.translate.bleu_score as bleu


def calculate_BLEU(generated_summary, reference_summary, n):
    # Tokenize the generated summary and reference summary
    generated_tokens = generated_summary.split()
    reference_tokens = reference_summary.split()

    # Calculate the BLEU score
    weights = [1.0 / n] * n  # Weights for n-gram precision calculation
    bleu_score = bleu.sentence_bleu([reference_tokens], generated_tokens, weights=weights)

    return bleu_score


# Example usage
generated_summary = "The dog slept on the couch."
reference_summary = "The cat sat on the mat."
n = 2  # Bigram

bleu_score = calculate_BLEU(generated_summary, reference_summary, n)
print(f"BLEU-{n} score: {bleu_score}")
>> BLEU-2 score: 0.316227766016838
```
So, we get two different scores from these two different approaches.

The evaluation metric you will chose to fine tune your LLM models will depend on the task at hand, but usually BLEU is used for machine translations and ROUGE is used for summarisation tasks.
July 8, 2023

I asked ChatGPT to write a language model

I asked ChatGPT to write a language model. Here is the code that it returned.

from keras.layers import Embedding, LSTM, Dense
from keras.models import Sequential

# Define the model
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_size, input_length=max_length))
model.add(LSTM(units=hidden_size))
model.add(Dense(units=vocab_size, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy')

# Fit the model to the training data
model.fit(x_train, y_train, batch_size=batch_size, epochs=num_epochs)

So I decided to build a language model using it, but before I had to write a couple of lines of code myself. First the Tokenizer.

class Tokenizer():
    def __init__(self, 
                 oov_token ='<unk>',
                 pad_token ='<pad>'):
        self.vocab = {}
        self.reverse_vocab = {}
        self.oov_token = oov_token
        self.pad_token = pad_token
        self.__add_to_dict(self.oov_token)
        self.__add_to_dict(self.pad_token)
        for i in range(10):
            self.__add_to_dict(str(i))
        for i in range(26):
            self.__add_to_dict(chr(ord('a') + i))

        # Add space and punctuation to the dictionary
        self.__add_to_dict('.')
        self.__add_to_dict(' ')
    
    def __add_to_dict(self, character):
        if character not in self.vocab:
            self.vocab[character] = len(self.vocab)
            self.reverse_vocab[self.vocab[character]] = character
        
    def tokenize(self, text):
        return [self.vocab[c] for c in text]

    def detokenize(self, text):
        return [self.reverse_vocab[c] for c in text]
    
    def get_vocabulary(self):
        return self.vocab
    
    def vocabulary_size(self):
        return len(self.vocab)
    
    def token_to_id(self,character):
        return self.vocab[character]
    
    def id_to_token(self , token):
        return self.reverse_vocab[token]
    
    def pad_seq(self,seq, max_len):
        return seq[:max_len] + [self.token_to_id(self.pad_token)]*(max_len-len(seq))

Then I added the config, created a small corpus of text and prepared the training data it needs to train the model. To prepare the training data I also asked how should the corpus be created and created the corpus as it showed me.

t = Tokenizer()
vocab_size = t.vocabulary_size()
embedding_size = 64
max_length = vocab_size
num_epochs = 50
batch_size = 16

corpus = ["this is a dog",
         "dogs live with humans", 
         "they are called what is known as pets",
         "cats are also pets",
         "there are also wolves in the jungle",
         "there are many animals in the jungle",
         "the lion is called the king of the jungle",
         "the largest animal in the jungle is the elephant",
         "we have horses also there",
         "foxes are quite quick and they are also very clever",
         "the zebra also looks like a horse but is not domesticated",
         "we go to the zoo to look at monkeys",
         "we all live together on this earth",
         'monkeys like bananas',
         'pandas eat bamboo',
         'tigers are dangerous',
         'zebras have stripes',
         'lions are the kings of the savannah',
         "it is upon humans to take care of this planet"]
tokenized_corpus = [t.tokenize(line) for line in corpus]
#Padding the tokenized corpus
x_train = np.array([t.pad_seq(sentence, max_length) for sentence in tokenized_corpus])

y_train = np.array([np.roll(sentence, -1) for sentence in x_train])

Also it had a mistake in the code, which I corrected, and ran the training loop. You can find the full code here on my Github.

I then asked ChatGPT to write the inference code for the model. Here is what it returned.

def generate_text(seed_text, model, max_gen_length, tokenizer):
    # Tokenize the seed text
    seed_text = tokenizer.tokenize(seed_text)
    # Pad the seed text to the maximum length
    seed_text = tokenizer.pad_seq(seed_text, max_length=30)
    # Initialize an empty list to store the generated words
    generated_text = []
    # Append the seed text to the generated text
    generated_text.extend(seed_text)
    # Loop for the specified number of words to generate
    for i in range(max_gen_length):
        # Use the model to predict the next word
        next_word_probs = model.predict(np.array([seed_text]))[0]
        # Get the index of the most probable next word
        next_word_idx = np.argmax(next_word_probs)
        # Append the generated word to the generated text
        generated_text.append(next_word_idx)
        # update the seed text
        seed_text = np.delete(seed_text, 0)
        seed_text = np.append(seed_text, next_word_idx)
    # Convert the generated text from indices to words
    generated_text = [tokenizer.id_to_token(word) for word in generated_text]
    return "".join(generated_text)

# Initialize the seed text
seed_text = "The sky is"
# Generate new text
generated_text = generate_text(seed_text, model, max_gen_length=10, tokenizer=tokenizer)
print(generated_text)

Making a few changes to the code to suit our tokenizer class and model, here are a few inputs and outputs.

Input - the sky is
Output - the sky is<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>4444444444
Input - "lion is the king of the jungle"
Output - lion is the king of the jungle<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>4444444444

Sure the output is terrible, but remember it is a very basic model architecture and we’ve not used transformers or temperature sampling to improve our language model. In my future posts, I’ll use ChatGPT to build upon these blocks to train even bigger and more complex language models.

This shows how ChatGPT or similar large language models can enable developers in writing code or develop models in a short amount of time. It is

January 15, 2023

Tag: ChatGPT

How does ChatGPT remember? LLM Memory Explained.

The Essence of Memory in Conversational AI

Building Memory into Conversational Systems

Storing: The Backbone of Memory

Querying: The Brain of Memory

Practical Implementation with LangChain

Conclusion

Large Language Model (LLM) Evaluation Metrics – BLEU and ROUGE

I asked ChatGPT to write a language model