Tag: NLP

  • Ultimate Guide to Chroma Vector Database: Everything You Need to Know – Part 1

    Ultimate Guide to Chroma Vector Database: Everything You Need to Know – Part 1

    In this tutorial, we will walk through how to use Chromadb as your vector database for all your Retrieval-Augmented Generation (RAG) tasks.

    But before that, you need to install Chromadb, if you’re using Python then all you need to do is –

    pip install chromadb

    Now that you’ve installed Chromadb, let’s begin. We will use a PDF file as an example. For the PDF we will be using this research paper, but feel free to use the PDF of your choice.

    The first step is to create a persistent client, i.e., the storage which can be used at multiple places. While creating the client remember to add the setting to allow resetting the client should you require this functionality.

    import chromadb
    
    client = chromadb.PersistentClient(
                path="<path of persistent storage>", settings=chromadb.config.Settings(allow_reset=True)
            )

    Once you’ve a client set up then you can define collections within it. If you use BigQuery or any SQL products, imagine the client being the project and the collection being the dataset. Within the collection, you can store the documents as embeddings. In this example, we will call our collection as “research”. Also, one very important thing to remember is that each collection should have its own embedding function that has to be fixed. The query is also passed as an embedding when you try to search for the most similar documents. So in case you use embedding function X to add the documents and use embedding function Y to query them, then the similarity scores will not be correct, so this is a point to remember. We will be using the OpenAI ttext-embedding-3-small model. Another point to remember is that in a single document, you should only have as many tokens as the embedding function can embed. If say your embedding function is all-MiniLM-L6-v2 from HuggingFace, then the max sequence length that the function can handle is 256, so if you try to vectorise a file with longer context, then it will just clip the document to 256 tokens and embed that. The model from OpenAI has a longer max sequence length, but how much exactly is hard to find.

    # Defining the embedding function
    embedding_func = embedding_functions.OpenAIEmbeddingFunction(api_key=os.environ.get("OPENAI_API_KEY") , 
                                                                 model_name="text-embedding-3-small")
    

    Creating the collection, it’s best practice to specify the embedding function while creating the collection, otherwise, Chromadb uses a default embedding function. Chromadb will use sentence-transformer as a default if no embedding function is supplied.

    Now we will need to add a document to this collection, for this, we will use some helper functions from langchain.

    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain_community.document_loaders import PyPDFLoader
    
    data_path = "./data/2311.04635v1.pdf"
    
    pdf_loader = PyPDFLoader(data_path)
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=50,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    documents = pdf_loader.load_and_split(text_splitter=text_splitter)

    The PyPDFLoader will help to load the PDF file and the RecursiveCharacterTextSplitter will help in splitting the PDF into chunks. We are using a chunk size of 1000 with an overlap of 50, meaning the chunk sizes will be roughly 1000 tokens with overlapping text of 50 tokens. You can learn more about how the text splitter works here.

    Now that we have our documents loaded, time to add them to the Chromadb collection. Since we’ve specified the embedding function in the collection already, we can simply add the text files as embeddings. You have to specify “ids”, and think of them as table names in SQL, also you can specify metadata for each document, both are useful when you want to upsert or delete documents in the collection.

    collection.add(
                documents=[i.page_content for i in documents],
                ids=[f"pdf_chunk_{i}" for i in range(len(documents))],
                metadatas=
                [
                    {
                        "file_name": "reasearch_paper",
                        "timestamp": datetime.now(timezone.utc).isoformat(),
                    }
                    for _ in documents
                ],
            )

    Here we use the page contents of the loaded documents as documents, since they are text, the collection using its embedding function will automatically convert them as embeddings. In case you already have embeddings, you can directly add them as embeddings. I’ve also specified some ids, these are very rudimentary here for illustration purposes. Also, I’ve specified some metadata for each document. Both of these will be used to query, upsert or delete individual documents from the vector database.

    Read more about how you can upsert documents, query a collection and delete individual documents in Part II of this Ultimate Guide to ChromaDB

  • How does ChatGPT remember? LLM Memory Explained.

    In the fascinating world of conversational AI, the ability of systems like ChatGPT to remember and refer back to earlier parts of a conversation is nothing short of magic. But how does this seemingly simple act of recollection work under the hood? Let’s dive into the concept of memory in large language models (LLMs) and uncover the mechanisms that enable these digital conversationalists to keep track of our chats.

    The Essence of Memory in Conversational AI

    Memory in conversational AI systems is about the ability to store and recall information from earlier interactions. This capability is crucial for maintaining the context and coherence of a conversation, allowing the LLM to reference past exchanges and build upon them meaningfully. This also gives the appearance that the LLM has intelligence when in reality they are stateless and have no inbuilt memory.

    LangChain, a framework for building conversational AI applications, highlights the importance of memory in these systems. It distinguishes between two fundamental actions that a memory system needs to support: reading and writing.

    What happens is that the LLM is passed an additional context of memory in addition to your input as a prompt so that it can process the information as if it had all the context from the get-go.

    Building Memory into Conversational Systems

    The development of an effective memory system involves two key design decisions: how the state is stored and how it is queried.

    Storing: The Backbone of Memory

    Underneath any memory system lies a history of all chat interactions. These can range from simple in-memory lists to sophisticated persistent databases. Storage is simple, you can store all past conversations in a database. You can either store them as simple text documents or use a vector database and store them as embeddings.

    Querying: The Brain of Memory

    Storing chat messages is only one part of the equation. The real magic happens in the querying phase, where data structures and algorithms work together to present a view of the message history that is most useful for the current context. This might involve returning the most recent messages, summarizing past interactions, or extracting and focusing on specific entities mentioned in the conversation.

    Practical Implementation with LangChain

    Here we will take a look at one way to store memory using LangChain.

    from langchain.memory import ConversationBufferMemory

    memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

    Now you can attach this memory to any LLM chain and it will add the entire previous conversations as context to the LLM after each chain invoke. The advantage of using this kind of memory is that its simple to implement. The disadvantage is that in longer conversations you’re passing more tokens and the input prompt size explodes, meaning slower response and if you’re using paid models like GPT-4, then costs also increase.

    Conclusion

    The ability of systems like ChatGPT to remember past interactions is a cornerstone of effective chatbots. By leveraging sophisticated memory systems, developers can create applications that not only understand the current context but can also draw on previous exchanges to provide more coherent and engaging responses. As we continue to push the boundaries of what conversational AI can achieve, the exploration and enhancement of memory mechanisms will remain a critical area of focus.

  • Fine Tune Llama-2-13b on a single GPU on custom data.

    In this tutorial, we will walk through each step of fine-tuning Llama-2-13b model on a single GPU. I’ll be using a collab notebook but you can use your local machine, it just needs to have around 12 Gb of VRAM.

    The required libraries can be installed by running this in your notebook.

    !pip install -q transformers trl peft huggingface_hub datasets bitsandbytes accelerate

    First login to your huggingface account.

    from huggingface_hub import login
    login("<your token here>")

    Loading the tokenizer.

    model_id = "meta-llama/Llama-2-13b-chat-hf"
    import torch
    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
    
    from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, BitsAndBytesConfig
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token

    Now we will load the model in its quantised form, this reduces the memory requirements to fit the model, so it can run on a single GPU.

    bnb_config = BitsAndBytesConfig(load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=False)

    If you’ve a bit more GPU to play around, you can load the 8-bit model. Play around with this configuration based on your hardware specifications.

    model = AutoModelForCausalLM.from_pretrained(model_id,  quantization_config=bnb_config, use_cache=False)

    Now the below lines of code prepare the model for 4 or 8-bit training, otherwise, you get an error.

    from peft import prepare_model_for_kbit_training, get_peft_model, LoraConfig, TaskType
    
    model.gradient_checkpointing_enable()
    model = prepare_model_for_kbit_training(model)

    Then you define your LoRA config, mainly there are two parameters that you play around with – rank and lora_alpha. For more details, you can read about the params here.

    peft_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM, 
        inference_mode=False, 
        r=64, 
        lora_alpha=32, 
        lora_dropout=0.1,
    )
    model = get_peft_model(model, peft_config)
    

    Now the below cell is a helper function that shows how many trainable parameters are there.

    def print_trainable_parameters(model):
        """
        Prints the number of trainable parameters in the model.
        """
        trainable_params = 0
        all_param = 0
        for _, param in model.named_parameters():
            all_param += param.numel()
            if param.requires_grad:
                trainable_params += param.numel()
        print(
            f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
        )
    
    print_trainable_parameters(model)
    
    >>> trainable params: 52428800 || all params: 6724408320 || trainable%: 0.7796790067620403

    We can see with LoRA, there are very few parameters to train.

    To prepare your data, you can have it in any form you want, as long as it is with datasets, you can pass a formatting function while training, which can combine all text part of the data.

    Here you can change the training configurations, For LoRA you can start with a higher learning rate as the original weights are frozen, so you don’t have to worry about catastrophic forgetting. The arguments you want to play around with are per_device_train_batch_size and gradient_accumulation_steps as when you run out of memory then lower per_device_train_batch_size and increase gradient_accumulation_steps.

    max_seq_length = 512
    
    from transformers import TrainingArguments, EarlyStoppingCallback
    from trl import SFTTrainer
    output_dir = "./results"
    optim = "paged_adamw_32bit"
    training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=True,
        optim=optim,
        learning_rate=1e-4,
        logging_steps=10,
        max_steps=300,
        warmup_ratio=0.3,
        per_device_train_batch_size=8,
        gradient_accumulation_steps=4,
        gradient_checkpointing=True,
        save_total_limit = 5,
        fp16=True
        
    )
    

    Here I’m writing an example of a formatting function. My data already had a text field which had all the text data.

    def format_function(example):
        return example['text']

    But in case you don’t have text field, you can have it so that the function returns all text as one.

    Now we define the trainer.

    from trl import SFTTrainer
    peft_trainer = SFTTrainer(
        model=model,
        train_dataset=dataset,
        peft_config=peft_config,
        max_seq_length=max_seq_length,
        tokenizer=tokenizer,
        args=training_args,
        formatting_func=format_function)
    
    peft_trainer.train()

    Once the model has been trained, you can store is locally or push it to huggingface hub.

    Hope this tutorial cleared any doubts you had around fine-tuning LLMs on a single GPU.

  • I asked ChatGPT to write a language model

    I asked ChatGPT to write a language model. Here is the code that it returned.

    from keras.layers import Embedding, LSTM, Dense
    from keras.models import Sequential
    
    # Define the model
    model = Sequential()
    model.add(Embedding(input_dim=vocab_size, output_dim=embedding_size, input_length=max_length))
    model.add(LSTM(units=hidden_size))
    model.add(Dense(units=vocab_size, activation='softmax'))
    
    # Compile the model
    model.compile(optimizer='adam', loss='categorical_crossentropy')
    
    # Fit the model to the training data
    model.fit(x_train, y_train, batch_size=batch_size, epochs=num_epochs)
    
    

    So I decided to build a language model using it, but before I had to write a couple of lines of code myself. First the Tokenizer.

    class Tokenizer():
        def __init__(self, 
                     oov_token ='<unk>',
                     pad_token ='<pad>'):
            self.vocab = {}
            self.reverse_vocab = {}
            self.oov_token = oov_token
            self.pad_token = pad_token
            self.__add_to_dict(self.oov_token)
            self.__add_to_dict(self.pad_token)
            for i in range(10):
                self.__add_to_dict(str(i))
            for i in range(26):
                self.__add_to_dict(chr(ord('a') + i))
    
            # Add space and punctuation to the dictionary
            self.__add_to_dict('.')
            self.__add_to_dict(' ')
        
        def __add_to_dict(self, character):
            if character not in self.vocab:
                self.vocab[character] = len(self.vocab)
                self.reverse_vocab[self.vocab[character]] = character
            
        def tokenize(self, text):
            return [self.vocab[c] for c in text]
    
        def detokenize(self, text):
            return [self.reverse_vocab[c] for c in text]
        
        def get_vocabulary(self):
            return self.vocab
        
        def vocabulary_size(self):
            return len(self.vocab)
        
        def token_to_id(self,character):
            return self.vocab[character]
        
        def id_to_token(self , token):
            return self.reverse_vocab[token]
        
        def pad_seq(self,seq, max_len):
            return seq[:max_len] + [self.token_to_id(self.pad_token)]*(max_len-len(seq))
    

    Then I added the config, created a small corpus of text and prepared the training data it needs to train the model. To prepare the training data I also asked how should the corpus be created and created the corpus as it showed me.

    t = Tokenizer()
    vocab_size = t.vocabulary_size()
    embedding_size = 64
    max_length = vocab_size
    num_epochs = 50
    batch_size = 16
    
    corpus = ["this is a dog",
             "dogs live with humans", 
             "they are called what is known as pets",
             "cats are also pets",
             "there are also wolves in the jungle",
             "there are many animals in the jungle",
             "the lion is called the king of the jungle",
             "the largest animal in the jungle is the elephant",
             "we have horses also there",
             "foxes are quite quick and they are also very clever",
             "the zebra also looks like a horse but is not domesticated",
             "we go to the zoo to look at monkeys",
             "we all live together on this earth",
             'monkeys like bananas',
             'pandas eat bamboo',
             'tigers are dangerous',
             'zebras have stripes',
             'lions are the kings of the savannah',
             "it is upon humans to take care of this planet"]
    tokenized_corpus = [t.tokenize(line) for line in corpus]
    #Padding the tokenized corpus
    x_train = np.array([t.pad_seq(sentence, max_length) for sentence in tokenized_corpus])
    
    y_train = np.array([np.roll(sentence, -1) for sentence in x_train])
    
    
    

    Also it had a mistake in the code, which I corrected, and ran the training loop. You can find the full code here on my Github.

    I then asked ChatGPT to write the inference code for the model. Here is what it returned.

    def generate_text(seed_text, model, max_gen_length, tokenizer):
        # Tokenize the seed text
        seed_text = tokenizer.tokenize(seed_text)
        # Pad the seed text to the maximum length
        seed_text = tokenizer.pad_seq(seed_text, max_length=30)
        # Initialize an empty list to store the generated words
        generated_text = []
        # Append the seed text to the generated text
        generated_text.extend(seed_text)
        # Loop for the specified number of words to generate
        for i in range(max_gen_length):
            # Use the model to predict the next word
            next_word_probs = model.predict(np.array([seed_text]))[0]
            # Get the index of the most probable next word
            next_word_idx = np.argmax(next_word_probs)
            # Append the generated word to the generated text
            generated_text.append(next_word_idx)
            # update the seed text
            seed_text = np.delete(seed_text, 0)
            seed_text = np.append(seed_text, next_word_idx)
        # Convert the generated text from indices to words
        generated_text = [tokenizer.id_to_token(word) for word in generated_text]
        return "".join(generated_text)
    
    # Initialize the seed text
    seed_text = "The sky is"
    # Generate new text
    generated_text = generate_text(seed_text, model, max_gen_length=10, tokenizer=tokenizer)
    print(generated_text)
    
    

    Making a few changes to the code to suit our tokenizer class and model, here are a few inputs and outputs.

    Input - the sky is
    Output - the sky is<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>4444444444
    Input - "lion is the king of the jungle"
    Output - lion is the king of the jungle<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>4444444444

    Sure the output is terrible, but remember it is a very basic model architecture and we’ve not used transformers or temperature sampling to improve our language model. In my future posts, I’ll use ChatGPT to build upon these blocks to train even bigger and more complex language models.

    This shows how ChatGPT or similar large language models can enable developers in writing code or develop models in a short amount of time. It is