Tag: LLM

  • Fine Tune Llama-2-13b on a single GPU on custom data.

    In this tutorial, we will walk through each step of fine-tuning Llama-2-13b model on a single GPU. I’ll be using a collab notebook but you can use your local machine, it just needs to have around 12 Gb of VRAM.

    The required libraries can be installed by running this in your notebook.

    !pip install -q transformers trl peft huggingface_hub datasets bitsandbytes accelerate

    First login to your huggingface account.

    from huggingface_hub import login
    login("<your token here>")

    Loading the tokenizer.

    model_id = "meta-llama/Llama-2-13b-chat-hf"
    import torch
    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
    
    from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, BitsAndBytesConfig
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token

    Now we will load the model in its quantised form, this reduces the memory requirements to fit the model, so it can run on a single GPU.

    bnb_config = BitsAndBytesConfig(load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=False)

    If you’ve a bit more GPU to play around, you can load the 8-bit model. Play around with this configuration based on your hardware specifications.

    model = AutoModelForCausalLM.from_pretrained(model_id,  quantization_config=bnb_config, use_cache=False)

    Now the below lines of code prepare the model for 4 or 8-bit training, otherwise, you get an error.

    from peft import prepare_model_for_kbit_training, get_peft_model, LoraConfig, TaskType
    
    model.gradient_checkpointing_enable()
    model = prepare_model_for_kbit_training(model)

    Then you define your LoRA config, mainly there are two parameters that you play around with – rank and lora_alpha. For more details, you can read about the params here.

    peft_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM, 
        inference_mode=False, 
        r=64, 
        lora_alpha=32, 
        lora_dropout=0.1,
    )
    model = get_peft_model(model, peft_config)
    

    Now the below cell is a helper function that shows how many trainable parameters are there.

    def print_trainable_parameters(model):
        """
        Prints the number of trainable parameters in the model.
        """
        trainable_params = 0
        all_param = 0
        for _, param in model.named_parameters():
            all_param += param.numel()
            if param.requires_grad:
                trainable_params += param.numel()
        print(
            f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
        )
    
    print_trainable_parameters(model)
    
    >>> trainable params: 52428800 || all params: 6724408320 || trainable%: 0.7796790067620403

    We can see with LoRA, there are very few parameters to train.

    To prepare your data, you can have it in any form you want, as long as it is with datasets, you can pass a formatting function while training, which can combine all text part of the data.

    Here you can change the training configurations, For LoRA you can start with a higher learning rate as the original weights are frozen, so you don’t have to worry about catastrophic forgetting. The arguments you want to play around with are per_device_train_batch_size and gradient_accumulation_steps as when you run out of memory then lower per_device_train_batch_size and increase gradient_accumulation_steps.

    max_seq_length = 512
    
    from transformers import TrainingArguments, EarlyStoppingCallback
    from trl import SFTTrainer
    output_dir = "./results"
    optim = "paged_adamw_32bit"
    training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=True,
        optim=optim,
        learning_rate=1e-4,
        logging_steps=10,
        max_steps=300,
        warmup_ratio=0.3,
        per_device_train_batch_size=8,
        gradient_accumulation_steps=4,
        gradient_checkpointing=True,
        save_total_limit = 5,
        fp16=True
        
    )
    

    Here I’m writing an example of a formatting function. My data already had a text field which had all the text data.

    def format_function(example):
        return example['text']

    But in case you don’t have text field, you can have it so that the function returns all text as one.

    Now we define the trainer.

    from trl import SFTTrainer
    peft_trainer = SFTTrainer(
        model=model,
        train_dataset=dataset,
        peft_config=peft_config,
        max_seq_length=max_seq_length,
        tokenizer=tokenizer,
        args=training_args,
        formatting_func=format_function)
    
    peft_trainer.train()

    Once the model has been trained, you can store is locally or push it to huggingface hub.

    Hope this tutorial cleared any doubts you had around fine-tuning LLMs on a single GPU.

  • Fine Tune Llama-2-7b with a custom dataset on google collab

    Fine Tune Llama-2-7b with a custom dataset on google collab

    I’ll add the code and explanations as text here, but everything is explained in the Youtube video.

    Link to collab notebook.

  • Large Language Model (LLM) Evaluation Metrics – BLEU and ROUGE

    How are language models evaluated? In traditional machine learning, we have metrics like accuracy, f1-score, precision, recall etc. But how can you calculate objectively how the model performed, when the label is I like to drink coffee over tea and the model’s output is I prefer coffee to tea. As humans, we can clearly see that these two have the same meaning, but how can a machine make the same evaluation?

    Well, there are two approaches –

    1. BLEU – Bilingual Evaluation Understudy is a metric used to evaluate the quality of machine-generated translations against one or more reference translations. It measures the similarity between the machine-generated translation and the reference translations based on the n-grams (contiguous sequences of n words) present in both. BLEU score ranges from 0 to 1, with a higher score indicating a better match between the generated translation and the references. A score of 1 means a perfect match, while a score of 0 means no overlap between the generated and reference translations.
    2. ROUGE – Recall-Oriented Understudy for Gisting Evaluation) is a widely used evaluation metric for assessing the quality of automatic summaries generated by text summarization systems. It measures the similarity between the generated summary and one or more reference summaries. ROUGE calculates the precision and recall scores by comparing the n-gram units (such as words or sequences of words) in the generated summary with those in the reference summaries. It focuses on the recall score, which measures how much of the important information from the reference summaries is captured by the generated summary.

    Let us take an example and calculate both the metrics. Suppose the label is "The cat sat on the mat." and the model’s output is "The dog slept on the couch."

    Here is the python code to calculate ROUGE score –

    from collections import Counter
    import re
    
    
    def calculate_ROUGE(generated_summary, reference_summary, n):
        # Tokenize the generated summary and reference summary into n-grams
        generated_ngrams = generate_ngrams(generated_summary, n)
        reference_ngrams = generate_ngrams(reference_summary, n)
    
        # Calculate the recall score
        matching_ngrams = len(set(generated_ngrams) & set(reference_ngrams))
        recall_score = matching_ngrams / len(reference_ngrams)
    
        return recall_score
    
    
    def generate_ngrams(text, n):
        # Preprocess text by removing punctuation and converting to lowercase
        text = re.sub(r'[^\w\s]', '', text.lower())
    
        # Generate n-grams from the preprocessed text
        words = text.split()
        ngrams = [tuple(words[i:i+n]) for i in range(len(words) - n + 1)]
    
        return ngrams
    
    
    # Example usage
    generated_summary = "The dog slept on the couch."
    reference_summary = "The cat sat on the mat."
    n = 2  # bigram
    
    rouge_score = calculate_ROUGE(generated_summary, reference_summary, n)
    print(f"ROUGE-{n} score: {rouge_score}")
    >> ROUGE-2 score: 0.2
    

    If we use n = 2, which means bigrams, then the ROUGE-2 score is 0.2 .

    Similarly, lets calculate the BLEU score –

    from collections import Counter
    import nltk.translate.bleu_score as bleu
    
    
    def calculate_BLEU(generated_summary, reference_summary, n):
        # Tokenize the generated summary and reference summary
        generated_tokens = generated_summary.split()
        reference_tokens = reference_summary.split()
    
        # Calculate the BLEU score
        weights = [1.0 / n] * n  # Weights for n-gram precision calculation
        bleu_score = bleu.sentence_bleu([reference_tokens], generated_tokens, weights=weights)
    
        return bleu_score
    
    
    # Example usage
    generated_summary = "The dog slept on the couch."
    reference_summary = "The cat sat on the mat."
    n = 2  # Bigram
    
    bleu_score = calculate_BLEU(generated_summary, reference_summary, n)
    print(f"BLEU-{n} score: {bleu_score}")
    >> BLEU-2 score: 0.316227766016838
    

    So, we get two different scores from these two different approaches.

    The evaluation metric you will chose to fine tune your LLM models will depend on the task at hand, but usually BLEU is used for machine translations and ROUGE is used for summarisation tasks.