Tag: Data Science

From Certain to Uncertain | Stochastic Bellman Equation Made Easy

In the video below we will go over how to calculate value for a state when the actions are probabilistic.

If you wondered how do I get the values for all states, here is the code snippet for it.

import numpy as np
import matplotlib.pyplot as plt
from typing import List, Tuple

class StochasticGridWorld:
    def __init__(self, size: int = 3, gamma: float = 0.9):
        self.size = size
        self.gamma = gamma
        # Initialize states
        self.values = np.zeros((size, size))
        self.values[0, 2] = -1  # Cat
        self.values[2, 2] = 1   # Cheese
        
        # Track value history for convergence visualization
        self.value_history = {(i, j): [] for i in range(size) for j in range(size)}
        
        # Movement probabilities
        self.p_intended = 0.5  # Probability of moving in intended direction
        self.p_random = 0.5 / 4  # Split remaining probability among all directions
        
    def get_next_state(self, current_state: Tuple[int, int], 
                       action: Tuple[int, int]) -> Tuple[int, int]:
        """Calculate next state given current state and action"""
        next_i = current_state[0] + action[0]
        next_j = current_state[1] + action[1]
        
        # Check if next state is within grid
        if 0 <= next_i < self.size and 0 <= next_j < self.size:
            return (next_i, next_j)
        return current_state
    
    def get_possible_actions(self) -> List[Tuple[int, int]]:
        """Return all possible actions as (dx, dy)"""
        return [(0, 1), (0, -1), (1, 0), (-1, 0)]  # Right, Left, Down, Up
    
    def calculate_state_value(self, state: Tuple[int, int]) -> float:
        """Calculate value for a given state considering all actions"""
        if state == (0, 2) or state == (2, 2):  # Terminal states
            return self.values[state]
        
        max_value = float('-inf')
        actions = self.get_possible_actions()
        
        for action in actions:
            value = 0 # We know this as the immediate reward is 0
            # Intended movement
            next_state = self.get_next_state(state, action)
            value += self.p_intended * self.values[next_state]
            
            # Random movements
            for random_action in actions:
                random_next_state = self.get_next_state(state, random_action)
                value += self.p_random * self.values[random_next_state]
            
            value = self.gamma * value  # Apply discount factor
            max_value = max(max_value, value)
            
        return max_value
    
    def value_iteration(self, num_iterations: int = 100, 
                       threshold: float = 1e-4) -> np.ndarray:
        """Perform value iteration and store history"""
        for iteration in range(num_iterations):
            delta = 0
            new_values = np.copy(self.values)
            
            for i in range(self.size):
                for j in range(self.size):
                    if (i, j) not in [(0, 2), (2, 2)]:  # Skip terminal states
                        old_value = self.values[i, j]
                        new_values[i, j] = self.calculate_state_value((i, j))
                        delta = max(delta, abs(old_value - new_values[i, j]))
                        self.value_history[(i, j)].append(new_values[i, j])
            
            self.values = new_values
            
            # Check convergence
            if delta < threshold:
                print(f"Converged after {iteration + 1} iterations")
                break
        
        return self.values
    
    def plot_convergence(self):
        """Plot value convergence for each non-terminal state"""
        plt.figure(figsize=(12, 8))
        for state, history in self.value_history.items():
            if state not in [(0, 2), (2, 2)]:  # Skip terminal states
                plt.plot(history, label=f'State {state}')
        
        plt.title('Value Convergence Over Iterations')
        plt.xlabel('Iteration')
        plt.ylabel('State Value')
        plt.legend()
        plt.grid(True)
        plt.show()

# Run the simulation
grid_world = StochasticGridWorld()
final_values = grid_world.value_iteration(num_iterations=100)

print("\nFinal Values:")
print(np.round(final_values, 3))

October 30, 2024

Exploring Data Distribution Differences in Machine Learning: An Adversarial Approach
First, a shout-out to Santiago, whose tweet inspired this post.

In the realm of machine learning, ensuring that models perform well not only on training data but also on unseen test data is crucial. A common challenge that arises is the difference in data distribution between training and testing datasets, known as dataset shift. This discrepancy can significantly degrade the performance of a model when deployed in real-world scenarios. To tackle this issue, researchers and practitioners have developed various methods to detect and quantify differences in data distribution. One innovative approach is the adversarial method, which leverages concepts from adversarial training to assess and address these differences.

Understanding Dataset Shift

Before diving into the adversarial methods, it is essential to understand what dataset shift entails. Dataset shift occurs when the joint distribution of inputs and outputs differs between the training and testing phases. This shift can be categorised into several types, including covariate shift, prior probability shift, and concept shift, each affecting the model in different ways.
- Covariate Shift: The distribution of input features changes between the training and testing datasets.
- Prior Probability Shift: The distribution of the output variable changes.
- Concept Shift: The relationship between the input features and the output variable changes.
Detecting and correcting for these shifts is crucial for developing robust machine learning models.

Adversarial Methods for Detecting Dataset Shift

Adversarial methods for dataset shift detection are inspired by adversarial training in neural networks, where models are trained to be robust against intentionally crafted malicious input. Similarly, in dataset shift detection, these methods involve creating a scenario where a model tries to distinguish between training and testing data based on their data distributions.

The way to do this is –
1. Combine your train and test data.
2. Create a new column, where you label training data as 1 and test data as 0.
3. Train a classifier on this using your new column as the target.
If the data in both train and test comes from the same distribution, the AUC will be close to 0.5, but if they are from different distributions, then the model will learn to differentiate the data points and the AUC will be close to 1.

Example

In this example, we will have training data as height and weight in metres and kilograms, and in the test data, we will have the same data but in centimetres and grams. Then if we train a simple logistic regression to learn on the dummy target, which is 1 on the training set and 0 on test data, given that we are not scaling the variables, the model should have an AUC close to 1.
```
#Loading required libraries
import numpy as np 
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot as plt
```
Then we define our features for train and test
```
# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic data
# Training data (height in meters, weight in kilograms)
train_height = np.random.normal(1.75, 0.1, 1000)  # Average height 1.75 meters
train_weight = np.random.normal(70, 10, 1000)    # Average weight 70 kg

# Test data (height in centimeters, weight in grams)
test_height = train_height * 100  # Convert meters to centimeters
test_weight = train_weight * 1000  # Convert kilograms to grams
```
Once we’ve our features defined, all we need to do is create a training dataset, train our classifier and check the AUC score.
```
# Combine data into feature matrices
X_train = np.column_stack((train_height, train_weight))
X_test = np.column_stack((test_height, test_weight))

# Create labels: 1 for training data, 0 for test data
y_train = np.ones(X_train.shape[0])
y_test = np.zeros(X_test.shape[0])

# Combine into a single dataset
X = np.vstack((X_train, X_test))
y = np.concatenate((y_train, y_test))

# Train logistic regression model
model = LogisticRegression()
model.fit(X, y)

# Predict probabilities for ROC AUC calculation
y_pred_proba = model.predict_proba(X)[:, 1]

# Calculate AUC
auc = roc_auc_score(y, y_pred_proba)
print(f"The AUC is: {auc:.2f}")
```
The AUC here comes out to be 1.0 as expected. Since the train and test data comes from different distributions, the model was easily able to identify the difference in the distribution between train and test.

Using this approach you can also easily test whether the train and test data come from the same distribution.
April 20, 2024
Mastering Time: Unlock Hyper-Parameter Tuning with Time Series Cross-Validation
We all know how to do hyper-parameter tuning using scikit-learn, but I guess you might be struggling with how to tune your hyper-parameters using time-series cross-validation. First, let’s understand what time-series cross-validation is in the first place.

Time series cross-validation is a technique used to evaluate the performance of predictive models on time-ordered data. Unlike traditional cross-validation methods, which randomly split the dataset into training and testing sets, time series cross-validation maintains the chronological order of observations. This approach is crucial for time series data, where the relationship between past and future data points is essential for accurate predictions. In time series cross-validation, the dataset is split into a series of training and testing sets over time. For example, in a simple walk-forward validation, the model might be trained on the first year of data and tested on the following month, then trained on the first year plus one month, and tested on the next month, and so on. This method allows for the evaluation of the model’s performance over different time intervals, ensuring that the model can adapt to changes in the data over time.

We will be utilising TimeSeriesSplit from scikit-learn to get these splits on our data.

Suppose we have our train data and test data ready with all the features, and we’ve a timestamp column also in it. So the first step is to set this column as the index and sort the dataframe.
```
# Supposing X is our dataframe and timestamp_ is the column name which has the time related information.
import pandas as pd
X.set_index(keys='timestamp_', drop=True, inplace = True)
X.sort_index(inplace=True)
y = X[<target col>]
X.drop([<target col>], axis = 1, inplace = True)
```
Once you’ve the DataFrame sorted, now you need to create your hyper-parameter grid. For this also, we will be using scikit-learn to help us. We will also need to create the time series splits, again using scikit-learn to create those for us. You can write this to run in parallel, but since we are using a demo example, we will be using for loops. But first, we will write a training function. Assuming our task is a classification one and we’re using catboost.
```
from catboost import CatBoostClassifier
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score

def train(param: dict, X: pd.DataFrame, y: pd.Series, train_index: np.array, test_index: np.array) -> float:
    X_train, X_val = X.iloc[train_index], X.iloc[test_index]
    y_train, y_val = y.iloc[train_index], y.iloc[test_index]
    
    model = CatBoostClassifier(max_depth=param['max_depth'],
                               subsample=param['subsample'],
                               verbose=0)  # Set verbose to 0 for silent training
    
    model.fit(X_train, y_train,
              eval_set=(X_val, y_val))
    
    # Predict probabilities for the positive class
    y_pred_proba = model.predict_proba(X_val)[:, 1]
    
    # Calculate AUC score
    score = roc_auc_score(y_val, y_pred_proba)
    
    return score
```
Here the function takes the parameter dictionary, the feature matrix, the label and the index which we will get after using TimeSeriesSplit. It then fits a model. I have used AUC as an example metric, but you’re free to use any metric. After this, all we need to do is run the training over all possible combinations of parameters and keep track of the best score and best parameters.
```
from sklearn.model_selection import TimeSeriesSplit, ParameterGrid

params = {'max_depth' : [6,7,8],
          'subsample' : [0.8,1] }

# Initialising the best_score and best_params
best_score = -999
best_params = None

# Looping over the parameters
for i, param in enumerate(ParameterGrid(params)):
     scores = [train(param=param, train_index=train_index, test_index=test_index, X=X, y=y) for train_index, test_index in tscv.split(X)] 
     cv_score = np.mean(scores)
     if cv_score > best_score:
        best_score = cv_score
        best_params = param
 
```
In the above block, we define a grid, and then using the ParameterGrid we create a generator which yields a parameter dict on each run of the for loop. In the loop, we calculate the score on each split, which we get from the TimeSeriesSplit, it creates indices to use for the splits, but it has to be fed an already sorted data on time, hence we did this step in the beginning.

Once we have the score for each split, we compare the average to the existing best_score, if it’s greater then we update both the best_score and best_params. Once all possible combinations are done, we now have a tuned model hyper-parameters using time series cross-validation. Once you’ve the final hyper-parameters, all that’s left is to train your final model.
```
# Assuming best_params contains the best hyper-parameter values found
# from the tuning process

# Initialize the model with the best parameters
final_model = CatBoostClassifier(max_depth=best_params['max_depth'],
                                 subsample=best_params['subsample'])

# Fit the model on the entire dataset
final_model.fit(X, y, eval_set=(X_val, y_val))

# Now, the final_model is trained with the best hyper-parameters on the full dataset
# You can proceed to make predictions or further evaluate the model as needed
```
April 14, 2024

Embed Documents Using Ollama – OllamaEmbeddings

You can now create document embeddings using Ollama. Also once these embeddings are created, you can store them on a vector database. You can read this article where I go over how you can do so.

from langchain_community.embeddings import OllamaEmbeddings
ollama_emb = OllamaEmbeddings(
    model="mistral",
)
r1 = ollama_emb.embed_documents(
    [
        "Alpha is the first letter of Greek alphabet",
        "Beta is the second letter of Greek alphabet",
        "This is a random sentence"
    ]
)
r2 = ollama_emb.embed_query(
    "What is the second letter of Greek alphabet"
)

Let’s inspect the array shapes-

print(np.array(r1).shape)
>>> (3,4096)
print(np.array(r2).shape)
>>> (4096,)

Now we can also find the cosine similarity between the vectors –

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(np.array(r1), np.array(r2).reshape(1,-1))
>>>array([[0.62087283],
       [0.65085897],
       [0.36985642]])

Here we can clearly see that the second document in our 3 reference documents is the closest to our question. Similarly, you can also create embeddings from your text documents and store them and can later query them using Ollama and LangChain.

March 4, 2024

Custom Objective Function in XGBoost

In the previous post, we covered how you can create a custom loss function in Catboost, but you might be using catboost, so how can you create the same if you’re using Xgboost to train your models. In this post, I’ll walk over an example using the famous Titanic dataset, where we’ll recreate the LogLoss function and compare the results with the standard implementation in the library.

First, we have to set up the data.

import numpy as np 
import seaborn as sns
import pandas as pd
import xgboost as xgb
from sklearn.metrics import log_loss

data = sns.load_dataset('titanic')

Then some data cleaning and setting up the training dataset. The goal is not to get the best model but to demonstrate the custom loss function, so not much feature engineering is being done.

data['embarked'].fillna('S', inplace = True)

X,y = data[[c for c in data.columns if c not in  \
            ['survived', 'alive', 'deck', 'embark_town']]], \
      data['survived']

cat_columns = ['pclass', 'sex', 'sibsp', 'parch', 'embarked', 'class',
       'who', 'adult_male', 'alone']

X = pd.get_dummies(X, columns=cat_columns, drop_first=True)

Let’s say there was no loss function like logloss, then how would you define the logloss as an objective function.

$LogLoss = -1/N \sum({y_{i}log(\hat{y}) + (1-y_{i})log(1-\hat{y})})$

You’ll have to calculate the first and second derivative with respect to the $\hat{y}$

$\Large \frac{\partial LogLoss}{\partial \hat{y}} = -\frac{y_{i}}{\hat{y}} + \frac{1-y_{i}}{1-\hat{y}}$

$\Large \frac{\partial^2LogLoss }{\partial \hat{y}^2} = \frac{y_{i}}{\hat{y}^{2}} + \frac{1-y_{i}}{(1-\hat{y})^{2}}$

Now we will write these up as Python functions and create a function that returns the gradient and hessian (second derivative) values. In the xgboost library, the first value being passed is the predictions and the second is the training matrix.

def log_loss_derivative(y_pred, dtrain ):
    y = dtrain.get_label()
    return (-y/y_pred) + ((1-y)/(1-y_pred))

def log_loss_second_derivative(y_pred,  dtrain):
    y = dtrain.get_label()
    return (y/np.power(y_pred,2)) + ((1-y)/np.power((1-y_pred),2))

def custom_log_loss(predt, dtrain):
    y_pred = np.clip(predt, a_max=1-1e-5, a_min=1e-5)
    grad = log_loss_derivative(y_pred= y_pred, dtrain = dtrain)
    hess = log_loss_second_derivative(y_pred= y_pred, dtrain = dtrain)
    return grad, hess

We clip the predictions to avoid division by zero errors. Now let’s train.

import xgboost as xgb

dtrain =xgb.DMatrix(data=X, label=y)

model = xgb.train({'tree_method': 'hist', 'seed': 1994},
           dtrain=dtrain,
           num_boost_round=10,
           obj=custom_log_loss)

log_loss(y_pred=np.clip(model.predict(dtrain), a_max=1, a_min=0), y_true=y)
>>>0.24912

Comparison with the standard implementation.

clf = xgb.XGBClassifier(n_estimators = 10, **{'tree_method': 'hist', 'seed': 1994})
clf.fit(X,y)

log_loss(y_pred=np.clip(clf.predict_proba(X)[:,1], a_max=1, a_min=0), y_true=y)

>>>0.2861

As we can see the metrics are very close in our implementation of the LogLoss and the standard implementation. Of course, you should use the standard implementation when it’s available, but in case you want to use a custom loss function, you now know how to do so.

January 21, 2024

Creating a Custom Loss Function For Machine Learning Models

While standard Machine Learning Libraries provide a vast array of loss functions out of the box, sometimes we need to create our own custom loss function. In this blog post, I’ll go over a simple example and create a custom loss function in Catboost.

First we will create the data for training.

# Importing libraries
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from catboost import CatBoostRegressor, Pool
from sklearn.datasets import fetch_california_housing

raw_data = fetch_california_housing()

data = pd.concat([pd.DataFrame(raw_data['data'], columns=raw_data['feature_names']), 
                  pd.Series(raw_data['target'], name = 'target')], axis = 1)

features = [i for i in data.columns.tolist() if i != 'target']

Since the objective is not to create the best model possible, we won’t be doing any feature engineering. Let’s use catboost, and create a model using standard loss functions.

model = CatBoostRegressor(loss_function='RMSE', n_estimators=100, eval_metric='RMSE')

cb_pool = Pool(data=data[features], label=data['target'], feature_names=features)

model.fit(cb_pool)

predictions = model.predict(cb_pool)

mean_squared_error(y_true=data['target'], y_pred=predictions)

Upon evaluating the model we find that the mean squared error is 0.15. Definitely a model which is overfitting, but that’s not a concern for this tutorial.

But what is you don’t want to use RMSE as a loss function, and instead want to use something like this –

$loss = \frac{\sum (y - \hat{y})^{4}}{n}$

Then how do you create a loss function in catboost?

For this, you’ll need to calculate the first derivative and the second derivative of the loss function with respect to $\hat{y}$ .

Using the chain rule, the first derivative is

$\frac{\partial (y-\hat{y})^4}{\partial \hat{y}} = \frac{\partial (y-\hat{y})^4}{\partial (y-\hat{y})}*\frac{\partial y - \hat{y}}{\partial \hat{y}} = 4 * (y - \hat{y})^{3}* -1 = -4(y -\hat{y})^{3}$

And similarly using the chain rule, the second derivative comes out to be $12*(y-\hat{y})^2$

The catboost template for a custom objective is as follows –

class UserDefinedObjective(object):
    def calc_ders_range(self, approxes, targets, weights):
        """
        Computes first and second derivative of the loss function 
        with respect to the predicted value for each object.

        Parameters
        ----------
        approxes : indexed container of floats
            Current predictions for each object.

        targets : indexed container of floats
            Target values you provided with the dataset.

        weight : float, optional (default=None)
            Instance weight.

        Returns
        -------
            der1 : list-like object of float
            der2 : list-like object of float

        """
        pass

Using this temple, we can write the custom objective –

class CustomLossObjective(object):
    def calc_ders_range(self, approxes, targets, weights):
        assert len(approxes) == len(targets)
        if weights is not None:
            assert len(weights) == len(approxes)
        
        result = []
        n = len(targets)  # Number of samples

        for index in range(len(targets)):
            error = targets[index] - approxes[index]
            der1 = -4 * error**3
            der2 = 12 * error**2

            if weights is not None:
                der1 *= weights[index]
                der2 *= weights[index]

            result.append((der1, der2))
        return result

Now let’s use this custom loss in our model

model = CatBoostRegressor(loss_function=CustomLossObjective(), n_estimators=100, eval_metric='RMSE')
model.fit(cb_pool)

predictions = model.predict(cb_pool)
mean_squared_error(y_true=data['target'], y_pred=predictions)

Using this loss, we see that the mean squared error is 0.735, this is clearly inferior to using RMSE, but as mentioned before the objective of this blog post is not to build the best model but to showcase how one can create a custom loss objective in catboost.

January 14, 2024

Temperature In Language Models – A way to control for Randomness
Temperature is a parameter that you can access with open-source LLMs which essentially guide the model on how random their behaviour is.

Here is an image from cohere.ai

In this image, we can see that if you increase the temperature, the probability distribution of the softmax output of the next token is changed. So when you sample from this probability distribution, there is a chance that you can select the output token which has a very low probability score in the original distribution.

Similarly, there is also something known as top k and top p.

They also work similarly to temperature. The higher their value, the more random, your output will be.

Let’s take an example. What do you expect the completion of this sentence to be – The cat sat on the _____

I think most of us will think mat, followed by other things where you can sit like porch, floor, etc. and not sky.

Suppose we feed this to a text-generation model and the softmax probability distribution looks like this –

token prob

mat 0.6

floor 0.2

porch 0.1

car 0.05

bus 0.03

sky 0.02

If you set temperature = 0, then the most likely completion of the sentence that the model will return will be The cat sat on the mat

But when we set temperature = 1, or a number which is high, then we could get the model to give the output as The cat sat on the sky Cause then the probability distribution of the softmax output is skewed artificially to generate less likely tokens. This can be good or bad based on the context of the problem.

In the video below, we ran through a couple of settings and saw the effect these parameters had on the output of Llama-2-7b.
```
#loading the model 

import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline

bnb_config = BitsAndBytesConfig(load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=False)

model_id = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config = bnb_config,device_map={"":0})
```
Then we create the prompt template and a function to create a text-generation pipeline –
```
import json
import textwrap

B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
DEFAULT_SYSTEM_PROMPT = """
"""



def get_prompt(instruction, new_system_prompt=DEFAULT_SYSTEM_PROMPT ):
    SYSTEM_PROMPT = B_SYS + new_system_prompt + E_SYS
    prompt_template =  B_INST + SYSTEM_PROMPT + instruction + E_INST
    return prompt_template

def create_pipeline(temperature = 0, top_p = 0.1, top_k = 3, max_new_tokens=512):
    pipe = pipeline("text-generation",
                model=model,
                tokenizer = tokenizer,
                max_new_tokens = max_new_tokens,
                temperature = temperature,
                do_sample = True, 
                top_p = top_p,
                top_k = top_k)
    return pipe
```
Now let’s see the model output when we pass this prompt to the model with different configurations.
```
[INST]<<SYS>>


<</SYS>>

Complete the sentence - The cat sat on the [/INST]
```
```
# Model with all params as low.
pipe = create_pipeline(0.1)
output = pipe.predict(prompt)
print(output[0]['generated_text'])

>>> [INST]<<SYS>>


<</SYS>>

Complete the sentence - The cat sat on the [/INST]  The cat sat on the mat.
```
As expected, the model’s output was in line with our expectations.
```
# Model with all params as high.
pipe = create_pipeline(0.8, top_p = 0.8, top_k = 100)
output = pipe.predict(prompt)
print(output[0]['generated_text'])

>>> [INST]<<SYS>>


<</SYS>>

Complete the sentence - The cat sat on the [/INST]  The cat sat on the windowsill.
```
Here, we saw that by changing the parameters, the model’s output was also influenced.
August 9, 2023
Gorilla – A LLM to output API calls, paper walkthrough with a working example

In the Youtube video, I go over Gorilla, a LLM which is fine-tuned on API calls.

Let me know in case you want to learn more about such LLM or ML concepts in the comments below.

July 30, 2023
PDF ChatBot Demo with Gradio, Llama-2 and LangChain

In this post, we will learn how you can create a chatbot which can read through your documents and answer any question. In addition, we will learn how to create a working demo using Gradio that you can share with your colleagues or friends.

The google collab notebook can be found here.

July 27, 2023

token	prob
mat	0.6
floor	0.2
porch	0.1
car	0.05
bus	0.03
sky	0.02

Fine Tune Llama-2-13b on a single GPU on custom data.

In this tutorial, we will walk through each step of fine-tuning Llama-2-13b model on a single GPU. I’ll be using a collab notebook but you can use your local machine, it just needs to have around 12 Gb of VRAM.

The required libraries can be installed by running this in your notebook.

!pip install -q transformers trl peft huggingface_hub datasets bitsandbytes accelerate

First login to your huggingface account.

from huggingface_hub import login
login("<your token here>")

Loading the tokenizer.

model_id = "meta-llama/Llama-2-13b-chat-hf"
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, BitsAndBytesConfig

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

Now we will load the model in its quantised form, this reduces the memory requirements to fit the model, so it can run on a single GPU.

bnb_config = BitsAndBytesConfig(load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=False)

If you’ve a bit more GPU to play around, you can load the 8-bit model. Play around with this configuration based on your hardware specifications.

model = AutoModelForCausalLM.from_pretrained(model_id,  quantization_config=bnb_config, use_cache=False)

Now the below lines of code prepare the model for 4 or 8-bit training, otherwise, you get an error.

from peft import prepare_model_for_kbit_training, get_peft_model, LoraConfig, TaskType

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

Then you define your LoRA config, mainly there are two parameters that you play around with – rank and lora_alpha. For more details, you can read about the params here.

peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM, 
    inference_mode=False, 
    r=64, 
    lora_alpha=32, 
    lora_dropout=0.1,
)
model = get_peft_model(model, peft_config)

Now the below cell is a helper function that shows how many trainable parameters are there.

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

print_trainable_parameters(model)

>>> trainable params: 52428800 || all params: 6724408320 || trainable%: 0.7796790067620403

We can see with LoRA, there are very few parameters to train.

To prepare your data, you can have it in any form you want, as long as it is with datasets, you can pass a formatting function while training, which can combine all text part of the data.

Here you can change the training configurations, For LoRA you can start with a higher learning rate as the original weights are frozen, so you don’t have to worry about catastrophic forgetting. The arguments you want to play around with are per_device_train_batch_size and gradient_accumulation_steps as when you run out of memory then lower per_device_train_batch_size and increase gradient_accumulation_steps.

max_seq_length = 512

from transformers import TrainingArguments, EarlyStoppingCallback
from trl import SFTTrainer
output_dir = "./results"
optim = "paged_adamw_32bit"
training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    optim=optim,
    learning_rate=1e-4,
    logging_steps=10,
    max_steps=300,
    warmup_ratio=0.3,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    save_total_limit = 5,
    fp16=True
    
)

Here I’m writing an example of a formatting function. My data already had a text field which had all the text data.

def format_function(example):
    return example['text']

But in case you don’t have text field, you can have it so that the function returns all text as one.

Now we define the trainer.

from trl import SFTTrainer
peft_trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_args,
    formatting_func=format_function)

peft_trainer.train()

Once the model has been trained, you can store is locally or push it to huggingface hub.

Hope this tutorial cleared any doubts you had around fine-tuning LLMs on a single GPU.

July 24, 2023

Tag: Data Science

Understanding Dataset Shift

Adversarial Methods for Detecting Dataset Shift

Example