Tag: Data Science

  • From Certain to Uncertain | Stochastic Bellman Equation Made Easy

    From Certain to Uncertain | Stochastic Bellman Equation Made Easy

    In the video below we will go over how to calculate value for a state when the actions are probabilistic.

    If you wondered how do I get the values for all states, here is the code snippet for it.

    import numpy as np
    import matplotlib.pyplot as plt
    from typing import List, Tuple
    
    class StochasticGridWorld:
        def __init__(self, size: int = 3, gamma: float = 0.9):
            self.size = size
            self.gamma = gamma
            # Initialize states
            self.values = np.zeros((size, size))
            self.values[0, 2] = -1  # Cat
            self.values[2, 2] = 1   # Cheese
            
            # Track value history for convergence visualization
            self.value_history = {(i, j): [] for i in range(size) for j in range(size)}
            
            # Movement probabilities
            self.p_intended = 0.5  # Probability of moving in intended direction
            self.p_random = 0.5 / 4  # Split remaining probability among all directions
            
        def get_next_state(self, current_state: Tuple[int, int], 
                           action: Tuple[int, int]) -> Tuple[int, int]:
            """Calculate next state given current state and action"""
            next_i = current_state[0] + action[0]
            next_j = current_state[1] + action[1]
            
            # Check if next state is within grid
            if 0 <= next_i < self.size and 0 <= next_j < self.size:
                return (next_i, next_j)
            return current_state
        
        def get_possible_actions(self) -> List[Tuple[int, int]]:
            """Return all possible actions as (dx, dy)"""
            return [(0, 1), (0, -1), (1, 0), (-1, 0)]  # Right, Left, Down, Up
        
        def calculate_state_value(self, state: Tuple[int, int]) -> float:
            """Calculate value for a given state considering all actions"""
            if state == (0, 2) or state == (2, 2):  # Terminal states
                return self.values[state]
            
            max_value = float('-inf')
            actions = self.get_possible_actions()
            
            for action in actions:
                value = 0 # We know this as the immediate reward is 0
                # Intended movement
                next_state = self.get_next_state(state, action)
                value += self.p_intended * self.values[next_state]
                
                # Random movements
                for random_action in actions:
                    random_next_state = self.get_next_state(state, random_action)
                    value += self.p_random * self.values[random_next_state]
                
                value = self.gamma * value  # Apply discount factor
                max_value = max(max_value, value)
                
            return max_value
        
        def value_iteration(self, num_iterations: int = 100, 
                           threshold: float = 1e-4) -> np.ndarray:
            """Perform value iteration and store history"""
            for iteration in range(num_iterations):
                delta = 0
                new_values = np.copy(self.values)
                
                for i in range(self.size):
                    for j in range(self.size):
                        if (i, j) not in [(0, 2), (2, 2)]:  # Skip terminal states
                            old_value = self.values[i, j]
                            new_values[i, j] = self.calculate_state_value((i, j))
                            delta = max(delta, abs(old_value - new_values[i, j]))
                            self.value_history[(i, j)].append(new_values[i, j])
                
                self.values = new_values
                
                # Check convergence
                if delta < threshold:
                    print(f"Converged after {iteration + 1} iterations")
                    break
            
            return self.values
        
        def plot_convergence(self):
            """Plot value convergence for each non-terminal state"""
            plt.figure(figsize=(12, 8))
            for state, history in self.value_history.items():
                if state not in [(0, 2), (2, 2)]:  # Skip terminal states
                    plt.plot(history, label=f'State {state}')
            
            plt.title('Value Convergence Over Iterations')
            plt.xlabel('Iteration')
            plt.ylabel('State Value')
            plt.legend()
            plt.grid(True)
            plt.show()
    
    # Run the simulation
    grid_world = StochasticGridWorld()
    final_values = grid_world.value_iteration(num_iterations=100)
    
    print("\nFinal Values:")
    print(np.round(final_values, 3))
    
  • Exploring Data Distribution Differences in Machine Learning: An Adversarial Approach

    Exploring Data Distribution Differences in Machine Learning: An Adversarial Approach

    First, a shout-out to Santiago, whose tweet inspired this post.

    In the realm of machine learning, ensuring that models perform well not only on training data but also on unseen test data is crucial. A common challenge that arises is the difference in data distribution between training and testing datasets, known as dataset shift. This discrepancy can significantly degrade the performance of a model when deployed in real-world scenarios. To tackle this issue, researchers and practitioners have developed various methods to detect and quantify differences in data distribution. One innovative approach is the adversarial method, which leverages concepts from adversarial training to assess and address these differences.

    Understanding Dataset Shift

    Before diving into the adversarial methods, it is essential to understand what dataset shift entails. Dataset shift occurs when the joint distribution of inputs and outputs differs between the training and testing phases. This shift can be categorised into several types, including covariate shift, prior probability shift, and concept shift, each affecting the model in different ways.

    • Covariate Shift: The distribution of input features changes between the training and testing datasets.
    • Prior Probability Shift: The distribution of the output variable changes.
    • Concept Shift: The relationship between the input features and the output variable changes.

    Detecting and correcting for these shifts is crucial for developing robust machine learning models.

    Adversarial Methods for Detecting Dataset Shift

    Adversarial methods for dataset shift detection are inspired by adversarial training in neural networks, where models are trained to be robust against intentionally crafted malicious input. Similarly, in dataset shift detection, these methods involve creating a scenario where a model tries to distinguish between training and testing data based on their data distributions.

    The way to do this is –

    1. Combine your train and test data.
    2. Create a new column, where you label training data as 1 and test data as 0.
    3. Train a classifier on this using your new column as the target.

    If the data in both train and test comes from the same distribution, the AUC will be close to 0.5, but if they are from different distributions, then the model will learn to differentiate the data points and the AUC will be close to 1.

    Example

    In this example, we will have training data as height and weight in metres and kilograms, and in the test data, we will have the same data but in centimetres and grams. Then if we train a simple logistic regression to learn on the dummy target, which is 1 on the training set and 0 on test data, given that we are not scaling the variables, the model should have an AUC close to 1.

    #Loading required libraries
    import numpy as np 
    import pandas as pd
    import seaborn as sns
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import roc_auc_score
    from matplotlib import pyplot as plt
    

    Then we define our features for train and test

    # Set random seed for reproducibility
    np.random.seed(42)
    
    # Generate synthetic data
    # Training data (height in meters, weight in kilograms)
    train_height = np.random.normal(1.75, 0.1, 1000)  # Average height 1.75 meters
    train_weight = np.random.normal(70, 10, 1000)    # Average weight 70 kg
    
    # Test data (height in centimeters, weight in grams)
    test_height = train_height * 100  # Convert meters to centimeters
    test_weight = train_weight * 1000  # Convert kilograms to grams
    

    Once we’ve our features defined, all we need to do is create a training dataset, train our classifier and check the AUC score.

    # Combine data into feature matrices
    X_train = np.column_stack((train_height, train_weight))
    X_test = np.column_stack((test_height, test_weight))
    
    # Create labels: 1 for training data, 0 for test data
    y_train = np.ones(X_train.shape[0])
    y_test = np.zeros(X_test.shape[0])
    
    # Combine into a single dataset
    X = np.vstack((X_train, X_test))
    y = np.concatenate((y_train, y_test))
    
    # Train logistic regression model
    model = LogisticRegression()
    model.fit(X, y)
    
    # Predict probabilities for ROC AUC calculation
    y_pred_proba = model.predict_proba(X)[:, 1]
    
    # Calculate AUC
    auc = roc_auc_score(y, y_pred_proba)
    print(f"The AUC is: {auc:.2f}")
    
    

    The AUC here comes out to be 1.0 as expected. Since the train and test data comes from different distributions, the model was easily able to identify the difference in the distribution between train and test.

    Using this approach you can also easily test whether the train and test data come from the same distribution.

  • Mastering Time: Unlock Hyper-Parameter Tuning with Time Series Cross-Validation

    Mastering Time: Unlock Hyper-Parameter Tuning with Time Series Cross-Validation

    We all know how to do hyper-parameter tuning using scikit-learn, but I guess you might be struggling with how to tune your hyper-parameters using time-series cross-validation. First, let’s understand what time-series cross-validation is in the first place.

    Time series cross-validation is a technique used to evaluate the performance of predictive models on time-ordered data. Unlike traditional cross-validation methods, which randomly split the dataset into training and testing sets, time series cross-validation maintains the chronological order of observations. This approach is crucial for time series data, where the relationship between past and future data points is essential for accurate predictions. In time series cross-validation, the dataset is split into a series of training and testing sets over time. For example, in a simple walk-forward validation, the model might be trained on the first year of data and tested on the following month, then trained on the first year plus one month, and tested on the next month, and so on. This method allows for the evaluation of the model’s performance over different time intervals, ensuring that the model can adapt to changes in the data over time.

    We will be utilising TimeSeriesSplit from scikit-learn to get these splits on our data.

    Suppose we have our train data and test data ready with all the features, and we’ve a timestamp column also in it. So the first step is to set this column as the index and sort the dataframe.

    # Supposing X is our dataframe and timestamp_ is the column name which has the time related information.
    import pandas as pd
    X.set_index(keys='timestamp_', drop=True, inplace = True)
    X.sort_index(inplace=True)
    y = X[<target col>]
    X.drop([<target col>], axis = 1, inplace = True)

    Once you’ve the DataFrame sorted, now you need to create your hyper-parameter grid. For this also, we will be using scikit-learn to help us. We will also need to create the time series splits, again using scikit-learn to create those for us. You can write this to run in parallel, but since we are using a demo example, we will be using for loops. But first, we will write a training function. Assuming our task is a classification one and we’re using catboost.

    from catboost import CatBoostClassifier
    import pandas as pd
    import numpy as np
    from sklearn.metrics import roc_auc_score
    
    def train(param: dict, X: pd.DataFrame, y: pd.Series, train_index: np.array, test_index: np.array) -> float:
        X_train, X_val = X.iloc[train_index], X.iloc[test_index]
        y_train, y_val = y.iloc[train_index], y.iloc[test_index]
        
        model = CatBoostClassifier(max_depth=param['max_depth'],
                                   subsample=param['subsample'],
                                   verbose=0)  # Set verbose to 0 for silent training
        
        model.fit(X_train, y_train,
                  eval_set=(X_val, y_val))
        
        # Predict probabilities for the positive class
        y_pred_proba = model.predict_proba(X_val)[:, 1]
        
        # Calculate AUC score
        score = roc_auc_score(y_val, y_pred_proba)
        
        return score

    Here the function takes the parameter dictionary, the feature matrix, the label and the index which we will get after using TimeSeriesSplit. It then fits a model. I have used AUC as an example metric, but you’re free to use any metric. After this, all we need to do is run the training over all possible combinations of parameters and keep track of the best score and best parameters.

    from sklearn.model_selection import TimeSeriesSplit, ParameterGrid
    
    params = {'max_depth' : [6,7,8],
              'subsample' : [0.8,1] }
    
    # Initialising the best_score and best_params
    best_score = -999
    best_params = None
    
    # Looping over the parameters
    for i, param in enumerate(ParameterGrid(params)):
         scores = [train(param=param, train_index=train_index, test_index=test_index, X=X, y=y) for train_index, test_index in tscv.split(X)] 
         cv_score = np.mean(scores)
         if cv_score > best_score:
            best_score = cv_score
            best_params = param
     

    In the above block, we define a grid, and then using the ParameterGrid we create a generator which yields a parameter dict on each run of the for loop. In the loop, we calculate the score on each split, which we get from the TimeSeriesSplit, it creates indices to use for the splits, but it has to be fed an already sorted data on time, hence we did this step in the beginning.

    Once we have the score for each split, we compare the average to the existing best_score, if it’s greater then we update both the best_score and best_params. Once all possible combinations are done, we now have a tuned model hyper-parameters using time series cross-validation. Once you’ve the final hyper-parameters, all that’s left is to train your final model.

    # Assuming best_params contains the best hyper-parameter values found
    # from the tuning process
    
    # Initialize the model with the best parameters
    final_model = CatBoostClassifier(max_depth=best_params['max_depth'],
                                     subsample=best_params['subsample'])
    
    # Fit the model on the entire dataset
    final_model.fit(X, y, eval_set=(X_val, y_val))
    
    # Now, the final_model is trained with the best hyper-parameters on the full dataset
    # You can proceed to make predictions or further evaluate the model as needed
  • Embed Documents Using Ollama – OllamaEmbeddings

    You can now create document embeddings using Ollama. Also once these embeddings are created, you can store them on a vector database. You can read this article where I go over how you can do so.

    from langchain_community.embeddings import OllamaEmbeddings
    ollama_emb = OllamaEmbeddings(
    model="mistral",
    )
    r1 = ollama_emb.embed_documents(
    [
    "Alpha is the first letter of Greek alphabet",
    "Beta is the second letter of Greek alphabet",
    "This is a random sentence"
    ]
    )
    r2 = ollama_emb.embed_query(
    "What is the second letter of Greek alphabet"
    )

    Let’s inspect the array shapes-

    print(np.array(r1).shape)
    >>> (3,4096)
    print(np.array(r2).shape)
    >>> (4096,)

    Now we can also find the cosine similarity between the vectors –

    from sklearn.metrics.pairwise import cosine_similarity
    cosine_similarity(np.array(r1), np.array(r2).reshape(1,-1))
    >>>array([[0.62087283],
    [0.65085897],
    [0.36985642]])

    Here we can clearly see that the second document in our 3 reference documents is the closest to our question. Similarly, you can also create embeddings from your text documents and store them and can later query them using Ollama and LangChain.

  • Custom Objective Function in XGBoost

    In the previous post, we covered how you can create a custom loss function in Catboost, but you might be using catboost, so how can you create the same if you’re using Xgboost to train your models. In this post, I’ll walk over an example using the famous Titanic dataset, where we’ll recreate the LogLoss function and compare the results with the standard implementation in the library.

    First, we have to set up the data.

    import numpy as np 
    import seaborn as sns
    import pandas as pd
    import xgboost as xgb
    from sklearn.metrics import log_loss

    data = sns.load_dataset('titanic')

    Then some data cleaning and setting up the training dataset. The goal is not to get the best model but to demonstrate the custom loss function, so not much feature engineering is being done.

    data['embarked'].fillna('S', inplace = True)

    X,y = data[[c for c in data.columns if c not in \
    ['survived', 'alive', 'deck', 'embark_town']]], \
    data['survived']

    cat_columns = ['pclass', 'sex', 'sibsp', 'parch', 'embarked', 'class',
    'who', 'adult_male', 'alone']

    X = pd.get_dummies(X, columns=cat_columns, drop_first=True)

    Let’s say there was no loss function like logloss, then how would you define the logloss as an objective function.

    LogLoss = -1/N \sum({y_{i}log(\hat{y}) + (1-y_{i})log(1-\hat{y})})

    You’ll have to calculate the first and second derivative with respect to the \hat{y}

    \Large \frac{\partial LogLoss}{\partial \hat{y}} = -\frac{y_{i}}{\hat{y}} + \frac{1-y_{i}}{1-\hat{y}}

    \Large  \frac{\partial^2LogLoss }{\partial \hat{y}^2} = \frac{y_{i}}{\hat{y}^{2}} + \frac{1-y_{i}}{(1-\hat{y})^{2}}

    Now we will write these up as Python functions and create a function that returns the gradient and hessian (second derivative) values. In the xgboost library, the first value being passed is the predictions and the second is the training matrix.

    def log_loss_derivative(y_pred, dtrain ):
    y = dtrain.get_label()
    return (-y/y_pred) + ((1-y)/(1-y_pred))

    def log_loss_second_derivative(y_pred, dtrain):
    y = dtrain.get_label()
    return (y/np.power(y_pred,2)) + ((1-y)/np.power((1-y_pred),2))

    def custom_log_loss(predt, dtrain):
    y_pred = np.clip(predt, a_max=1-1e-5, a_min=1e-5)
    grad = log_loss_derivative(y_pred= y_pred, dtrain = dtrain)
    hess = log_loss_second_derivative(y_pred= y_pred, dtrain = dtrain)
    return grad, hess

    We clip the predictions to avoid division by zero errors. Now let’s train.

    import xgboost as xgb

    dtrain =xgb.DMatrix(data=X, label=y)

    model = xgb.train({'tree_method': 'hist', 'seed': 1994},
    dtrain=dtrain,
    num_boost_round=10,
    obj=custom_log_loss)

    log_loss(y_pred=np.clip(model.predict(dtrain), a_max=1, a_min=0), y_true=y)
    >>>0.24912

    Comparison with the standard implementation.

    clf = xgb.XGBClassifier(n_estimators = 10, **{'tree_method': 'hist', 'seed': 1994})
    clf.fit(X,y)

    log_loss(y_pred=np.clip(clf.predict_proba(X)[:,1], a_max=1, a_min=0), y_true=y)

    >>>0.2861

    As we can see the metrics are very close in our implementation of the LogLoss and the standard implementation. Of course, you should use the standard implementation when it’s available, but in case you want to use a custom loss function, you now know how to do so.

  • Creating a Custom Loss Function For Machine Learning Models

    While standard Machine Learning Libraries provide a vast array of loss functions out of the box, sometimes we need to create our own custom loss function. In this blog post, I’ll go over a simple example and create a custom loss function in Catboost.

    First we will create the data for training.

    # Importing libraries
    import numpy as np
    import pandas as pd
    from sklearn.metrics import mean_squared_error
    from catboost import CatBoostRegressor, Pool
    from sklearn.datasets import fetch_california_housing

    raw_data = fetch_california_housing()

    data = pd.concat([pd.DataFrame(raw_data['data'], columns=raw_data['feature_names']),
    pd.Series(raw_data['target'], name = 'target')], axis = 1)

    features = [i for i in data.columns.tolist() if i != 'target']

    Since the objective is not to create the best model possible, we won’t be doing any feature engineering. Let’s use catboost, and create a model using standard loss functions.

    model = CatBoostRegressor(loss_function='RMSE', n_estimators=100, eval_metric='RMSE')

    cb_pool = Pool(data=data[features], label=data['target'], feature_names=features)

    model.fit(cb_pool)

    predictions = model.predict(cb_pool)

    mean_squared_error(y_true=data['target'], y_pred=predictions)

    Upon evaluating the model we find that the mean squared error is 0.15. Definitely a model which is overfitting, but that’s not a concern for this tutorial.

    But what is you don’t want to use RMSE as a loss function, and instead want to use something like this –

    loss = \frac{\sum (y - \hat{y})^{4}}{n}

    Then how do you create a loss function in catboost?

    For this, you’ll need to calculate the first derivative and the second derivative of the loss function with respect to \hat{y}.

    Using the chain rule, the first derivative is

    \frac{\partial (y-\hat{y})^4}{\partial \hat{y}} = \frac{\partial (y-\hat{y})^4}{\partial (y-\hat{y})}*\frac{\partial y - \hat{y}}{\partial \hat{y}} = 4 * (y - \hat{y})^{3}* -1 = -4(y -\hat{y})^{3}

    And similarly using the chain rule, the second derivative comes out to be 12*(y-\hat{y})^2

    The catboost template for a custom objective is as follows –

    class UserDefinedObjective(object):
        def calc_ders_range(self, approxes, targets, weights):
            """
            Computes first and second derivative of the loss function 
            with respect to the predicted value for each object.
    
            Parameters
            ----------
            approxes : indexed container of floats
                Current predictions for each object.
    
            targets : indexed container of floats
                Target values you provided with the dataset.
    
            weight : float, optional (default=None)
                Instance weight.
    
            Returns
            -------
                der1 : list-like object of float
                der2 : list-like object of float
    
            """
            pass
    

    Using this temple, we can write the custom objective –

    class CustomLossObjective(object):
    def calc_ders_range(self, approxes, targets, weights):
    assert len(approxes) == len(targets)
    if weights is not None:
    assert len(weights) == len(approxes)

    result = []
    n = len(targets) # Number of samples

    for index in range(len(targets)):
    error = targets[index] - approxes[index]
    der1 = -4 * error**3
    der2 = 12 * error**2

    if weights is not None:
    der1 *= weights[index]
    der2 *= weights[index]

    result.append((der1, der2))
    return result

    Now let’s use this custom loss in our model

    model = CatBoostRegressor(loss_function=CustomLossObjective(), n_estimators=100, eval_metric='RMSE')
    model.fit(cb_pool)

    predictions = model.predict(cb_pool)
    mean_squared_error(y_true=data['target'], y_pred=predictions)

    Using this loss, we see that the mean squared error is 0.735, this is clearly inferior to using RMSE, but as mentioned before the objective of this blog post is not to build the best model but to showcase how one can create a custom loss objective in catboost.

  • Temperature In Language Models – A way to control for Randomness

    Temperature In Language Models – A way to control for Randomness

    Temperature is a parameter that you can access with open-source LLMs which essentially guide the model on how random their behaviour is.

    Here is an image from cohere.ai

    In this image, we can see that if you increase the temperature, the probability distribution of the softmax output of the next token is changed. So when you sample from this probability distribution, there is a chance that you can select the output token which has a very low probability score in the original distribution.

    Similarly, there is also something known as top k and top p.

    They also work similarly to temperature. The higher their value, the more random, your output will be.

    Let’s take an example. What do you expect the completion of this sentence to be – The cat sat on the _____

    I think most of us will think mat, followed by other things where you can sit like porch, floor, etc. and not sky.

    Suppose we feed this to a text-generation model and the softmax probability distribution looks like this –

    token prob
    mat 0.6
    floor 0.2
    porch 0.1
    car 0.05
    bus 0.03
    sky 0.02

    If you set temperature = 0, then the most likely completion of the sentence that the model will return will be The cat sat on the mat

    But when we set temperature = 1, or a number which is high, then we could get the model to give the output as The cat sat on the sky Cause then the probability distribution of the softmax output is skewed artificially to generate less likely tokens. This can be good or bad based on the context of the problem.

    In the video below, we ran through a couple of settings and saw the effect these parameters had on the output of Llama-2-7b.

    #loading the model 
    
    import torch
    from peft import PeftModel, PeftConfig
    from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
    
    bnb_config = BitsAndBytesConfig(load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=False)
    
    model_id = "meta-llama/Llama-2-7b-chat-hf"
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config = bnb_config,device_map={"":0})

    Then we create the prompt template and a function to create a text-generation pipeline –

    import json
    import textwrap
    
    B_INST, E_INST = "[INST]", "[/INST]"
    B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
    DEFAULT_SYSTEM_PROMPT = """
    """
    
    
    
    def get_prompt(instruction, new_system_prompt=DEFAULT_SYSTEM_PROMPT ):
        SYSTEM_PROMPT = B_SYS + new_system_prompt + E_SYS
        prompt_template =  B_INST + SYSTEM_PROMPT + instruction + E_INST
        return prompt_template
    
    def create_pipeline(temperature = 0, top_p = 0.1, top_k = 3, max_new_tokens=512):
        pipe = pipeline("text-generation",
                    model=model,
                    tokenizer = tokenizer,
                    max_new_tokens = max_new_tokens,
                    temperature = temperature,
                    do_sample = True, 
                    top_p = top_p,
                    top_k = top_k)
        return pipe

    Now let’s see the model output when we pass this prompt to the model with different configurations.

    [INST]<<SYS>>
    
    
    <</SYS>>
    
    Complete the sentence - The cat sat on the [/INST]
    # Model with all params as low.
    pipe = create_pipeline(0.1)
    output = pipe.predict(prompt)
    print(output[0]['generated_text'])
    
    >>> [INST]<<SYS>>
    
    
    <</SYS>>
    
    Complete the sentence - The cat sat on the [/INST]  The cat sat on the mat.

    As expected, the model’s output was in line with our expectations.

    # Model with all params as high.
    pipe = create_pipeline(0.8, top_p = 0.8, top_k = 100)
    output = pipe.predict(prompt)
    print(output[0]['generated_text'])
    
    >>> [INST]<<SYS>>
    
    
    <</SYS>>
    
    Complete the sentence - The cat sat on the [/INST]  The cat sat on the windowsill.

    Here, we saw that by changing the parameters, the model’s output was also influenced.

  • Gorilla – A LLM to output API calls, paper walkthrough with a working example

    In the Youtube video, I go over Gorilla, a LLM which is fine-tuned on API calls.

    Let me know in case you want to learn more about such LLM or ML concepts in the comments below.

  • PDF ChatBot Demo with Gradio, Llama-2 and LangChain

    PDF ChatBot Demo with Gradio, Llama-2 and LangChain

    In this post, we will learn how you can create a chatbot which can read through your documents and answer any question. In addition, we will learn how to create a working demo using Gradio that you can share with your colleagues or friends.

    The google collab notebook can be found here.

  • Fine Tune Llama-2-13b on a single GPU on custom data.

    In this tutorial, we will walk through each step of fine-tuning Llama-2-13b model on a single GPU. I’ll be using a collab notebook but you can use your local machine, it just needs to have around 12 Gb of VRAM.

    The required libraries can be installed by running this in your notebook.

    !pip install -q transformers trl peft huggingface_hub datasets bitsandbytes accelerate

    First login to your huggingface account.

    from huggingface_hub import login
    login("<your token here>")

    Loading the tokenizer.

    model_id = "meta-llama/Llama-2-13b-chat-hf"
    import torch
    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
    
    from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, BitsAndBytesConfig
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token

    Now we will load the model in its quantised form, this reduces the memory requirements to fit the model, so it can run on a single GPU.

    bnb_config = BitsAndBytesConfig(load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=False)

    If you’ve a bit more GPU to play around, you can load the 8-bit model. Play around with this configuration based on your hardware specifications.

    model = AutoModelForCausalLM.from_pretrained(model_id,  quantization_config=bnb_config, use_cache=False)

    Now the below lines of code prepare the model for 4 or 8-bit training, otherwise, you get an error.

    from peft import prepare_model_for_kbit_training, get_peft_model, LoraConfig, TaskType
    
    model.gradient_checkpointing_enable()
    model = prepare_model_for_kbit_training(model)

    Then you define your LoRA config, mainly there are two parameters that you play around with – rank and lora_alpha. For more details, you can read about the params here.

    peft_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM, 
        inference_mode=False, 
        r=64, 
        lora_alpha=32, 
        lora_dropout=0.1,
    )
    model = get_peft_model(model, peft_config)
    

    Now the below cell is a helper function that shows how many trainable parameters are there.

    def print_trainable_parameters(model):
        """
        Prints the number of trainable parameters in the model.
        """
        trainable_params = 0
        all_param = 0
        for _, param in model.named_parameters():
            all_param += param.numel()
            if param.requires_grad:
                trainable_params += param.numel()
        print(
            f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
        )
    
    print_trainable_parameters(model)
    
    >>> trainable params: 52428800 || all params: 6724408320 || trainable%: 0.7796790067620403

    We can see with LoRA, there are very few parameters to train.

    To prepare your data, you can have it in any form you want, as long as it is with datasets, you can pass a formatting function while training, which can combine all text part of the data.

    Here you can change the training configurations, For LoRA you can start with a higher learning rate as the original weights are frozen, so you don’t have to worry about catastrophic forgetting. The arguments you want to play around with are per_device_train_batch_size and gradient_accumulation_steps as when you run out of memory then lower per_device_train_batch_size and increase gradient_accumulation_steps.

    max_seq_length = 512
    
    from transformers import TrainingArguments, EarlyStoppingCallback
    from trl import SFTTrainer
    output_dir = "./results"
    optim = "paged_adamw_32bit"
    training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=True,
        optim=optim,
        learning_rate=1e-4,
        logging_steps=10,
        max_steps=300,
        warmup_ratio=0.3,
        per_device_train_batch_size=8,
        gradient_accumulation_steps=4,
        gradient_checkpointing=True,
        save_total_limit = 5,
        fp16=True
        
    )
    

    Here I’m writing an example of a formatting function. My data already had a text field which had all the text data.

    def format_function(example):
        return example['text']

    But in case you don’t have text field, you can have it so that the function returns all text as one.

    Now we define the trainer.

    from trl import SFTTrainer
    peft_trainer = SFTTrainer(
        model=model,
        train_dataset=dataset,
        peft_config=peft_config,
        max_seq_length=max_seq_length,
        tokenizer=tokenizer,
        args=training_args,
        formatting_func=format_function)
    
    peft_trainer.train()

    Once the model has been trained, you can store is locally or push it to huggingface hub.

    Hope this tutorial cleared any doubts you had around fine-tuning LLMs on a single GPU.