Author: sahaymaniceet

  • Cohen’s Kappa and its use in ML

    Suppose you’re building a classification model on an imbalanced dataset and you want to have other measures for your model other than accuracy, F1-score, and ROC-AUC curve, what else can you measure to be confident in your results. The answer is Cohen’s kappa.

    Cohen’s Kappa is a statistical measure that quantifies the level of agreement between two annotators or, in the context of ML, the agreement between the model’s predictions and the true labels. It accounts for the possibility of agreement occurring by chance, providing a more nuanced evaluation than traditional accuracy metrics.

    The Formula:
    The formula for Cohen’s Kappa is –

    \kappa = \frac{p_{0} - p_{e}}{1-p_{e}}

    Where p_{0} is the observed agreement between the model’s predictions and true labels and p_{e} is the expected agreement by chance.

    Let’s take an example to understand this better. A binary classification scenario where you’re building a spam email classifier. The task is to distinguish between spam and non-spam (ham) emails. We’ll use a simple logistic regression model for this example.

    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score, confusion_matrix, cohen_kappa_score

    # Sample data for spam and non-spam emails
    data = [
    ("Get rich quick! Claim your prize now!", "spam"),
    ("Meeting at 3 pm in the conference room.", "ham"),
    ("Exclusive offer for you!", "spam"),
    ("Reminder: Project deadline tomorrow.", "ham"),
    # ... more data ...
    ]

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(
    [text for text, label in data],
    [label for text, label in data],
    test_size=0.2,
    random_state=42
    )

    # Vectorize the text data
    vectorizer = CountVectorizer()
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)

    # Train a logistic regression classifier
    classifier = LogisticRegression()
    classifier.fit(X_train_vec, y_train)

    # Make predictions on the test set
    y_pred = classifier.predict(X_test_vec)

    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)
    kappa_score = cohen_kappa_score(y_test, y_pred)

    # Print the results
    print(f"Accuracy: {accuracy}")
    print(f"Confusion Matrix:\n{conf_matrix}")
    print(f"Cohen's Kappa: {kappa_score}")

    After this, you get a kappa of 1, which means that it’s an excellent model and there is no variability that can be attributed to chance. Be aware that this is an ideal scenario.

    Another scenario is that you get a score of 0, meaning that the model’s performance is no better than random chance, that is your features don’t capture any meaningful patterns in the data.

    In the context of model evaluation:

    Kappa scores closer to 1 indicate a high level of agreement and are generally considered desirable.

    Kappa scores around 0 or below suggest poor agreement, and the model’s predictions might not be reliable.

    It’s essential to interpret Cohen’s Kappa alongside other evaluation metrics, such as accuracy, precision, recall, and the confusion matrix, to comprehensively understand the model’s performance. Additionally, the interpretation of Kappa may vary depending on the specific problem and the level of difficulty in the classification task.

  • GPT-4 Vision API – How to Guide

    In the Dev Conference, OpenAI announced the GPT-4 Vision API. With access to this one can develop many tools, with the GPT-4-turbo model being the engine of your tool. The use cases can range from information retrieval to classification models.

    In this article, we will go over how you can use the vision API, how can you pass multiple images with the API and some tricks you should be using to improve the response.

    Firstly, you should have a billing account with OpenAI and also some credits to use this API, as, unlike ChatGPT, here you’re charged per token and not a flat fee, so be careful in your experiments.

    The API –

    The API consists of two parts –

    1. Header – Here you pass your authentication key and if you want the organisation id
    2. Payload – This is where the meat of your request lies. The image can be passed either as a URL or a base64 encoded image. I prefer to pass it in the latter way.
    # To encode the image in base64
    
    # Function to encode the image
    def encode_image(image_path):
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode('utf-8')
    
    # Path to your image
    image_path = "./sample.png"
    
    # Getting the base64 string
    base64_image = encode_image(image_path)

    Let’s look at the API format

    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {API_KEY_HERE}"
    }
    
    payload = {
        "model": "gpt-4-vision-preview",
        "messages": [
             {"role": <user or system>,
              "content" : [{"type": <text or image_url>,
                            "text or image_url": <text or image_url>}]
    }
        ],
        "max_tokens": <max tokens here>
    }
    
    response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)

    Let’s take an example.
    Suppose I want to create a prompt which has a system prompt and a user prompt which can extract JSON output from an image. My payload will look like this.

    payload = {
        "model": "gpt-4-vision-preview",
        "messages": [
    # First define the system prompt        
    {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are a system that always extracts information from an image in a json_format"
                }
            ]
        },
            
        # Define the user prompt  
          {
            "role": "user",
    # Under the user prompt, I pass two content, one text and one image
            "content": [
              {
                "type": "text",
                "text": """Extract the grades from this image in a structured format. Only return the output.
                           ```
                           [{"subject": "<subject>", "grade": "<grade>"}]
                           ```"""
              },
              {
                "type": "image_url",
                "image_url": {
                  "url": f"data:image/jpeg;base64,{base64_image}"
                }
              }
            ]
          }
        ],
        "max_tokens": 500 # Return no more than 500 completion tokens
    }

    The return I get from the API is exactly how i wanted.

    ```json
    [
      {"subject": "English", "grade": "A+"},
      {"subject": "Math", "grade": "B-"},
      {"subject": "Science", "grade": "B+"},
      {"subject": "History", "grade": "C+"}
    ]
    ```

    This is just an example of how just by using the correct prompt we can built an information retrieval system on images using the vision API.

    In the next article, we will build a classifier using the API, which will involve no knowledge of Machine Learning and just by using the API we will build a state-of-the-art image classifier.

  • Deploy Machine Learning Model on Spaces by HuggingFace Using Gradio

    Deploy Machine Learning Model on Spaces by HuggingFace Using Gradio

    Once your Gradio application is ready and tested in the notebook, then the next thing you need to do is deploy it using spaces.

    In this demo example, we will deploy the Titanic model using spaces. You can visit the space here. There are only 4 steps involved –

    1. Create a new space – We will call this space titanic_demo, you can also use paid GPU instances if required.

    2. Create app.py – Here lies the code which runs the Gradio application. Below is the code used to run the space.

    import numpy as np
    import pandas as pd
    import gradio as gr
    from catboost import CatBoostClassifier
    
    clf = CatBoostClassifier()
    clf.load_model("./titanic_model.bin")
    
    def predict(pclass:int = 3,
                sex:str = "male",
                age:float = 30,
                fare:float = 100,
                embarked:str = "S"):
      prediction_array = np.array([pclass, sex, age, fare, embarked])
      survived = clf.predict(prediction_array)
      if survived == 1:
        return f"The passenger survived"
      else:
        return f"The passenger did not survive"
    
    
    
    with gr.Blocks() as demo:
      # Keeping the three categorical feature input in the same row
      with gr.Row() as row1:
        pclass = gr.Dropdown(choices=[1,2,3], label= "pclass")
        sex = gr.Dropdown(choices =["male", "female"], label = "sex")
        embarked = gr.Dropdown(choices =["C", "Q", "S"], label = "embarked")
      # Creating slider for the two numerical inputs and also defining the limits for both
      age = gr.Slider(1,100, label = "age", interactive = True
      )
      fare = gr.Slider(10,600, label = "fare", interactive = True
      )
    
      submit = gr.Button(value = 'Predict')
    
      # Showing the output
      output = gr.Textbox(label = "Whether the passenger survived ?", interactive = False,)
    
      # Defining what happens when the user clicks the submit button
      submit.click(predict, inputs = [pclass,sex, age,fare,embarked], outputs = [output])
    
    demo.launch(share = False, debug = False)

    Remember to make share=False, as in spaces you don’t need to create a shareable link.

    3. Create requirements.txt (optional )- This is optional in case you’re not using packages not pre-loaded into the space. We’re using catboost as the model, so we will specify the requirements.txt file

    4. Add your model file (optional ) – This is again optional as your ML application might not involve loading from a saved model file. Here we’ve stored our titanic model in a bin file, so we add it to the files.

    That’s it. Once you’ve followed these steps your ML model is up and running on Spaces and you don’t have to worry about the link expiring.

  • What is MLaaS (Machine Learning as a Service)

    Like SaaS (Software as a Service), MLaaS is a cloud-based offering that allows users to access and utilize machine learning capabilities without needing to invest in the underlying infrastructure or have extensive expertise in machine learning. With MLaaS, users can access pre-built machine learning models, algorithms, and tools through APIs or web interfaces, enabling them to integrate machine learning capabilities into their applications, processes, or products. This approach simplifies the deployment of machine learning solutions and lowers the barriers for organizations to leverage the power of machine learning in their operations.

    Examples of MLaaS products are –

    1. AWS Sagemaker by Amazon
    2. AutoML by Google
    3. Azure Machine Learning by Microsoft
    4. Watson ML by IBM
    5. AWS Rekognition

    But one can train ML models on local machines as well?

    Training ML models is just one step within the larger machine learning pipeline. While training ML models on local machines is possible and often done during development, the machine learning process involves several stages that go beyond just training:

    1. Data Collection and Preparation
    2. Feature Engineering
    3. Model Selection and Architecture Design
    4. Model Training
    5. Model Evaluation
    6. Hyperparameter Tuning
    7. Deployment
    8. Monitoring and Maintenance

    Using MLaaS can simplify various stages of this pipeline by providing pre-built models, tools, and infrastructure for these tasks, allowing developers and businesses to focus more on the specific problem they’re solving.

    So then we should always use MLaaS ?

    The short answer is that it depends like everything using MLaaS comes with its cons as well, and these are the drawbacks to be aware of –

    1. Cost: While MLaaS can be convenient, it often comes with a cost. If your usage is consistent and predictable, building your own infrastructure might be more cost-effective in the long run. However, if your usage is sporadic or unpredictable, the pay-as-you-go model of MLaaS might be more cost-efficient.
    2. Customization: MLaaS platforms might offer a limited set of models and configurations. If your application requires highly specialized models or specific tweaks, building your own solution might be more suitable.
    3. Lock-In: Using MLaaS can sometimes result in vendor lock-in, where it becomes challenging to migrate away from the chosen provider if needed. This can be a concern for long-term projects.
    4. Learning Experience: If you’re interested in learning about the intricacies of machine learning or if your organization’s core competence is in this field, building and managing your own machine learning infrastructure might be beneficial.

    In summary, there is no one size fits all solution, so you’ve to decide which suits your problem statement the best. Whether it is to use MLaaS or build it on your own.

  • Temperature In Language Models – A way to control for Randomness

    Temperature In Language Models – A way to control for Randomness

    Temperature is a parameter that you can access with open-source LLMs which essentially guide the model on how random their behaviour is.

    Here is an image from cohere.ai

    In this image, we can see that if you increase the temperature, the probability distribution of the softmax output of the next token is changed. So when you sample from this probability distribution, there is a chance that you can select the output token which has a very low probability score in the original distribution.

    Similarly, there is also something known as top k and top p.

    They also work similarly to temperature. The higher their value, the more random, your output will be.

    Let’s take an example. What do you expect the completion of this sentence to be – The cat sat on the _____

    I think most of us will think mat, followed by other things where you can sit like porch, floor, etc. and not sky.

    Suppose we feed this to a text-generation model and the softmax probability distribution looks like this –

    token prob
    mat 0.6
    floor 0.2
    porch 0.1
    car 0.05
    bus 0.03
    sky 0.02

    If you set temperature = 0, then the most likely completion of the sentence that the model will return will be The cat sat on the mat

    But when we set temperature = 1, or a number which is high, then we could get the model to give the output as The cat sat on the sky Cause then the probability distribution of the softmax output is skewed artificially to generate less likely tokens. This can be good or bad based on the context of the problem.

    In the video below, we ran through a couple of settings and saw the effect these parameters had on the output of Llama-2-7b.

    #loading the model 
    
    import torch
    from peft import PeftModel, PeftConfig
    from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
    
    bnb_config = BitsAndBytesConfig(load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=False)
    
    model_id = "meta-llama/Llama-2-7b-chat-hf"
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config = bnb_config,device_map={"":0})

    Then we create the prompt template and a function to create a text-generation pipeline –

    import json
    import textwrap
    
    B_INST, E_INST = "[INST]", "[/INST]"
    B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
    DEFAULT_SYSTEM_PROMPT = """
    """
    
    
    
    def get_prompt(instruction, new_system_prompt=DEFAULT_SYSTEM_PROMPT ):
        SYSTEM_PROMPT = B_SYS + new_system_prompt + E_SYS
        prompt_template =  B_INST + SYSTEM_PROMPT + instruction + E_INST
        return prompt_template
    
    def create_pipeline(temperature = 0, top_p = 0.1, top_k = 3, max_new_tokens=512):
        pipe = pipeline("text-generation",
                    model=model,
                    tokenizer = tokenizer,
                    max_new_tokens = max_new_tokens,
                    temperature = temperature,
                    do_sample = True, 
                    top_p = top_p,
                    top_k = top_k)
        return pipe

    Now let’s see the model output when we pass this prompt to the model with different configurations.

    [INST]<<SYS>>
    
    
    <</SYS>>
    
    Complete the sentence - The cat sat on the [/INST]
    # Model with all params as low.
    pipe = create_pipeline(0.1)
    output = pipe.predict(prompt)
    print(output[0]['generated_text'])
    
    >>> [INST]<<SYS>>
    
    
    <</SYS>>
    
    Complete the sentence - The cat sat on the [/INST]  The cat sat on the mat.

    As expected, the model’s output was in line with our expectations.

    # Model with all params as high.
    pipe = create_pipeline(0.8, top_p = 0.8, top_k = 100)
    output = pipe.predict(prompt)
    print(output[0]['generated_text'])
    
    >>> [INST]<<SYS>>
    
    
    <</SYS>>
    
    Complete the sentence - The cat sat on the [/INST]  The cat sat on the windowsill.

    Here, we saw that by changing the parameters, the model’s output was also influenced.

  • Fuzzy Match DataFrames Using RapidFuzz and Pandas

    Today, we will be going over how you can match two DataFrames using RapidFuzz and Pandas.

    Suppose that you’ve two DataFrames, one having the product_id and the other having the product_price, with the key being the name. Since, names can be written differently, you’ve to match them.

    nameproduct_id
    M.D. LuffyA
    R. ZoroB
    SanjiC
    NamiD
    NarutoE
    name and product_id table

    nameproduct_price
    Monkey D. Luffy100
    Roronoa Zoro10
    Sannnji500
    Nami Chain1000
    Jiraiya300
    name and product_price table

    Since we can see that the name is written differently in the tables, we can’t do a direct left join using name as the key. We will be using fuzzy join using rapidfuzz library.

    # Importing the libraries
    import pandas as pd
    from rapidfuzz import fuzz
    from rapidfuzz.process import cdist, extract
    

    We will be using the extract process to find the closest matching key from the second table to use as merge in the first table.

    df1 = pd.DataFrame({"name": ["M.D. Luffy",
    "R. Zoro",
    "Sanji",
    "Nami", 
    "Naruto"],
    "product_id":["A", "B", "C", "D", "E"]})
    
    df2 = pd.DataFrame({"name": ["Monkey D. Luffy",
    "Roronoa Zoro",
    "Sannnji",
    "Nami Chan", 
                                "Jiraiya"],
    "product_price":[100,10,500,1000,300]})
    
    df1['join_key_tuple'] = df1['name'].apply(lambda x: extract(query = x, choices =df2['name'], score_cutoff=80))

    Here the query is the string you want to match, choices are all the choices to match against. You can pass a custom scorer as well, but here we are using the defualt scorer, and lastly we’re passing a cutoff which we will use to determine a successful match.

             name	    product_id	           join_key_tuple
    0	M.D. Luffy	A	        [(Monkey D. Luffy, 85.5, 0)]
    1	R. Zoro	        B	        [(Roronoa Zoro, 85.5, 1)]
    2	Sanji	        C	        [(Sannnji, 83.33333333333334, 2)]
    3	Nami	        D	        [(Nami Chan, 90.0, 3)]
    4	Naruto	        E	        []

    We can now extract the key to join from the returned tuple and make the join to have the price of the product against the product_id.

    df1["join_key"] = df1["join_key_tuple"].apply(lambda x: x[0][0] if x else np.nan)
    
    df1.merge(df2, how = "left", left_on = "join_key", right_on ="name")

    This way you can do fuzz join on two pandas dataframe.

  • Understanding R-squared (R2) in Regression: A Comprehensive Explanation of Model Fit

    In the realm of regression analysis, one of the key metrics used to evaluate the goodness-of-fit of a model is the R-squared (R2) statistic. R-squared serves as a crucial tool for quantifying how well a regression model captures the variation in the dependent variable based on the independent variables. In this blog, we will delve into the concept of R-squared, its interpretation, calculation, and its strengths and limitations in assessing the performance of regression models.

    R^{2}=1- \frac{RSS}{TSS}

    But what do RSS and TSS mean?

    RSS is also called the residual sum of squares. It is calculated by the formula –

    RSS = \sum(y - \hat{y})^{2}

    So it is the sum of the squared difference between the predicted value and the actual value.

    Plotting this on the graph will look like this.

    Here we can see that the vertical lines are the residuals, and squaring and adding up these values will give us the RSS.

    Similarly, the TSS is given by the formula –

    TSS = \sum(y - \bar{y})^{2}

    Here we can see the error with respect to \bar{y}.

    But why is R^{2} = 1 -\frac{RSS}{TSS} ?

    The answer is very logical if you think. The simplest estimate of the predicted value is the mean. So if \hat{y} = \bar{y}, then RSS = TSS and your R-squared value becomes 0. On the other hand, if your regression line fits perfectly, i.e. \hat{y} = y, the RSS = 0, and R-squared becomes 1.

    So that’s why R-squared is a goodness of fit measurement, and its value is always between 0 and 1.

  • Huber Loss – Loss function to use in Regression when dealing with Outliers

    Huber Loss – Loss function to use in Regression when dealing with Outliers

    Huber loss, also known as smooth L1 loss, is a loss function commonly used in regression problems, particularly in machine learning tasks involving regression tasks. It is a modified version of the Mean Absolute Error (MAE) and Mean Squared Error (MSE) loss functions, which combines the best properties of both.

    Below are some advantages of Huber Loss –

    1. Robustness to outliers: One of the main advantages of Huber loss is its ability to handle outliers effectively. Unlike Mean Squared Error (MSE), which heavily penalizes large errors due to its quadratic nature, Huber loss transitions to a linear behaviour for larger errors. This property reduces the impact of outliers and makes the loss function more robust in the presence of noisy data.
    2. Differentiability: Huber loss is differentiable at all points, including the transition point between the quadratic and linear regions. This differentiability is essential when using gradient-based optimization algorithms, such as Stochastic Gradient Descent (SGD), to update the model parameters during training. The continuous and differentiable nature of the loss function enables efficient optimization.
    3. The balance between L1 and L2 loss: Huber loss combines the benefits of both Mean Absolute Error (MAE) and MSE loss functions. For small errors, it behaves similarly to MSE (quadratic), which helps the model converge faster during training. On the other hand, for larger errors, it behaves like MAE (linear), which reduces the impact of outliers.
    4. Smoother optimization landscape: The transition from quadratic to linear behaviour in Huber loss results in a smoother optimization landscape compared to MSE. This can prevent issues related to gradient explosions and vanishing gradients, which may occur in certain cases with MSE.
    5. Efficient optimization: Due to its smoother nature and better handling of outliers, Huber loss can lead to faster convergence during model training. It enables more stable and efficient optimization, especially when dealing with complex and noisy datasets.
    6. User-defined threshold: The parameter δ in Huber loss allows users to control the sensitivity of the loss function to errors. By adjusting δ, practitioners can customize the loss function to match the specific characteristics of their dataset, making it more adaptable to different regression tasks.
    7. Wide applicability: Huber loss can be applied to a variety of regression problems across different domains, including finance, image processing, natural language processing, and more. Its versatility and robustness make it a popular choice in many real-world applications.

    While there are also some disadvantages of using this loss function –

    1. Hyperparameter tuning: The Huber loss function depends on the user-defined threshold parameter, δ. Selecting an appropriate value for δ is crucial, as it determines when the loss transitions from quadratic (MSE-like) to linear (MAE-like) behaviour. Finding the optimal δ value can be challenging and may require experimentation or cross-validation, making the model development process more complex.
    2. Task-specific performance: Although Huber loss is more robust to outliers compared to MSE, it might not be the best choice for all regression tasks. The choice of loss function should be task-specific, and in some cases, other loss functions tailored to the specific problem might provide better performance.
    3. Less emphasis on smaller errors: The quadratic behavior of Huber loss for small errors means that it might not penalize small errors as much as the pure L1 loss (MAE). In certain cases, especially in noiseless datasets, the added robustness to outliers might come at the cost of slightly reduced accuracy in predicting smaller errors.

    Let’s see Huber Regression in Action and see how it is different compared to Linear Regression

    import numpy as np
    from sklearn.linear_model import HuberRegressor, LinearRegression
    from sklearn.datasets import make_regression
    import seaborn as sns
    sns.set_theme()
    rng = np.random.RandomState(0)
    X, y, coef = make_regression(n_samples=200, n_features=2, noise=4.0, coef=True, random_state=0)
    
    #Adding outliers
    X[:4] = rng.uniform(10, 20, (4, 2))
    y[:4] = rng.uniform(10, 20, 4)
    
    #plotting the data 
    
    sns.scatterplot(x = X[:,1], y = y)
    sns.scatterplot(x = X[:,0], y = y)

    As we can see from our data plotted that there are a few outliers in this.
    Let us see how Huber Regression and Linear Regression perform.

    huber = HuberRegressor().fit(X, y)
    
    lr = LinearRegression()
    lr.fit(X,y)
    
    print(f'True coefficients are {coef}')
    >>>True coefficients are [20.4923687  34.16981149]
    print(f'Huber coefficients are {huber.coef_}')
    >>>Huber coefficients are [17.79064252 31.01066091]
    print(f'Linear coefficients are {lr.coef_}')
    >>>Linear coefficients are [-1.92210833  7.02266092]

    Here we can see that the Huber coefficients are closer to the true coefficients, let us also visualise this by plotting the line.

    # use line_kws to set line label for legend
    
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
    sns.regplot(x=X[:,1], y=y, color='b', 
     line_kws={'label':"y={0:.1f}x+{1:.1f}".format(huber.coef_[1],huber.intercept_)}, ax = axes[0])
    axes[1] = sns.regplot(x=X[:,1], y=y, color='r', 
     line_kws={'label':"y={0:.1f}x+{1:.1f}".format(lr.coef_[1],lr.intercept_)}, ax = axes[1])

    In these plots, we can clearly see the effect the outlier has on the regression output between Linear and Huber Regression.

  • Gorilla – A LLM to output API calls, paper walkthrough with a working example

    In the Youtube video, I go over Gorilla, a LLM which is fine-tuned on API calls.

    Let me know in case you want to learn more about such LLM or ML concepts in the comments below.

  • Create a Machine Learning Demo Using Gradio

    Create a Machine Learning Demo Using Gradio

    Let’s build a Machine Learning Demo Using Gradio. We will be using the Titanic Dataset as an example.

    Link to the Google Collab Notebook.

    First, we will install and load the libraries.

    !pip install -q gradio
    !pip install -q catboost
    
    import pandas as pd
    import numpy as np
    import seaborn as sns
    
    sns.set_theme()

    Then let’s load the data and analyse the data types.

    df = sns.load_dataset("titanic")
    df.dtypes
    
    >>>
    survived          int64
    pclass            int64
    sex              object
    age             float64
    sibsp             int64
    parch             int64
    fare            float64
    embarked         object
    class          category
    who              object
    adult_male         bool
    deck           category
    embark_town      object
    alive            object
    alone              bool
    dtype: object

    We can see that we have the target as survived, the same is also as alive. Also, we will not be using all the features to create the model as we want to show how you can demo a Machine Learning model as a live predictor, so we don’t want to overburden the user.

    # Identifying features, we are keeping very few features as we want to simulate this using gradio
    features = [ 'pclass', 'sex', 'age', 'fare',
           'embarked']
    target = 'survived'

    Then we replace the missing values. For age, we will replace it with the median and for the embarked we will replace it with the most common value, which is S.

    # Filling missing values
    df['age'].fillna(np.nanquantile(df['age'], 0.5), inplace = True)
    df['embarked'].fillna("S", inplace = True)

    Now let’s build the model. No tuning, as getting the highest accuracy or f1-score is not objective.

    from catboost import CatBoostClassifier
    clf = CatBoostClassifier()
    
    # Creating features and target
    X = df[features]
    y = df[target]
    
    clf.fit(X,y, cat_features=['pclass', 'sex', 'embarked'])

    The we write the function which takes in the inputs and returns an output of whether the passenger would’ve survived or not.

    def predict(pclass:int = 3, 
                sex:str = "male", 
                age:float = 30, 
                fare:float = 100, 
                embarked:str = "S"):
      prediction_array = np.array([pclass, sex, age, fare, embarked])
      survived = clf.predict(prediction_array)
      if survived == 1:
        return f"The passenger survived"
      else:
        return f"The passenger did not survive"

    Now for the gradio demo, we want to take in these inputs with different gradio components, pass those as inputs to the prediction function and display the output. I go into much more detail on how this is done in the YouTube video, but the code snippet has comments which will help in case you don’t want to watch the explainer video.

    with gr.Blocks() as demo:
      # Keeping the three categorical feature input in the same row
      with gr.Row() as row1:
        pclass = gr.Dropdown(choices=[1,2,3], label= "pclass")
        sex = gr.Dropdown(choices =["male", "female"], label = "sex")
        embarked = gr.Dropdown(choices =["C", "Q", "S"], label = "embarked")
      # Creating slider for the two numerical inputs and also defining the limits for both
      age = gr.Slider(1,100, label = "age", interactive = True
      )
      fare = gr.Slider(10,600, label = "fare", interactive = True
      )
    
      submit = gr.Button(value = 'Predict')
    
      # Showing the output 
      output = gr.Textbox(label = "Whether the passenger survived ?", interactive = False,)
    
      # Defining what happens when the user clicks the submit button
      submit.click(predict, inputs = [pclass,sex, age,fare,embarked], outputs = [output])
    
    demo.launch(share = False, debug = False)

    Then you’ll get an output like this, where you’re free to play around with the features and see what the ML model’s output will be.

    Let me know in case you want to build something else with Gradio or want me to cover any ML topic in the comments below.