Tag: Python

Mastering Time: Unlock Hyper-Parameter Tuning with Time Series Cross-Validation
We all know how to do hyper-parameter tuning using scikit-learn, but I guess you might be struggling with how to tune your hyper-parameters using time-series cross-validation. First, let’s understand what time-series cross-validation is in the first place.

Time series cross-validation is a technique used to evaluate the performance of predictive models on time-ordered data. Unlike traditional cross-validation methods, which randomly split the dataset into training and testing sets, time series cross-validation maintains the chronological order of observations. This approach is crucial for time series data, where the relationship between past and future data points is essential for accurate predictions. In time series cross-validation, the dataset is split into a series of training and testing sets over time. For example, in a simple walk-forward validation, the model might be trained on the first year of data and tested on the following month, then trained on the first year plus one month, and tested on the next month, and so on. This method allows for the evaluation of the model’s performance over different time intervals, ensuring that the model can adapt to changes in the data over time.

We will be utilising TimeSeriesSplit from scikit-learn to get these splits on our data.

Suppose we have our train data and test data ready with all the features, and we’ve a timestamp column also in it. So the first step is to set this column as the index and sort the dataframe.
```
# Supposing X is our dataframe and timestamp_ is the column name which has the time related information.
import pandas as pd
X.set_index(keys='timestamp_', drop=True, inplace = True)
X.sort_index(inplace=True)
y = X[<target col>]
X.drop([<target col>], axis = 1, inplace = True)
```
Once you’ve the DataFrame sorted, now you need to create your hyper-parameter grid. For this also, we will be using scikit-learn to help us. We will also need to create the time series splits, again using scikit-learn to create those for us. You can write this to run in parallel, but since we are using a demo example, we will be using for loops. But first, we will write a training function. Assuming our task is a classification one and we’re using catboost.
```
from catboost import CatBoostClassifier
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score

def train(param: dict, X: pd.DataFrame, y: pd.Series, train_index: np.array, test_index: np.array) -> float:
    X_train, X_val = X.iloc[train_index], X.iloc[test_index]
    y_train, y_val = y.iloc[train_index], y.iloc[test_index]
    
    model = CatBoostClassifier(max_depth=param['max_depth'],
                               subsample=param['subsample'],
                               verbose=0)  # Set verbose to 0 for silent training
    
    model.fit(X_train, y_train,
              eval_set=(X_val, y_val))
    
    # Predict probabilities for the positive class
    y_pred_proba = model.predict_proba(X_val)[:, 1]
    
    # Calculate AUC score
    score = roc_auc_score(y_val, y_pred_proba)
    
    return score
```
Here the function takes the parameter dictionary, the feature matrix, the label and the index which we will get after using TimeSeriesSplit. It then fits a model. I have used AUC as an example metric, but you’re free to use any metric. After this, all we need to do is run the training over all possible combinations of parameters and keep track of the best score and best parameters.
```
from sklearn.model_selection import TimeSeriesSplit, ParameterGrid

params = {'max_depth' : [6,7,8],
          'subsample' : [0.8,1] }

# Initialising the best_score and best_params
best_score = -999
best_params = None

# Looping over the parameters
for i, param in enumerate(ParameterGrid(params)):
     scores = [train(param=param, train_index=train_index, test_index=test_index, X=X, y=y) for train_index, test_index in tscv.split(X)] 
     cv_score = np.mean(scores)
     if cv_score > best_score:
        best_score = cv_score
        best_params = param
 
```
In the above block, we define a grid, and then using the ParameterGrid we create a generator which yields a parameter dict on each run of the for loop. In the loop, we calculate the score on each split, which we get from the TimeSeriesSplit, it creates indices to use for the splits, but it has to be fed an already sorted data on time, hence we did this step in the beginning.

Once we have the score for each split, we compare the average to the existing best_score, if it’s greater then we update both the best_score and best_params. Once all possible combinations are done, we now have a tuned model hyper-parameters using time series cross-validation. Once you’ve the final hyper-parameters, all that’s left is to train your final model.
```
# Assuming best_params contains the best hyper-parameter values found
# from the tuning process

# Initialize the model with the best parameters
final_model = CatBoostClassifier(max_depth=best_params['max_depth'],
                                 subsample=best_params['subsample'])

# Fit the model on the entire dataset
final_model.fit(X, y, eval_set=(X_val, y_val))

# Now, the final_model is trained with the best hyper-parameters on the full dataset
# You can proceed to make predictions or further evaluate the model as needed
```
April 14, 2024
Ultimate Guide to Chroma Vector Database: Everything You Need to Know – Part 1
In this tutorial, we will walk through how to use Chromadb as your vector database for all your Retrieval-Augmented Generation (RAG) tasks.

But before that, you need to install Chromadb, if you’re using Python then all you need to do is –
```
pip install chromadb
```
Now that you’ve installed Chromadb, let’s begin. We will use a PDF file as an example. For the PDF we will be using this research paper, but feel free to use the PDF of your choice.

The first step is to create a persistent client, i.e., the storage which can be used at multiple places. While creating the client remember to add the setting to allow resetting the client should you require this functionality.
```
import chromadb

client = chromadb.PersistentClient(
            path="<path of persistent storage>", settings=chromadb.config.Settings(allow_reset=True)
        )
```
Once you’ve a client set up then you can define collections within it. If you use BigQuery or any SQL products, imagine the client being the project and the collection being the dataset. Within the collection, you can store the documents as embeddings. In this example, we will call our collection as “research”. Also, one very important thing to remember is that each collection should have its own embedding function that has to be fixed. The query is also passed as an embedding when you try to search for the most similar documents. So in case you use embedding function X to add the documents and use embedding function Y to query them, then the similarity scores will not be correct, so this is a point to remember. We will be using the OpenAI ttext-embedding-3-small model. Another point to remember is that in a single document, you should only have as many tokens as the embedding function can embed. If say your embedding function is all-MiniLM-L6-v2 from HuggingFace, then the max sequence length that the function can handle is 256, so if you try to vectorise a file with longer context, then it will just clip the document to 256 tokens and embed that. The model from OpenAI has a longer max sequence length, but how much exactly is hard to find.
```
# Defining the embedding function
embedding_func = embedding_functions.OpenAIEmbeddingFunction(api_key=os.environ.get("OPENAI_API_KEY") , 
                                                             model_name="text-embedding-3-small")
```
Creating the collection, it’s best practice to specify the embedding function while creating the collection, otherwise, Chromadb uses a default embedding function. Chromadb will use sentence-transformer as a default if no embedding function is supplied.

Now we will need to add a document to this collection, for this, we will use some helper functions from langchain.
```
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

data_path = "./data/2311.04635v1.pdf"

pdf_loader = PyPDFLoader(data_path)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
)
documents = pdf_loader.load_and_split(text_splitter=text_splitter)
```
The PyPDFLoader will help to load the PDF file and the RecursiveCharacterTextSplitter will help in splitting the PDF into chunks. We are using a chunk size of 1000 with an overlap of 50, meaning the chunk sizes will be roughly 1000 tokens with overlapping text of 50 tokens. You can learn more about how the text splitter works here.

Now that we have our documents loaded, time to add them to the Chromadb collection. Since we’ve specified the embedding function in the collection already, we can simply add the text files as embeddings. You have to specify “ids”, and think of them as table names in SQL, also you can specify metadata for each document, both are useful when you want to upsert or delete documents in the collection.
```
collection.add(
            documents=[i.page_content for i in documents],
            ids=[f"pdf_chunk_{i}" for i in range(len(documents))],
            metadatas=
            [
                {
                    "file_name": "reasearch_paper",
                    "timestamp": datetime.now(timezone.utc).isoformat(),
                }
                for _ in documents
            ],
        )
```
Here we use the page contents of the loaded documents as documents, since they are text, the collection using its embedding function will automatically convert them as embeddings. In case you already have embeddings, you can directly add them as embeddings. I’ve also specified some ids, these are very rudimentary here for illustration purposes. Also, I’ve specified some metadata for each document. Both of these will be used to query, upsert or delete individual documents from the vector database.

Read more about how you can upsert documents, query a collection and delete individual documents in Part II of this Ultimate Guide to ChromaDB
April 9, 2024

Embed Documents Using Ollama – OllamaEmbeddings

You can now create document embeddings using Ollama. Also once these embeddings are created, you can store them on a vector database. You can read this article where I go over how you can do so.

from langchain_community.embeddings import OllamaEmbeddings
ollama_emb = OllamaEmbeddings(
    model="mistral",
)
r1 = ollama_emb.embed_documents(
    [
        "Alpha is the first letter of Greek alphabet",
        "Beta is the second letter of Greek alphabet",
        "This is a random sentence"
    ]
)
r2 = ollama_emb.embed_query(
    "What is the second letter of Greek alphabet"
)

Let’s inspect the array shapes-

print(np.array(r1).shape)
>>> (3,4096)
print(np.array(r2).shape)
>>> (4096,)

Now we can also find the cosine similarity between the vectors –

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(np.array(r1), np.array(r2).reshape(1,-1))
>>>array([[0.62087283],
       [0.65085897],
       [0.36985642]])

Here we can clearly see that the second document in our 3 reference documents is the closest to our question. Similarly, you can also create embeddings from your text documents and store them and can later query them using Ollama and LangChain.

March 4, 2024

Build RAG Application Using Ollama

In this tutorial, we will build a Retrieval Augmented Generation(RAG) Application using Ollama and Langchain. For the vector store, we will be using Chroma, but you are free to use any vector store of your choice.

There are 4 key steps to building your RAG application –

Load your documents
Add them to the vector store using the embedding function of your choice.
Define your prompt template.
Deinfe your Retrieval Chatbot using the LLM of your choice.

In case you want the collab notebook, you can click here.

First we load the required libraries.

# Loading required libraries
import os

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.memory import ConversationSummaryMemory
from langchain_openai import OpenAIEmbeddings
from langchain.prompts import PromptTemplate
from langchain.llms import Ollama

Then comes step 1 which is to load our documents. Here I’ll be using Elden Ring Wiki PDF, you can just visit the Wikipedia page and download it as a PDF file.

data_path = "./data/Elden_Ring.pdf"
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=30,
    length_function=len,)

documents = PyPDFLoader(data_path).load_and_split(text_splitter=text_splitter)

The next step is to use an embedding function that will convert our text into embeddings. I prefer using OpenAI embeddings, but you can use any embedding function. Using this embedding function we will add our documents to the Chroma vector database.

embedding_func = OpenAIEmbeddings(api_key=os.environ.get("OPENAI_API_KEY"))
vectordb = Chroma.from_documents(documents, embedding=embedding_func)

Moving on, we have to define a prompt template. I’ll be using the mistral model, so its a very basic prompt template that mistral provides.

template = """<s>[INST] Given the context - {context} </s>[INST] [INST] Answer the following question - {question}[/INST]"""
pt = PromptTemplate(
            template=template, input_variables=["context", "question"]
        )

All that is left to do is to define our memory and Retrieval Chatbot using Ollama as the LLM.

rag = RetrievalQA.from_chain_type(
            llm=Ollama(model="mistral"),
            retriever=vectordb.as_retriever(),
            memory=ConversationSummaryMemory(llm = Ollama(model="mistral")),
            chain_type_kwargs={"prompt": pt, "verbose": True},
        )

rag.invoke("What is Elden Ring ?")
>>> {'query': 'What is Elden Ring ?',
 'history': '',
 'result': ' Elden Ring is a 2022 action role-playing game developed by FromSoftware. It was published for PlayStation 4, PlayStation 5, Windows, Xbox One, and Xbox Series X/S. In the game, players control a customizable character on a quest to repair the Elden Ring and become the new Elden Lord. The game is set in an open world, presented through a third-person perspective, and includes several types of weapons and magic spells. Players can traverse the six main areas using their steed Torrent and discover linear hidden dungeons and checkpoints that enable fast travel and attribute improvements. Elden Ring features online multiplayer mode for cooperative play or player-versus-player combat. The game was developed with inspirations from Dark Souls series, and contributions from George R.R. Martin on the narrative and Tsukasa Saitoh, Shoi Miyazawa, Tai Tomisawa, Yuka Kitamura, and Yoshimi Kudo for the original soundtrack. Elden Ring received critical acclaim for its open world, gameplay systems, and setting, with some criticism for technical performance. It sold over 20 million copies and a downloadable content expansion, Shadow of the Erdtree, is planned to be released in June 2024.'}

We see that it was even able to tell us when Shadow of the Erdtree is planned to release for which I’m really excited about. Let me know in the comments if you want to cover anything else.

February 24, 2024

Create Your Own Vector Database

In this tutorial, we will walk through how you can create your own vector database using Chroma and Langchain. With this, you will be able to easily store PDF files and use the chroma db as a retriever in your Retrieval Augmented Generation (RAG) systems. In another part, I’ll walk over how you can take this vector database and build a RAG system.

# Importing Libraries

import chromadb
import os
from chromadb.utils import embedding_functions
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from typing import Optional
from pathlib import Path
from glob import glob
from uuid import uuid4

Now we will define some variables –

db_path = <path you want to store db >
collection_name = <name of collection of chroma, it's similar to dataset>
document_dir_path = <path where the pdfs are stored>

Now, you also need to create an embedding function, I will use the OpenAI model in the embedding function as it’s very cheap and good but you can use open-source embedding functions as well. You’ll need to pass this embedding function every time you call the collection.

embedding_func = embedding_functions.OpenAIEmbeddingFunction(
            api_key=<openai_api_key> ,
            model_name="text-embedding-3-small",
        )

Now we need to initialise the client, we will be using a persistent client and create our collection.

client = chromadb.PersistentClient(path=db_path)
client.create_collection(
            name=collection_name,
            embedding_function=embedding_func,
        )

Now let’s load our PDFs. To do this, first, we will create a text splitter and then for each PDF, load it and split it into documents, which will then be stored in the collection. You can use any chunk size you want, we will use 1000 here.

chunk_size = 1000

#Load the collection 
collection = client.get_collection(
            collection_name, embedding_function=embedding_func
        )
text_splitter = RecursiveCharacterTextSplitter(
            # Set a really small chunk size, just to show.
            chunk_size=chunk_size,
            chunk_overlap=20,
            length_function=len,
        )

for pdf_file in glob(f"{document_dir_path}*.pdf"):
            pdf_loader = PyPDFLoader(pdf_file)
            documents = [
                doc.page_content
                for doc in pdf_loader.load_and_split(text_splitter=text_splitter)
            ]
            collection.add(
                documents=documents,
                ids=[str(uuid4()) for _ in range(len(documents))],
            )

The collections require an id to be passed, you can pass any string value, here we are passing random strings, but you can, for example, pass the name of the file as id.

Let me know in case you’ve any questions.

February 5, 2024

Custom Objective Function in XGBoost

In the previous post, we covered how you can create a custom loss function in Catboost, but you might be using catboost, so how can you create the same if you’re using Xgboost to train your models. In this post, I’ll walk over an example using the famous Titanic dataset, where we’ll recreate the LogLoss function and compare the results with the standard implementation in the library.

First, we have to set up the data.

import numpy as np 
import seaborn as sns
import pandas as pd
import xgboost as xgb
from sklearn.metrics import log_loss

data = sns.load_dataset('titanic')

Then some data cleaning and setting up the training dataset. The goal is not to get the best model but to demonstrate the custom loss function, so not much feature engineering is being done.

data['embarked'].fillna('S', inplace = True)

X,y = data[[c for c in data.columns if c not in  \
            ['survived', 'alive', 'deck', 'embark_town']]], \
      data['survived']

cat_columns = ['pclass', 'sex', 'sibsp', 'parch', 'embarked', 'class',
       'who', 'adult_male', 'alone']

X = pd.get_dummies(X, columns=cat_columns, drop_first=True)

Let’s say there was no loss function like logloss, then how would you define the logloss as an objective function.

$LogLoss = -1/N \sum({y_{i}log(\hat{y}) + (1-y_{i})log(1-\hat{y})})$

You’ll have to calculate the first and second derivative with respect to the $\hat{y}$

$\Large \frac{\partial LogLoss}{\partial \hat{y}} = -\frac{y_{i}}{\hat{y}} + \frac{1-y_{i}}{1-\hat{y}}$

$\Large \frac{\partial^2LogLoss }{\partial \hat{y}^2} = \frac{y_{i}}{\hat{y}^{2}} + \frac{1-y_{i}}{(1-\hat{y})^{2}}$

Now we will write these up as Python functions and create a function that returns the gradient and hessian (second derivative) values. In the xgboost library, the first value being passed is the predictions and the second is the training matrix.

def log_loss_derivative(y_pred, dtrain ):
    y = dtrain.get_label()
    return (-y/y_pred) + ((1-y)/(1-y_pred))

def log_loss_second_derivative(y_pred,  dtrain):
    y = dtrain.get_label()
    return (y/np.power(y_pred,2)) + ((1-y)/np.power((1-y_pred),2))

def custom_log_loss(predt, dtrain):
    y_pred = np.clip(predt, a_max=1-1e-5, a_min=1e-5)
    grad = log_loss_derivative(y_pred= y_pred, dtrain = dtrain)
    hess = log_loss_second_derivative(y_pred= y_pred, dtrain = dtrain)
    return grad, hess

We clip the predictions to avoid division by zero errors. Now let’s train.

import xgboost as xgb

dtrain =xgb.DMatrix(data=X, label=y)

model = xgb.train({'tree_method': 'hist', 'seed': 1994},
           dtrain=dtrain,
           num_boost_round=10,
           obj=custom_log_loss)

log_loss(y_pred=np.clip(model.predict(dtrain), a_max=1, a_min=0), y_true=y)
>>>0.24912

Comparison with the standard implementation.

clf = xgb.XGBClassifier(n_estimators = 10, **{'tree_method': 'hist', 'seed': 1994})
clf.fit(X,y)

log_loss(y_pred=np.clip(clf.predict_proba(X)[:,1], a_max=1, a_min=0), y_true=y)

>>>0.2861

As we can see the metrics are very close in our implementation of the LogLoss and the standard implementation. Of course, you should use the standard implementation when it’s available, but in case you want to use a custom loss function, you now know how to do so.

January 21, 2024

Creating a Custom Loss Function For Machine Learning Models

While standard Machine Learning Libraries provide a vast array of loss functions out of the box, sometimes we need to create our own custom loss function. In this blog post, I’ll go over a simple example and create a custom loss function in Catboost.

First we will create the data for training.

# Importing libraries
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from catboost import CatBoostRegressor, Pool
from sklearn.datasets import fetch_california_housing

raw_data = fetch_california_housing()

data = pd.concat([pd.DataFrame(raw_data['data'], columns=raw_data['feature_names']), 
                  pd.Series(raw_data['target'], name = 'target')], axis = 1)

features = [i for i in data.columns.tolist() if i != 'target']

Since the objective is not to create the best model possible, we won’t be doing any feature engineering. Let’s use catboost, and create a model using standard loss functions.

model = CatBoostRegressor(loss_function='RMSE', n_estimators=100, eval_metric='RMSE')

cb_pool = Pool(data=data[features], label=data['target'], feature_names=features)

model.fit(cb_pool)

predictions = model.predict(cb_pool)

mean_squared_error(y_true=data['target'], y_pred=predictions)

Upon evaluating the model we find that the mean squared error is 0.15. Definitely a model which is overfitting, but that’s not a concern for this tutorial.

But what is you don’t want to use RMSE as a loss function, and instead want to use something like this –

$loss = \frac{\sum (y - \hat{y})^{4}}{n}$

Then how do you create a loss function in catboost?

For this, you’ll need to calculate the first derivative and the second derivative of the loss function with respect to $\hat{y}$ .

Using the chain rule, the first derivative is

$\frac{\partial (y-\hat{y})^4}{\partial \hat{y}} = \frac{\partial (y-\hat{y})^4}{\partial (y-\hat{y})}*\frac{\partial y - \hat{y}}{\partial \hat{y}} = 4 * (y - \hat{y})^{3}* -1 = -4(y -\hat{y})^{3}$

And similarly using the chain rule, the second derivative comes out to be $12*(y-\hat{y})^2$

The catboost template for a custom objective is as follows –

class UserDefinedObjective(object):
    def calc_ders_range(self, approxes, targets, weights):
        """
        Computes first and second derivative of the loss function 
        with respect to the predicted value for each object.

        Parameters
        ----------
        approxes : indexed container of floats
            Current predictions for each object.

        targets : indexed container of floats
            Target values you provided with the dataset.

        weight : float, optional (default=None)
            Instance weight.

        Returns
        -------
            der1 : list-like object of float
            der2 : list-like object of float

        """
        pass

Using this temple, we can write the custom objective –

class CustomLossObjective(object):
    def calc_ders_range(self, approxes, targets, weights):
        assert len(approxes) == len(targets)
        if weights is not None:
            assert len(weights) == len(approxes)
        
        result = []
        n = len(targets)  # Number of samples

        for index in range(len(targets)):
            error = targets[index] - approxes[index]
            der1 = -4 * error**3
            der2 = 12 * error**2

            if weights is not None:
                der1 *= weights[index]
                der2 *= weights[index]

            result.append((der1, der2))
        return result

Now let’s use this custom loss in our model

model = CatBoostRegressor(loss_function=CustomLossObjective(), n_estimators=100, eval_metric='RMSE')
model.fit(cb_pool)

predictions = model.predict(cb_pool)
mean_squared_error(y_true=data['target'], y_pred=predictions)

Using this loss, we see that the mean squared error is 0.735, this is clearly inferior to using RMSE, but as mentioned before the objective of this blog post is not to build the best model but to showcase how one can create a custom loss objective in catboost.

January 14, 2024

PDF ChatBot Demo with Gradio, Llama-2 and LangChain

In this post, we will learn how you can create a chatbot which can read through your documents and answer any question. In addition, we will learn how to create a working demo using Gradio that you can share with your colleagues or friends.

The google collab notebook can be found here.

July 27, 2023
Fine Tune Llama-2-7b with a custom dataset on google collab

I’ll add the code and explanations as text here, but everything is explained in the Youtube video.

Link to collab notebook.

July 21, 2023
Time Series Forecasting with Python – Part IV – Stationarity and Augmented Dicky Fuller Test
In Part III, we saw trends and seasonality in time series data and how can we decompose it using statsmodel.

In this part we will learn about stationarity in time series data and how can we test it using Augmented Dicky Fuller Test.

Stationarity is a fundamental concept in time series analysis. It refers to the statistical properties of a time series remaining constant over time. In a stationary time series, the mean, variance, and autocovariance structure do not change with time.

There are three main components of stationarity:
1. Constant Mean: The mean of the time series should remain constant over time. This means that the average value of the series does not show any trend or systematic patterns as time progresses.
2. Constant Variance: The variance (or standard deviation) of the series should remain constant over time. It implies that the spread or dispersion of the data points around the mean should not change as time progresses.
3. Constant Autocovariance: The autocovariance between any two points in the time series should only depend on the time lag between them and not on the specific time at which they are observed. Autocovariance measures the linear relationship between a data point and its lagged values. In a stationary series, the autocovariance structure remains constant over time.
Why is stationarity important in time series analysis? Stationarity is a crucial assumption for many time series models and statistical tests. If a time series violates the stationarity assumption, it can lead to unreliable and misleading results. For example, non-stationary series may exhibit trends, seasonality, or other time-dependent patterns that can distort statistical inference, prediction, and forecasting.

To analyze non-stationary time series, researchers often use techniques like differencing to transform the series into a stationary form. Differencing involves computing the differences between consecutive observations to remove trends or other time-dependent patterns. Other methods, such as detrending or deseasonalizing, can also be employed depending on the specific characteristics of the series.

It is important to note that while stationarity is desirable for many time series models, there are cases where non-stationary time series analysis is appropriate, such as when studying trending or seasonal data. However, in such cases, specialized models and techniques designed for non-stationary series need to be employed.

Testing for Stationarity

In Python, you can use various statistical tests to check for stationarity in a time series. One commonly used test is the Augmented Dickey-Fuller (ADF) test. The statsmodels library provides an implementation of the ADF test, which can be used to assess the stationarity of a time series.

Here’s an example of how to perform the ADF test in Python:
```
import pandas as pd
from statsmodels.tsa.stattools import adfuller

# Create a time series dataset
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# Perform the ADF test
result = adfuller(data)

# Extract the test statistic and p-value
test_statistic = result[0]
p_value = result[1]

# Print the results
print("ADF Test Statistic:", test_statistic)
print("p-value:", p_value)
```
The values come out to be
```
ADF Test Statistic: 0.0
p-value: 0.958532086060056
```
The ADF test statistic measures the strength of the evidence against the null hypothesis of non-stationarity. A more negative (i.e., lower) test statistic indicates stronger evidence in favor of stationarity. The p-value represents the probability of observing the given test statistic if the null hypothesis of non-stationarity were true. A small p-value (typically less than 0.05) suggests rejecting the null hypothesis and concluding that the series is stationary. In this example we can clearly see that the null hypothesis was not rejected, meaning that the time series is not stationary.

In the next part we will cover how we can convert non-stationary time series data to stationary time series.
July 4, 2023