In this tutorial, we will walk through how you can create your own vector database using Chroma and Langchain. With this, you will be able to easily store PDF files and use the chroma db as a retriever in your Retrieval Augmented Generation (RAG) systems. In another part, I’ll walk over how you can take this vector database and build a RAG system.
# Importing Libraries
import chromadb
import os
from chromadb.utils import embedding_functions
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from typing import Optional
from pathlib import Path
from glob import glob
from uuid import uuid4
Now we will define some variables –
db_path = <path you want to store db >
collection_name = <name of collection of chroma, it's similar to dataset>
document_dir_path = <path where the pdfs are stored>
Now, you also need to create an embedding function, I will use the OpenAI model in the embedding function as it’s very cheap and good but you can use open-source embedding functions as well. You’ll need to pass this embedding function every time you call the collection.
embedding_func = embedding_functions.OpenAIEmbeddingFunction(
api_key=<openai_api_key> ,
model_name="text-embedding-3-small",
)
Now we need to initialise the client, we will be using a persistent client and create our collection.
client = chromadb.PersistentClient(path=db_path)
client.create_collection(
name=collection_name,
embedding_function=embedding_func,
)
Now let’s load our PDFs. To do this, first, we will create a text splitter and then for each PDF, load it and split it into documents, which will then be stored in the collection. You can use any chunk size you want, we will use 1000 here.
chunk_size = 1000
#Load the collection
collection = client.get_collection(
collection_name, embedding_function=embedding_func
)
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size=chunk_size,
chunk_overlap=20,
length_function=len,
)
for pdf_file in glob(f"{document_dir_path}*.pdf"):
pdf_loader = PyPDFLoader(pdf_file)
documents = [
doc.page_content
for doc in pdf_loader.load_and_split(text_splitter=text_splitter)
]
collection.add(
documents=documents,
ids=[str(uuid4()) for _ in range(len(documents))],
)
The collections require an id to be passed, you can pass any string value, here we are passing random strings, but you can, for example, pass the name of the file as id.
Let me know in case you’ve any questions.
Leave a comment