In this tutorial, we will walk through how to use Chromadb as your vector database for all your Retrieval-Augmented Generation (RAG) tasks.
But before that, you need to install Chromadb, if you’re using Python then all you need to do is –
pip install chromadb
Now that you’ve installed Chromadb, let’s begin. We will use a PDF file as an example. For the PDF we will be using this research paper, but feel free to use the PDF of your choice.
The first step is to create a persistent client, i.e., the storage which can be used at multiple places. While creating the client remember to add the setting to allow resetting the client should you require this functionality.
import chromadb
client = chromadb.PersistentClient(
path="<path of persistent storage>", settings=chromadb.config.Settings(allow_reset=True)
)
Once you’ve a client set up then you can define collections within it. If you use BigQuery or any SQL products, imagine the client being the project and the collection being the dataset. Within the collection, you can store the documents as embeddings. In this example, we will call our collection as “research”. Also, one very important thing to remember is that each collection should have its own embedding function that has to be fixed. The query is also passed as an embedding when you try to search for the most similar documents. So in case you use embedding function X to add the documents and use embedding function Y to query them, then the similarity scores will not be correct, so this is a point to remember. We will be using the OpenAI ttext-embedding-3-small model. Another point to remember is that in a single document, you should only have as many tokens as the embedding function can embed. If say your embedding function is all-MiniLM-L6-v2 from HuggingFace, then the max sequence length that the function can handle is 256, so if you try to vectorise a file with longer context, then it will just clip the document to 256 tokens and embed that. The model from OpenAI has a longer max sequence length, but how much exactly is hard to find.
# Defining the embedding function
embedding_func = embedding_functions.OpenAIEmbeddingFunction(api_key=os.environ.get("OPENAI_API_KEY") ,
model_name="text-embedding-3-small")
Creating the collection, it’s best practice to specify the embedding function while creating the collection, otherwise, Chromadb uses a default embedding function. Chromadb will use sentence-transformer as a default if no embedding function is supplied.
Now we will need to add a document to this collection, for this, we will use some helper functions from langchain.
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
data_path = "./data/2311.04635v1.pdf"
pdf_loader = PyPDFLoader(data_path)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""],
)
documents = pdf_loader.load_and_split(text_splitter=text_splitter)
The PyPDFLoader will help to load the PDF file and the RecursiveCharacterTextSplitter will help in splitting the PDF into chunks. We are using a chunk size of 1000 with an overlap of 50, meaning the chunk sizes will be roughly 1000 tokens with overlapping text of 50 tokens. You can learn more about how the text splitter works here.
Now that we have our documents loaded, time to add them to the Chromadb collection. Since we’ve specified the embedding function in the collection already, we can simply add the text files as embeddings. You have to specify “ids”, and think of them as table names in SQL, also you can specify metadata for each document, both are useful when you want to upsert or delete documents in the collection.
collection.add(
documents=[i.page_content for i in documents],
ids=[f"pdf_chunk_{i}" for i in range(len(documents))],
metadatas=
[
{
"file_name": "reasearch_paper",
"timestamp": datetime.now(timezone.utc).isoformat(),
}
for _ in documents
],
)
Here we use the page contents of the loaded documents as documents, since they are text, the collection using its embedding function will automatically convert them as embeddings. In case you already have embeddings, you can directly add them as embeddings. I’ve also specified some ids, these are very rudimentary here for illustration purposes. Also, I’ve specified some metadata for each document. Both of these will be used to query, upsert or delete individual documents from the vector database.
Read more about how you can upsert documents, query a collection and delete individual documents in Part II of this Ultimate Guide to ChromaDB

