Tag: ChromaDB

  • Ultimate Guide to Chroma Vector Database: Everything You Need to Know – Part 2

    Ultimate Guide to Chroma Vector Database: Everything You Need to Know – Part 2

    In Part 1, we learned how to create the vector database and add documents to a collection. In this tutorial, we will learn how you can query the collection, upsert documents, delete individual documents and also the collection.

    Querying

    Now you can either peek at the collection, which will return you the first 10 documents in the collection, you can also specify the number of documents to peek at, or you can specify either the metadata or the ID you want to retrieve.

    collection.peek(5) # Returns the top 5 documents
    collection.get(ids=['pdf_chunk_0', 'pdf_chunk_1']) # Returns the documents corresponding to ids mentioned in the list

    You can also query a collection using the where method, where you can specify metadata. For example, in Part 1 we added metadata to each document, where the file name was reasearch_paper. So we can query all documents with the metadata.

    collection.get(where={'file_name': 'reasearch_paper'})

    Another thing you can do is query the most similar documents to an input query. For example, I want to know in the research paper who the authors are, I can get the documents which may contain this information by running –

    collection.query(query_texts=["Who are the authors of the paper ?"], n_results=3)

    Here the query texts are my queries and n_results is the number of similar documents I want for the query. You can specify multiple queries at the same time. In that case, it will return results for each query at the same time.

    Upserting

    Similar to querying, you can upsert providing the IDs. So for example I want to upsert the data in ID pdf_chunk_0, then I’ll run the following –

    collection.upsert(ids=['pdf_chunk_0'], documents=['This is an example of upsertion'])

    Now if I query the document, I should see the above document text instead of the original document. Note that if you provide an ID which is not present, ChromaDB will consider it as an add operation.

    Deleting

    Again you can delete individual documents by either specifying the IDs or using the where method. So in case I want to delete pdf_chunk_0, I can run this – collection.delete(ids = ['pdf_chunk_0']) or if I want to delete all documents containing some metadata, I can run this query – collection.delete(where={"file_name": "research_paper"})

    You can also delete the entire collection by client.delete_collection('research')

    In case you want to reset the client, and you’ve allowed so when creating the persistent client in the setting, you can run client.reset(). Empties and completely resets the database. ⚠️ This is destructive and not reversible.

    Let me know In case you want to learn more about ChromaDB, then I’ll create a guide for advanced users.

  • Ultimate Guide to Chroma Vector Database: Everything You Need to Know – Part 1

    Ultimate Guide to Chroma Vector Database: Everything You Need to Know – Part 1

    In this tutorial, we will walk through how to use Chromadb as your vector database for all your Retrieval-Augmented Generation (RAG) tasks.

    But before that, you need to install Chromadb, if you’re using Python then all you need to do is –

    pip install chromadb

    Now that you’ve installed Chromadb, let’s begin. We will use a PDF file as an example. For the PDF we will be using this research paper, but feel free to use the PDF of your choice.

    The first step is to create a persistent client, i.e., the storage which can be used at multiple places. While creating the client remember to add the setting to allow resetting the client should you require this functionality.

    import chromadb
    
    client = chromadb.PersistentClient(
                path="<path of persistent storage>", settings=chromadb.config.Settings(allow_reset=True)
            )

    Once you’ve a client set up then you can define collections within it. If you use BigQuery or any SQL products, imagine the client being the project and the collection being the dataset. Within the collection, you can store the documents as embeddings. In this example, we will call our collection as “research”. Also, one very important thing to remember is that each collection should have its own embedding function that has to be fixed. The query is also passed as an embedding when you try to search for the most similar documents. So in case you use embedding function X to add the documents and use embedding function Y to query them, then the similarity scores will not be correct, so this is a point to remember. We will be using the OpenAI ttext-embedding-3-small model. Another point to remember is that in a single document, you should only have as many tokens as the embedding function can embed. If say your embedding function is all-MiniLM-L6-v2 from HuggingFace, then the max sequence length that the function can handle is 256, so if you try to vectorise a file with longer context, then it will just clip the document to 256 tokens and embed that. The model from OpenAI has a longer max sequence length, but how much exactly is hard to find.

    # Defining the embedding function
    embedding_func = embedding_functions.OpenAIEmbeddingFunction(api_key=os.environ.get("OPENAI_API_KEY") , 
                                                                 model_name="text-embedding-3-small")
    

    Creating the collection, it’s best practice to specify the embedding function while creating the collection, otherwise, Chromadb uses a default embedding function. Chromadb will use sentence-transformer as a default if no embedding function is supplied.

    Now we will need to add a document to this collection, for this, we will use some helper functions from langchain.

    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain_community.document_loaders import PyPDFLoader
    
    data_path = "./data/2311.04635v1.pdf"
    
    pdf_loader = PyPDFLoader(data_path)
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=50,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    documents = pdf_loader.load_and_split(text_splitter=text_splitter)

    The PyPDFLoader will help to load the PDF file and the RecursiveCharacterTextSplitter will help in splitting the PDF into chunks. We are using a chunk size of 1000 with an overlap of 50, meaning the chunk sizes will be roughly 1000 tokens with overlapping text of 50 tokens. You can learn more about how the text splitter works here.

    Now that we have our documents loaded, time to add them to the Chromadb collection. Since we’ve specified the embedding function in the collection already, we can simply add the text files as embeddings. You have to specify “ids”, and think of them as table names in SQL, also you can specify metadata for each document, both are useful when you want to upsert or delete documents in the collection.

    collection.add(
                documents=[i.page_content for i in documents],
                ids=[f"pdf_chunk_{i}" for i in range(len(documents))],
                metadatas=
                [
                    {
                        "file_name": "reasearch_paper",
                        "timestamp": datetime.now(timezone.utc).isoformat(),
                    }
                    for _ in documents
                ],
            )

    Here we use the page contents of the loaded documents as documents, since they are text, the collection using its embedding function will automatically convert them as embeddings. In case you already have embeddings, you can directly add them as embeddings. I’ve also specified some ids, these are very rudimentary here for illustration purposes. Also, I’ve specified some metadata for each document. Both of these will be used to query, upsert or delete individual documents from the vector database.

    Read more about how you can upsert documents, query a collection and delete individual documents in Part II of this Ultimate Guide to ChromaDB

  • Create Your Own Vector Database

    In this tutorial, we will walk through how you can create your own vector database using Chroma and Langchain. With this, you will be able to easily store PDF files and use the chroma db as a retriever in your Retrieval Augmented Generation (RAG) systems. In another part, I’ll walk over how you can take this vector database and build a RAG system.

    # Importing Libraries

    import chromadb
    import os
    from chromadb.utils import embedding_functions
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain_community.document_loaders import PyPDFLoader
    from typing import Optional
    from pathlib import Path
    from glob import glob
    from uuid import uuid4

    Now we will define some variables –

    db_path = <path you want to store db >
    collection_name = <name of collection of chroma, it's similar to dataset>
    document_dir_path = <path where the pdfs are stored>

    Now, you also need to create an embedding function, I will use the OpenAI model in the embedding function as it’s very cheap and good but you can use open-source embedding functions as well. You’ll need to pass this embedding function every time you call the collection.

    embedding_func = embedding_functions.OpenAIEmbeddingFunction(
    api_key=<openai_api_key> ,
    model_name="text-embedding-3-small",
    )

    Now we need to initialise the client, we will be using a persistent client and create our collection.

    client = chromadb.PersistentClient(path=db_path)
    client.create_collection(
    name=collection_name,
    embedding_function=embedding_func,
    )

    Now let’s load our PDFs. To do this, first, we will create a text splitter and then for each PDF, load it and split it into documents, which will then be stored in the collection. You can use any chunk size you want, we will use 1000 here.

    chunk_size = 1000

    #Load the collection
    collection = client.get_collection(
    collection_name, embedding_function=embedding_func
    )
    text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=chunk_size,
    chunk_overlap=20,
    length_function=len,
    )

    for pdf_file in glob(f"{document_dir_path}*.pdf"):
    pdf_loader = PyPDFLoader(pdf_file)
    documents = [
    doc.page_content
    for doc in pdf_loader.load_and_split(text_splitter=text_splitter)
    ]
    collection.add(
    documents=documents,
    ids=[str(uuid4()) for _ in range(len(documents))],
    )

    The collections require an id to be passed, you can pass any string value, here we are passing random strings, but you can, for example, pass the name of the file as id.

    Let me know in case you’ve any questions.