Guides / Tutorials
Guides / TutorialsVector Store and RAG Tutorial

Vector Store and RAG Tutorial

Learn how to create, update, and query your own managed vector store with EquoAI!

What are Vector Stores, and why do I need one?

Vector Stores, or Vector Databases, are crucial to any modern implementation of a Large Language Model.

Neural networks only retain knowledge obtained through training data, and therefore have limited knowledge about the world and current events outside of this. This is why when you ask ChatGPT about anything in 2023, it will likely remind you that its training data extends only up to 2021 (or perhaps up to 2022 very recently). In order for an AI to learn about the current world, you would therefore need to train (or preferably tune it) on text data generated recently.

There is a huge problem with this, however. LLM's tend to have tens if not hundreds of billions of parameters in their neural architecture- therefore training them will require enormous amounts of data and thousands of GPU hours.

Improving AI Response Quality With Your Vector Store

We offer an alternative- "RAG", or "Retrieval Augmented Generation". Using RAG, instead of tuning or training a model on current-year data to keep it up-to-date, we feed information into the AI model from a repository of documents. These documents contain all the up-to-date information that we will likely need the AI to also know. Since neural networks don't actually understand raw text ("strings"), we must convert the text to vector embeddings using layers from a pretrained language model. The documents' vector embeddings are then stored in a modified binary search tree data structure (yes, the same one that you struggled with on leetcode!).

Whenever you want to augment your model's knowledge, the query (the thing you're asking the AI to do) is vectorized, and then the Vector Store is searched for the vector embedding most similar to the relevant query. The K most similar documents related to your query are retrieved, and then using some clever prompt engineering, you can give the information to the model. In its wisdom, the language model shall analyze the document in relation to your question, and with its newfound knowledge, will provide you with an accurate and useful answer.

A simpler way of thinking about this is to imagine that you have a really, really smart friend. One day you ask him/her a question that they don't know the answer to right away. Instead of giving up, they crack open a book and find the answer for you. In this analogy, your smart friend is ChatGPT/Llama/Falcon/Skynet/Insert-Your-Favourite-LLM-Here, and the book is the Vector Store.

Other Uses

Retrieval Augmentation is used to provide a solution to another common problem in AI- "Hallucinations". AI Models have been known to concoct made-up facts. This has actually got lawyers and students alike in trouble. RAG offers a solution to this- if the AI is simply paraphrasing information from documents/text which it was provided from the Vector Store, then the only way you're likely to get made-up facts from the AI is if your documents were made-up to begin with. As long as the Vector Store you're using for RAG doesn't contain a copy of one of the Harry Potter books, then your AI probably won't invent fake case briefs and get your firm fined (we're looking at you, lawyers!)

This way, instead of spending millions of dollars to train an AI in order to teach it correct information, we simply suplement its knowledge with a set of documents.

Python Example

The Fun Part- Let's Code!


Make sure that you're running python 3.6+

python3 --version 

Our most current software distribution relies on vectorization libraries, so make sure to pip install the following:


Import the necessary packages, and initialize your first vector database.

Before running the next bit of code, make sure that you have an account with us. You can sign in at, and then navigate to to generate your api key.

NOTE: This API is a paid service- after you sign up with us using a valid email, you'll gain access to this API and many other services.

Just hit the big purple button that says "Generate API Key"- we promise you won't miss it.

What you're going to do now is initialize a database object, using said api key.

from equoai import equonode
from sentence_transformers import SentenceTransformer
#Input your api key
db = equonode(os.getenv('EQUOAI_API_KEY'))
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
documents = [  
          "Canada is one of Britain's oldest former colonies",
          "Ram of Zod is an awesome heavy metal band. White Sheet Grooms isn't as great.", 
          "Pineapples definitely belong on pizza", 
          "Bojo's bizarre adventure is one of the greatest animes of all time.", 
          "I had the best service at Dave's coffee of any place!",
#Obtain embeddings and convert from ndarray to Python list. 
#Make sure that your embeddings are a list of floating point values before uploading
embeddings = model.encode(documents)
embeddings = embeddings.tolist()
#This will also overwrite an existing project of the same name
db.create(documents, embeddings, project_name, metadata={
          "topic_name": "Here is a fun fact to spice up our already spicy query!"

Run a similarity search of your Vector Store to retrieve the Top-K most similar documents (texts) to the documents which you had stored previously.

query = "Where can I get the best customer service?"
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
#Obtainquery_embeddings and convert from ndarray to Python list
query_embeddings = model.encode(query)
query_embeddings = query_embeddings.tolist()
#This will also overwrite an existing project of the same name
query_results = db.get(query, query_embeddings, project_name, top_k=3)
print('similarity search results: ', query_results)
Response should look like this
similarity search results:  {'documents': [
  "I had the best service at Dave's coffee of any place!", 
  "Ram of Zod is an awesome heavy metal band. White Sheet Grooms isn't as great.",
  'Pineapples definitely belong on pizza'], 
  'num_tokens': [13, 20, 8]}

Update Existing Embeddings

### Update my embeddings with more relevant information.
new_documents = [
    "I should definitely use EquoAI's vector store and RAG pipeline to improve my AI project",
    "I think eventually that aliens will reveal themselves. I saw an alien fly over my house today."
new_embeddings = model.encode(new_documents)
new_embeddings = new_embeddings.tolist()
db.update(new_documents, new_embeddings, project_name)

Test Embeddings Upload!

updated_query = "What software should I use with EquoAI to improve my AI projects?"
updated_query_embeddings = model.encode(updated_query)
updated_query_embeddings = updated_query_embeddings.tolist()
updated_query_results = db.get(updated_query, updated_query_embeddings, project_name, top_k=3)
print('similarity search results: ', updated_query_results)
similarity search results:  {'documents': [
  "I should definitely use EquoAI's vector store and RAG pipeline to improve my AI project",
  "I should definitely use this vector store with Langchain to improve my AI project",
  "Bojo's bizarre adventure is one of the greatest animes of all time."], 
  "num_tokens": [19, 15, 16]

To delete a resource, you should simply be able to write:

'''Delete the current project. Any data stored will be lost.'''

Dynamic Querying

Large Language Models are sensitive to hallucination. This means that if you provide too many tokens (words) to an LLM all at once, its context window might not be large enough to make use of the entire prompt. It may then attempt to complete the question as a response to a malformed question, and concoct irrelevant, or imaginary details in the process.

Hallucination largely becomes a matter of only attempting to fit the most relevant information to a given query into the RAG prompt. How would we determine which of our documents are the most relevant, and which ones can we exclude in order to make the smallest possible relevant prompt?

EquoAI uses a built-in dynamic programming algorithm to generate all possible relevant combinations of your documents. This allows you to use only the most relevant docs, without worrying too much about document size. There are three core benefits to this feature:

  1. Prompt your LLM with only the most relevant context, thereby ensuring only the most useful possible response.

  2. The algorithm relies on bottom-up dynamic programming techniques, which makes it faster than 99% of possible solutions. Speed is important when building chatbot applications for customers or employees to use.

  3. By minimizing the number of relevant documents in the query, the amount of tokens in the RAG prompt is reduced. This drastically kills the risk of hallucination.

Here's an example. Let's say we wanted to parse through a Wikipedia article on Kazakhstan, and use it to help us answer specific questions. I.e. let's "talk" with the Wiki article!

model =  SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
document = """ 
The territory of Kazakhstan has historically been inhabited by nomadic groups and empires.
In antiquity, the nomadic Scythians inhabited the land, and the Achaemenid Persian Empire expanded 
towards the southern region. Turkic nomads have inhabited the country from as early as the 6th century. 
In the 13th century, the territory was subjugated by the Mongol Empire under Genghis Khan. 
In the 15th century, as a result of disintegration of the Golden Horde, 
the Kazakh Khanate was established. By the 18th century, Kazakh Khanate disintegrated into three 
jüz which were absorbed and conquered by the Russian Empire; by the mid-19th century, 
the Russians nominally ruled all of Kazakhstan as part of the Russian Empire and liberated
the slaves of the Kazakhs in 1859.[14] Following the 1917 Russian Revolution and subsequent
Russian Civil War, the territory was reorganized several times. In 1936, it was established
as the Kazakh Soviet Socialist Republic within the Soviet Union. Kazakhstan was the last of
the Soviet republics to declare independence during the dissolution of the Soviet Union 
from 1988 to 1991. 
snippets = document.split('.') 
# query = 'What people have historically inhabited the nation of Kazakhstan?'
query = 'Which groups have inhabited the territory of throughout antiquity?'
query = model.encode(query)#.tolist()
context = model.encode(snippets).tolist()
(best_document, highest_similarity, dp) = db.dynamic_retrieval(
print('MAXIMUM CONTEXT RELEVANCE: {}'.format(highest_similarity))
print('HIGHEST RELEVANCE COMBINED DOCUMENT : {}'.format(best_document))
In antiquity, the nomadic Scythians inhabited the land, and the Achaemenid Persian Empire expanded 
towards the southern region.  
The territory of Kazakhstan has historically been inhabited by nomadic groups and empires. 

Where the "Maximum Context Relevance" is the cosine similarity between the query and the embeddings of the resulting combined document. The resulting combination of snippets does a better job of answering our query than any other combination of snippets from the Wiki article, and due to its brevity has a low risk of hallucinations when prompting an LLM.

Further Optimizing Queries

If you have a large number of relevant documents, and wish to further reduce hallucination risk, you may optimize the query even further by setting optimized=True in the dynamic_retrieval() method. This activates a 1D clustering algorithm that clusters all document combinations by relevance to our query, and then returns the smallest example by token count. This is useful for doing a relevance search across thousands of documents of varying sizes, and will likely require its own tutorial soon!

best_document = db.dynamic_retrieval(
print('HIGHEST RELEVANCE COMBINED DOCUMENT : {}'.format(best_document))


LLM's store enormous amounts of information during training/fine-tuning. When training your organization's own LLM's, your engineers will likely have to feed in lots of sensitive data, containing information related to both users and possibly company IP. Sometimes, this may happen even unintentionally because there are too many data points for engineering teams to properly clean. Even one piece of user information poses a security risk.

In order to mitigate this risk, EquoAI's RAG Pipelines implement a built in threshold parameter. This forces the RAG Pipeline (LLM + Vector Store) to respond only to queries relevant to data found within our documents. This way, even if an LLM learns sensitive information during its training phases, it is restrained from answering queries with information not found in a vetted document store. We don't know all of what the LLM knows (especially when using an open-source LLM), but we can restrict the content of its responses to include information which we know to be safe.

#Restrict the RAG Pipeline to only answer questions at least 50% relevant to a query!
#Protect your data! 
best_document = db.dynamic_retrieval(
print('HIGHEST RELEVANCE COMBINED DOCUMENT : {}'.format(best_document))

There are two enormous value offerings derived from this feature:

  1. Restrict LLM responses only to information in a Vector Store vetted as being safe for exposure to the end-user.

  2. Greatly improve the quality of responses to end-users by ensuring that questions asked are relevant to documents available on-hand

It is worth adjusting the threshold parameter according to an organization's needs. By configuring an optimal optimized boolean and threshold parameter during production, it is possible to secure your organization's information while providing the most optimal user experience possible with an LLM-based application.

This is the balance which EquoAI brings to GenAI products!

Was this helpful?
Copyright © EquoAI. All rights reserved.