Retrieval Augmentation Generation

RAG Pipeline Automated Quality Assurance

In order to ensure enduring and competitive quality across LLM-based applications, it is often necessary to incorporate quality assurance. This way, an engineering team knows when it's time to make changes to its existing RAG Pipelines.

LLMOps

LLM Ops, or Large Language Model Operations, is a subset of MLOps focused on autoregressive chat-based applications. It is the practice of effectively monitoring, improving, and performing efficient version-control accross LLM's in production.

Whereas most MLOps for general ML models optimizes for quality in terms of validation accuracy and measuring model drift, LLM applications are not judged on such concrete, traditional machine learning metrics, but rather by the experience of the end-user.

Three key metrics have been agreed upon for tracking end-user experience when using a RAG Pipeline:


1. Groundedness- relevance of the LLM response
to correct documents on-file

2. Context Relevance- relevance of documents in RAG 
Pipeline to the user's question or prompt 

3. Answer Relevance- relevance of the LLM response 
to the user's question or prompt

Ideally, if all three metrics are high, then your users are most likely to be satisfied with your chat-application. Additionally, these metrics are useful in monitoring the degree to which the RAG Pipeline is data-secure- i.e. if Context Relevance is generally high, then it means that the pipeline is both providing relevant information to the end user, while not risking exposing sensitive information. EquoAI uses these metrics to optimize both data security and the user experience in this regard, while providing your organizations live measurements in this metrics over time.

Uses

Monitoring RAG Pipelines enables engineering teams to make informed decisions such as: -- A/B testing RAG Pipelines to identify which vector-store/LLM combination provides the highest quality -- Take RAG Pipelines off-line if security risks are readily detected -- Make changes to Vector Store schema, or decide need for LLM-Fine-Tuning.

RAG Tutorial- Let's Do E-Commerce!

Now, we're going to build a RAG pipeline using EquoAI, leveraging our Vector Store to improve the response quality of ChatGPT. This will work with any sufficiently intelligent language model.

In this tutorial, we're going to LARP (Live-Action-RolePlay) as Nick, an employee at an online fashion brand called Invinzsible.com. Invinzsible has some really stylish shirts, and Nick has to keep an account of them online. Thanks to AI, Nick no longer needs to understand Excel or SQL databases- he can write a quick note about the company's inventory, use equoai_db to compute and store the embeddings in the vector-store, and then let the ChatBot do the rest. Nick can save his notes on inventory in a text file, and the company can then the actual web developers at the company only need to run a few lines of code to generate and store the vector embeddings.

 
#Nick Sanchez's API KEY 
from equoai import equonode
from sentence_transformers import SentenceTransformer
import os 
 
 
db = equonode(os.getenv('EQUOAI_API_KEY'))
    
 
#Embeddings model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
 
documents = [
         "We made over $1800 in sales this past Monday",
         "Currently, there are only t-shirts available. You can get them as V-Necks or Crewnecks. Dress shirts will be coming by winter.",
         "No customers currently have asked us for a refund",
         "Our women's footwear are extremely popular, with a lot of customers living in Toronto"
         ]
 
#Obtain embeddings and convert from ndarray to Python list. 
#Make sure that your embeddings are a list of floating point values before uploading
embeddings = model.encode(documents)
embeddings = embeddings.tolist()
 
project_name='store-inventory'
 
db.create(documents, embeddings, project_name)

Let's say that one day, you go on the store's site and are greeted by a friendly chatbot. You want to know what kinds of shirts they have, but don't want to spend several minutes browsing, so you ask the ChatBot for some help. Sure enough, it provides you with a solid answer. Here's the backend code powering it:

 
customer_question = " How many types of shirt does the fashion brand Invinzsible currently sell?"
customer_question_embeddings = model.encode(customer_question).tolist()
#Do a greedy search for single most relevant embeddings. You want the single most relevant answer to your question.
search_results = db.get(customer_question, customer_question_embeddings, 'invinzsible-inventory', 1)
 
#Ask GPT-3.5 to answer our question based on the information in our store-inventory documents.
rag_prompt = "Given the following information, answer the Customer Question provided: " + search_results['documents'][0] + " " +  customer_question 
#Whatever LLM sofware you choose to generate your responses with, e.g. Cohere, OpenAI, etc.
response = completions(rag_prompt)
 
print(response)
# #LIKELY AI RESPONSE:
# "The fashion brand Invinzsible currently sells two types of shirts: V-Necks and Crewnecks."

Notice that in our uploaded documents, we never explicitly mentioned that there were two kinds of casual shirt available at the online store.

Our documents stated the two types of shirt available, and a Large Language Model read this information, and based on the question which asked how many shirts were available, was able to determine that there are two available shirts.

This is the power of efficient information retrieval coupled with artificial intelligence- we can use AI to derive insights from large amounts of documents which would in the past only be possible by getting a human to read over them. Automating tasks like this provides valuable insights to businesses and employees alike, because you can now talk with your documents!

And that's how you do practical RAG with EquoAI!