Guides / Tutorials
Guides / TutorialsHallucinations


Why LLM's Hallucinate, and how to Manage Hallucination Risks

Hallucinations in AI

Many developers and customers alike complain about hallucinations. They will ask an LLM to answer some question given a very specific prompt, and often receive a very unhelpful or irrelevant reply. More specifically, an LLM can often make up untrue facts- in both academia and the legal profession, there are often dire consequences for this. Both of these scenarios are referred to as Hallucinations!

This can be frustrating, and any B2C facing product relying on an LLM is going to become unbearable for users if it continuously delivers low-quality replies. Many users can become enraged if the replies contain false information, especially in the case of professionals and business owners attempting to reduce their time processing documents.

Why Do Language Models Hallucinate?

Because we all do. Contrary to popular belief, Large Language Models or LLM's are not Stochastic Parrots- what esteemed Physicist Michio Kaku also referred to as being "glorified tape recorders". They are, in fact, glorified hallucination machines!

As it turns out, LLM's actually do appear to be somewhat intelligent (if this worries you, then you're starting to get the big picture!) Outputs from LLM's are often unique, in the same way that every conversation between two people is. LLM's appear to understand language (that's kind of the point), and can draw connections between different concepts to varying degrees. This allows them to generate novel content, usually not ever the same as data they were trained on.

LLM's Are Just Hallucination Machines

In the same way that the human brain hallucinates its own thoughts while processing information, LLM's hallucinate outputs that are approximately similar to but not exactly the same as conversational data on which they were trained.

How LLM's Speak

Bear with us here. Every person has a vocabulary of distinct words in our respective languages, from which we choose when speaking or writing out a thought. When you are speaking, you are predicting which word best completed the sentence, in an entirely subconscious fashion. You are randomly sampling words from the list of words that you are already familiar with.

LLM's work in exactly the same way. Each word predicted by an LLM in response to a prompt is randomly sampled from its vocabulary.

Most distributions in nature follow a Normal or Gaussian Distribution- which you may recognize as the notorious bell curve.

Not how our vocabulary is distributed when we speak!

If the likelihood of each word followed a Normal Distribution, we would end up mostly just saying our favourite words (think middle of the bell-curve!), resulting in nonsense word-salads.

LLM's weight each token in their vocabulary by sampling them from a Boltzmann Distribution. This is the probability scheme used in Thermal Physics to identify the likelihood of a system having a certain amount of energy, given a temperature, and its PDF (probability density function) is commonly referred to in machine learning as the Softmax Function.

In English

For our less nerdy readers, the translation of the above passage is that LLM's, not unlike ourselves, speak by figuring out which words most likely should come next in a sentence, and then randomly sampling from them. Some words are more likely than others, based entirely on the context of that word in a sentence.

Example: The Round Black Cat Ate A Bowl Of ___

When a native English speaker reads this sentence, there are over 30,000 words in the language which can fill in the blank here. What yout brain does, however, is assign a likelihood to each word that should come next, and then sample from the most likely words at the top of your head.

You probably thought "Catnip", "Milk", "Cereal" or "Soup" (even though you stopped yourself from saying it, subconsciously your brain considered this option after seeing the word Bowl). If you said "Catnip" (the most grammatically correct answer), then this was the word your brain assigned the highest probability to coming next.


Transformers are a wonderful architecture discovered by Google Brain in the 2016-7 period, and popularized with the famous paper Attention Is All You Need. Transformers convert language vocabulary tokens ("words" but also suffixes and prefixes) into mathematical representations called Embeddings.

These embeddings are then fed through a long sequence of mathematical operations to eventually predict the likelihood of the next word in a sequence, based on the context provided by all previous parts of that sequence. It does this using an algorithm called gradient descent, which is used to figure out what parameters (think "neurons") give the most accurate results, using techniques from vector calculus.

The mathematics of Transformers is beyond the scope of a blog-post, however it is our hope that business owners receive valuable insights from learning a high level understanding of what this technology does.

Large Language Models are almost all based on the Transformer architecture. This means that they predict the next word in a sentence based on all previous words, and attempt to use the context provided by all previous words to identify the most likely next words. Since it is playing a guessing game, the next word is comparable to the roll of a die, and this causes the model to often say things that are incorrect or even misleading.

Impact of Hallucinations on Professionals

If LLM's are hallucination machines by nature, then many businesses appear to be at an immediate disadvantage when attempting to use them. These are businesses where information provided needs to be factual, or grounded. Some examples:

Mortgage Underwriters

Insurance Companies

Lawyers and Paralegals

Medical Professionals


How We Help

These disadvantages go away with EquoAI. Among our core offerings is the concept of minimizing hallucinations in the sense of ensuring that LLM's respond with accurate information.

Retrieval Augmentation Generation

"RAG", as it's commonly known, is the unflattering acronym for the process of providing context (facts) to an LLM. Done correctly, it's a near-magical solution to this problem.

Businesses wishing to use LLM's need a RAG Pipeline, which is a software program that gathers, transforms, and feeds their data into an LLM whenever a user prompts it.

Production grade RAG Pipelines are often complex to build, as they require a solid understanding of vector algebra, computer science algorithms, and data engineering; if they're built correctly, however, then a business can now rely on its LLM's to provide high-quality, high-accuracy outputs based on documents available at hand.

Quality Assurance

The quality of a RAG Pipeline's ability to answer questions is often measured by groundedness- a score from 0-100% assessing how relevant a document is to the original prompt from an end-user. Each time a user prompts the model, the RAG Pipeline would retrieve the most relevant information to help answer their question. The relevance of the documents to the prompt is the groundedness.

By monitoring the groundedness of an application, it is possible to keep track of when a model hallucinates and then make changes to the pipeline accordingly. This is essential for high-traffic LLM apps where even a small percentage of harmful replies can cost a business thousands of dollars.

Using High-Quality Embeddings, Search Algorithms

RAG can further be improved by taking advantage of our dynamic vector search algorithms coupled with the use of high-quality embeddings and reranking. In our developers' article on embeddings, we cover why this is important for ensuring the best responses.

Reach Out!

If you'd like to begin building with LLM's, then get started with our RAG services and you're business is guaranteed to hit the ground running with LLM's.

Was this helpful?
Copyright © EquoAI. All rights reserved.