Hallucinations in LLMs: Strategies for monitoring


Sevval Gölbasi


Hallucinations in large language models (LLMs) refer to instances where the model generates information that is either incorrect, incomplete, or unfaithful to the input. These hallucinations can take many forms, from giving wrong answers to leaving out important details. Additionally, when evaluating model performance, it’s important to consider other quality metrics, such as the tone of voice, levels of toxicity, bias, or even hatefulness. 


Methods for detecting hallucinations


Probability based

The probability-based approach utilizing the MMLU (Massive Multitask Language Understanding) with few-shot prompting works like this: The model is given a few examples of questions along with their correct answers as a way of learning the topic. Then, when it's asked to solve similar questions on its own, it calculates the probabilities for each possible answer. The model then picks the answer with the highest probability. If it gets the answer right, it earns a point.

However, sometimes even when the model picks the answer with the highest probability, it can still be wrong. This is where hallucinations come into play. 

$$\text{PPL}(X) = \exp\left\{-\frac{1}{t} \sum_{i=1}^t \log p_\theta\left(x_i \mid x_{<{i}}\right)\right\} $$

Perplexity, a formula used to measure how confident a model is, helps us monitor how "sure" the model is about its predictions. In simple terms, perplexity tells us how "surprised" the model is by the data it sees. If the model’s perplexity is low, it’s likely giving answers with high confidence. If the perplexity is high, the model is less sure and might struggle to give accurate answers.

Encoder based

Encoder-based hallucination detection helps identify when a model's predictions don’t match the actual intended meaning. Here's how it works: The model predicts a sentence (like "Berlin is the capital of Germany"), and we compare it to the correct sentence (for example, "The capital of Germany is Berlin"). Even though these two sentences mean the same thing, they use different wording.

To detect potential hallucinations, both sentences are passed through something called an encoder, which turns them into numerical data (or vectors). After this, the model applies pooling, which combines the numerical data from each sentence into a single summary, or “pooled representation,” that captures the overall meaning of the sentence,

By comparing these summaries, the model checks how closely its prediction matches the correct answer. If there’s a big difference between the two, it might be a sign that the model has "hallucinated," meaning it gave a prediction that doesn’t truly reflect the correct information, even if it sounds reasonable. This method helps the model look beyond just word-for-word matches and ensures it’s staying true to the intended meaning.

Cosine similarity

Cosine similarity is a mathematical tool used to compare how similar two sentences are based on their pooled representations (which are the numerical summaries created by the encoder). After both the predicted sentence and the correct sentence are turned into vectors, cosine similarity measures how close these two vectors are.

If the two sentences are very similar in meaning, the cosine similarity score will be close to 1. If the sentences are quite different, the score will be closer to 0. For example, if the score is 0.9, it means the sentences are almost identical in meaning, even if the wording is different. 

However, one disadvantage is that cosine similarity heavily depends on the model, and it can still give wrong answers even if the similarity score is high, as the model may not fully grasp the context or deeper meaning of the sentences.

Another technique : BERTscore

BERTScore is a way to measure how similar two pieces of text are by looking at the meanings of the words in their context. It uses a model called BERT, which creates "contextual embeddings" — meaning that the way a word is represented depends on the words around it. 

For example, in "The bank of the river," the word "bank" means a riverbank, but in "I deposited money at the bank," it refers to a financial institution. BERT understands these differences and gives each "bank" a different meaning in each sentence. BERTScore uses these smart word representations to compare how closely related the two texts are, making it more powerful than using older methods like Word2Vec, which give each word just one fixed meaning.

How BERTScore improves word comparisons with context?

BERTScore is more advanced than traditional cosine similarity when applied to simple word representations like Word2Vec, as for traditional cosine similary each word has a fixed meaning regardless of context. BERTScore, on the other hand, uses contextual embeddings from BERT, where the meaning of each word changes based on the sentence it appears in. Unlike pooling techniques that generate a single representation for the whole sentence, BERTScore compares word embeddings directly. It still uses cosine similarity, but it applies it at the word level, capturing the context-sensitive meaning of words within sentences rather than averaging the sentence's overall meaning.

Figure 2 : BERTScore calculation pipeline with importance weighting

Natural Language Inference

NLI, or Natural Language Inference, is a task in natural language processing (NLP) that involves determining the logical relationship between two sentences. The goal is to classify whether the second sentence (called the hypothesis) is entailment, contradiction, or neutral in relation to the first sentence (called the premise).

Here’s what these terms mean:

  • Entailment: The hypothesis logically follows from the premise.
    Premise: "All toys are red."
    Hypothesis: "Airplane toys are red."
    This is entailment because the hypothesis is true based on the premise.

  • Contradiction: The hypothesis contradicts the premise.
    Premise: "All toys are red."
    Hypothesis: "No toys are red."
    This is a contradiction because the hypothesis directly opposes the premise.

  • Neutral: The hypothesis is neither entailed nor contradicted by the premise.
    Premise: "All toys are red."
    Hypothesis: "Some toys are made of plastic."
    This is neutral because the hypothesis does not directly relate to the color of the toys in the premise.

NLI is important in tasks like understanding and reasoning in natural language, and it's commonly used in models like BERT and other advanced NLP systems.


LLM as a judge

LLMs can evaluate their own responses by checking them against the context or data they retrieve. This process helps ensure that the model’s answers are accurate and based on the provided information. By either fine-tuning the model or prompting it to verify its output, we can evaluate response accuracy.

Example strategies:

  • Faithfulness checks help ensure that the model's output accurately reflects the original information without adding unsupported details. To implement this, one approach is to prompt the LLM to evaluate whether the generated answer is factually aligned with the retrieval context. Additionally, the model can use similarity metrics like cosine similarity or BERTScore to compare its response to the retrieved context, flagging any significant deviations. Fact-matching is another strategy where the model “cites” specific parts of the context to back up its claims. Named entity checks also ensure that people, places, or organizations mentioned in the response match those in the context.

  • Multi-pass refinement is another approach where the model refines its response in stages instead of generating the final answer all at once. It starts with an initial draft based on the available data, then retrieves more context to fill in any gaps or correct assumptions. After gathering additional information, the model revises its response, adding missing details and improving clarity. Finally, a verification pass ensures the refined response aligns with the retrieved context, and in more complex tasks, this process can be repeated for greater accuracy.


Conclusion


In conclusion, as large language models (LLMs) become more advanced, managing hallucinations—where models provide incorrect or misleading information—is important. Techniques like probability-based methods, perplexity, and encoder-based detection tools offer valuable approaches to identify these issues. Advanced methods such as BERTScore, which accounts for word context, provide a more nuanced understanding and improve detection of discrepancies. Natural Language Inference (NLI) further enhances models' ability to understand logical relationships between sentences. Additionally, employing LLMs as self-evaluators or "judges" adds an extra layer of verification, ensuring that model responses are factually aligned with the provided context. By combining these techniques, we can improve the accuracy and reliability of LLMs.

Are you currently working on your own LLM-based applications and could use support in the areas of LLM evaluation and LLM governance? Then use our contact form to get in touch with our ML-Scientists for an initial conversation.