Extend the knowledge of your Large Language Model with RAG

Thanh Long Phan, Fabian Dechent

Large Language Models (LLMs) have rapidly gained popularity in Natural Language tasks due to their remarkable human-like ability to understand and generate text. Amidst great advances, there are still challenges to be solved on the way to building perfectly reliable assistants.

LLMs are known to make up answers, often producing text that adheres to the expected style, but lacks accuracy or factual grounding.

Generated words and phrases are chosen as they are likely to follow previous text, where the likelihood is adjusted to fit the training corpus as closely as possible. This gives rise to the possibility that a piece of information is outdated, if the corpus is not updated and the model retrained. Or that it is just factually incorrect, while the generated words have the quality of sounding correct and can be matched to the required genre. The core problem here is that the LLM does not know, what it does not know.

In addition, even if a piece of information is correct, it is hard to track its source in order to enable fact-checking.

In this article, we introduce RAG (Retrieval-Augmented Generation) as a method to address both problems and which thus aims to enhance the reliability and accuracy of information generated by LLMs.

What is RAG?

In 2020, Meta came up with a framework - called retrieval-augmented generation (RAG), which gives LLMs access to information beyond their training data.

As the name suggests, RAG has two phases, a retrieval phase and a generative phase.

Retrieval Phase - In this phase, the algorithm searches for snippets of information relevant to the user's input prompt. It is common practice to break down documents into smaller chunks. An embedding model - typically a BERT model - then calculates an embedding vector for each chunk, and this data is stored in a vector database. The relevant information chunks can then readily be retrieved using semantic search.

Generative Phase - Given the retrieved information in addition to the user query, an LLM is now used to generate an appropriate answer. Through prompt engineering, as well as fine-tuning, it is feasible to limit the facts within the LLM’s answer to information sourced from the database. As a bonus the source document can also be stated in addition to the response.

In instances, where the question cannot be addressed solely with the available information the LLM has the capability to acknowledge its limitation and communicate that it cannot generate an answer based on the provided information.


Today’s chatbots, which are built with LLMs, can give users personalized answers. RAG takes LLMs one step further by increasing factual correctness using up-to-date sources. Simply add the new documents to the database.

Another type of application, which relies on the retrieval of information for context augmentation are LLM powered agents. They are designed with the objective to solve multi-step problems with memory, complex reasoning capabilities as well as the possibility to access the web and various other tools.

Needless to say, an agent can chat.

Limitations of RAG

Not always the relevant documents are retrieved - To make RAG work, the retrieval method needs to be sufficient. This stands and falls with producing a good embedding of your documents, as well as searching phrases that are sufficiently similar to the wording used in the documents. If the relevant information is not retrieved, the setup does not work.

It increases the response time - Depending on the type of database and the size of the LLM, the process can be slow. Typically, before passing all the retrieved chunks to the LLM, you may apply the LLM to the chunk to summarize it or rerank the retrieved chunks. But additional transformations can also be applied on the input query.

Accurate answer - The size of the context taken into account for a given answer is limited and thus special care has to be taken, to construct it as small as possible and at the same time as large as necessary. Sometimes the context window is simply too small to store all pertinent information, however below this threshold, the balance is delicate. Here the chunk size for individual pieces of information is crucial. Reducing it too much and being too fine-grained risks losing the coherence of the text, leading to fragmented comprehension. In such cases, the model understands individual chunks but struggles to grasp the document as a whole. If the chunks are too large, not all retrieved pieces might fit.

Furthermore incorrect chunks might be retrieved. Especially with queries that include several thematically separate components, less elaborate retrieval methods might fail to identify all relevant chunks.

Another challenge is the model's comprehensive power; if it proves insufficient, a larger, more advanced model may be necessary for improved performance.

How to improve RAG

Improving the Retrieval - As we mentioned above, a good retrieval is still the most important part to make RAG work.

To make sure that the retrieval returns relevant information for the maximum number of possible user queries, it might be helpful to fine-tune your embedding model on your personal data. Additionally, crucial parameters like chunk size and number of returned chunks need to be optimized. Another method is to apply meta-data filtering to prioritize e.g. the most recent documents.

The query can also be preprocessed by decomposing it into thematically separated subqueries or creating several paraphrases, which are then fed into the semantic search algorithm.

Cleaning your Data - Similar to other machine learning and data science tasks, the quality of the data base is key. Even when the information comes from trusted sources, it may contain noisy text, such as HTML tags; in such cases, preprocessing to remove the noise might be necessary.

If the data base consists of a large volume of documents spanning various subjects (e.g. natural science, history, literature, ...), it might be useful to create several themed data bases, from which one can selectively retrieve documents.

Taking the preprocessing a step further, one might generate and store document summaries beforehand. In the retrieval step, initially, the process is run over these summaries before delving into the details later.

Further Reading