Retrieval-Augmented Generation (RAG) for LLMs
Introduction/overview
Key ideas
- Retrieval-augmented generation (RAG) systems tackle factual inaccuracies in LLM responses and reduce "hallucinations.”
- RAGs use a vector database to transform text for fast and precise data retrieval.
- The attention mechanism within the generation component of RAG allows the LLM to integrate information from both the query and the documents to produce an output response.
- With resources like web scraping tools and live-document parsing, RAG delivers more personalized and accurate responses.
Large Language Models (LLMs) are impressively capable but have limitations. These include "hallucinations" or generating outputs that are not true. Hallucinations can come from limited or unrepresentative training data, sparse data conditions, overfitting, ambiguous or misleading prompts, inherent model bias, or the LLM misinterpreting prompts.
Nevertheless, patterns learned during the LLM’s model training are the most common cause of hallucinations in the output. Providing highly relevant data directly accessible to the LLM after training can be done via Retrieval-Augmented Generation (RAG). As we briefly touched on in Lesson 5 of Course 1, RAG addresses LLM limitations by integrating relevant data on demand.
RAG models improve the factual accuracy and reliability of generated text by dynamically accessing relevant, up-to-date information. They provide a more robust solution that blends the best of both information retrieval and text generation. This integration increases the reliability of the LLMs' content and significantly broadens their potential applications, making them more adaptable and reliable tools in various contexts.
This lesson will teach you about RAGs, their retrieval and generation phases, and their use cases.
How retrieval-augmented generation (RAG) systems work
At first glance, RAGs might seem like your typical database, holding tons of documentation and information for reference in a vast data pool. However, they capitalize on a significantly optimized database called a vector database. The database encodes documents into a higher-dimensional representation of vectors (called embeddings). It then quickly computes the documents most relevant to a user's LLM prompt.
RAGs have two mechanisms—retrieval and generation—that help select relevant documents and improve the LLM’s original response quality. As a side note, RAGs can be a significant way to implement continual learning for LLMs. After ingesting the curated data into the vector database, the LLM can use it to augment its responses.
Constructing the vector database
One of the challenges in creating any database is extracting meaningful connections between various queries across a large corpus of data, especially when considering the massive data demand from LLMs. Nevertheless, RAG quickly resolves this using the Dense Retrieval approach.
The Dense Retrieval approach involves converting information into higher-dimensional vectors for storage in the vector database. This method transforms words or text into a representation that encapsulates a much richer context, resulting in much quicker search times to find relevant information.
Training this retrieval system allows the system to learn how to “encode” the data into high-dimensional space, resulting in a system that enables any data or prompts to the LLM to be encoded. Learning this encoding is essential for our next step, the retrieval phase.
The retrieval phase
Once you establish the vector database and train the retrieval system, you can incorporate RAGs into your LLM. When a user sends a prompt to the LLM, the prompt retrieval system encodes it with the same transformations applied to the documents in the vector database. In other words, the user’s prompt is now in the same “language” as the documents (vector representation).
Since the user prompt and documents share the same vector transformations, our retrieval system can mathematically compare them for similarity (using metrics like cosine similarity). RAG can identify and select the vectors that align most closely with the specific prompt to provide additional context to the LLM.
The generation phase
Once the LLM identifies the most similar documents, it can begin generating a response to the user through a “generator.” The generator then uses a machine learning (ML) principle known as the “attention mechanism” (refer back to Lesson 2 of Course 1) to consider the original query's context with the retrieved information.
When the generator creates the response, it computes attention scores at each step for all words in the input, including both the query and the retrieved documents, as it decodes the output word by word.
These attention scores represent the relevance of each input word for generating each output word, allowing the generator to “focus” on different parts of the input as needed. If the next word of the response relates more to something within the retrieved documents, the attention mechanism will consider those components more heavily. If it's more relevant to the original query, the attention scores will be higher for the words in the query.
The generator crafts a coherent response by adjusting the focus between the query context and the external information. These responses are also informatively rich and contextually relevant, effectively integrating insights from external data sources into the LLM's output.
Benefits of retrieval-augmented generation (RAG)
Unlike traditional models that demand extensive memory for numerous fine-tuned parameters, RAGs retain only a generator and shared document encoding parameters. This strategic approach drastically dwindles storage requirements and ensures smoother deployment across various devices and platforms, even those with limited memory resources.
RAG also addresses the hallucination problem common in LLMs. Although RAG does not entirely solve hallucinations, grounding responses on retrieved documents significantly mitigates such errors, increasing output accuracy.
Key benefits include:
- Improved factual accuracy: RAG's ability to retrieve and utilize real-world knowledge greatly improves the factual correctness and grounding of the generated responses to provide a solid foundation for high-quality response generation.
- Domain specialization: RAG outshines traditional LLMs in dealing with complex tasks requiring specific domain expertise or niche domains. It taps into a targeted, detailed knowledge base to deliver superior performance on tasks that demand deep, specific knowledge.
- Continuous learning: Being dynamic, RAG systems possess the potential to adapt and learn from the newly retrieved information. This ensures responsiveness to evolving data and maintains the relevance of responses over time.
- Real-time document analysis: RAG can capitalize on data pipelines with live-document parsing, allowing it to retrieve and interpret real-time information from various content formats such as PDFs and eBooks. This can diversify the RAG's knowledge base for more personalized and accurate responses to individual user needs.
- Offline content analysis: Web scraping tools can parse links and draw available digital information into RAG's database, particularly in offline modes where an LLM lacks internet connectivity.
Use cases of retrieval-augmented generation (RAG)
You can apply RAGs in many scenarios, from powering customer service chatbots with sharp, context-aware responses to assisting in academic research by pinpointing crucial documents for thorough literature reviews. We have already learned about some of these applications in earlier lessons.
- Question-Answering (QA) Systems: Imagine going through a massive documentation page for a project. Instead of sifting through pages of text, ingesting the documentation into a RAG system and querying it will give you direct references and answers.
- Customer Service Chatbots and Virtual Assistants: RAG can equip digital tools like chatbots and virtual assistants to generate more helpful, factually correct, and relevant responses. It can also enable the system to pull more personal information, such as individual files, schedule events, and even incorporate account information for customers to provide a more informed and satisfactory reponse.
- Market Research: RAG can help analyze customer feedback or survey responses by retrieving similar scenarios or benchmarks from existing databases. This can provide a fuller picture for researchers and result in more effective feedback loops for product improvement.
- Healthcare Assistance: RAG can be used within virtual healthcare platforms to answer patient queries. It can pull information from large datasets, providing more accurate information on symptoms, treatments, etc.
- Legal Assistance: RAG can support tasks such as analyzing legal documents or answering detailed legal queries by pulling in specific laws or cases from its database for reference.
- Technical Support: RAG could assist technical support teams by solving common or complex problems. It could search a database of issues, pull relevant solution documents, and generate a comprehensive response for resolution.
These use cases are only a tiny subset of where RAG can be applied. Fundamentally, any scenario that requires referencing information can benefit from implementing RAGs and LLMs.
In the next course, you will learn more about the LLM deployment landscape, how to deploy it, and how to monitor it in production. Stay tuned!