WhyLabs AI Control Center (also known as the WhyLabs Platform) is now an open source project!

Rich Young

Jul 17, 2024

Back to Blog

How to Evaluate and Improve RAG Applications for Safe Production Deployment

AI Observability
LLMs
LLM Security
LangKit
RAG
Open Source

Rich Young

Jul 17, 2024

TL;DR

Retrieval-augmented generation (RAG) combines information retrieval systems with large language models (LLMs) to make AI-generated text more accurate and reliable by accessing the latest, relevant data.
To develop RAG systems, you must carefully consider the retrieval quality, LLM behavior, data quality, security, and how the LLMs and retrieval mechanisms integrate.
Many RAGs today are evaluated using LLMs; even the datasets used to evaluate the RAG are created by LLMs! Using tools such as LangKit allows you to assess RAG components.
LangKit also natively integrates with WhyLabs AI Control Center to provide a platform with dashboards that allow you to visualize and monitor your LLM applications for accuracy issues, malicious prompts, and more.

Today, we admire large language models (LLMs) like GPT-4 because they are so good at writing text that looks like it could have been written by a human. However, these models often struggle with accessing real-time, real-world knowledge, which can lead to potential factual inaccuracies commonly referred to as "hallucinations".

To overcome these problems, most developers have built retrieval-augmented generation (RAG) systems. These systems integrate the generative power of LLMs with robust information retrieval systems (e.g., dense or sparse retrieval).

This allows the models to pull relevant data from external sources, improving accuracy, contextual understanding, and output reliability.

Developing RAG systems is a complex process for developers. They must navigate challenges such as selecting appropriate data sources, optimizing retrieval algorithms, and ensuring seamless communication between the LLM and retrieval components.

But the development phase is only half the battle. Rigorous evaluation is equally critical before transitioning a RAG system to production. You must assess the system's performance, accuracy, and robustness under various scenarios to ensure it meets the desired standards.

By the end of this article, you will have learned how to evaluate RAG systems thoroughly to identify and address potential issues, making the systems more reliable and effective for real-world applications.

How do RAGs work?

RAGs enhance the capabilities of LLMs by integrating them with external data typically stored in vector databases. These databases store large datasets as high-dimensional vectors to efficiently retrieve contextually relevant data by mathematically calculating prompt similarity.

*An overview of how Retrieval-Augmented Generation (RAG) systems work. | Source:* *WhyLabs’s Free LLM Fundamentals Course (Lesson 5)*.

As a brief explanation, here's what happens when you send a prompt to a RAG-enabled LLM:

Query formulation: The LLM receives a natural language prompt converted into an encoded vector.
Information retrieval: The vector is compared to others in the vector database to find the most relevant data to retrieve based on the query's intent and context.
Data integration and response generation: The retrieved information is integrated into the LLM's process of predicting the next word in its response, combining its trained knowledge with the knowledge extracted from the vector database for the final output.

💡

New to RAGs? Check out our full-length lesson, “Retrieval-Augmented Generation (RAG) for LLMs,” in the LLM Fundamentals Course to get started.

Challenges to address before deploying RAG systems

Deploying RAG systems to production is challenging for many reasons, mostly because combining generative LLMs with complex retrieval mechanisms is complex.

Before even considering deployment, you must address a handful of issues while developing RAG systems:

1. Non-deterministic LLM behavior

LLMs can produce unpredictable outputs. Fine-tuning with diverse datasets and implementing uncertainty-handling mechanisms can promote consistency.

Non-deterministic behavior refers to LLMs' tendency to generate different responses to the same prompt due to the inherent randomness in their training and inference processes.

2. Quality of retrieval

A RAG system's effectiveness heavily depends on its ability to fetch relevant and diverse documents in response to user queries. Inaccurate retrieval can result in irrelevant or misleading responses, severely impacting user trust and system utility in prod.

3. Data quality issues

Most RAG systems that work well in production depend on the quality of their data sources. Outdated or erroneous data can lead to misleading answers, compromising the system's utility.

Ensuring high-quality data involves rigorous cleaning, validation processes, and continuous updates and monitoring of the knowledge base to keep the information current and relevant.

📚 Recommended Read: WhyLabs Weekly MLOps: Data Quality Validation.

4. Integration nuances

Technical challenges in integrating retrieval and generative components can affect output quality. Retrieval latency, document selection accuracy, and efficient information passage between system components must be meticulously managed.

5. Security, safety, and compliance

Robust security, safety, and compliance measures like SOC 2 are essential to ensure the responsible development of RAG systems.

This involves protecting data privacy through anonymization techniques, adhering to data privacy regulations like GDPR or CCPA, and implementing content filtering and bias detection algorithms to prevent misuse.

*WhyLabs Dashboard - WhyLabs Secure ensures the ultimate visibility and control over any guardrail decision, giving AI developers and stakeholders peace of mind. | Source:* *WhyLabs Secure: Protect Your GenAI with Advanced Features | WhyLabs*

Human oversight mechanisms and adversarial training are crucial for reviewing outputs and improving the system's resilience against manipulation attempts.

The Open Web Application Security Project (OWASP) Top 10 list for LLMs provides valuable guidance on potential vulnerabilities, such as prompt injection, insecure output handling, and training data poisoning.

📹 See WhyLabs’s CEO, Alessya Visnjic's workshop on Intro to LLM Security - OWASP Top 10 for Large Language Models (LLMs) to learn more.

You’ve seen some challenges you might encounter before deploying the RAG system. However, deploying it in production could also involve the following challenges:

Scalability and robustness: The system must handle varying demands and maintain functionality under high loads, requiring a scalable architecture that doesn't sacrifice speed or accuracy.
Predicting user interactions: Anticipating real-world user interactions is difficult. Continuous monitoring and adaptive modifications based on user data are crucial.

📚 Recommended Read: How to Distinguish User Behavior and Data Drift in LLMs.

Evaluating the performance of RAG systems with LangKit

Much testing and experimenting are done to assess RAG performance, including model versioning, prompt versioning, frameworks, vector databases, retrieval techniques, generation techniques, etc.

LangKit is an open-source, out-of-the-box telemetry tool for LLM prompts and responses that tracks quality, relevance, sentiment, and security metrics. It can collect metrics and allow users to compare them against each strategy to make data-driven decisions about improving their assessment pipeline.

LangKit lets you:

Evaluate whether the LLM behavior is compliant with policy
Validate and safeguard individual prompts and responses
Compare and A/B test across different LLM and prompt versions
Monitor user interactions inside LLM-powered applications

*How LangKit works with agents and LLM solutions, provides out-of-the-box telemetry metrics, and integrates with the WhyLabs platform for observability.*

Logging metrics and comparing the outputs of responses from different models and assessment strategies will also allow for quantitative and unbiased assessments for faster deployments with higher confidence.

In this section, you will go through a sample assessment pipeline using an LLM to assess a RAG. Throughout this process, you’ll see how LangKit provides assessment metrics for your RAG system.

Step 1: Generate a synthetic evaluation dataset

A synthetic evaluation dataset is a collection of artificially generated questions and corresponding answers (Q&A) used to assess the performance of a RAG system.

A robust synthetic evaluation dataset is critical for assessing your RAG's performance. It allows you to test the system's ability to handle various questions and scenarios.

However, manually crafting such a dataset can be time-consuming and limit the variety of questions. Here's where LangKit shines.

Although this can be done with any LLM with the correct prompt, LangKit can be leveraged to streamline this process and explore the effectiveness of different LLMs for generating Q&A pairs.

LangKit will allow you to quickly assess and evaluate different strategies during the dataset creation process, such as:

Quickly assess various LLMs: Connect LangKit to different LLM models, such as OpenAI’s GPT-4o, or open-source models, such as Mixtral. If the RAG database contains domain-specific data, other fine-tuned LLMs focused on those domains can also be assessed in this step.
Assess various prompts: Experiment with different prompts to guide the LLMs in generating relevant and informative questions. LangKit stores and manages these variations, allowing you to test which prompts yield the most effective question-answer pairs for your RAG system evaluation.

For instance, the code snippet below shows how to log the metrics of prompts and responses from two OpenAI LLMs (one is a DaVinci-based model, and the other is the default, GPT-3.5-turbo model) using LangKit:

for row in local_dataset:
    # Create a new conversation for each question to evaluate single turn performance
    llm_a = Conversation(invocation_params=OpenAIDavinci())
    llm_b = Conversation(invocation_params=OpenAIDefault())

    # extract the prompt from this row
    prompt = row["prompt"]

    # send this prompt to the two LLMs (we could do this one at a time)
    response_a = llm_a.send_prompt(prompt)
    response_b = llm_b.send_prompt(prompt)

    # use the agent to log a dictionary; these should be flat
    # to get the best results, in this case, we log the prompt and response text
    telemetry_agent_a.log(response_a.to_dict())
    telemetry_agent_b.log(response_b.to_dict())
    row["llm_a_answer"] = response_a.to_dict()['response']
    row["llm_b_answer"] = response_b.to_dict()['response']
    print(f"Processed row: {row}") # example cell output below is an excerpt
print("done sending all the test prompts through the LLMs!")

Check out the complete notebook to learn more about choosing the right LLM by evaluating their performance on generated test data.

Step 2: Critique agent evaluation

After generating the set of Q&A samples, you will require additional filtering to remove unsuitable and inappropriate questions from the dataset. To accomplish this task, use critique agents to assess the quality of the LLM-generated questions based on three critical criteria:

Groundedness: Does the question have a clear basis in the provided context (e.g., surrounding text)?
Relevance: Is the question helpful or informative to the user?
Stand-alone: Can someone with domain knowledge and internet access understand the question independently without needing the specific context of the article from which it was generated?

Critique agents assign scores to each question based on these criteria, typically on a scale of 1-5. For example, a question highly grounded in the context, relevant to the user, and can stand alone would receive a score of 5. In contrast, a question that lacks context, is irrelevant, and requires specific article knowledge might receive a score of 1.

You can filter out low-scoring questions to ensure the evaluation dataset is populated with high-quality questions that effectively test the RAG system's abilities.

You can use LangKit for critique evaluation with an evaluation check like `response_hallucination` to assess the consistency between the target response and a group of additional response samples:

from langkit import response_hallucination
from langkit.openai import OpenAIDefault

response_hallucination.init(llm=OpenAIDefault(), num_samples=1)

result = response_hallucination.consistency_check(
    prompt="who was Marie Curie?",
    response="Marie Curie was an English barrister and politician who served as Member of Parliament for Thetford from 1859 to 1868.",
)

print(result)

Here’s the output for running that code snippet that includes the semantic similarity score between the prompt, the generated target response, and other generated responses:

{'llm_score': 1.0,
 'semantic_score': 0.44071340560913086,
 'final_score': 0.7203567028045654,
 'total_tokens': 309,
 'samples': ['Marie Curie was a Polish-born physicist and chemist who conducted pioneering research on radioactivity. She was the first woman to win a Nobel Prize and remains the only person to have won Nobel Prizes in two different scientific fields (physics in 1903 and chemistry in 1911). Her discoveries laid the foundation for the development of X-ray machines and cancer treatments, and she is considered one of the most important scientists in history.'],
 'response': 'Marie Curie was an English barrister and politician who served as Member of Parliament for Thetford from 1859 to 1868.'}

For more information on this approach, see the documentation on the `hallucination` namespace.

LangKit can help detect private personal information (PII), which may be useful if your RAG system contains a lot of PII:

from langkit import extract, pii

data = {"prompt": prompts[0],
        "response": prompts[-1]}
result = extract(data)

Our output is a JSON file that displays our original prompt, then the sensitive words we found and their scores:

{'prompt': 'Hello, my name is David Johnson and I live in Maine. My credit card number is 4095-2609-9393-4932 and my crypto wallet id is 16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ.',
 'response': 'Hi, My name is John.',
 'prompt.pii_presidio.result': '[{"type": "CREDIT_CARD", "start": "82", "end": "101", "score": "1.0"}, {"type": "CRYPTO", "start": "129", "end": "163", "score": "1.0"}]',
 'prompt.pii_presidio.entities_count': 2,
 'response.pii_presidio.result': '[]',
 'response.pii_presidio.entities_count': 0}

If you would like to learn more about how to implement these metrics using LangKit, take a look at our notebook examples: Analyzing response consistency and Detecting PII.

Step 3: Building the RAG system

This step is where the magic happens! When building your RAG system, you integrate two key components: the Preprocessor and the Retriever-Reader System.

Preprocessor: Organizes the knowledge base by dividing documents into smaller units (e.g., passages) for efficient retrieval.
Retriever-reader system: Analyzes user queries, searches the preprocessed knowledge base, and extracts/summarizes relevant information for the LLM.

Together, these components provide the LLM (powering a third component—the generator) with the information needed for accurate and comprehensive responses, enhancing the quality of the generated text.

Additionally, you can change the RAG system’s design (e.g., modify the chunk size in the preprocessor or change the retrieval system). Using LangKit, you can quickly assess the effects of these changes on the LLM.

With LangKit, you can also create a feedback loop to see how changes in the retrieval system affect the overall output when evaluating the RAG using a Judge LLM in the next step.

LangKit provides quantifiable data points (toxicity scores, semantic scores, etc.), allowing you to make data-driven decisions on the most effective architectural choice.

Step 4: Judge LLM evaluation

The final step involves evaluating the RAG system's performance using an LLM as a "judge." Here, an LLM like GPT-4 assesses the usefulness of the RAG system's outputs on a prepared evaluation dataset.

While large and reputable LLMs can provide valuable insights, a robust evaluation process requires a broader scope. Here's how you can integrate LangKit to achieve a more comprehensive assessment:

Detecting jailbreaks and prompt injections: Identifying attempts to manipulate the system through malicious techniques like jailbreaks (attempts to bypass safety restrictions) or prompt injections (inserting malicious instructions into prompts). Here’s a notebook you can follow to detect jailbreaks, prompt injections, and malicious attacks with LangKit.

Monitoring sentiment and toxicity: Ensuring responses are informative, unbiased, and neutral in tone. Here’s a notebook you can follow to understand how to track the sentiment and toxicity of LLM-generated responses with LangKit.
Response consistency: Evaluating whether responses are consistent with retrieved information and past interactions. Here’s a notebook you can follow to understand how to analyze LLM-generated responses for consistency with LangKit.
Monitoring LLM behavior over time: Tracking key metrics to identify shifts in performance or bias for continuous improvement. Here’s a notebook you can follow to understand how to keep track of the LLM’s behavior with LangKit.

You create a more comprehensive and reliable evaluation process for your RAG system by incorporating the full scope of LangKit's features. This should, of course, lead to a more secure, reliable, and trustworthy system for your users.

📚 Recommended Read: 7 Ways to Evaluate and Monitor LLMs.

Using LangKit with WhyLabs AI Control Center to identify areas for improving RAG applications

Once you’ve assessed our LLM metrics, how can you visualize and compare our results?

WhyLabs AI Control Center is a centralized platform for monitoring and managing AI applications, including LLMs. It provides a comprehensive view of AI system performance, enabling teams to identify issues, optimize models, and ensure responsible AI development.

As a plus, LangKit natively integrates with the WhyLabs AI Control Center so that you can send the extracted metrics directly to the platform. These metrics can include relevance scores, sentiment analysis, and other key performance indicators (KPIs) specific to your RAG system.

Visualizing these critical metrics on clear dashboards lets you quickly identify trends and potential LLM performance problems over time. For example, if the sentiment analysis scores from LangKit consistently indicate high negativity in the generated responses, this could signal an issue with the LLM's training data or prompt design.

WhyLabs dashboards allow you to spot trends early and adjust the model or prompts to improve the tone and quality of content.

*WhyLabs dashboard to view and monitor extracted metrics on the text prompts and responses.*

In addition to visualization, WhyLabs AI Control Center allows you to define thresholds for your LLM's performance metrics. If metrics fall outside these predefined ranges, WhyLabs can trigger alerts to notify you of potential issues.

This proactive approach enables you to address concerns before they escalate to ensure a consistently high-quality user experience.

WhyLabs also enables you to combine LangKit data with data from other monitoring tools, creating a holistic view of your LLM's behavior. Integrating data from multiple sources helps you understand your LLM and make better optimization and use decisions.

*The WhyLabs Contol Center’s complete stack provides everything you need to be in control of your AI application. | Source:* *AI Control Center*.

Combining WhyLabs AI Control Center and LangKit creates a powerful toolkit for building secure, reliable, and trustworthy RAG systems. WhyLabs is the central hub for security, explainability, and responsible AI development, while LangKit provides the tools to monitor and measure each process step.

These complementary platforms can help your RAG system become an effective, responsible AI tool that provides exceptional user value and experience.

🛠️ Check out the notebook for sending LangKit telemetry to WhyLabs for visualizing the performance of LLM and RAG systems.

Conclusion:

Retrieval Augmented Generation (RAG) systems combine the power of LLMs with robust information retrieval systems. By giving LLMs access to and using real-time, domain-specific data, RAG systems can produce more accurate, useful, and reliable results than LLMs that work alone.

However, developing and deploying RAG systems comes with its own set of challenges. These include ensuring high-quality data retrieval, seamless integration between the retrieval and generation components, handling non-deterministic LLM behavior, maintaining data quality, and addressing security, safety, and compliance concerns.

A rigorous evaluation process is essential to overcome these challenges and realize the full potential of RAG systems. Tools like LangKit and WhyLabs AI Control Center play a crucial role in this process, assessing performance, identifying issues, and making data-driven improvements at every stage of development.

Following these development, evaluation, and monitoring practices, organizations can use RAG systems to drive innovation, improve decision-making, and deliver exceptional user value while prioritizing responsible and ethical AI development.

Additional Resources

LangKit Github Repository
Sending LLM Metrics to WhyLabs
WhyLabs AI Control Center
HuggingFace: Technical Guide on Building a RAG Evaluator
ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent
- A portion of this paper discusses the metrics used to assess a critique agent.
Prometheus on HuggingFace
- This is comparable to GPT-4 when it comes to Q&A assessments. It also discusses the importance of clearly defining each numeric score when generating your RAG Q&A dataset.

Rich Young

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Rich Young

Dec 10, 2024

Learn how the NIST AI Risk Management Framework (RMF) guides AI security and governance and discover how WhyLabs guardrails can help implement and manage AI risks effectively.

Read post

AI risk management
AI Observability
AI security
NIST RMF implementation
AI compliance
AI risk mitigation

Best Practicies for Monitoring and Securing RAG Systems in Production

Rich Young

Oct 8, 2024

Retrieval-augmented generation (RAG) systems combine advanced retrieval techniques with large language models (LLMs) to improve the responses they generate...

Read post

Retrival-Augmented Generation (RAG)
LLM Security
Generative AI
ML Monitoring
LangKit

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

WhyLabs Team

Jun 2, 2024

With WhyLabs and NVIDIA NIM, enterprises can accelerate GenAI application deployment and help ensure the safety of end-user experiences WhyLabs has been on a mission to empower enterprises with tools that ensure safe and responsible AI adoption. With its integration with NVIDIA NIM inference microservices, WhyLabs is helping make responsible AI adoption more accessible. Customers can now maintain better security and control of GenAI applications with self-hosted deployment of the most powerfu

Read post

AI Observability
Generative AI
Integrations
LLM Security
LLMs
Partnerships

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

Alessya Visnjic

May 21, 2024

Discover strategies for safeguarding your large language models (LLMs). Learn how to protect your AI technologies effectively based on OWASP's top 10 security tips.

Read post

LLMs
LLM Security
Generative AI

7 Ways to Evaluate and Monitor LLMs

WhyLabs Team

May 13, 2024

Learn about 7 techniques for evaluating & monitoring LLMs, including LLM-as-a-Judge, ML-model-as-a-Judge, and embedding-as-a-source. Improve your understanding of LLMs with these strategies.

Read post

LLMs
Generative AI

How to Distinguish User Behavior and Data Drift in LLMs

Bernease Herman

May 7, 2024

Large Language Models (LLMs) rarely provide consistent responses for the same prompts over time. In this blog we’ll demonstrate how identify and monitor data changes using a few common scenarios.

Read post

LLMs
Generative AI

Run AI with Certainty

Book a demo

How to Evaluate and Improve RAG Applications for Safe Production Deployment

TL;DR

How do RAGs work?

Challenges to address before deploying RAG systems

1. Non-deterministic LLM behavior

2. Quality of retrieval

3. Data quality issues

4. Integration nuances

5. Security, safety, and compliance

Evaluating the performance of RAG systems with LangKit

Step 1: Generate a synthetic evaluation dataset

Step 2: Critique agent evaluation

Step 3: Building the RAG system

Step 4: Judge LLM evaluation

Using LangKit with WhyLabs AI Control Center to identify areas for improving RAG applications

Conclusion:

Additional Resources

Other posts

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Best Practicies for Monitoring and Securing RAG Systems in Production

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

7 Ways to Evaluate and Monitor LLMs

How to Distinguish User Behavior and Data Drift in LLMs

Run AI with Certainty

About

Resources

whylogs

WhyLabs