7 Ways to Evaluate and Monitor LLMs
- LLMs
- Generative AI
May 13, 2024
Consider the last time you interacted with a large language model (LLM). You may chat with a virtual assistant, search for information online, or use a language-translation tool. How did you determine if the LLM's response was accurate and reliable? Did you simply rely on your judgment or use specific techniques or tools to evaluate it?
Evaluating the quality of LLM-powered applications is crucial because it ensures they provide accurate and appropriate responses to user inputs. Unlike traditional models that you can assess with predefined metrics and performance benchmarks, evaluating LLMs is more complex. This complexity arises from the nature of natural language responses and the inherent challenges in defining ground truth data.
LLMs generate responses in natural language, which makes it difficult to determine what constitutes a "correct" or "optimal" response. Unlike structured data used in traditional models, natural language responses are highly context-dependent and can vary based on tone, nuance, and cultural nuances. This article will explore various methods for evaluating the performance and quality of the LLM. We will discuss different approaches to assessing LLMs, including applying multiple metrics to measure their effectiveness. The goal is to help you develop an intuition for understanding what constitutes a successful evaluation of LLMs. Finally, we will conclude with a walkthrough demonstrating the use of LangKit for LLM evaluation. This practical demonstration will provide hands-on experience for improving your LLM evaluation processes.
You can also watch the recording of our workshop with WhyLabs CEO and Co-Founder Alessya Visnjic:
Understanding how to measure LLM performance
Evaluating the performance of large language models (LLMs) has been a topic of discussion, and there has been confusion about the most effective approaches.
The three most popular methods for evaluating LLMs are:
Eyeballing
Manually reviewing and assessing the quality of LLM-generated responses based on human judgment. While it provides immediate feedback, it’s also labor-intensive, inefficient, inconsistent, prone to bias, and difficult to scale.
HELM (Holistic Evaluation of Language Models)
HELM is a suite of benchmarks to evaluate LLMs across various language understanding and generation capabilities across various tasks. The issue here is that setting up and maintaining these benchmarks is resource-intensive and time-consuming. Additionally, it may not fully capture the nuances of real-world language understanding tasks, leading to mismatches between benchmark performance and actual model utility.
LLM-as-a-Judge
Using LLMs to evaluate the quality of responses generated by other LLMs or language systems. There are concerns here that we’ll explore in a later section, but generally, because of the stochastic nature of LLMs, evaluation results are irreproducible. Also, intricate prompt engineering is required to remove ambiguity or inconsistency that could impact the reliability of evaluations.
While these approaches are great for evaluating your LLM’s performance, they fall short of solving the problem of continuously monitoring or evaluating them on various tasks. How do we solve this?
7 techniques for extracting metrics
At WhyLabs, we have spent a lot of time working on the most generalizable ways to measure the performance of LLMs and came up with some key questions:
- What metrics can be extracted from LLM responses and metadata?
- What metrics are helpful in root-cause analysis of LLM performance (quality of responses)?
- What metrics are helpful in root-cause analysis of LLM issues (pattern detection)?
- What metrics are possible to extract in cost and latency constraint settings?
After spending a lot of research on these questions and talking to customers, we developed seven (7) techniques for extracting metrics that would ensure LLMs are continuously evaluated and monitored:
1. LLM-as-a-Judge2. ML-model-as-Judge3. Embedding-as-a-source4. NLP metrics5. Pattern Recognition6. End-user-in-the-loop7. Human-in-the-loop
Of course, you probably use techniques that are not on this list or are currently being developed—it is by no means exhaustive. We encourage you to share feedback on incorporating other techniques into the open-source LangKit library.
Criteria for comparing the evaluation techniques
To decide which technique is best in different circumstances, let’s examine six dimensions that help us compare these techniques.
With that in mind, let’s learn the seven techniques for extracting LLMs for continuous evaluation and observability (monitoring).
1. Large Language Model-as-a-Judge (LLM-as-a-Judge)
Large Language Model-as-a-Judge (LLM-as-a-Judge) is a technique for evaluating and monitoring the performance of LLMs that differs from traditional methods like human evaluation or automated metrics such as BLEU and ROUGE. This technique involves using the same or a different LLM to assess specific attributes of text, such as factualness, completeness, or toxicity.
To implement LLM-as-a-Judge, you construct a prompt with a scoring rubric and examples to guide the LLM's assessment. For instance, to evaluate the factualness of a given text, the prompt might ask the LLM to rate the accuracy of the information provided on a scale from 1 to 5, with examples illustrating what constitutes a highly factual or inaccurate statement. Optionally, you could also ask for an explanation of the score, which is particularly helpful when you have an LLM to provide sources or examples.
One of the standout features of LLM-as-a-Judge is its generalizability. You can apply it to a wide range of tasks and measure various aspects of text quality, such as:
- Factualness: How accurate and trustworthy the information in a text is.
- Completeness: Whether the text provides all the information required to understand a topic or issue fully.
- Toxicity: The presence of harmful or offensive language in the text.
- Hallucination: Whether the text contains fabricated or invented information that lacks a basis in reality.
The effectiveness of LLM-as-a-Judge heavily relies on the quality of the prompts and examples provided. Carefully crafting these components is crucial to ensuring the reliability and consistency of the evaluation results.
The illustration below shows how this technique ranks across the six dimensions for evaluating a metric extraction technique you learned in the previous section:
Pros of using LLM-as-a-Judge
- Generalizable technique that is applicable to different evaluation tasks (good coverage).
- Measures various aspects of text quality.
- Utilizes the expertise of the LLM in natural language understanding.
- Can generate explanations for the scores, enhancing interpretability.
Cons of using LLM-as-a-Judge
- Requires an additional LLM call for each evaluation metric, leading to increased cost and latency.
- Quality of the evaluation metric depends on the prompts and examples provided.
- Lacks reproducibility of scores, mainly with proprietary LLM APIs (e.g., GPT-4 API).
response_hallucination
proactive_injection_detection
2. Machine Learning-model-as-Judge (ML-model-as-Judge)
ML-model-as-Judge is a strategy used to evaluate and monitor the performance of LLMs by using an additional ML model to score text based on metrics such as toxicity, sentiment, or topical relevance.
Unlike LLM-as-a-Judge, which uses the same or a different LLM for evaluation, ML-model-as-Judge relies on simpler, lightweight models specifically trained for the desired evaluation tasks.
The essence of this technique revolves around selecting and configuring a model capable of scoring text according to predefined metrics. Following this setup, the chosen model generates a score or prediction for the specified metric. One can conduct a feature importance analysis to explain the assigned score.
For example, when evaluating the toxicity level of a text snippet, a toxicity classification model produces a toxicity score, thereby offering insights into the presence of harmful language.
You can quickly set up ML-model-as-Judge, but it does not have the same error ranges, cost, and latency problems as LLM-as-a-Judge. However, the reliability and effectiveness of this technique heavily depend on the quality of the selected models and the datasets used for training and evaluation.
Examples of metrics that can be measured with ML-model-as-Judge include:
- Toxicity: Measures the degree of harmful or offensive language in a text.
- Sentiment: Evaluates the emotional tone or attitude conveyed in a text, typically categorizing it as positive, negative, or neutral.
- Presence of Topics: Assesses the relevance and prominence of specific topics or themes within a text.
- Presence of Entities: Identifies and extracts named entities such as people, organizations, locations, dates, or numerical values in the text.
The illustration below shows how this technique ranks across the six dimensions for evaluating a metric extraction technique:
ML-model-as-a-Judge across the six dimensions of evaluating an extraction technique. | Source: 7 Ways to Evaluate & Monitor LLM Performance.
Pros of Using ML-model-as-a-Judge:
- Low cost and latency compared to LLM-as-a-Judge.
- Easy setup for scenarios where you can use open-source models.
- Covers many use cases (e.g., entity extraction, topic extraction, toxicity, sentiment).
Cons of using ML-model-as-a-Judge:
- Explainability can be low or expensive, depending on the model used.
- Setup can be difficult if no open-source or existing models are used.
- A new model is required for each use case.
3. Embedding-as-a-source
Embedding-as-a-source is a technique used to evaluate and monitor the performance of LLMs by leveraging embeddings, which are numerical representations of textual data. Embeddings capture semantic and syntactic information about words or phrases, allowing for mathematical comparisons and similarity measurements.
This methodology encompasses the utilization of embeddings extracted directly from the LLM itself, including prompts or responses, or using open-source embeddings such as `e5-mistral-7b-instruct`. The distance between items, such as prompts and responses, is measured to gauge their semantic similarity or relevance. A smaller distance indicates a higher level of similarity or relevance.
A good application of this technique involves measuring prompt-response relevance. Embedding both the prompt and the generated response makes calculating their proximity or distance possible. A smaller distance signifies a higher level of relevance between the prompt and the response, indicating a more appropriate and contextually fitting reply.
Considering the training data used for these embeddings and how they compare to the evaluation task is essential to ensure their relevance and effectiveness when using open-source embeddings.
Examples of metrics that can be measured with embedding-as-a-source include:
- Prompt-Response Relevance: Assesses how closely aligned a generated response is to the given prompt.
- Response Similarity/Consistency: Evaluates the similarity or consistency between multiple responses generated by the LLM.
- Distance Between Two Specific Themes: This method calculates the semantic distance between two themes or topics within a text corpus.
The illustration below shows how this technique ranks across the six dimensions for evaluating a metric extraction technique:
While embedding-as-a-source is primarily used for relevance and similarity assessment, its applications can be extended to other evaluation tasks. For example, embeddings can be used to cluster LLM-generated text to identify common themes or patterns or to detect anomalies or outliers in the generated responses.
Pros of Using Embedding-as-a-source
- It has low cost and latency and can be used online.
- Easy setup if the LLM exposes embeddings.
- Good reproducibility and explainability (assuming an understanding of embeddings).
Cons of using Embedding-as-a-source
- Limited coverage of use cases, primarily centered around relevance and similarity assessment.
- Requires sufficient understanding of embeddings for effective usage.
- May lack versatility for addressing diverse evaluation tasks beyond relevance and similarity measurements.
4. Computing NLP metrics
Computing NLP metrics is a technique for evaluating and monitoring the performance of LLMs by quantifying various aspects of text quality and length. This approach utilizes open-source libraries such as textstat to compute various metrics efficiently.
NLP metrics differ from other evaluation techniques, like embedding-based methods or human evaluation, by focusing on quantitative measures of linguistic features. These metrics provide numerical values that capture different aspects of text quality and length, allowing for objective comparisons and analysis.
Textstat, for example, can compute readability, complexity, grade level, and other quality metrics, as well as basic metrics like syllable count, word count, and character count. These metrics serve as indicators of the linguistic characteristics of LLM-generated text, facilitating the assessment of its appropriateness, coherence, and readability.
In addition to assessing text quality, NLP metrics can also measure the length of prompts or responses, enabling the monitoring of user interaction patterns and trends. For instance, a sudden increase in prompt length may signify significant shifts in user behavior or raise flags for security risks, such as prompt injection attacks.
Examples of metrics that can be measured with NLP metrics include:
- Readability: Evaluates how easy or difficult it is for readers to comprehend a text.
- Difficult word count: Measures the number of complex or hard-to-understand words in a text.
- Number of Sentences: Counts the total number of sentences in a given text.
- Number of Words: Quantifies the total word count in a text.
The illustration below shows how this technique ranks across the six dimensions for evaluating a metric extraction technique:
Pros of using NLP metrics
- Very low cost and latency.
- Easy to set up when using open-source libraries like textstat (for English language).
- Great reproducibility and explainability.
Cons of using NLP metrics
- Low coverage, limited to heuristic metrics and counts.
- Potential lack of sensitivity to nuances in language.
- Interpretability challenges require additional expertise or contextual knowledge.
5. Pattern recognition
Pattern recognition is a fundamental aspect of text analysis. It is crucial to detect sensitive information, such as credit card numbers or social security numbers, within prompts and responses. This technique relies on regular expressions (RegEx)—a sequence of characters that defines a search pattern—to identify and detect patterns within textual data.
For example, to detect credit card numbers, a RegEx pattern might be designed to recognize sequences of 16 digits arranged in groups of four. Once these patterns are defined, the text is systematically scanned to identify specific pattern occurrences.
When a match between the RegEx pattern and the text is found, the system flags the presence of sensitive information, alerting the user to take appropriate action. This proactive approach helps prevent data breaches and reinforces data privacy and security measures within the LLM ecosystem.
Examples of sensitive information that can be detected using pattern recognition include:
- Credit Card Numbers: Detecting sequences of 16 digits, often grouped into sets of four.
- Social Security Numbers (SSNs): Identifying strings of nine digits in the format XXX-XX-XXXX.
- Phone Numbers: Recognizing sequences of digits that adhere to common phone number formatting conventions.
The illustration below shows how this technique ranks across the six dimensions for evaluating a metric extraction technique:
Pros of using pattern recognition
- Low cost and minimal latency, suitable for real-time use.
- High reproducibility and clarity in identifying specific patterns or information.
Cons of using pattern recognition
- Low coverage, limited to predefined patterns.
- Setting up the detection system can be challenging, requiring the enumeration of all desired patterns beforehand.
Pattern recognition can be combined with other techniques, such as machine learning-based classification or named entity recognition, to mitigate the low coverage limitation and improve the overall detection capabilities.
Additionally, regularly updating and maintaining the RegEx patterns is crucial to ensuring the system effectively identifies new types of sensitive information or changes in data formats.
has_patterns
6. End-user in-the-Loop
End-user in-the-loop is a captivating approach that seamlessly integrates feedback tools directly into your application's user interface (UI). If you've had the opportunity to interact with ChatGPT or similar LLM applications, you may have encountered these intuitive features: the refresh/regenerate button, thumbs-up button, and thumbs-down button.
Monitoring these metrics provides invaluable insights into users' sentiments and perceptions regarding the responses generated by your Language Model (LLM) application. By measuring user reactions to the generated responses, you can gain a real-time gauge of user satisfaction and usability.
You can analyze the collected user feedback and use it to improve the LLM's performance in various ways. For example, you can fine-tune the feedback data in the model, adjust the response generation parameters, or identify areas where the LLM needs further training or improvement.
It is crucial to design intuitive and user-friendly feedback mechanisms to encourage user participation and gather more representative data. The feedback tools should be easily accessible, visually appealing, and seamlessly integrated into the user experience. By making the feedback process effortless and engaging, you can increase the likelihood of users providing their valuable input.
Examples of metrics that can be measured using end-user in-the-loop include:
- Refresh/Regenerate: Tracking the frequency of users requesting new responses when dissatisfied with the initial output.
- Thumbs up/Thumbs down: Monitoring user approval or disapproval of generated responses through binary feedback buttons.
- Distance to Edit: Measuring the extent of user modifications to the generated responses to gauge their satisfaction and the LLM's accuracy.
The illustration below shows how this technique ranks across the six dimensions for evaluating a metric extraction technique:
It is important to acknowledge that user feedback may not be universally provided for every prompt or response, leading to potential data sparsity. To address this challenge, active learning techniques can be employed to prioritize gathering feedback on the most informative or uncertain responses.
You can maximize the value of the available data and improve the efficiency of the feedback loop by strategically targeting feedback collection.
Pros of using End-user In-the-Loop:
- Directly measures user reactions, providing valuable insights into user satisfaction and usability.
- Offers broad coverage, as any task can be designed to gather metrics closely related to LLM performance.
Cons of using End-user In-the-Loop:
- Requires design and engineering effort to integrate into the user experience.
- Metrics collected may indirectly measure performance and lack clear explanations.
- Feedback may not be provided for every LLM response, leading to data sparsity.
Thumbs up (customizable themes)
Distance to edit
See Notebook: Behavioral Monitoring of Large Language Models.
7. Human-as-a-Judge
Human-as-a-judge is a classic method for evaluating the performance of LLMs by comparing their generated responses to human-generated responses. This approach involves setting up experiments, selecting judges, curating responses for evaluation, and collecting metrics.
Renowned for their accuracy, human-judging experiments are regarded as the gold standard for assessing LLM quality. They entail the deliberate design of experiments to solicit feedback on how LLM-generated responses measure up against human-generated responses.The methodology underlying human judging experiments offers a comprehensive framework for measuring various facets of LLM quality. Researchers carefully select and train human judges to ensure consistency and reliability in the evaluation process. The judges provide valuable insights into the strengths and limitations of LLM-generated text across diverse dimensions, including coherence, relevance, factual accuracy, and linguistic fluency.
This approach provides a nuanced understanding of LLM performance and serves as a benchmark for evaluating progress and identifying areas for improvement.
Carefully design the experiments and select representative prompts that cover a wide range of topics, styles, and difficulty levels to validate the results. By exposing the LLMs to diverse prompts and comparing their responses to human-generated responses, researchers can better understand the LLM’s capabilities and limitations.
Examples of metrics that can be measured using Human-as-a-Judge include:
- Toxicity: Assessing the presence of harmful or offensive language in LLM-generated text.
- Completeness: Measuring the extent to which LLM-generated responses fully address the content and context of the given prompts.
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Evaluating the similarity between LLM-generated and human-generated reference summaries.
The illustration below shows how this technique ranks across the six dimensions for evaluating a metric extraction technique:
Pros of using Human-as-a-Judge:
- Directly measures aspects that other metrics indirectly assess, providing nuanced insights into LLM quality.
- Offers broad coverage because a human can evaluate any task for judgment.
Cons of using Human-as-a-Judge:
- Requires careful experimental design and human judges with subject matter expertise, depending on the application.
- Not feasible to evaluate 100% of responses, necessitating decisions on which responses to collect for evaluation.
To mitigate the limitations of Human-as-a-Judge, researchers can employ a combination of human evaluation and automated metrics. This approach balances the trade-off between accuracy and scalability, allowing for a more comprehensive evaluation of LLM performance.
Automated metrics can quickly assess a large number of responses, while human evaluation can be reserved for a carefully selected subset of responses that require more nuanced judgment.
ROUGE
Choosing the right LLM evaluation metrics
After going through all these seven techniques, the question on your mind would be: how do I choose the right metrics/technique? Here are four helpful questions you need to ask yourself before choosing a metric:
- How critical is the use case? If you are in a critical industry, such as healthcare, energy, or finance, you understand that LLM hallucinations or poor-quality responses can cause irreparable damage.
- What are you trying to optimize for? This relates to your use case. Would you prioritize speed and efficiency over quality, or vice versa?
- What are your constraints? (Do you want to run evaluations synchronously or asynchronously? In batch or real-time? This will define your production requirements.
- Who is the target audience for the metrics? This could be a non-technical audience requiring a report on your LLM’s performance or a technical audience feeding the results to downstream systems.
Here is a chart showing how all the different techniques are compared in the rubric:
Using LangKit and WhyLabs to compute LLM metrics for evaluation and monitoring
To start computing LLM metrics for evaluation and monitoring, you will use LangKit. It is an open-source text metrics toolkit for monitoring language models. It offers an array of methods for extracting relevant logs from the prompts and/or responses, which are compatible with the open-source data profiling library whylogs. LangKit can extract relevant signals from unstructured text data, such as Text Quality, Text Relevance, Security and Privacy, and Sentiment and Toxicity.Installing LangKit is straightforward; you can install it via pip:
pip install langkit[all]
Check out the Google Colab notebook for the full code walkthrough on generating out-of-the-box text metrics for Hugging Face LLMs using LangKit and monitoring them in the WhyLabs Observability Platform.
You will use the GPT-2 model since it's lightweight and easy to run without a GPU, but the example can be run on any of the larger Hugging Face models.
Key takeaways: 7 ways to evaluate and monitor LLM performance
Understanding the need to evaluate LLM-powered applications is crucial for ensuring their effectiveness and reliability in real-world scenarios. Current evaluation methods have limitations in capturing the nuanced aspects of LLM performance, including semantic coherence and context appropriateness.However, if you are to look into one of the seven techniques for evaluating and monitoring LLM performance, you should be able to pick one after carefully selecting criteria such as cost, latency, setup, explainability, etc.
As LLMs continue to evolve and we see more development in the space, further research and development in evaluation methodologies are needed to address current challenges and ensure robust assessment of LLM performance. This article only addressed seven techniques, and more techniques must be discovered.
We invite you to contribute to the open-source LangKit library and join our community.
Frequently Asked Questions (FAQs)
- What is the best evaluation technique for evaluating a QA model with ground truth? A good starting point is HELM. It has a bunch of QA benchmarks and metrics extracted for all your favorite LLMs. Then, you should look at LangKit.
- What is the similarity between Python LangKit and WhyLabs tools? LangKit is Open source, and it extracts metrics from blobs of text, most commonly prompts and responses. WhyLabs is a platform that helps you track these scores over time, potentially across different experiments. WhyLabs platform only ingests scores from prompts and responses and not your actual prompts and responses, and it allows you to build a dashboard to keep track of and monitor how these metrics change over time.
- How do I choose the right evaluation metrics for my LLM? Your selection will be based on the appropriate metrics for your use case and objectives.
- Can I use human judging experiments to evaluate non-English Large Language Models? While human judging experiments are commonly used to evaluate English LLMs, adapting this technique to assess non-English models presents certain challenges. Factors such as language diversity, cultural nuances, and the availability of qualified judges can impact the reliability and validity of the results.
- Can I combine multiple evaluation techniques to comprehensively assess my LLM? Absolutely! Combining multiple evaluation techniques can help provide a more comprehensive understanding of your LLM’s performance.
Other posts
Best Practicies for Monitoring and Securing RAG Systems in Production
Oct 8, 2024
- Retrival-Augmented Generation (RAG)
- LLM Security
- Generative AI
- ML Monitoring
- LangKit
How to Evaluate and Improve RAG Applications for Safe Production Deployment
Jul 17, 2024
- AI Observability
- LLMs
- LLM Security
- LangKit
- RAG
- Open Source
WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control
Jun 2, 2024
- AI Observability
- Generative AI
- Integrations
- LLM Security
- LLMs
- Partnerships
OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety
May 21, 2024
- LLMs
- LLM Security
- Generative AI
How to Distinguish User Behavior and Data Drift in LLMs
May 7, 2024
- LLMs
- Generative AI