Evaluating Large Language Models (LLMs)
Introduction/overview
Key ideas
- A combination of intrinsic and extrinsic evaluation will give you the best assessment of an LLM.
- All metrics have pros and cons. It’s good to use a mix of many based on your needs to understand the model's behavior.
- Remember to incorporate human feedback into your LLM. Metrics are effective ways to measure, but human feedback, although more time-consuming to obtain, can greatly help correct slight nuances and errors in an LLM that metrics might not capture.
In the previous lessons, we discussed LLM training and adaptation methods. Evaluation is crucial for LLMs, like when an architect inspects a building to ensure it meets design and safety standards or a chef tastes a dish for the perfect flavor balance. This stage is not merely about assessing performance; it's about understanding the model's behaviors, identifying biases, and ensuring it aligns with ethical guidelines.
In this lesson, we will explore the significance of this evaluation process in depth. We'll introduce you to various evaluation techniques, ranging from quantitative metrics that measure model performance against benchmarks to qualitative assessments that examine ethical and societal implications.
Intrinsic vs. extrinsic evaluation
LLM evaluation extends traditional ML model evaluation. Unlike traditional models with clear evaluation metrics, like BLEU for evaluating NLP translation models, LLMs often do not have clear ground truth comparisons.
This is especially true in unsupervised training settings, where LLMs learn to guess the next word in a sentence without being directly supervised. In such cases, the absence of a clear "correct answer" makes it hard to use traditional accuracy-based metrics.
For LLMs, evaluation falls into two categories:
- Intrinsic: Assessing the model on tasks it was directly trained to perform.
- Extrinsic: Assessing the model on downstream tasks or real-world applications.
Intrinsic evaluation
This foundational layer of assessment measures a model's performance on tasks directly related to its training objectives. Imagine a carpenter using a spirit level to ensure a perfectly horizontal surface; similarly, intrinsic evaluation examines the "levelness" of an LLM's linguistic capabilities.
Key metrics like word prediction accuracy and perplexity shed light on the LLM’s proficiency in language understanding and generation. However, the focus of intrinsic evaluation on the model's inherent qualities, while crucial for initial insights, might not fully capture its real-world effectiveness.
Recognizing its limitations, particularly in predicting overfitting, is essential. This method serves as a starting point that highlights the need for further, application-focused assessment—extrinsic evaluation.
Extrinsic evaluation
Moving beyond foundational checks, extrinsic evaluation addresses the performance of the LLMs in real-world applications and tasks not explicitly covered during training. It assesses the practical utility of the model's outputs, incorporating user satisfaction surveys and human-in-the-loop testing to gauge real-world effectiveness.
These methods provide invaluable feedback on the strengths and weaknesses of the LLMs from the perspective of end-users. The method also integrates human judgment to capture nuances that automated evaluations (like accuracy) may miss.
Balancing intrinsic and extrinsic evaluations
While intrinsic methods offer efficiency and cost-effectiveness, extrinsic evaluations provide a comprehensive picture of the model's utility in practical scenarios. Together, they furnish a holistic view of an LLM's capabilities, grounding further improvements in both fundamental linguistic understanding and real-world applicability.
This balanced approach ensures that LLMs are not only linguistically adept but also ethically responsible and practically useful, guiding ongoing optimizations with a dual focus on accuracy and practical application.
Nuances of LLM evaluation
Understanding how LLMs perform across key areas is crucial for assessing their readiness for real-world applications. These areas include:
- Language fluency: The model's ability to generate text that is smooth and natural.
- Coherence: The logical flow and consistency within the generated text.
- Contextual understanding: How well the model applies context in its responses.
- Factual accuracy: The correctness of information provided by the model.
- Relevance and meaningfulness: The model's capacity to produce appropriate content in response to prompts.
Evaluating these dimensions offers insights into a model's capabilities and highlights areas for improvement, including managing hallucinations (factual accuracy and relevance of generated content).
Below, we discuss two methods of computing metrics to evaluate LLMs:
- Automatic evaluation
- Human evaluation
Automatic evaluation
Automatic evaluation provides a scalable and consistent way to assess those dimensions of LLM performance using computational methods. These metrics can quickly give insights into a model's capabilities but may not capture all the nuances of human language understanding and generation.
Perplexity
Perplexity is a foundational metric that gauges a model's ability to predict a sample (next word). It serves as a proxy for language understanding. It's vital to note that the metric applies specifically to autoregressive—and not masked—language models.
Limitation: While lower perplexity suggests better language modeling, it doesn't directly correlate with text quality or relevance.
BLEU score (Bilingual Evaluation Understudy)
BLEU scores compare machine-translated texts against human-translated references, focusing on linguistic fluency and coherence. They are easy to compute and promote standardization across different models.
Limitation: BLEU scores can lack assessments of the semantic accuracy and fluency of translations. They also heavily rely on the quality of reference translations, which could lead to misleading scores if those are not top-notch.
ROUGE score (Recall-Oriented Understudy for Gisting Evaluation)
ROUGE scores compare automatic summarizations or translations against reference summaries. Focusing on recall, ROUGE is significant for ensuring all necessary information is included in summaries, using multiple references for a balanced evaluation.
Limitation: It may not fully reflect the fluency or grammatical correctness of the generated text, and a high recall could lead to verbose outputs.
METEOR score (Metric for Evaluation of Translation with Explicit Ordering)
METEOR scores evaluate machine translation by considering synonyms and sentence structure, providing a nuanced view of language fluency and coherence. It considers synonyms and paraphrasing and aligns more closely with human judgment than BLEU, which provides a nuanced interpretation of meanings.
Limitation: It requires more computation time than BLEU or ROUGE and is sensitive to the choice of reference translations (which could introduce bias).
Human evaluation
Human evaluation remains the gold standard for assessing the overall effectiveness and appropriateness of LLM outputs. It involves subjective judgment and interpretation of text quality that captures the subtleties that automated metrics miss.
- Language Fluency: Human judges assess the smoothness and naturalness of the model-generated text, providing insights into the model's linguistic proficiency.
- Coherence: Evaluators inspect the logical flow and consistency of information, assessing how well the model maintains topic and argument structure across a text.
- Contextual Understanding: Human assessment gauges the model's ability to understand and appropriately apply context in its responses, which is crucial for tasks requiring nuanced understanding.
- Factual Accuracy: Evaluators estimate the correctness of the information and details provided by the model, identifying instances of hallucinations or misinformation.
- Relevance and Meaningfulness: Human judges evaluate the effectiveness of the model in producing content that is suitable and valuable in response to various prompts, ensuring the outputs meet user expectations.
Forms of human evaluation
Likert scale (rating a response):
Using a Likert scale involves evaluators rating LLM-generated responses based on specific criteria, such as fluency, coherence, or relevance. For example, evaluators might rate the fluency of a generated text on a scale from 1 (completely incoherent) to 5 (perfectly fluent).
Limitations: Likert scales can introduce biases, such as central tendency bias, where evaluators might avoid using the extremes of the scale. Additionally, different evaluators may interpret the same scale points differently, leading to inconsistencies in ratings.
Preference judgments:
This form asks evaluators to compare two or more outputs and select the one they prefer based on those criteria. This is a method ChatGPT uses to refine and fine-tune the responses. For instance, given two responses to the same prompt, evaluators might be asked to choose the one that better maintains topic relevance or creative expression. Here’s an example within ChatGPT:
Limitations: The context of comparison can affect preference judgments, potentially skewing evaluations towards relative rather than absolute assessments. The binary nature of the choice also fails to capture nuanced preferences or the evaluators' reasoning.
Fine-grained:
Fine-grained evaluation entails detailed analysis or annotation of texts, such as marking grammatical errors or noting instances of factual inaccuracy. This method allows for in-depth feedback on specific aspects of the generated text.
Limitations: This approach is time-intensive and may require evaluators with specialized knowledge or training to provide meaningful feedback. It also introduces the risk of subjective bias in the interpretation of criteria and annotations.
Using LLMs as LLM evaluators
Here, one LLM (the evaluator) analyzes the output of another by understanding and evaluating the generated text's linguistic qualities, relevance, and adherence to given prompts (evaluation criteria). It can be particularly useful for preliminary assessments, continuous integration during development, and large-scale evaluations where human evaluation is impractical.
The evaluator can be trained or fine-tuned to perform specific evaluation tasks, such as scoring the fluency of text or detecting factual inaccuracies and hallucinations in generated content.
An example is using an LLM to perform consistency checks, where it evaluates whether responses given by another LLM across different examples maintain factual consistency and do not contradict known information or previous answers.
Limitations:
- Although using LLMs for evaluation can be more scalable than human evaluation, it still requires significant computational resources, especially for training and running large models.
- They are sensitive to changes in response tokens (where one word in the output changes how it should be evaluated), which makes them less predictable and unable to accurately assess the subtleties in outputs.
- For example, if the sentence to be evaluated is "This policy does not benefit the majority," altering it to include a bit of irony, such as "This policy surely benefits the majority," may cause the LLMevaluator to miss the intended sarcasm, depending on its training and the context provided.
The right approach?
Here's a concise table summarizing the scenarios for choosing between automatic evaluation, human evaluation, and LLM evaluators for assessing LLMs:
Getting started
LangKit is an open-source text metrics toolkit we designed at WhyLabs to improve the safety and reliability of your LLMs. You can identify and mitigate risks across any LLM, including malicious prompts, sensitive data leakage, toxic responses, hallucinations, and jailbreak attempts. A good mix of all methods of evaluation for your LLM.
Here’s an introductory Colab Notebook to get you started.