LLM Monitoring and Observability
Introduction/overview
Key ideas
- LLMs can be easy to set up but challenging to make reliable for production use. Observability is a critical part of the toolkit for making LLMs reliable.
- Traditional software engineering tools for reliability (unit tests, debugging) are difficult to use with LLMs.
- Prompt engineering changes can significantly impact LLM behavior, making it difficult to track regressions. Consider logging predetermined prompt-response pairs to track LLM performance.
Building and deploying your LLM is an exciting accomplishment, but it's only the first step in a continuous process. As these models take center stage in production systems—powering search engines, chatbots, and content creation tools—a key challenge arises: How do we effectively monitor, manage, and understand their behavior in real-world applications? This is where LLM observability becomes paramount.
In lesson 1 of this course, you briefly learned about LLM monitoring and observability. In this lesson, we will explore this concept further. It also includes a practical demonstration. You’ll use the GPT2 model since it's lightweight and easy to run without a GPU, but the example can be run using any of the larger Hugging Face models.
What is LLM observability?
Observability allows you to monitor the real-world behavior of your LLM in production. By capturing data on LLM inputs, outputs, and the surrounding context, you can identify and fix issues.
It's like installing a sophisticated dashboard in your car that gives you real-time insights into your engine's health and any potential issues you might need to address. For LLMs, this includes monitoring:
- Performance: Are your models responding quickly? Are the results accurate and relevant to the input? How does my LLM's performance change over time? Are there signs of degradation I need to address?
- Biases: Are there unintended biases in the model's output that might lead to unfair or harmful outcomes? Can I identify problematic patterns?
- Security: Are LLMs protected against adversarial attacks that seek to exploit or manipulate them?
- Reliability: Can LLMs remain robust and deliver predictable results even as input data or environmental conditions change?
- Explicability: Can you understand why the LLM makes certain predictions or generates specific responses?
Why does LLM observability matter?
LLMs, like all complex software systems, aren't perfect. Their black-box nature poses significant challenges:
- Hidden risks: Biases, security vulnerabilities, and inaccuracies may lurk hidden within the model, going undetected until it's too late.
- Unexpected behavior: LLMs can generate surprising and sometimes harmful outputs, especially when interacting with real-world data that differs from their training sets
- Difficulty in troubleshooting and testing: Unit testing these models is not straightforward because minor changes in prompts can significantly alter the LLM's output. And when things go wrong, debugging them can become complex and time-consuming.
By implementing robust LLM observability, you gain the tools and insights to:
- Mitigate risk: Proactively identify and address potential biases, security issues, and performance bottlenecks.
- Improve User Experience: Ensure your LLM-powered applications deliver consistent, reliable, and trustworthy results.
- Build Responsible AI: LLM observability is critical to developing ethical and accountable AI systems.
The pillars of LLM observability
Essential pillars of effective LLM observability include:
- Logging and monitoring: Setting up systems to record and track key metrics, including traces (the end-to-end interaction), input data, model predictions, and user feedback.
- Analyzing embeddings: Understanding how your LLM represents information internally. (Side note: See how WhyLabs allows you to log and analyze embeddings.)
- Evaluation metrics: Defining meaningful metrics to track the performance, fairness, and robustness of your models. (See lesson 6 of course 1 for a detailed understanding of how to evaluate LLMs.)
- Explicability techniques: Using traces and spans (an operation in the LLM app; input, response) to shed light on the inner workings of LLMs, enabling effective debugging.
- Feedback and control mechanisms: Implementing ways for users and developers to provide feedback and guide the LLM's behavior.
- Profiling and debugging tools: Tools and features to log and investigate model behavior in detail.
Getting started with LLM observability
You will be able to create more dependable and robust applications using large language models with LLM observability. There are several tools on the market to track LLM behavior.
In this example, we'll show how to generate out-of-the-box text metrics for Hugging Face LLMs using LangKit and monitor them in the WhyLabs Observability Platform. The WhyLabs Observability platform helps you control the behavior of your LLM application in production.
LangKit can extract relevant signals from unstructured text data, such as:
Preparations
If you do not want to upload the logs (traces and spans) anonymously to WhyLabs, create a free WhyLabs account and create an API token.
Step 1: Install Hugging Face Transformers & LangKit
Transformers provide access to pre-trained models and a simple interface for various NLP tasks.
%pip install transformers
%pip install 'langkit[all]'
Step 2: Use LangKit to monitor LLMs with any Hugging Face model
Import and initialize the Hugging Face GPT2 model + tokenizer.
from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
Step 3: Create GPT model function
The GPT model function will take in a prompt and return a dictionary containing the model response and prompt.
def gpt_model(prompt):
# Encode the prompt
input_ids = tokenizer.encode(prompt, return_tensors='pt')
# Generate a response
output = model.generate(input_ids, max_length=100, temperature=0.8,
do_sample=True, pad_token_id=tokenizer.eos_token_id)
# Decode the output
response = tokenizer.decode(output[0], skip_special_tokens=True)
# Combine the prompt and the output into a dictionary
prompt_and_response = {
"prompt": prompt,
"response": response
}
# print(response)
return prompt_and_response
Step 4: Create and inspect language metrics with LangKit
LangKit provides a toolkit of metrics for LLM applications, lets initialize them, and creates a profile of the data that can be viewed in WhyLabs for quick analysis.
from langkit import llm_metrics # alternatively use 'light_metrics'
import whylogs as why
import pandas as pd
why.init(session_type='whylabs_anonymous')
# Set to show all columns in dataframe
pd.set_option("display.max_columns", None)
# Note: llm_metrics.init() downloads models so this is slow first time.
schema = llm_metrics.init()
You can choose to use an API key to upload the profile to WhyLabs or run an anonymous session. In this walkthrough, we will upload the profile using an API key (see the ‘Preparations’ section to
Grab a specific ‘dataset ID’ (or model-ID) or skip that option by hitting enter. Here is an example of where you can grab an ID in your new WhyLabs dashboard that looks like `model-xxxx` :
Running the cell successfully with your right credentials should show a similar output:
Great! Let’s move to the next step—creating test prompt-response pairs.
Step 4: Create multiple example prompts
Create a few prompts. Run a loop to log prompt/response pairs and send metrics to WhyLabs.
prompts = ["What is AI?",
"Tell me a joke.",
"Who won the world series in 2021?"]
for num, prompt in enumerate(prompts):
prompt_and_response = gpt_model(prompt)
# initial profile schema on first profile
if num == 0:
profile = why.log(prompt_and_response, schema=schema).profile()
profile.track(prompt_and_response)
You should get a link to view and explore the profile on WhyLabs. Note: Every generated URL will be slightly different. Click on it to open the WhyLabs Observability Platform:
Here’s what you should see on the dashboard:
WhyLabs provides insights and recommendations to help you better understand how to improve your LLM:
With these insights, you can:
- Guide prompt engineering and fine-tuning efforts for improvement.
- Establish a feedback loop for continuous improvement: fix, deploy, monitor, and iterate.
Observability is essential for building reliable LLMs. By capturing the profile of the data on LLM behavior and using it to guide development, you can improve the long-term reliability of your LLMs.
We hope this is helpful for you! Check out the recommended resources section below to explore this further.