WhyLabs AI Control Center (also known as the WhyLabs Platform) is now an open source project!

Sage Elliott

Jul 10, 2023

Back to Blog

Monitoring LLM Performance with LangChain and LangKit

LLMs
Integrations
LangKit
Open Source
Generative AI

Sage Elliott

Jul 10, 2023

Large language models (LLMs) serve as core pillars in today's AI powered applications. From powering customer support interfaces, generating content, enhancing predictive text, and streamlining information retrieval, these models have emerged as a fundamental tool for most organizations.

However, gaining insights and effectively monitoring LLMs in production can be challenging. What kind of prompts are users writing? What kind of sentiment are my responses returning? How does changing my system prompt affect user experience?

In this blog post, we'll dive into the significance of monitoring LLMs and show how to get started with monitoring a LangChain application with LangKit and WhyLabs.

Want to jump straight into the code? Run this example notebook!

Keep reading for:

The Importance of Monitoring for LLMs: Why and What to Track
How to Use LangKit with Langchain and OpenAI to monitor LLMs
Analyze LLM insights with WhyLabs
Monitoring Large Language Models Conclusion

The importance of monitoring for LLMs: why and what to track

Before discussing monitoring large language models with LangChain, it may be useful to establish the importance of ML monitoring in the realm of LLMs.

When performing LLM monitoring, the first step involves identifying the appropriate metrics for tracking. There is a wide range of potential metrics that can assist in evaluating LLM usage and performance, but here are a few important ones to consider:

Response Relevance: The degree to which the model's output matches the context and the intent of the input.
Sentiment: Monitoring the emotional tone of user prompts and the model adopted in its responses is critical for customer interaction scenarios.
Jailbreak Similarity: Instances when the user prompt or model generates content that breaks its constraints or rules.
Topic: Categorizing the user prompts and model responses can help ensure users interact with your LLMs for the intended applications and that responses stay on topic.
Toxicity: Checking the prompts and output of the LLMs for harmful, offensive, or unsuitable language can be useful for adding guardrails and detecting unwanted response anomalies.

Keeping track of language metrics as mentioned above can provide insights into the evolution of user prompts and LLM responses over time, as well as the impact of updates to system prompts on user interactions.

In this post we’ll focus on tracking sentiment changes between prompts, but you’ll be able to use the same methods for any metrics relevant to your use case.

How to use LangKit with LangChain and OpenAI for LLM monitoring

We'll show how you can generate and monitor out-of-the-box text metrics using LangKit, Langchain, OpenAI and the WhyLabs Observability Platform.

LangChain is a popular open-source framework for developing applications powered by language models.
OpenAI is the company behind the popular GPT text generation models.
LangKit is an open-source text metric toolkit designed for monitoring large language models, you can easily extract the telemetry needed using prompts and responses from any LLM model.
WhyLabs is a platform that allows you to track AI telemetry over time, creating easy monitor configuration and collaboration among teams without the necessity to set up any additional infrastructure.

For this specific example, we'll pay attention to sentiment change between prompts and responses. Sentiment can be a valuable metric to understand how users interact with your LLM in production and how any system prompts or template updates change responses.

You can also run all the code for this example in a colab notebook!

Installing LangKit and LangChain

Both LangKit and LangChain can be installed in a python environment using pip.

pip install langkit[all]==0.0.2
pip install langchain==0.0.205

Set OpenAI and WhyLabs credentials

To send LangKit profiles to WhyLabs we will need three pieces of information:

API token
Organization ID
Dataset ID (or model-id)

Go to https://whylabs.ai/whylabs-free-sign-up and grab a free account. You can follow along with the quick start examples or skip them if you'd like to follow this example immediately.

Create a new project and note its ID (if it's a model project, it will look like model-xxxx)
Create an API token from the "Access Tokens" tab
Copy your org ID from the same "Access Tokens" tab

Get your OpenAI API key from your OpenAI account.

Replace the placeholder string values with your own OpenAI and WhyLabs API Keys below:

# Set OpenAI and WhyLabs credentials
import os

os.environ["OPENAI_API_KEY"] = "OPENAIAPIKEY"
os.environ["WHYLABS_DEFAULT_ORG_ID"] = "WHYLABSORGID"
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = "MODELID"
os.environ["WHYLABS_API_KEY"] = "WHYLABSAPIKEY

Import LangChain callbacks, OpenAI LLM, and additional language metrics

Learn more about LangKit modules for additional metrics on the LangKit GitHub repository.

from langchain.callbacks import WhyLabsCallbackHandler
from langchain.llms import OpenAI

# Import additional language metrics
import langkit.sentiment
import langkit.topics

Initialize WhyLabs Callback and GPT model with LangChain

Use the LangChain callbacks and LLMs feature to initial the WhyLabs writer and OpenAI GPT model.

# Initialize WhyLabs Callback & GPT model with LangChain
whylabs = WhyLabsCallbackHandler.from_params()
llm = OpenAI(temperature=0, callbacks=[whylabs])

Generate responses on prompts and close WhyLabs session

This will generate responses to the prompts passed to the LLM and send the profiles containing the text metrics to WhyLabs.

The rolling logger for WhyLabs will write profiles every 5 minutes or when .flush() or .close() is called.

# generate responses from LLM
result = llm.generate(
    [
        "I love nature, its beautilful and amazing!",
        "This product is awesome. I really enjoy it.",
        "Chatting with you has been a great experience! you're very helpful."
    ]
)
print(result)

# you don't need to call flush, this will occur periodically, but to demo let's not wait.
whylabs.close()

That's it! Language metrics about the prompts and model responses are now being tracked in WhyLabs.

Analyze LLM insights with WhyLabs

Once at least one LangKit profile has been uploaded, navigate to the profile tab and click on "View details" over the prompt.sentiment_nltk metric to see the distribution of sentiment scores for the prompt.

In this example, all the prompts have a positive sentiment score of 80+.

Click on the "Show Insights" button to see further insights about LLM metrics for prompts and responses. This will update as more profiles are uploaded. Try running the negative prompts below and see what happens!

As more profiles are written on different dates, you'll get a time series pattern you can analyze and set monitors to send alerts when drift or anomalies are detected. See an example in the Demo environment.

You can also backfill batches of data by overwriting the date and time as seen in this example.

Watch the sentiment value change from negative prompts

After inspecting the results in WhyLabs, try changing your prompts to trigger a change in the metric you're monitoring, such as prompt sentiment.

# Intialize WhyLabs Callback & GPT with Langchain
whylabs = WhyLabsCallbackHandler.from_params()
llm = OpenAI(temperature=0, callbacks=[whylabs])

result = llm.generate(
    [
        "I hate nature, its ugly.",
        "This product is bad. I hate it.",
        "Chatting with you has been a terrible experience!."
        "I'm terrible at saving money, can you give me advice?"
    ]
)
print(result)

# close WhyLabs Session
whylabs.close()

Taking another look at the LLM insights in WhyLabs, we see the prompt sentiment is called out for containing negative sentiment and our response generates a phone number. In this case we haven’t told the GPT model a phone number to provide, so it completely made one up!

Viewing the histogram results in WhyLabs again, you can see the sentiment value change from only positive prompts to containing a range of negative and positive prompts.

We can configure monitors to alert us automatically when the sentiment value changes in the monitor manager tab.

In this example, we've seen how you can use LangKit to extract and monitor sentiment from unstructured text data. You can also use LangKit to extract and monitor other relevant signals from text data.

Monitoring Large Language Models (LLMs) conclusion

In conclusion, monitoring large language models is an important part of ensuring LLM performance and relevance in production.

Choosing the right metrics is key to understanding model performance and detecting issues early. Relevance, sentiment, jailbreak, and throughput are just some of the metrics that can be monitored to ensure a model is performing as expected.

Implementing LLM monitoring into LangChain applications with LangKit and WhyLabs makes extracting and monitoring the AI telemetry easy!

Learn more about monitoring large language models in production:

Sage Elliott

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Rich Young

Dec 10, 2024

Learn how the NIST AI Risk Management Framework (RMF) guides AI security and governance and discover how WhyLabs guardrails can help implement and manage AI risks effectively.

Read post

AI risk management
AI Observability
AI security
NIST RMF implementation
AI compliance
AI risk mitigation

Best Practicies for Monitoring and Securing RAG Systems in Production

Rich Young

Oct 8, 2024

Retrieval-augmented generation (RAG) systems combine advanced retrieval techniques with large language models (LLMs) to improve the responses they generate...

Read post

Retrival-Augmented Generation (RAG)
LLM Security
Generative AI
ML Monitoring
LangKit

How to Evaluate and Improve RAG Applications for Safe Production Deployment

Rich Young

Jul 17, 2024

Learn how to evaluate and improve RAG applications using LangKit and WhyLabs AI Control Center. Develop secure and reliable RAG applications.

Read post

AI Observability
LLMs
LLM Security
LangKit
RAG
Open Source

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

WhyLabs Team

Jun 2, 2024

With WhyLabs and NVIDIA NIM, enterprises can accelerate GenAI application deployment and help ensure the safety of end-user experiences WhyLabs has been on a mission to empower enterprises with tools that ensure safe and responsible AI adoption. With its integration with NVIDIA NIM inference microservices, WhyLabs is helping make responsible AI adoption more accessible. Customers can now maintain better security and control of GenAI applications with self-hosted deployment of the most powerfu

Read post

AI Observability
Generative AI
Integrations
LLM Security
LLMs
Partnerships

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

Alessya Visnjic

May 21, 2024

Discover strategies for safeguarding your large language models (LLMs). Learn how to protect your AI technologies effectively based on OWASP's top 10 security tips.

Read post

LLMs
LLM Security
Generative AI

7 Ways to Evaluate and Monitor LLMs

WhyLabs Team

May 13, 2024

Learn about 7 techniques for evaluating & monitoring LLMs, including LLM-as-a-Judge, ML-model-as-a-Judge, and embedding-as-a-source. Improve your understanding of LLMs with these strategies.

Read post

LLMs
Generative AI

How to Distinguish User Behavior and Data Drift in LLMs

Bernease Herman

May 7, 2024

Large Language Models (LLMs) rarely provide consistent responses for the same prompts over time. In this blog we’ll demonstrate how identify and monitor data changes using a few common scenarios.

Read post

LLMs
Generative AI

Run AI with Certainty

Book a demo

Monitoring LLM Performance with LangChain and LangKit

The importance of monitoring for LLMs: why and what to track

How to use LangKit with LangChain and OpenAI for LLM monitoring

Installing LangKit and LangChain

Set OpenAI and WhyLabs credentials

Import LangChain callbacks, OpenAI LLM, and additional language metrics

Initialize WhyLabs Callback and GPT model with LangChain

Generate responses on prompts and close WhyLabs session

Analyze LLM insights with WhyLabs

Watch the sentiment value change from negative prompts

Monitoring Large Language Models (LLMs) conclusion

Other posts

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Best Practicies for Monitoring and Securing RAG Systems in Production

How to Evaluate and Improve RAG Applications for Safe Production Deployment

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

7 Ways to Evaluate and Monitor LLMs

How to Distinguish User Behavior and Data Drift in LLMs

Run AI with Certainty

About

Resources

whylogs

WhyLabs