blog bg left
Back to Blog

Hugging Face and LangKit: Your Solution for LLM Observability

Hugging Face has quickly become a leading name in the world of natural language processing (NLP), with its open-source library becoming the go-to resource for developers and researchers alike. As more organizations turn to Hugging Face's language models for their NLP needs, the need for robust monitoring and observability solutions becomes more apparent. That’s where WhyLabs comes in: by pairing LangKit with Hugging Face LLMs, you can have insight into model behavior to implement guardrails, evaluations, and observability.

This blog will detail the steps to generate out-of-the-box text metrics for Hugging Face LLMs using LangKit and how to easily monitor them in the WhyLabs Observability Platform.

LangKit is an open-source text metrics toolkit for monitoring large language models. It offers an array of methods for extracting relevant signals from the input and/or output text, which are compatible with the open-source data logging library whylogs.

Out of the box LLM metrics include:

LangKit can used with any LLM

We'll use the GPT2 model for this post since it's lightweight and easy to run without a GPU, but any of the larger Hugging Face models, such as Falcon and Llama2, can be used by swapping out the model and tokenizer.

Install Hugging Face transformers and LangKit

Both the Hugging Face transformers library and LangKit can be installed into your Python environment using pip.

pip install transformers
pip install 'langkit[all]'

Import and initialize the Hugging Face GPT2 model + tokenizer

We’ll start by importing and initializing the GPT2 model and tokenizer. Again, you can swap these out your desired large language model in the Hugging Face transformer library.

from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

Create GPT model function

Let’s build a simple function to take in a prompt and return a dictionary containing the input prompt and LLM response.

def gpt_model(prompt):

  # Encode the prompt
  input_ids = tokenizer.encode(prompt, return_tensors='pt')

  # Generate a response
  output = model.generate(input_ids, max_length=100, temperature=0.8,
                          do_sample=True, pad_token_id=tokenizer.eos_token_id)

  # Decode the output
  response = tokenizer.decode(output[0], skip_special_tokens=True)

  # Combine the prompt and the output into a dictionary
  prompt_and_response = {
      "prompt": prompt,
      "response": response

  return prompt_and_response

Generate an example response to look at and extract language metrics from LangKit by calling the gpt_model function and providing it a string.

prompt_and_response = gpt_model("Tell me a story about a cute dog")

This will return a Python dictionary similar to this:

{'prompt': 'Tell me a story about a cute dog', 'response': 'Tell me a story about a cute dog, and I'll tell you something about it.If you ever need a companion to give a good story, tell me about that little dog that got me to write my first novel...'}

Note: We can see why people were not blown away by the gpt2 model. 😂

You can extract language metrics for any LLM with LangKit once you have prompt and responses in this format!

Create and inspect language metrics with LangKit

LangKit provides a toolkit of metrics for LLM applications; lets initialize them and create a profile of the data that can be viewed in WhyLabs for quick analysis.

See a full list of metrics here.

from langkit import llm_metrics # alternatively use 'light_metrics'
import whylogs as why

# Note: llm_metrics.init() downloads models so this is slow first time.
schema = llm_metrics.init()

Let's look at the language metrics for our `prompt_and_response` dictionary created from the gpt_model function.

We’ll create a whylogs profile using the LLM metrics schema. Profiles in whylogs only contain summary statistics about the dataset, by default no raw data is stored. We’ll see what this looks like for LLM metrics in a second!

# Let's look at metrics from our prompt_and_response created above
profile = why.log(prompt_and_response, name="HF prompt & response", schema=schema)

Click the link generated from the above code. You’ll be able to view language metrics and view insights generated about the LLM prompts and results. See an example here.

Viewing LLM metrics in WhyLabs

We can also interact and view our language metrics directly in our Python environment by viewing our profile as a pandas data frame.

This data can be used in real time to make a decision about prompts and responses, such as setting guardrails on your large language model or stored in an AI observability platform for monitoring LLMs overtime.

profview = profile.view()

Learn more about using validators in LangKit in this blog post.  

ML Monitoring for Hugging Face LLMs in WhyLabs

To send LangKit profiles to WhyLabs we will need three pieces of information:

  • WhyLabs API token
  • Organization ID
  • Dataset ID (or model-id)

Go to and sign up for a free account. You can follow along with the quick start examples or skip them if you'd like to follow this example immediately.

  1. Create a new project and note its ID (if it's a model project, it will look like `model-xxxx`)
  2. Create an API token from the "Access Tokens" tab
  3. Copy your org ID from the same "Access Tokens" tab

Replace the placeholder string values with your own OpenAI and WhyLabs API Keys below:

import os
# set authentication & project keys
os.environ["WHYLABS_API_KEY"] = 'APIKEY'

We’ll import the WhyLabs writer, LangKit and whylogs. We’ll also Initialize the LLM metrics.

from whylogs.api.writer.whylabs import WhyLabsWriter
from langkit import llm_metrics # alternatively use 'light_metrics'
import whylogs as why

# Note: llm_metrics.init() downloads models so this is slow first time.
schema = llm_metrics.init()

We’ll initialize the WhyLabs writer as `telemetry_agent`, profile the `prompt_and_response` dictionary with the LLM metrics as the schema, and call write on the telemetry agent to push the data to the WhyLabs platform.

# Single Profile
telemetry_agent = WhyLabsWriter()
profile = why.log(prompt_and_response, schema=schema)

As more profiles are written on different dates, you'll get a time series pattern you can analyze & configure monitoring. See an example of what this looks like in this WhyLabs Demo project.

Anomaly detected for prompt sentiment

You can also backfill batches of data by overwriting the date and time as seen in this example. Or keep reading and run the code below!

Look at your existing LLM data in a time series view

Let’s log LLM data for the previous seven days to simulate a model being in production for a week. This can also be useful in general to backfill profiles in the AI observability platform.

Start by creating a list of a week's worth of prompts with three prompts per day.

prompt_lists = [
    ["How can I create a new account?", "Great job to the team", "Fantastic product, had a good experience"],
    ["This product made me angry, can I return it", "You dumb and smell bad", "I hated the experience, and I was over charged"],
    ["This seems amazing, could you share the pricing?", "Incredible site, could we setup a call?", "Hello! Can you kindly guide me through the documentation?"],
    ["This looks impressive, could you provide some information on the cost?", "Stunning platform, can we arrange a chat?", "Hello there! Could you assist me with the documentation?"],
    ["This looks remarkable, could you tell me the price range?", "Fantastic webpage, is it possible to organize a call?", "Greetings! Can you help me with the relevant documents?"],
    ["This is great, Ilove it, could you inform me about the charges?", "love the interface, can we have a teleconference?", "Hello! Can I take a look at the user manuals?"],
    ["This seems fantastic, how much does it cost?", "Excellent website, can we setup a call?", "Hello! Could you help me find the resource documents?"]

We’ll upload the language metrics similar to how we did previously for a single profile, except we’ll loop through the list and adjust the datetime for each day's prompts. By default when you upload a profile to WhyLabs the timestamp will be the time of writing.

import datetime

telemetry_agent = WhyLabsWriter()
all_prompts_and_responses = []  # This list will store all the prompts and responses.

for i, day in enumerate(prompt_lists):
  # walking backwards. Each dataset has to map to a date to show up as a different batch in WhyLabs
  dt = - datetime.timedelta(days=i)
  for prompt in day:
    prompt_and_response = gpt_model(prompt)
    profile = why.log(prompt_and_response, schema=schema)

     # Save the prompt and its response in the list.
    all_prompts_and_responses.append({'prompt': prompt, 'response': prompt_and_response})

    # set the dataset timestamp for the profile

Once profiles are written to WhyLabs they can be inspected, compared, and monitored for data quality and data drift. Visit your WhyLabs project and you’ll see the profiles logged. To see a time series view of a metric navigate to the inputs tab and select one.

Time series view of language metrics in WhyLabs

Set monitors for data drift and anomaly detection on LLM metrics

Now we can enable a pre-configured monitor with just one click (or create a custom one) to detect anomalies in our data profiles. This makes it easy to set up common monitoring tasks, such detecting data drift, data quality issues, and model performance.

Select a preset monitor for LLM metrics

Once a monitor is configured, it can be previewed while viewing a feature in the input tab.

Alert triggered for LLM metrics from configured monitor

When anomalies are detected, notifications can be sent via email, Slack, or PagerDuty. Set notification preferences in Settings > Global Notifications Actions.

Configure notifications and workflow triggers

That’s it! We have gone through all the steps needed to ingest data from anywhere in ML pipelines and get notified if anomalies occur.

Optional: Use a Rolling Logger

A rolling logger can be used instead of the method above to merge profiles together and write them to the AI observatory at predefined intervals. This is a common way for LangKit and whylogs to be used in production deployments. Read more about the whylogs rolling logger in the docs.

telemetry_agent = why.logger(mode="rolling", interval=5, when="M",schema=schema, base_name="huggingface")

# Log data + model outputs to

# Close the whylogs rolling logger when the service is shut down

Key takeaways for monitoring Hugging Face LLMs

Given the capabilities of LLMs, it's essential to monitor them continuously to ensure they perform effectively and as expected. By closely monitoring LLMs, any irregularities or potential issues can be identified early on, allowing for timely adjustments and necessary improvements.

Keeping an eye on various metrics, such as the text quality, response relevance, sentiment and toxicity can help identify how model performance and user interaction are changing over time.

Ready to start monitoring your Hugging Face LLMs with LangKit? Head over to the GitHub repository and sign up for a free WhyLabs account!

Learn more about monitoring large language models in production:

We recently held a workshop designed to equip you with the knowledge and skills to use LangKit with Hugging Face models. Check out the recording to be guided through what we covered in this blog post!

Other posts

Glassdoor Decreases Latency Overhead and Improves Data Monitoring with WhyLabs

The Glassdoor team describes their integration latency challenges and how they were able to decrease latency overhead and improve data monitoring with WhyLabs.

Understanding and Monitoring Embeddings in Amazon SageMaker with WhyLabs

WhyLabs and Amazon Web Services (AWS) explore the various ways embeddings are used, issues that can impact your ML models, how to identify those issues and set up monitors to prevent them in the future!

Data Drift Monitoring and Its Importance in MLOps

It's important to continuously monitor and manage ML models to ensure ML model performance. We explore the role of data drift management and why it's crucial in your MLOps pipeline.

Ensuring AI Success in Healthcare: The Vital Role of ML Monitoring

Discover how ML monitoring plays a crucial role in the Healthcare industry to ensure the reliability, compliance, and overall safety of AI-driven systems.

WhyLabs Recognized by CB Insights GenAI 50 among the Most Innovative Generative AI Startups

WhyLabs has been named on CB Insights’ first annual GenAI 50 list, named as one of the world’s top 50 most innovative companies developing generative AI applications and infrastructure across industries.

7 Ways to Monitor Large Language Model Behavior

Discover seven ways to track and monitor Large Language Model behavior using metrics for ChatGPT’s responses for a fixed set of 200 prompts across 35 days.

Safeguarding and Monitoring Large Language Model (LLM) Applications

We explore the concept of observability and validation in the context of language models, and demonstrate how to effectively safeguard them using guardrails.

Robust & Responsible AI Newsletter - Issue #6

A quarterly roundup of the hottest LLM, ML and Data-Centric AI news, including industry highlights, what’s brewing at WhyLabs, and more.

Monitoring LLM Performance with LangChain and LangKit

In this blog post, we dive into the significance of monitoring Large Language Models (LLMs) and show how to gain insights and effectively monitor a LangChain application with LangKit and WhyLabs.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo