WhyLabs AI Control Center (also known as the WhyLabs Platform) is now an open source project!

Sage Elliott

Jul 26, 2023

Back to Blog

Hugging Face and LangKit: Your Solution for LLM Observability

LLMs
Integrations
LLM Security
LangKit
Generative AI

Sage Elliott

Jul 26, 2023

Hugging Face has quickly become a leading name in the world of natural language processing (NLP), with its open-source library becoming the go-to resource for developers and researchers alike. As more organizations turn to Hugging Face's language models for their NLP needs, the need for robust monitoring and observability solutions becomes more apparent. That’s where WhyLabs comes in: by pairing LangKit with Hugging Face LLMs, you can have insight into model behavior to implement guardrails, evaluations, and observability.

This blog will detail the steps to generate out-of-the-box text metrics for Hugging Face LLMs using LangKit and how to easily monitor them in the WhyLabs Observability Platform.

LangKit is an open-source text metrics toolkit for monitoring large language models. It offers an array of methods for extracting relevant signals from the input and/or output text, which are compatible with the open-source data logging library whylogs.

Out of the box LLM metrics include:

We'll use the GPT2 model for this post since it's lightweight and easy to run without a GPU, but any of the larger Hugging Face models, such as Falcon and Llama2, can be used by swapping out the model and tokenizer.

Install Hugging Face transformers and LangKit

Both the Hugging Face transformers library and LangKit can be installed into your Python environment using pip.

pip install transformers
pip install 'langkit[all]'

Import and initialize the Hugging Face GPT2 model + tokenizer

We’ll start by importing and initializing the GPT2 model and tokenizer. Again, you can swap these out your desired large language model in the Hugging Face transformer library.

from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

Create GPT model function

Let’s build a simple function to take in a prompt and return a dictionary containing the input prompt and LLM response.

def gpt_model(prompt):

  # Encode the prompt
  input_ids = tokenizer.encode(prompt, return_tensors='pt')

  # Generate a response
  output = model.generate(input_ids, max_length=100, temperature=0.8,
                          do_sample=True, pad_token_id=tokenizer.eos_token_id)

  # Decode the output
  response = tokenizer.decode(output[0], skip_special_tokens=True)

  # Combine the prompt and the output into a dictionary
  prompt_and_response = {
      "prompt": prompt,
      "response": response
  }

  return prompt_and_response

Generate an example response to look at and extract language metrics from LangKit by calling the gpt_model function and providing it a string.

prompt_and_response = gpt_model("Tell me a story about a cute dog")
print(prompt_and_response)

This will return a Python dictionary similar to this:

{'prompt': 'Tell me a story about a cute dog', 'response': 'Tell me a story about a cute dog, and I'll tell you something about it.If you ever need a companion to give a good story, tell me about that little dog that got me to write my first novel...'}

Note: We can see why people were not blown away by the gpt2 model. 😂

You can extract language metrics for any LLM with LangKit once you have prompt and responses in this format!

Create and inspect language metrics with LangKit

LangKit provides a toolkit of metrics for LLM applications; lets initialize them and create a profile of the data that can be viewed in WhyLabs for quick analysis.

See a full list of metrics here.

from langkit import llm_metrics # alternatively use 'light_metrics'
import whylogs as why

why.init(session_type='whylabs_anonymous')
# Note: llm_metrics.init() downloads models so this is slow first time.
schema = llm_metrics.init()

Let's look at the language metrics for our `prompt_and_response` dictionary created from the gpt_model function.

We’ll create a whylogs profile using the LLM metrics schema. Profiles in whylogs only contain summary statistics about the dataset, by default no raw data is stored. We’ll see what this looks like for LLM metrics in a second!

# Let's look at metrics from our prompt_and_response created above
profile = why.log(prompt_and_response, name="HF prompt & response", schema=schema)

Click the link generated from the above code. You’ll be able to view language metrics and view insights generated about the LLM prompts and results. See an example here.

We can also interact and view our language metrics directly in our Python environment by viewing our profile as a pandas data frame.

This data can be used in real time to make a decision about prompts and responses, such as setting guardrails on your large language model or stored in an AI observability platform for monitoring LLMs overtime.

profview = profile.view()
profview.to_pandas()

Learn more about using validators in LangKit in this blog post.

ML Monitoring for Hugging Face LLMs in WhyLabs

To send LangKit profiles to WhyLabs we will need three pieces of information:

WhyLabs API token
Organization ID
Dataset ID (or model-id)

Sign up for a free account. You can follow along with the quick start examples or skip them if you'd like to follow this example immediately.

Create a new project and note its ID (if it's a model project, it will look like `model-xxxx`)
Create an API token from the "Access Tokens" tab
Copy your org ID from the same "Access Tokens" tab

Replace the placeholder string values with your own OpenAI and WhyLabs API Keys below:

import os
# set authentication & project keys
os.environ["WHYLABS_DEFAULT_ORG_ID"] = 'ORGID'
os.environ["WHYLABS_API_KEY"] = 'APIKEY'
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = 'MODELID'

We’ll import the WhyLabs writer, LangKit and whylogs. We’ll also Initialize the LLM metrics.

from whylogs.api.writer.whylabs import WhyLabsWriter
from langkit import llm_metrics # alternatively use 'light_metrics'
import whylogs as why

# Note: llm_metrics.init() downloads models so this is slow first time.
schema = llm_metrics.init()

We’ll initialize the WhyLabs writer as `telemetry_agent`, profile the `prompt_and_response` dictionary with the LLM metrics as the schema, and call write on the telemetry agent to push the data to the WhyLabs platform.

# Single Profile
telemetry_agent = WhyLabsWriter()
profile = why.log(prompt_and_response, schema=schema)
telemetry_agent.write(profile.view())

As more profiles are written on different dates, you'll get a time series pattern you can analyze & configure monitoring. See an example of what this looks like in this WhyLabs Demo project.

You can also backfill batches of data by overwriting the date and time as seen in this example. Or keep reading and run the code below!

Look at your existing LLM data in a time series view

Let’s log LLM data for the previous seven days to simulate a model being in production for a week. This can also be useful in general to backfill profiles in the AI observability platform.

Start by creating a list of a week's worth of prompts with three prompts per day.

prompt_lists = [
    ["How can I create a new account?", "Great job to the team", "Fantastic product, had a good experience"],
    ["This product made me angry, can I return it", "You dumb and smell bad", "I hated the experience, and I was over charged"],
    ["This seems amazing, could you share the pricing?", "Incredible site, could we setup a call?", "Hello! Can you kindly guide me through the documentation?"],
    ["This looks impressive, could you provide some information on the cost?", "Stunning platform, can we arrange a chat?", "Hello there! Could you assist me with the documentation?"],
    ["This looks remarkable, could you tell me the price range?", "Fantastic webpage, is it possible to organize a call?", "Greetings! Can you help me with the relevant documents?"],
    ["This is great, Ilove it, could you inform me about the charges?", "love the interface, can we have a teleconference?", "Hello! Can I take a look at the user manuals?"],
    ["This seems fantastic, how much does it cost?", "Excellent website, can we setup a call?", "Hello! Could you help me find the resource documents?"]
]

We’ll upload the language metrics similar to how we did previously for a single profile, except we’ll loop through the list and adjust the datetime for each day's prompts. By default when you upload a profile to WhyLabs the timestamp will be the time of writing.

import datetime

telemetry_agent = WhyLabsWriter()
all_prompts_and_responses = []  # This list will store all the prompts and responses.


for i, day in enumerate(prompt_lists):
  # walking backwards. Each dataset has to map to a date to show up as a different batch in WhyLabs
  dt = datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=i)
  for prompt in day:
    prompt_and_response = gpt_model(prompt)
    profile = why.log(prompt_and_response, schema=schema)

     # Save the prompt and its response in the list.
    all_prompts_and_responses.append({'prompt': prompt, 'response': prompt_and_response})

    # set the dataset timestamp for the profile
    profile.set_dataset_timestamp(dt)
    telemetry_agent.write(profile.view())

Once profiles are written to WhyLabs they can be inspected, compared, and monitored for data quality and data drift. Visit your WhyLabs project and you’ll see the profiles logged. To see a time series view of a metric navigate to the inputs tab and select one.

Time series view of language metrics in WhyLabs

Set monitors for data drift and anomaly detection on LLM metrics

Now we can enable a pre-configured monitor with just one click (or create a custom one) to detect anomalies in our data profiles. This makes it easy to set up common monitoring tasks, such detecting data drift, data quality issues, and model performance.

Once a monitor is configured, it can be previewed while viewing a feature in the input tab.

Alert triggered for LLM metrics from configured monitor

When anomalies are detected, notifications can be sent via email, Slack, or PagerDuty. Set notification preferences in Settings > Global Notifications Actions.

Configure notifications and workflow triggers

That’s it! We have gone through all the steps needed to ingest data from anywhere in ML pipelines and get notified if anomalies occur.

Optional: Use a Rolling Logger

A rolling logger can be used instead of the method above to merge profiles together and write them to the AI observatory at predefined intervals. This is a common way for LangKit and whylogs to be used in production deployments. Read more about the whylogs rolling logger in the docs.

telemetry_agent = why.logger(mode="rolling", interval=5, when="M",schema=schema, base_name="huggingface")
telemetry_agent.append_writer("whylabs")

# Log data + model outputs to WhyLabs.ai
telemetry_agent.log(prompt_and_response)

# Close the whylogs rolling logger when the service is shut down
telemetry_agent.close()

Key takeaways for monitoring Hugging Face LLMs

Given the capabilities of LLMs, it's essential to monitor them continuously to ensure they perform effectively and as expected. By closely monitoring LLMs, any irregularities or potential issues can be identified early on, allowing for timely adjustments and necessary improvements.

Keeping an eye on various metrics, such as the text quality, response relevance, sentiment and toxicity can help identify how model performance and user interaction are changing over time.

Ready to start monitoring your Hugging Face LLMs with LangKit? Head over to the GitHub repository and sign up for a free WhyLabs account!

Learn more about monitoring large language models in production:

We recently held a workshop designed to equip you with the knowledge and skills to use LangKit with Hugging Face models. Check out the recording to be guided through what we covered in this blog post!

Sage Elliott

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Rich Young

Dec 10, 2024

Learn how the NIST AI Risk Management Framework (RMF) guides AI security and governance and discover how WhyLabs guardrails can help implement and manage AI risks effectively.

Read post

AI risk management
AI Observability
AI security
NIST RMF implementation
AI compliance
AI risk mitigation

Best Practicies for Monitoring and Securing RAG Systems in Production

Rich Young

Oct 8, 2024

Retrieval-augmented generation (RAG) systems combine advanced retrieval techniques with large language models (LLMs) to improve the responses they generate...

Read post

Retrival-Augmented Generation (RAG)
LLM Security
Generative AI
ML Monitoring
LangKit

How to Evaluate and Improve RAG Applications for Safe Production Deployment

Rich Young

Jul 17, 2024

Learn how to evaluate and improve RAG applications using LangKit and WhyLabs AI Control Center. Develop secure and reliable RAG applications.

Read post

AI Observability
LLMs
LLM Security
LangKit
RAG
Open Source

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

WhyLabs Team

Jun 2, 2024

With WhyLabs and NVIDIA NIM, enterprises can accelerate GenAI application deployment and help ensure the safety of end-user experiences WhyLabs has been on a mission to empower enterprises with tools that ensure safe and responsible AI adoption. With its integration with NVIDIA NIM inference microservices, WhyLabs is helping make responsible AI adoption more accessible. Customers can now maintain better security and control of GenAI applications with self-hosted deployment of the most powerfu

Read post

AI Observability
Generative AI
Integrations
LLM Security
LLMs
Partnerships

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

Alessya Visnjic

May 21, 2024

Discover strategies for safeguarding your large language models (LLMs). Learn how to protect your AI technologies effectively based on OWASP's top 10 security tips.

Read post

LLMs
LLM Security
Generative AI

7 Ways to Evaluate and Monitor LLMs

WhyLabs Team

May 13, 2024

Learn about 7 techniques for evaluating & monitoring LLMs, including LLM-as-a-Judge, ML-model-as-a-Judge, and embedding-as-a-source. Improve your understanding of LLMs with these strategies.

Read post

LLMs
Generative AI

How to Distinguish User Behavior and Data Drift in LLMs

Bernease Herman

May 7, 2024

Large Language Models (LLMs) rarely provide consistent responses for the same prompts over time. In this blog we’ll demonstrate how identify and monitor data changes using a few common scenarios.

Read post

LLMs
Generative AI

Run AI with Certainty

Book a demo

Hugging Face and LangKit: Your Solution for LLM Observability

Install Hugging Face transformers and LangKit

Import and initialize the Hugging Face GPT2 model + tokenizer

Create GPT model function

Create and inspect language metrics with LangKit

ML Monitoring for Hugging Face LLMs in WhyLabs

Look at your existing LLM data in a time series view

Set monitors for data drift and anomaly detection on LLM metrics

Optional: Use a Rolling Logger

Key takeaways for monitoring Hugging Face LLMs

Other posts

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Best Practicies for Monitoring and Securing RAG Systems in Production

How to Evaluate and Improve RAG Applications for Safe Production Deployment

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

7 Ways to Evaluate and Monitor LLMs

How to Distinguish User Behavior and Data Drift in LLMs

Run AI with Certainty

About

Resources

whylogs

WhyLabs