Hugging Face and LangKit: Your Solution for LLM Observability
- LLMs
- Integrations
- LLM Security
- LangKit
- Generative AI
Jul 26, 2023
Hugging Face has quickly become a leading name in the world of natural language processing (NLP), with its open-source library becoming the go-to resource for developers and researchers alike. As more organizations turn to Hugging Face's language models for their NLP needs, the need for robust monitoring and observability solutions becomes more apparent. That’s where WhyLabs comes in: by pairing LangKit with Hugging Face LLMs, you can have insight into model behavior to implement guardrails, evaluations, and observability.
This blog will detail the steps to generate out-of-the-box text metrics for Hugging Face LLMs using LangKit and how to easily monitor them in the WhyLabs Observability Platform.
LangKit is an open-source text metrics toolkit for monitoring large language models. It offers an array of methods for extracting relevant signals from the input and/or output text, which are compatible with the open-source data logging library whylogs.
Out of the box LLM metrics include:
We'll use the GPT2 model for this post since it's lightweight and easy to run without a GPU, but any of the larger Hugging Face models, such as Falcon and Llama2, can be used by swapping out the model and tokenizer.
Install Hugging Face transformers and LangKit
Both the Hugging Face transformers library and LangKit can be installed into your Python environment using pip.
pip install transformers
pip install 'langkit[all]'
Import and initialize the Hugging Face GPT2 model + tokenizer
We’ll start by importing and initializing the GPT2 model and tokenizer. Again, you can swap these out your desired large language model in the Hugging Face transformer library.
from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
Create GPT model function
Let’s build a simple function to take in a prompt and return a dictionary containing the input prompt and LLM response.
def gpt_model(prompt):
# Encode the prompt
input_ids = tokenizer.encode(prompt, return_tensors='pt')
# Generate a response
output = model.generate(input_ids, max_length=100, temperature=0.8,
do_sample=True, pad_token_id=tokenizer.eos_token_id)
# Decode the output
response = tokenizer.decode(output[0], skip_special_tokens=True)
# Combine the prompt and the output into a dictionary
prompt_and_response = {
"prompt": prompt,
"response": response
}
return prompt_and_response
Generate an example response to look at and extract language metrics from LangKit by calling the gpt_model function and providing it a string.
prompt_and_response = gpt_model("Tell me a story about a cute dog")
print(prompt_and_response)
This will return a Python dictionary similar to this:
{'prompt': 'Tell me a story about a cute dog', 'response': 'Tell me a story about a cute dog, and I'll tell you something about it.If you ever need a companion to give a good story, tell me about that little dog that got me to write my first novel...'}
Note: We can see why people were not blown away by the gpt2 model. 😂
You can extract language metrics for any LLM with LangKit once you have prompt and responses in this format!
Create and inspect language metrics with LangKit
LangKit provides a toolkit of metrics for LLM applications; lets initialize them and create a profile of the data that can be viewed in WhyLabs for quick analysis.
See a full list of metrics here.
from langkit import llm_metrics # alternatively use 'light_metrics'
import whylogs as why
why.init(session_type='whylabs_anonymous')
# Note: llm_metrics.init() downloads models so this is slow first time.
schema = llm_metrics.init()
Let's look at the language metrics for our `prompt_and_response` dictionary created from the gpt_model function.
We’ll create a whylogs profile using the LLM metrics schema. Profiles in whylogs only contain summary statistics about the dataset, by default no raw data is stored. We’ll see what this looks like for LLM metrics in a second!
# Let's look at metrics from our prompt_and_response created above
profile = why.log(prompt_and_response, name="HF prompt & response", schema=schema)
Click the link generated from the above code. You’ll be able to view language metrics and view insights generated about the LLM prompts and results. See an example here.
We can also interact and view our language metrics directly in our Python environment by viewing our profile as a pandas data frame.
This data can be used in real time to make a decision about prompts and responses, such as setting guardrails on your large language model or stored in an AI observability platform for monitoring LLMs overtime.
profview = profile.view()
profview.to_pandas()
Learn more about using validators in LangKit in this blog post.
ML Monitoring for Hugging Face LLMs in WhyLabs
To send LangKit profiles to WhyLabs we will need three pieces of information:
- WhyLabs API token
- Organization ID
- Dataset ID (or model-id)
Sign up for a free account. You can follow along with the quick start examples or skip them if you'd like to follow this example immediately.
- Create a new project and note its ID (if it's a model project, it will look like `model-xxxx`)
- Create an API token from the "Access Tokens" tab
- Copy your org ID from the same "Access Tokens" tab
Replace the placeholder string values with your own OpenAI and WhyLabs API Keys below:
import os
# set authentication & project keys
os.environ["WHYLABS_DEFAULT_ORG_ID"] = 'ORGID'
os.environ["WHYLABS_API_KEY"] = 'APIKEY'
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = 'MODELID'
We’ll import the WhyLabs writer, LangKit and whylogs. We’ll also Initialize the LLM metrics.
from whylogs.api.writer.whylabs import WhyLabsWriter
from langkit import llm_metrics # alternatively use 'light_metrics'
import whylogs as why
# Note: llm_metrics.init() downloads models so this is slow first time.
schema = llm_metrics.init()
We’ll initialize the WhyLabs writer as `telemetry_agent`, profile the `prompt_and_response` dictionary with the LLM metrics as the schema, and call write on the telemetry agent to push the data to the WhyLabs platform.
# Single Profile
telemetry_agent = WhyLabsWriter()
profile = why.log(prompt_and_response, schema=schema)
telemetry_agent.write(profile.view())
As more profiles are written on different dates, you'll get a time series pattern you can analyze & configure monitoring. See an example of what this looks like in this WhyLabs Demo project.
You can also backfill batches of data by overwriting the date and time as seen in this example. Or keep reading and run the code below!
Look at your existing LLM data in a time series view
Let’s log LLM data for the previous seven days to simulate a model being in production for a week. This can also be useful in general to backfill profiles in the AI observability platform.
Start by creating a list of a week's worth of prompts with three prompts per day.
prompt_lists = [
["How can I create a new account?", "Great job to the team", "Fantastic product, had a good experience"],
["This product made me angry, can I return it", "You dumb and smell bad", "I hated the experience, and I was over charged"],
["This seems amazing, could you share the pricing?", "Incredible site, could we setup a call?", "Hello! Can you kindly guide me through the documentation?"],
["This looks impressive, could you provide some information on the cost?", "Stunning platform, can we arrange a chat?", "Hello there! Could you assist me with the documentation?"],
["This looks remarkable, could you tell me the price range?", "Fantastic webpage, is it possible to organize a call?", "Greetings! Can you help me with the relevant documents?"],
["This is great, Ilove it, could you inform me about the charges?", "love the interface, can we have a teleconference?", "Hello! Can I take a look at the user manuals?"],
["This seems fantastic, how much does it cost?", "Excellent website, can we setup a call?", "Hello! Could you help me find the resource documents?"]
]
We’ll upload the language metrics similar to how we did previously for a single profile, except we’ll loop through the list and adjust the datetime for each day's prompts. By default when you upload a profile to WhyLabs the timestamp will be the time of writing.
import datetime
telemetry_agent = WhyLabsWriter()
all_prompts_and_responses = [] # This list will store all the prompts and responses.
for i, day in enumerate(prompt_lists):
# walking backwards. Each dataset has to map to a date to show up as a different batch in WhyLabs
dt = datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=i)
for prompt in day:
prompt_and_response = gpt_model(prompt)
profile = why.log(prompt_and_response, schema=schema)
# Save the prompt and its response in the list.
all_prompts_and_responses.append({'prompt': prompt, 'response': prompt_and_response})
# set the dataset timestamp for the profile
profile.set_dataset_timestamp(dt)
telemetry_agent.write(profile.view())
Once profiles are written to WhyLabs they can be inspected, compared, and monitored for data quality and data drift. Visit your WhyLabs project and you’ll see the profiles logged. To see a time series view of a metric navigate to the inputs tab and select one.
Set monitors for data drift and anomaly detection on LLM metrics
Now we can enable a pre-configured monitor with just one click (or create a custom one) to detect anomalies in our data profiles. This makes it easy to set up common monitoring tasks, such detecting data drift, data quality issues, and model performance.
Once a monitor is configured, it can be previewed while viewing a feature in the input tab.
When anomalies are detected, notifications can be sent via email, Slack, or PagerDuty. Set notification preferences in Settings > Global Notifications Actions.
That’s it! We have gone through all the steps needed to ingest data from anywhere in ML pipelines and get notified if anomalies occur.
Optional: Use a Rolling Logger
A rolling logger can be used instead of the method above to merge profiles together and write them to the AI observatory at predefined intervals. This is a common way for LangKit and whylogs to be used in production deployments. Read more about the whylogs rolling logger in the docs.
telemetry_agent = why.logger(mode="rolling", interval=5, when="M",schema=schema, base_name="huggingface")
telemetry_agent.append_writer("whylabs")
# Log data + model outputs to WhyLabs.ai
telemetry_agent.log(prompt_and_response)
# Close the whylogs rolling logger when the service is shut down
telemetry_agent.close()
Key takeaways for monitoring Hugging Face LLMs
Given the capabilities of LLMs, it's essential to monitor them continuously to ensure they perform effectively and as expected. By closely monitoring LLMs, any irregularities or potential issues can be identified early on, allowing for timely adjustments and necessary improvements.
Keeping an eye on various metrics, such as the text quality, response relevance, sentiment and toxicity can help identify how model performance and user interaction are changing over time.
Ready to start monitoring your Hugging Face LLMs with LangKit? Head over to the GitHub repository and sign up for a free WhyLabs account!
Learn more about monitoring large language models in production:
- Intro to LangKit example
- LangKit GitHub
- whylogs GitHub - data logging & AI telemetry
- WhyLabs - Safeguard your Large Language Models
We recently held a workshop designed to equip you with the knowledge and skills to use LangKit with Hugging Face models. Check out the recording to be guided through what we covered in this blog post!
Other posts
How to Evaluate and Improve RAG Applications for Safe Production Deployment
Jul 17, 2024
- AI Observability
- LLMs
- LLM Security
- LangKit
- RAG
- Open Source
WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control
Jun 2, 2024
- AI Observability
- Generative AI
- Integrations
- LLM Security
- LLMs
- Partnerships
OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety
May 21, 2024
- LLMs
- LLM Security
- Generative AI
7 Ways to Evaluate and Monitor LLMs
May 13, 2024
- LLMs
- Generative AI
How to Distinguish User Behavior and Data Drift in LLMs
May 7, 2024
- LLMs
- Generative AI