LangKit: Making Large Language Models Safe and Responsible
- LLMs
- AI Observability
- LangKit
- Generative AI
- Product Updates
- LLM Security
Jun 14, 2023
tl;dr: With LangKit, you can monitor and safeguard your LLMs by quickly detecting and preventing malicious prompts, toxicity, hallucinations, and jailbreak attempts. This notebook has everything you need to get started with LangKit. Then sign up for a free WhyLabs account to easily monitor and track your models over time!
It's clear that Large Language Models (LLMs) are here to stay. At WhyLabs, we have been working with the industry's most advanced AI teams to develop a solution for understanding generative models. With LangKit, you can easily monitor the behavior and performance of your LLMs, ensuring their reliability, safety, and effectiveness.
Safeguard your Large Language Models with LangKit
LangKit provides AI practitioners the ability to extract critical telemetry data from prompts and responses, which can be used to help direct the behavior of an LLM through better prompt engineering and systematically observe at scale.
With WhyLabs, users can establish thresholds and baselines for a range of activities, such as malicious prompts, sensitive data leakage, toxicity, problematic topics, hallucinations, and jailbreak attempts. These alerts and guardrails enable any application developer to prevent inappropriate prompts, unwanted LLM responses, and violations of LLM usage policies without having to be an expert in natural language processing.
LangKit enables organizations to:
- Validate and safeguard individual prompts & responses: LangKit gathers telemetry fast, really fast. It’s exposed as soon as you run it and can drive whether you take actions like regenerate a response or ask for a better prompt. It can also help you gather a set of problematic areas to further fine-tune your model.
- Evaluate that the LLM behavior is compliant with policy: LangKit reduces the risk of running LLMs in production. We want to let anyone experience the power of LLMs, but if you’re a regulated organization, that could be hard without the right guarantees. We make it easy to define those guarantees as code and replicate them across any LLM both large and small.
- Monitor user interactions inside an LLM-powered application: LangKit lets you set standards for your models. Ensuring you create the best experience for your end user is important, whether that’s ensuring your chatbot stays on topic or your responses are relevant LangKit lets you find trends and surface anomalies.
- Compare and A/B test across different LLM and prompt versions: LangKit puts the engineering in Prompt Engineering. Move your team away from eyeballing responses, using LangKit you can measure the impact of changing a prompt ad-hoc as well as systematically to see how it impacts your users or consumers.
LangKit is simple and extensible
Using LangKit is easy. We set out to gather all of the best practices the industry has developed for monitoring LLMs beyond just tracking embedding drift, and you can get started with just a few lines of Python. LangKit lets you extract all of the important telemetry about your LLM with just a prompt and a response, and with WhyLabs you can easily track this telemetry over time and enable collaboration across teams without the need to set up or scale any infrastructure.
from langkit import llm_metrics
import whylogs as why
why.init(session_type='whylabs_anonymous')
results = why.log({"prompt":"hello!","response":"world!"}, name="openai_gpt4", schema=llm_metrics.init())
We can now put our telemetry to work to track behavior and highlight interesting changes. For example, if I’m using an LLM to interact with my users and I want to ensure it stays on topic, I’d rather not have my support bot giving a lot of legal answers if that’s not what I’ve tested or trained it on. We can use LangKit’s topic detection telemetry to help understand the trends of different topics being handled, we have a few topics out of the box but you can easily define your own too.
Understanding output quality is incredibly important when using LLMs in production, if I have a application that has an intended audience such as a medical practitioner or legal professional I’d expect my model to produce text that sounds like the user it’s intended for. Here we’re monitoring for the changing reading level of the response over time to see how our LLM is performing our task. You might also want to monitor for things like output relevance to the provided prompt or sentiment of the response to understand user interactions or whether you’re experiencing a hallucination, LangKit has you covered there too.
We can also use LangKit to surface similarity to known themes, for example, jailbreaks or refusals. It can also identify similarity to responses that you either do or don’t want to come up in the conversation. Here we’re measuring the maximum similarity to a known jailbreak, in the case that it goes past a certain threshold I may want to quarantine a session or enforce a higher sensitivity of logging to ensure I capture all relevant insights to why a user would be attempting to access information they may not have access to.
LangKit lets you start using industry best practices out of the box, but we expect every organization to have its own way to measure their LLMs. That’s why we made LangKit incredibly extensible through User Defined Functions (UDFs). Want to extract your own metric or validate your prompts and responses in a particular way? Just define a method and decorate it, we’ll do the rest.
@register_metric_udf(col_name='prompt')
def contains_classification_instructions(text):
lower_text = text.lower()
for target in ['classify','identify','categorize']:
if target in lower_text:
return 1
return 0
WhyLabs ensures we have the right notifications or workflows triggered when an anomaly surfaces so that you always know how your LLM is behaving over time. Use one of our preset monitors and get started monitoring your LLMs fast, or define a custom monitor to look for scenarios unique to your use case. With monitors and insights in WhyLabs, we can surface interesting trends and behaviors automatically, then tell the right person about it.
What does the community say about LangKit?
As part of our LangKit release, we gathered feedback from our community members and supporters, including from users, partners, and thought leaders in the data and ML space. Here’s what some of them had to say:
"As more organizations incorporate LLMs into customer-facing applications, reliability and transparency will be key for successful deployments. With LangKit, WhyLabs provides an extensible and scalable approach for solving challenges that many AI practitioners will face when deploying LLMs in production." - Andrew Ng, Managing General Partner of AI Fund
“In an era in which AI transitioned from buzzword to vital business necessity, effective use of LLMs is a must. As our team at Tryolabs helps enterprises put this powerful technology into practice, safety remains one of the main blocks for widespread adoption. WhyLabs’ LangKit is a leap forward for LLMOps, providing out-of-the-box tools for measuring the quality of LLM outputs and catching issues before they affect tasks downstream — whether end users, other applications, or even other LLMs. The fact that it’s easily extensible and lets you add your own checks is also a big plus!” - Alan Descoins, CTO at Tryolabs
“At Symbl.ai we deliver conversation intelligence as a service to builders so observability is critical for smooth operations and excellent customer experience. Our platform enables experiences powered by both Understanding and Generative AI for which LangKit is critical to enable the transparency and governance required across the end to end AI stack. The WhyLabs Platform provides observability tools for a wide range of AI use cases and the addition of LLM observability capabilities reduces engineering overhead and we can address all operational needs with one platform.” - Surbhi Rathore, CEO of Symbl.ai
Getting started
Our data centric approach to LLMOps allows you to control which prompts and responses are appropriate for your LLM application in real-time, validate how your LLM responds to known prompts, and observe your prompts and responses at scale. See how easy it is to identify and mitigate malicious prompts, sensitive data, toxic responses, problematic topics, hallucinations, and jailbreak attempts in any LLM model!
Ready to start monitoring your Large Language Models? This notebook has everything you need to get started. Then sign up for a free WhyLabs account to easily monitor and track your models over time!
Resources
Other posts
Best Practicies for Monitoring and Securing RAG Systems in Production
Oct 8, 2024
- Retrival-Augmented Generation (RAG)
- LLM Security
- Generative AI
- ML Monitoring
- LangKit
How to Evaluate and Improve RAG Applications for Safe Production Deployment
Jul 17, 2024
- AI Observability
- LLMs
- LLM Security
- LangKit
- RAG
- Open Source
WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control
Jun 2, 2024
- AI Observability
- Generative AI
- Integrations
- LLM Security
- LLMs
- Partnerships
OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety
May 21, 2024
- LLMs
- LLM Security
- Generative AI
7 Ways to Evaluate and Monitor LLMs
May 13, 2024
- LLMs
- Generative AI
How to Distinguish User Behavior and Data Drift in LLMs
May 7, 2024
- LLMs
- Generative AI