blog bg left
Back to Blog

LangKit: Making Large Language Models Safe and Responsible

tl;dr: With LangKit, you can monitor and safeguard your LLMs by quickly detecting and preventing malicious prompts, toxicity, hallucinations, and jailbreak attempts. This notebook has everything you need to get started with LangKit. Then sign up for a free WhyLabs account to easily monitor and track your models over time!

It's clear that Large Language Models (LLMs) are here to stay. At WhyLabs, we have been working with the industry's most advanced AI teams to develop a solution for understanding generative models. With LangKit, you can easily monitor the behavior and performance of your LLMs, ensuring their reliability, safety, and effectiveness.

Safeguard your Large Language Models with LangKit

LangKit provides AI practitioners the ability to extract critical telemetry data from prompts and responses, which can be used to help direct the behavior of an LLM through better prompt engineering and systematically observe at scale.

With WhyLabs, users can establish thresholds and baselines for a range of activities, such as malicious prompts, sensitive data leakage, toxicity, problematic topics, hallucinations, and jailbreak attempts. These alerts and guardrails enable any application developer to prevent inappropriate prompts, unwanted LLM responses, and violations of LLM usage policies without having to be an expert in natural language processing.

LangKit enables organizations to:

  • Validate and safeguard individual prompts & responses: LangKit gathers telemetry fast, really fast. It’s exposed as soon as you run it and can drive whether you take actions like regenerate a response or ask for a better prompt. It can also help you gather a set of problematic areas to further fine-tune your model.
  • Evaluate that the LLM behavior is compliant with policy: LangKit reduces the risk of running LLMs in production. We want to let anyone experience the power of LLMs, but if you’re a regulated organization, that could be hard without the right guarantees. We make it easy to define those guarantees as code and replicate them across any LLM both large and small.
  • Monitor user interactions inside an LLM-powered application: LangKit lets you set standards for your models. Ensuring you create the best experience for your end user is important, whether that’s ensuring your chatbot stays on topic or your responses are relevant LangKit lets you find trends and surface anomalies.
  • Compare and A/B test across different LLM and prompt versions: LangKit puts the engineering in Prompt Engineering. Move your team away from eyeballing responses, using LangKit you can measure the impact of changing a prompt ad-hoc as well as systematically to see how it impacts your users or consumers.

LangKit is simple and extensible

Using LangKit is easy. We set out to gather all of the best practices the industry has developed for monitoring LLMs beyond just tracking embedding drift, and you can get started with just a few lines of Python. LangKit lets you extract all of the important telemetry about your LLM with just a prompt and a response, and with WhyLabs you can easily track this telemetry over time and enable collaboration across teams without the need to set up or scale any infrastructure.

from langkit import llm_metrics
import whylogs as why


results = why.log({"prompt":"hello!","response":"world!"}, name="openai_gpt4", schema=llm_metrics.init())

We can now put our telemetry to work to track behavior and highlight interesting changes. For example, if I’m using an LLM to interact with my users and I want to ensure it stays on topic, I’d rather not have my support bot giving a lot of legal answers if that’s not what I’ve tested or trained it on. We can use LangKit’s topic detection telemetry to help understand the trends of different topics being handled, we have a few topics out of the box but you can easily define your own too.

Understanding output quality is incredibly important when using LLMs in production, if I have a application that has an intended audience such as a medical practitioner or legal professional I’d expect my model to produce text that sounds like the user it’s intended for. Here we’re monitoring for the changing reading level of the response over time to see how our LLM is performing our task. You might also want to monitor for things like output relevance to the provided prompt or sentiment of the response to understand user interactions or whether you’re experiencing a hallucination, LangKit has you covered there too.

We can also use LangKit to surface similarity to known themes, for example, jailbreaks or refusals. It can also identify similarity to responses that you either do or don’t want to come up in the conversation. Here we’re measuring the maximum similarity to a known jailbreak, in the case that it goes past a certain threshold I may want to quarantine a session or enforce a higher sensitivity of logging to ensure I capture all relevant insights to why a user would be attempting to access information they may not have access to.

LangKit lets you start using industry best practices out of the box, but we expect every organization to have its own way to measure their LLMs. That’s why we made LangKit incredibly extensible through User Defined Functions (UDFs). Want to extract your own metric or validate your prompts and responses in a particular way? Just define a method and decorate it, we’ll do the rest.

def contains_classification_instructions(text):
  lower_text = text.lower()
  for target in ['classify','identify','categorize']:
    if target in lower_text:
      return 1
  return 0

WhyLabs ensures we have the right notifications or workflows triggered when an anomaly surfaces so that you always know how your LLM is behaving over time. Use one of our preset monitors and get started monitoring your LLMs fast, or define a custom monitor to look for scenarios unique to your use case. With monitors and insights in WhyLabs, we can surface interesting trends and behaviors automatically, then tell the right person about it.

What does the community say about LangKit?

As part of our LangKit release, we gathered feedback from our community members and supporters, including from users, partners, and thought leaders in the data and ML space. Here’s what some of them had to say:

"As more organizations incorporate LLMs into customer-facing applications, reliability and transparency will be key for successful deployments. With LangKit, WhyLabs provides an extensible and scalable approach for solving challenges that many AI practitioners will face when deploying LLMs in production." - Andrew Ng, Managing General Partner of AI Fund
“In an era in which AI transitioned from buzzword to vital business necessity, effective use of LLMs is a must. As our team at Tryolabs helps enterprises put this powerful technology into practice, safety remains one of the main blocks for widespread adoption. WhyLabs’ LangKit is a leap forward for LLMOps, providing out-of-the-box tools for measuring the quality of LLM outputs and catching issues before they affect tasks downstream — whether end users, other applications, or even other LLMs. The fact that it’s easily extensible and lets you add your own checks is also a big plus!” - Alan Descoins, CTO at Tryolabs
“At we deliver conversation intelligence as a service to builders so observability is critical for smooth operations and excellent customer experience. Our platform enables experiences powered by both Understanding and Generative AI for which LangKit is critical to enable the transparency and governance required across the end to end AI stack. The WhyLabs Platform provides observability tools for a wide range of AI use cases and the addition of LLM observability capabilities reduces engineering overhead and we can address all operational needs with one platform.” - Surbhi Rathore, CEO of

Getting started

Our data centric approach to LLMOps allows you to control which prompts and responses are appropriate for your LLM application in real-time, validate how your LLM responds to known prompts, and observe your prompts and responses at scale. See how easy it is to identify and mitigate malicious prompts, sensitive data, toxic responses, problematic topics, hallucinations, and jailbreak attempts in any LLM model!

Ready to start monitoring your Large Language Models? This notebook has everything you need to get started. Then sign up for a free WhyLabs account to easily monitor and track your models over time!


Other posts

Glassdoor Decreases Latency Overhead and Improves Data Monitoring with WhyLabs

The Glassdoor team describes their integration latency challenges and how they were able to decrease latency overhead and improve data monitoring with WhyLabs.

Understanding and Monitoring Embeddings in Amazon SageMaker with WhyLabs

WhyLabs and Amazon Web Services (AWS) explore the various ways embeddings are used, issues that can impact your ML models, how to identify those issues and set up monitors to prevent them in the future!

Data Drift Monitoring and Its Importance in MLOps

It's important to continuously monitor and manage ML models to ensure ML model performance. We explore the role of data drift management and why it's crucial in your MLOps pipeline.

Ensuring AI Success in Healthcare: The Vital Role of ML Monitoring

Discover how ML monitoring plays a crucial role in the Healthcare industry to ensure the reliability, compliance, and overall safety of AI-driven systems.

WhyLabs Recognized by CB Insights GenAI 50 among the Most Innovative Generative AI Startups

WhyLabs has been named on CB Insights’ first annual GenAI 50 list, named as one of the world’s top 50 most innovative companies developing generative AI applications and infrastructure across industries.

Hugging Face and LangKit: Your Solution for LLM Observability

See how easy it is to generate out-of-the-box text metrics for Hugging Face LLMs and monitor them in WhyLabs to identify how model performance and user interaction are changing over time.

7 Ways to Monitor Large Language Model Behavior

Discover seven ways to track and monitor Large Language Model behavior using metrics for ChatGPT’s responses for a fixed set of 200 prompts across 35 days.

Safeguarding and Monitoring Large Language Model (LLM) Applications

We explore the concept of observability and validation in the context of language models, and demonstrate how to effectively safeguard them using guardrails.

Robust & Responsible AI Newsletter - Issue #6

A quarterly roundup of the hottest LLM, ML and Data-Centric AI news, including industry highlights, what’s brewing at WhyLabs, and more.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo