WhyLabs AI Control Center (also known as the WhyLabs Platform) is now an open source project!

Andre Elizondo

Jun 14, 2023

Back to Blog

LangKit: Making Large Language Models Safe and Responsible

LLMs
AI Observability
LangKit
Generative AI
Product Updates
LLM Security

Andre Elizondo

Jun 14, 2023

tl;dr: With LangKit, you can monitor and safeguard your LLMs by quickly detecting and preventing malicious prompts, toxicity, hallucinations, and jailbreak attempts. This notebook has everything you need to get started with LangKit. Then sign up for a free WhyLabs account to easily monitor and track your models over time!

It's clear that Large Language Models (LLMs) are here to stay. At WhyLabs, we have been working with the industry's most advanced AI teams to develop a solution for understanding generative models. With LangKit, you can easily monitor the behavior and performance of your LLMs, ensuring their reliability, safety, and effectiveness.

Safeguard your Large Language Models with LangKit

LangKit provides AI practitioners the ability to extract critical telemetry data from prompts and responses, which can be used to help direct the behavior of an LLM through better prompt engineering and systematically observe at scale.

With WhyLabs, users can establish thresholds and baselines for a range of activities, such as malicious prompts, sensitive data leakage, toxicity, problematic topics, hallucinations, and jailbreak attempts. These alerts and guardrails enable any application developer to prevent inappropriate prompts, unwanted LLM responses, and violations of LLM usage policies without having to be an expert in natural language processing.

LangKit enables organizations to:

Validate and safeguard individual prompts & responses: LangKit gathers telemetry fast, really fast. It’s exposed as soon as you run it and can drive whether you take actions like regenerate a response or ask for a better prompt. It can also help you gather a set of problematic areas to further fine-tune your model.
Evaluate that the LLM behavior is compliant with policy: LangKit reduces the risk of running LLMs in production. We want to let anyone experience the power of LLMs, but if you’re a regulated organization, that could be hard without the right guarantees. We make it easy to define those guarantees as code and replicate them across any LLM both large and small.
Monitor user interactions inside an LLM-powered application: LangKit lets you set standards for your models. Ensuring you create the best experience for your end user is important, whether that’s ensuring your chatbot stays on topic or your responses are relevant LangKit lets you find trends and surface anomalies.
Compare and A/B test across different LLM and prompt versions: LangKit puts the engineering in Prompt Engineering. Move your team away from eyeballing responses, using LangKit you can measure the impact of changing a prompt ad-hoc as well as systematically to see how it impacts your users or consumers.

LangKit is simple and extensible

Using LangKit is easy. We set out to gather all of the best practices the industry has developed for monitoring LLMs beyond just tracking embedding drift, and you can get started with just a few lines of Python. LangKit lets you extract all of the important telemetry about your LLM with just a prompt and a response, and with WhyLabs you can easily track this telemetry over time and enable collaboration across teams without the need to set up or scale any infrastructure.

from langkit import llm_metrics
import whylogs as why

why.init(session_type='whylabs_anonymous')

results = why.log({"prompt":"hello!","response":"world!"}, name="openai_gpt4", schema=llm_metrics.init())

We can now put our telemetry to work to track behavior and highlight interesting changes. For example, if I’m using an LLM to interact with my users and I want to ensure it stays on topic, I’d rather not have my support bot giving a lot of legal answers if that’s not what I’ve tested or trained it on. We can use LangKit’s topic detection telemetry to help understand the trends of different topics being handled, we have a few topics out of the box but you can easily define your own too.

Understanding output quality is incredibly important when using LLMs in production, if I have a application that has an intended audience such as a medical practitioner or legal professional I’d expect my model to produce text that sounds like the user it’s intended for. Here we’re monitoring for the changing reading level of the response over time to see how our LLM is performing our task. You might also want to monitor for things like output relevance to the provided prompt or sentiment of the response to understand user interactions or whether you’re experiencing a hallucination, LangKit has you covered there too.

We can also use LangKit to surface similarity to known themes, for example, jailbreaks or refusals. It can also identify similarity to responses that you either do or don’t want to come up in the conversation. Here we’re measuring the maximum similarity to a known jailbreak, in the case that it goes past a certain threshold I may want to quarantine a session or enforce a higher sensitivity of logging to ensure I capture all relevant insights to why a user would be attempting to access information they may not have access to.

LangKit lets you start using industry best practices out of the box, but we expect every organization to have its own way to measure their LLMs. That’s why we made LangKit incredibly extensible through User Defined Functions (UDFs). Want to extract your own metric or validate your prompts and responses in a particular way? Just define a method and decorate it, we’ll do the rest.

@register_metric_udf(col_name='prompt')
def contains_classification_instructions(text):
  lower_text = text.lower()
  for target in ['classify','identify','categorize']:
    if target in lower_text:
      return 1
  return 0

WhyLabs ensures we have the right notifications or workflows triggered when an anomaly surfaces so that you always know how your LLM is behaving over time. Use one of our preset monitors and get started monitoring your LLMs fast, or define a custom monitor to look for scenarios unique to your use case. With monitors and insights in WhyLabs, we can surface interesting trends and behaviors automatically, then tell the right person about it.

What does the community say about LangKit?

As part of our LangKit release, we gathered feedback from our community members and supporters, including from users, partners, and thought leaders in the data and ML space. Here’s what some of them had to say:

"As more organizations incorporate LLMs into customer-facing applications, reliability and transparency will be key for successful deployments. With LangKit, WhyLabs provides an extensible and scalable approach for solving challenges that many AI practitioners will face when deploying LLMs in production." - Andrew Ng, Managing General Partner of AI Fund

“In an era in which AI transitioned from buzzword to vital business necessity, effective use of LLMs is a must. As our team at Tryolabs helps enterprises put this powerful technology into practice, safety remains one of the main blocks for widespread adoption. WhyLabs’ LangKit is a leap forward for LLMOps, providing out-of-the-box tools for measuring the quality of LLM outputs and catching issues before they affect tasks downstream — whether end users, other applications, or even other LLMs. The fact that it’s easily extensible and lets you add your own checks is also a big plus!” - Alan Descoins, CTO at Tryolabs

“At Symbl.ai we deliver conversation intelligence as a service to builders so observability is critical for smooth operations and excellent customer experience. Our platform enables experiences powered by both Understanding and Generative AI for which LangKit is critical to enable the transparency and governance required across the end to end AI stack. The WhyLabs Platform provides observability tools for a wide range of AI use cases and the addition of LLM observability capabilities reduces engineering overhead and we can address all operational needs with one platform.” - Surbhi Rathore, CEO of Symbl.ai

Getting started

Our data centric approach to LLMOps allows you to control which prompts and responses are appropriate for your LLM application in real-time, validate how your LLM responds to known prompts, and observe your prompts and responses at scale. See how easy it is to identify and mitigate malicious prompts, sensitive data, toxic responses, problematic topics, hallucinations, and jailbreak attempts in any LLM model!

Ready to start monitoring your Large Language Models? This notebook has everything you need to get started. Then sign up for a free WhyLabs account to easily monitor and track your models over time!

Resources

Andre Elizondo

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Rich Young

Dec 10, 2024

Learn how the NIST AI Risk Management Framework (RMF) guides AI security and governance and discover how WhyLabs guardrails can help implement and manage AI risks effectively.

Read post

AI risk management
AI Observability
AI security
NIST RMF implementation
AI compliance
AI risk mitigation

Best Practicies for Monitoring and Securing RAG Systems in Production

Rich Young

Oct 8, 2024

Retrieval-augmented generation (RAG) systems combine advanced retrieval techniques with large language models (LLMs) to improve the responses they generate...

Read post

Retrival-Augmented Generation (RAG)
LLM Security
Generative AI
ML Monitoring
LangKit

How to Evaluate and Improve RAG Applications for Safe Production Deployment

Rich Young

Jul 17, 2024

Learn how to evaluate and improve RAG applications using LangKit and WhyLabs AI Control Center. Develop secure and reliable RAG applications.

Read post

AI Observability
LLMs
LLM Security
LangKit
RAG
Open Source

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

WhyLabs Team

Jun 2, 2024

With WhyLabs and NVIDIA NIM, enterprises can accelerate GenAI application deployment and help ensure the safety of end-user experiences WhyLabs has been on a mission to empower enterprises with tools that ensure safe and responsible AI adoption. With its integration with NVIDIA NIM inference microservices, WhyLabs is helping make responsible AI adoption more accessible. Customers can now maintain better security and control of GenAI applications with self-hosted deployment of the most powerfu

Read post

AI Observability
Generative AI
Integrations
LLM Security
LLMs
Partnerships

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

Alessya Visnjic

May 21, 2024

Discover strategies for safeguarding your large language models (LLMs). Learn how to protect your AI technologies effectively based on OWASP's top 10 security tips.

Read post

LLMs
LLM Security
Generative AI

7 Ways to Evaluate and Monitor LLMs

WhyLabs Team

May 13, 2024

Learn about 7 techniques for evaluating & monitoring LLMs, including LLM-as-a-Judge, ML-model-as-a-Judge, and embedding-as-a-source. Improve your understanding of LLMs with these strategies.

Read post

LLMs
Generative AI

How to Distinguish User Behavior and Data Drift in LLMs

Bernease Herman

May 7, 2024

Large Language Models (LLMs) rarely provide consistent responses for the same prompts over time. In this blog we’ll demonstrate how identify and monitor data changes using a few common scenarios.

Read post

LLMs
Generative AI

Run AI with Certainty

Book a demo

LangKit: Making Large Language Models Safe and Responsible

Safeguard your Large Language Models with LangKit

LangKit is simple and extensible

What does the community say about LangKit?

Getting started

Resources

Other posts

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Best Practicies for Monitoring and Securing RAG Systems in Production

How to Evaluate and Improve RAG Applications for Safe Production Deployment

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

7 Ways to Evaluate and Monitor LLMs

How to Distinguish User Behavior and Data Drift in LLMs

Run AI with Certainty

About

Resources

whylogs

WhyLabs