WhyLabs AI Control Center (also known as the WhyLabs Platform) is now an open source project!

Andre Elizondo,

Alessya Visnjic

May 11, 2023

Back to Blog

Get Early Access to the First Purpose-Built Monitoring Solution for LLMs

WhyLabs
LLMs
LLM Security
Generative AI
LangKit
News
Product Updates

Andre Elizondo,

Alessya Visnjic

May 11, 2023

The future is now! Today, WhyLabs announced the private beta release of The Language Toolkit, or LangKit for short, the first purpose-built monitoring solution for LLMs.

Since the release of this blog, we have officially launched LangKit - please visit our announcement to learn more.

The leap forward

Over the past year, generative AI unlocked experiences which previously were in the realm of science fiction. This newfound ability to generate text, images, voice, and code with remarkable fidelity has opened up a wealth of possibilities for businesses in various industries. The biggest wave of innovation is powered by LLMs, which are transforming the landscape of AI applications from genuinely helpful chatbots to nearly autonomous code generation tools.

LLMs in production - what can possibly go wrong?

The saying “with great power, comes great responsibility” has never been more true. As enthusiasm builds around the extraordinary capabilities of LLMs and organizations incorporate them into customer-facing applications, new risks have emerged - failure modes, biases, security loopholes, and embarrassing customer experiences.

To mitigate these risks, it’s essential for teams to ensure their LLM applications are monitored continuously and operated responsibly. WhyLabs has been collaborating with industry-leading teams on mitigating these risks through observability. Our customers have been running LLM-powered experiences in production since the release of GPT-3. LangKit combines our insights and experience from working with LLMs since 2021 in a purpose-built observability solution for LLMs.

The unique challenges of monitoring LLMs

The adoption of proprietary, foundational LLMs presents unique observability challenges. While fine-tuning and retraining offer benefits, models may deviate from initial versions. Users often have limited access to the vast original training data, resulting in reduced accuracy, unpredictable behavior, hallucinations, and poor understanding of the model's inherent characteristics.

Today, there is no standard for evaluating the performance of LLMs. AI practitioners commonly use two approaches - test prompts or text embeddings:

Using test prompts requires manual generation and evaluation of prompts that quickly grow in size and complexity in production. Since this approach doesn’t use actuals and user feedback, it fails to offer insights into how the model behaves over time or the effectiveness of its production settings. Using test prompts falls short in problem detection, breaks down at scale, and can become cost prohibitive.
Using text embeddings leverages the text embeddings created from the prompts used to interact with the LLM and the resulting response. Measuring these over time can provide some useful signals but often requires manual investigation or inefficient debugging methods like wandering through t-SNE or UMAP visualizations. This approach is insufficient as it loses the context of the original input, making it harder to evaluate model performance.

WhyLabs: Making LLM monitoring easy

To tackle the challenges of monitoring and running LLMs in production, WhyLabs built LangKit. Whether you’re looking to understand interactions with a model API such as OpenAI’s GPT or an in-house model fine-tuned to a unique task, LangKit has you covered.

Utilizing LangKit, you can continuously evaluate the LLM-powered applications using essential metrics extracted from prompts, responses, and user interactions. This method allows scaling to a massive number of users, without sampling down the interactions for analysis. So as your LLM application scales, so does your observability solution. WhyLabs Observability Platform then allows for individual inspection of the resulting metrics as well as systematic monitoring across a large number of interactions for anomalies and outliers. Using this scalable and efficient technique you’ll be able to detect issues across:

Quality - Are your prompts and responses high quality (readable, understandable, well written)? Are you seeing a drift in the types of prompts you expect or a concept drift in how your model is responding?
Sentiment - Is your LLM responding in the right tone? Are your upstream prompts changing their sentiment suddenly or over time? Are you seeing a divergence from your anticipated topics?
Governance - Is your LLM receiving the kind of information in line with your internal governance & compliance policies such as the usage of PII or PHI
Security - Is your LLM receiving adversarial attempts or malicious prompt injections? Are you experiencing prompt leakage?

Secure observability at scale

Distinct from other LLM monitoring tools, WhyLabs doesn’t store user input or model output, which can be particularly sensitive for LLMs. Instead, LangKit gathers all essential metrics about prompt, response data, and user interaction data as it flows through the model. WhyLabs Observability Platform evaluates and monitors these metrics, while you remain fully in control of the underlying data.

This approach effortlessly scales from recording just a handful of sample prompts and responses to handling millions of LLM interactions per hour. You can monitor your LLM-based products confidently, knowing that the same solution is effective during both the initial LLM MVP development and full-scale production deployment. By using LangKit, our customers gain the confidence that LLM applications will behave as expected over time, in various contexts, and through model version changes and fine-tuning adjustments.

Join the responsible LLM revolution!

We’ve made LLM Observability as easy as using OpenAI’s ChatGPT or a Hugging Face model. If you are an AI practitioner working with LLMs, learn more about LangKit or schedule a demo with us!

We love feedback, questions, and suggestions - join us in the WhyLabs community Slack.

Since the release of this blog, we have officially launched LangKit - please visit our announcement to learn more.

Andre Elizondo,

Alessya Visnjic

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Rich Young

Dec 10, 2024

Learn how the NIST AI Risk Management Framework (RMF) guides AI security and governance and discover how WhyLabs guardrails can help implement and manage AI risks effectively.

Read post

AI risk management
AI Observability
AI security
NIST RMF implementation
AI compliance
AI risk mitigation

Best Practicies for Monitoring and Securing RAG Systems in Production

Rich Young

Oct 8, 2024

Retrieval-augmented generation (RAG) systems combine advanced retrieval techniques with large language models (LLMs) to improve the responses they generate...

Read post

Retrival-Augmented Generation (RAG)
LLM Security
Generative AI
ML Monitoring
LangKit

How to Evaluate and Improve RAG Applications for Safe Production Deployment

Rich Young

Jul 17, 2024

Learn how to evaluate and improve RAG applications using LangKit and WhyLabs AI Control Center. Develop secure and reliable RAG applications.

Read post

AI Observability
LLMs
LLM Security
LangKit
RAG
Open Source

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

WhyLabs Team

Jun 2, 2024

With WhyLabs and NVIDIA NIM, enterprises can accelerate GenAI application deployment and help ensure the safety of end-user experiences WhyLabs has been on a mission to empower enterprises with tools that ensure safe and responsible AI adoption. With its integration with NVIDIA NIM inference microservices, WhyLabs is helping make responsible AI adoption more accessible. Customers can now maintain better security and control of GenAI applications with self-hosted deployment of the most powerfu

Read post

AI Observability
Generative AI
Integrations
LLM Security
LLMs
Partnerships

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

Alessya Visnjic

May 21, 2024

Discover strategies for safeguarding your large language models (LLMs). Learn how to protect your AI technologies effectively based on OWASP's top 10 security tips.

Read post

LLMs
LLM Security
Generative AI

7 Ways to Evaluate and Monitor LLMs

WhyLabs Team

May 13, 2024

Learn about 7 techniques for evaluating & monitoring LLMs, including LLM-as-a-Judge, ML-model-as-a-Judge, and embedding-as-a-source. Improve your understanding of LLMs with these strategies.

Read post

LLMs
Generative AI

How to Distinguish User Behavior and Data Drift in LLMs

Bernease Herman

May 7, 2024

Large Language Models (LLMs) rarely provide consistent responses for the same prompts over time. In this blog we’ll demonstrate how identify and monitor data changes using a few common scenarios.

Read post

LLMs
Generative AI

Run AI with Certainty

Book a demo

Get Early Access to the First Purpose-Built Monitoring Solution for LLMs

The leap forward

LLMs in production - what can possibly go wrong?

The unique challenges of monitoring LLMs

WhyLabs: Making LLM monitoring easy

Secure observability at scale

Join the responsible LLM revolution!

Other posts

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Best Practicies for Monitoring and Securing RAG Systems in Production

How to Evaluate and Improve RAG Applications for Safe Production Deployment

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

7 Ways to Evaluate and Monitor LLMs

How to Distinguish User Behavior and Data Drift in LLMs

Run AI with Certainty

About

Resources

whylogs

WhyLabs