Get Early Access to the First Purpose-Built Monitoring Solution for LLMs
- WhyLabs
- LLMs
- LLM Security
- Generative AI
- LangKit
- News
- Product Updates
May 11, 2023
The future is now! Today, WhyLabs announced the private beta release of The Language Toolkit, or LangKit for short, the first purpose-built monitoring solution for LLMs.
Since the release of this blog, we have officially launched LangKit - please visit our announcement to learn more.
The leap forward
Over the past year, generative AI unlocked experiences which previously were in the realm of science fiction. This newfound ability to generate text, images, voice, and code with remarkable fidelity has opened up a wealth of possibilities for businesses in various industries. The biggest wave of innovation is powered by LLMs, which are transforming the landscape of AI applications from genuinely helpful chatbots to nearly autonomous code generation tools.
LLMs in production - what can possibly go wrong?
The saying “with great power, comes great responsibility” has never been more true. As enthusiasm builds around the extraordinary capabilities of LLMs and organizations incorporate them into customer-facing applications, new risks have emerged - failure modes, biases, security loopholes, and embarrassing customer experiences.
To mitigate these risks, it’s essential for teams to ensure their LLM applications are monitored continuously and operated responsibly. WhyLabs has been collaborating with industry-leading teams on mitigating these risks through observability. Our customers have been running LLM-powered experiences in production since the release of GPT-3. LangKit combines our insights and experience from working with LLMs since 2021 in a purpose-built observability solution for LLMs.
The unique challenges of monitoring LLMs
The adoption of proprietary, foundational LLMs presents unique observability challenges. While fine-tuning and retraining offer benefits, models may deviate from initial versions. Users often have limited access to the vast original training data, resulting in reduced accuracy, unpredictable behavior, hallucinations, and poor understanding of the model's inherent characteristics.
Today, there is no standard for evaluating the performance of LLMs. AI practitioners commonly use two approaches - test prompts or text embeddings:
- Using test prompts requires manual generation and evaluation of prompts that quickly grow in size and complexity in production. Since this approach doesn’t use actuals and user feedback, it fails to offer insights into how the model behaves over time or the effectiveness of its production settings. Using test prompts falls short in problem detection, breaks down at scale, and can become cost prohibitive.
- Using text embeddings leverages the text embeddings created from the prompts used to interact with the LLM and the resulting response. Measuring these over time can provide some useful signals but often requires manual investigation or inefficient debugging methods like wandering through t-SNE or UMAP visualizations. This approach is insufficient as it loses the context of the original input, making it harder to evaluate model performance.
WhyLabs: Making LLM monitoring easy
To tackle the challenges of monitoring and running LLMs in production, WhyLabs built LangKit. Whether you’re looking to understand interactions with a model API such as OpenAI’s GPT or an in-house model fine-tuned to a unique task, LangKit has you covered.
Utilizing LangKit, you can continuously evaluate the LLM-powered applications using essential metrics extracted from prompts, responses, and user interactions. This method allows scaling to a massive number of users, without sampling down the interactions for analysis. So as your LLM application scales, so does your observability solution. WhyLabs Observability Platform then allows for individual inspection of the resulting metrics as well as systematic monitoring across a large number of interactions for anomalies and outliers. Using this scalable and efficient technique you’ll be able to detect issues across:
- Quality - Are your prompts and responses high quality (readable, understandable, well written)? Are you seeing a drift in the types of prompts you expect or a concept drift in how your model is responding?
- Sentiment - Is your LLM responding in the right tone? Are your upstream prompts changing their sentiment suddenly or over time? Are you seeing a divergence from your anticipated topics?
- Governance - Is your LLM receiving the kind of information in line with your internal governance & compliance policies such as the usage of PII or PHI
- Security - Is your LLM receiving adversarial attempts or malicious prompt injections? Are you experiencing prompt leakage?
Secure observability at scale
Distinct from other LLM monitoring tools, WhyLabs doesn’t store user input or model output, which can be particularly sensitive for LLMs. Instead, LangKit gathers all essential metrics about prompt, response data, and user interaction data as it flows through the model. WhyLabs Observability Platform evaluates and monitors these metrics, while you remain fully in control of the underlying data.
This approach effortlessly scales from recording just a handful of sample prompts and responses to handling millions of LLM interactions per hour. You can monitor your LLM-based products confidently, knowing that the same solution is effective during both the initial LLM MVP development and full-scale production deployment. By using LangKit, our customers gain the confidence that LLM applications will behave as expected over time, in various contexts, and through model version changes and fine-tuning adjustments.
Join the responsible LLM revolution!
We’ve made LLM Observability as easy as using OpenAI’s ChatGPT or a Hugging Face model. If you are an AI practitioner working with LLMs, learn more about LangKit or schedule a demo with us!
We love feedback, questions, and suggestions - join us in the WhyLabs community Slack.
Since the release of this blog, we have officially launched LangKit - please visit our announcement to learn more.
Other posts
Best Practicies for Monitoring and Securing RAG Systems in Production
Oct 8, 2024
- Retrival-Augmented Generation (RAG)
- LLM Security
- Generative AI
- ML Monitoring
- LangKit
How to Evaluate and Improve RAG Applications for Safe Production Deployment
Jul 17, 2024
- AI Observability
- LLMs
- LLM Security
- LangKit
- RAG
- Open Source
WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control
Jun 2, 2024
- AI Observability
- Generative AI
- Integrations
- LLM Security
- LLMs
- Partnerships
OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety
May 21, 2024
- LLMs
- LLM Security
- Generative AI
7 Ways to Evaluate and Monitor LLMs
May 13, 2024
- LLMs
- Generative AI
How to Distinguish User Behavior and Data Drift in LLMs
May 7, 2024
- LLMs
- Generative AI