Get Early Access to the First Purpose-Built Monitoring Solution for LLMs
- WhyLabs
- AI Observability
May 11, 2023
The future is now! Today, WhyLabs is announcing The Language Toolkit, or LangKit for short, the first purpose-built monitoring solution for LLMs. Join the private beta waitlist to make your LLM applications safe and responsible. Here is why…
The leap forward
Over the past year, generative AI unlocked experiences which previously were in the realm of science fiction. This newfound ability to generate text, images, voice, and code with remarkable fidelity has opened up a wealth of possibilities for businesses in various industries. The biggest wave of innovation is powered by LLMs, which are transforming the landscape of AI applications from genuinely helpful chatbots to nearly autonomous code generation tools.
LLMs in production - what can possibly go wrong?
The saying “with great power, comes great responsibility” has never been more true. As enthusiasm builds around the extraordinary capabilities of LLMs and organizations incorporate them into customer-facing applications, new risks have emerged - failure modes, biases, security loopholes, and embarrassing customer experiences.
To mitigate these risks, it’s essential for teams to ensure their LLM applications are monitored continuously and operated responsibly. WhyLabs has been collaborating with industry-leading teams on mitigating these risks through observability. Our customers have been running LLM-powered experiences in production since the release of GPT-3. LangKit combines our insights and experience from working with LLMs since 2021 in a purpose-built observability solution for LLMs.
The unique challenges of monitoring LLMs
The adoption of proprietary, foundational LLMs presents unique observability challenges. While fine-tuning and retraining offer benefits, models may deviate from initial versions. Users often have limited access to the vast original training data, resulting in reduced accuracy, unpredictable behavior, hallucinations, and poor understanding of the model's inherent characteristics.
Today, there is no standard for evaluating the performance of LLMs. AI practitioners commonly use two approaches - test prompts or text embeddings:
- Using test prompts requires manual generation and evaluation of prompts that quickly grow in size and complexity in production. Since this approach doesn’t use actuals and user feedback, it fails to offer insights into how the model behaves over time or the effectiveness of its production settings. Using test prompts falls short in problem detection, breaks down at scale, and can become cost prohibitive.
- Using text embeddings leverages the text embeddings created from the prompts used to interact with the LLM and the resulting response. Measuring these over time can provide some useful signals but often requires manual investigation or inefficient debugging methods like wandering through t-SNE or UMAP visualizations. This approach is insufficient as it loses the context of the original input, making it harder to evaluate model performance.
WhyLabs: Making LLM monitoring easy
To tackle the challenges of monitoring and running LLMs in production, WhyLabs built LangKit. Whether you’re looking to understand interactions with a model API such as OpenAI’s GPT or an in-house model fine-tuned to a unique task, LangKit has you covered.
Utilizing LangKit, you can continuously evaluate the LLM-powered applications using essential metrics extracted from prompts, responses, and user interactions. This method allows scaling to a massive number of users, without sampling down the interactions for analysis. So as your LLM application scales, so does your observability solution. WhyLabs Observability Platform then allows for individual inspection of the resulting metrics as well as systematic monitoring across a large number of interactions for anomalies and outliers. Using this scalable and efficient technique you’ll be able to detect issues across:
- Quality - Are your prompts and responses high quality (readable, understandable, well written)? Are you seeing a drift in the types of prompts you expect or a concept drift in how your model is responding?
- Sentiment - Is your LLM responding in the right tone? Are your upstream prompts changing their sentiment suddenly or over time? Are you seeing a divergence from your anticipated topics?
- Governance - Is your LLM receiving the kind of information in line with your internal governance & compliance policies such as the usage of PII or PHI
- Security - Is your LLM receiving adversarial attempts or malicious prompt injections? Are you experiencing prompt leakage?
Secure observability at scale
Distinct from other LLM monitoring tools, WhyLabs doesn’t store user input or model output, which can be particularly sensitive for LLMs. Instead, LangKit gathers all essential metrics about prompt, response data, and user interaction data as it flows through the model. WhyLabs Observability Platform evaluates and monitors these metrics, while you remain fully in control of the underlying data.
This approach effortlessly scales from recording just a handful of sample prompts and responses to handling millions of LLM interactions per hour. You can monitor your LLM-based products confidently, knowing that the same solution is effective during both the initial LLM MVP development and full-scale production deployment. By using LangKit, our customers gain the confidence that LLM applications will behave as expected over time, in various contexts, and through model version changes and fine-tuning adjustments.
Join the responsible LLM revolution!
We’ve made LLM Observability as easy as using OpenAI’s ChatGPT or a Hugging Face model. If you are an AI practitioner working with LLMs, sign up for early access to the LangKit private beta!
We love feedback, questions, and suggestions - join us in the WhyLabs community Slack.
Sign up for early access to the LangKit private beta!
Other posts
Mind Your Models: 5 Ways to Implement ML Monitoring in Production
May 17, 2023
- WhyLabs
- ML Monitoring
Simplifying ML Deployment: A Conversation with BentoML's Founder & CEO Chaoyu Yang
Apr 4, 2023
- WhyLabs
Data Drift vs. Concept Drift and Why Monitoring for Them is Important
Mar 28, 2023
- ML Monitoring
- MLOps
Robust & Responsible AI Newsletter - Issue #5
Mar 10, 2023
- WhyLabs
- Newsletter
Detecting Financial Fraud in Real-Time: A Guide to ML Monitoring
Mar 7, 2023
- ML Monitoring
- MLOps
How to Troubleshoot Embeddings Without Eye-balling t-SNE or UMAP Plots
Feb 23, 2023
- AI Observability
- whylogs
Achieving Ethical AI with Model Performance Tracing and ML Explainability
Feb 2, 2023
- WhyLabs
- ML Monitoring
Detecting and Fixing Data Drift in Computer Vision
Jan 26, 2023
- ML Monitoring
- whylogs
- WhyLabs
BigQuery Data Monitoring with WhyLabs
Jan 17, 2023
- WhyLabs
- BigQuery
- Integration