WhyLabs AI Control Center (also known as the WhyLabs Platform) is now an open source project!

Kelsey Olmeim

Jan 15, 2024

Back to Blog

Best Practices for Monitoring Large Language Models

LLMs
Generative AI
LLM Security
LangKit

Kelsey Olmeim

Jan 15, 2024

Large Language Models (LLMs) are increasingly powering natural language processing (NLP) applications, from chatbots to content generators. However, as these models grow in complexity and usage, so does the challenge of monitoring their performance and ensuring ethical and safe application. It's no longer sufficient to rely solely on traditional methods like embeddings; you need more nuanced approaches to keep pace with the advancements.

This article will discuss five🖐️ best practices for monitoring LLMs. It explores:

Essential metrics to track
How to set up robust alerting systems
Strategies to ensure that your monitoring scales efficiently alongside your models
Testing your LLM applications against jailbreaks and adversarial attacks
Ensuring high data integrity and input

If you are in haste, here’s the TL; DR of the best practices and their applications:

Practice	Key takeaway	Application
Choose the right LLM monitoring metrics	Select metrics that reflect the comprehensive capabilities and impacts of LLMs.	Incorporate a mix of intrinsic, extrinsic, and human evaluation metrics tailored to your LLM's specific use cases.
Set up effective alerting systems	Design alerting systems that are responsive and precise for early detection and mitigation of issues.	Define clear thresholds, monitoring frequencies, and escalation paths for each identified metric.
Ensure reliability and scalability	Ensure your LLM monitoring scales and adapts to handle evolving traffic demands.	Monitor performance metrics, automate processes, and use cloud-based solutions for flexible scaling.
Run adversarial tests	Regularly challenge your LLM to identify and strengthen vulnerabilities.	Conduct token manipulation, gradient-based, and jailbreak prompting attacks, and consider red-teaming techniques.
Data integrity and model input	Maintain strict standards for data quality and model inputs to ensure trustworthy LLM outputs.	Perform data quality checks, bias detection, and input validation while monitoring for anomalies and drift.

Phew! Now that you've breezed through the essentials of this article, let’s delve deeper into these practices. By the end of this discussion, you'll have a comprehensive understanding of the modern landscape of LLM monitoring and how to implement these practices effectively.

Get started monitoring LLMs with the open-source text metrics toolkit, LangKit, from this quick start notebook that implements most of these best practices.

Practice 1: Choosing the right LLM monitoring metrics

Selecting the right metrics to evaluate an LLM is a more intricate process than that for traditional ML models. Given the wide-ranging capabilities of LLMs and the large, diverse datasets they handle, your approach to measurement must be correspondingly expansive. Effective LLM evaluation requires a combination of intrinsic metrics like word prediction accuracy and perplexity and extrinsic metrics that assess the practical outcomes of the model's interactions.

Key considerations for LLM metrics include:

Quality:
- Assess the clarity, coherence, and relevance of prompts and responses. Monitor for drift in expected prompts or changes in response quality over time.
Relevance:
- Ensure the model's responses are relevant to the user queries (prompts) and align with the intended application.
Sentiment:
- Evaluate the tone and sentiment of responses, particularly in relation to the input prompt's sentiment and desired model behavior.
Security:
- Be vigilant about adversarial inputs, jailbreaks (prompt injections), or leakages that might lead to unintended or harmful responses.

While traditional metrics like accuracy, precision, and recall are foundational, LLMs demand additional, nuanced metrics due to their ability to generate creative and contextually rich responses. Task-specific metrics such as BLEU for translation or ROUGE for text similarity are useful but partial measures. Human-in-the-loop evaluation remains crucial for assessing understanding, tone, and sensitive content handling for a more rounded and realistic evaluation.

The complexity of LLMs, coupled with the volume of training datasets, potential biases, and interpretability challenges, makes evaluation and monitoring sophisticated tasks. By adopting a nuanced set of metrics encompassing quality, relevance, sentiment, and security, you can better understand LLM behaviors and effectively mitigate risks.

Tools like WhyLabs' open-source LangKit offer a streamlined approach to extracting and monitoring these metrics for easier oversight and collaboration across teams without extensive infrastructure.

Practice 2: Set up effective alerting and feedback systems

Establishing robust alerting systems is vital for maintaining the health and integrity of Large Language Models (LLMs). Once you've identified the key behavioral metrics, such as prompt quality, relevance, sentiment, and potential toxicity, the next step is to construct a responsive and precise alerting mechanism. Here are the critical elements:

Thresholds:
- Define specific thresholds for each metric to trigger alerts. For instance, an alert might be triggered if toxicity levels or the similarity of inputs to known malicious prompts exceed a safe threshold. These thresholds should be dynamic, adapting to the evolving nature of LLM interactions and applications.
Frequency:
- Metrics vary in criticality; make the frequency of checks align with their importance. More sensitive or high-impact metrics (e.g., jailbreak similarity) might require real-time or near-real-time monitoring and alerting, while others might need periodic review.
Escalation:
- Develop an escalation protocol to ensure alerts reach the appropriate team members or systems swiftly and efficiently. This might include a tiered alerting process or integration with incident management platforms.

Effective alerting is not just about detecting issues but also about driving continuous improvement. Incorporating techniques like Reinforcement Learning from Human Feedback (RLHF) can refine the LLM's behavior based on identified issues, gradually steering the model toward more desirable outputs.

Similarly, you can leverage Retrieval Augmented Generation (RAG) to improve the LLM's understanding and response accuracy by pulling in additional contextual information.

Proactive alerting and response mechanisms contribute to a cycle of monitoring, feedback, and improvement, especially when integrated with a continuous integration, deployment (C/CD), and training (or fine-tuning) pipeline. This ensures that the LLM is consistently tuned and improved based on real-world performance and feedback.

By automating alerting and incorporating these insights into the model's training and operational pipeline, developers and operators can ensure their LLMs deliver not just accurate and relevant responses but do so safely and ethically.

Practice 3: Ensure reliability and scalability with monitoring

Ensuring that your LLM monitoring practices are reliable and scalable is crucial to your AI systems' long-term success and efficiency. Here's how you can achieve this:

Operational metrics monitoring:
- Apart from monitoring functional metrics for the LLMs that evaluate performance and robustness, focus on the operational metrics, such as request latency, throughput, error rates, and resource utilization. Understanding these metrics allows for proactive scaling and load balancing, which is especially important in handling the large, unpredictable workloads characteristic of LLMs.
Automation:
- Leverage automation to conduct regular, consistent monitoring. Use scripts or workflows that can dynamically evaluate metrics, generate alerts, and even initiate corrective actions with little to no human intervention. This reduces the delay in response and minimizes human error.
Cloud-based solutions:
- Implement cloud-based platforms for their inherent scalability and flexibility. These platforms can dynamically allocate resources based on the LLM's needs, handling spikes in compute demand.
Audit the monitoring systems:
- Regularly audit and test your monitoring systems to ensure they are functioning correctly. This includes verifying that alerts are accurate and timely and that the system adapts appropriately to changes in LLM behavior.

As you transition LLMs from development to production, prioritize strategies supporting growth and efficiency. Use acceleration libraries like TensorRT and model optimization techniques to scale performance without compromising the model's accuracy or increasing operational costs. Adopting CI/CD frameworks promotes continuous learning and deployment, ensuring your LLMs are always up-to-date and performing optimally.

Practice 4: Run adversarial tests

Monitoring LLMs in production is a critical task for organizations that want to ensure the reliability, safety, and effectiveness of their NLP operations. By following the best practices outlined in this post, you can help ensure that your LLM monitoring practices are both effective and scalable, and that your organization is well-positioned to take advantage of the many benefits that LLMs can offer.

Adversarial testing is critical for model robustness and the safety of LLMs by actively identifying and mitigating potential vulnerabilities. It involves a range of techniques designed to challenge the model in a controlled manner, revealing the LLM weaknesses and vulnerabilities that can then be addressed. How to implement adversarial testing effectively:

Token manipulation

Involves altering key elements of the input prompts to mislead the LLM. The methods include:

Word substitution:
- Replace pivotal words with synonyms or antonyms to alter the input's perceived meaning so the model can generate inappropriate or nonsensical responses.
- For example, replacing "environmentalist" with "polluter" in an article might misrepresent the environmental issues discussed.
Paraphrasing:
- Slightly rephrase the input to change its meaning and evoke undesirable responses subtly.
- For instance, changing "create a presentation on renewable energy" to "create a presentation on the drawbacks of renewable energy" could alter the intended focus and message.
Inserting triggers:
- Add specific keywords or phrases known to exploit vulnerabilities in the LLM's training data.
- For example, including "notorious failures" would change "Provide an overview of achievements of famous scientists in history” to "Provide an overview of achievements of famous scientists in history, focusing on the notorious failures," which could skew a neutral or positive output toward a negative or controversial angle, potentially focusing on exaggerated or false failures of otherwise respected scientists.

Gradient-based attack

Involves using optimization algorithms to modify the input to elicit harmful outputs iteratively. Techniques include:

Iterative optimization:
- Refining the input in each iteration to guide the LLM towards generating biased or malicious content by carefully crafting a prompt.
Backpropagation with adversarial loss:
- Adjusting the input based on calculated gradients to produce specific, misleading outputs.
- For example, tweaking the wording in a news article can deceive the LLM into confidently developing misinformation or biased content.

Jailbreak prompting

Aiming to bypass the LLM's safety filters or constraints through:

Meta-prompting:
- Set a broad context with one prompt, then subtly manipulate subsequent prompts within this framework.
- This method's main advantage lies in its ability to target multiple vulnerabilities of the LLM in a controlled environment.
Prompt chaining:
- Use a sequence of prompts, each exploiting a different loophole in the LLM. The goal is to trick the LLM into gradually revealing sensitive information or generating inappropriate content, like a set of master keys unlocking different aspects of the LLM's vulnerability.

Learn how to navigate adversarial threats by detecting LLM prompt injections and jailbreaks in this blog post by Felipe Adachi.

Human and model red-teaming

Human experts:
- Use professionals to test the LLM's capacity against phishing attacks by devising complex scenarios to uncover vulnerabilities.
Adversarial AI models:
- Train other LLMs to probe and identify weaknesses in the target LLM. For instance, training an LLM specialized in generating misleading medical diagnoses can help test a medical diagnostic LLM.

Adversarial testing methods identify weaknesses and contribute to developing and hardening LLMs against potential threats. Pairing adversarial testing with comprehensive monitoring is essential to ensuring a robust defense strategy. Through these methods, LLMs can be better prepared to handle real-world scenarios and adapt to emerging threats, ensuring their safe and effective deployment.

Practice 5: Data Integrity and Model Input

Maintaining data integrity and securing model input is critical for the reliable and ethical functioning of Large Language Models (LLMs). As the quality and composition of training data directly impact LLM performance, it's imperative to establish rigorous standards and practices for data handling.

Data integrity considerations

Data quality checks:
- Regularly validate data for accuracy, consistency, and relevance. Inconsistent or misleading data can significantly affect the LLM's training and output quality.
Bias detection and mitigation:
- Identify and mitigate biases in training data to ensure fair and unbiased LLM outputs.
Data lineage tracking:
- Track the origins and transformations of data to maintain transparency in the fine-tuning process, compliance with legal standards and regulations, and troubleshooting production results.

Monitoring for data issues live

Anomaly detection:
- Use statistical methods to identify unusual patterns in LLM outputs, which may signal issues in the underlying data.
- For example, an LLM consistently producing factually inaccurate or implausible statements might show anomalies in its factual knowledge base, requiring investigation and correction.
Drift monitoring
- Regularly assess if the LLM's training data reflects current real-world scenarios and language use, particularly as languages evolve.
Data profiling:
- Analyze the metadata and characteristics of your data to uncover any underlying inconsistencies, biases, or quality issues.

How does data profiling compare to sampling, and why is it a superior technique if you are logging data metrics and stats at scale? Check out our in-depth blog post by Isaac Backus and Bernease Herman.

Safeguarding model inputs

Input validation: Implement robust checks to filter out harmful, irrelevant, or malicious inputs to maintain the integrity of data fed into the LLM.
Contextual understanding: Train the LLM to use context data alongside prompts to improve interpretation and prevent misinterpretations.
Human oversight: Establish mechanisms for reviewing inputs and outputs, especially in sensitive or critical applications, to ensure responsible usage and early detection of anomalies or ethical issues.

Data integrity and the quality of model inputs are inseparable from the overall performance of LLMs. By investing in robust tools and strategies for data and input monitoring, you solidify the foundation of your LLM, ensuring its reliability and ethical operation. These practices are safeguards and enablers of successful, responsible LLM deployment.

Next Steps: Safeguard your Large Language Models (LLMs) with LangKit

In collaboration with leading AI teams, we've crafted LangKit, an open-source text metrics toolkit designed to improve the safety and reliability of your Large Language Models. Identify and mitigate risks, including malicious prompts, sensitive data leakage, toxic responses, hallucinations, and jailbreak attempts across any LLM.

Key features of LangKit include:

Evaluation: Continually validate your LLM's responses to ensure consistency and accuracy, adapting seamlessly to changes in prompts or model updates.
Guardrails: Implement real-time controls to manage and filter appropriate prompts and responses tailored to your specific LLM applications.
Observability: Gain insights into your LLM's operation by tracking prompts and responses with telemetry data to monitor, maintain, and improve performance over time.

Get started with LangKit through a quick start notebook that implements most of the best practices you have learned from this article.

Kelsey Olmeim

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Rich Young

Dec 10, 2024

Learn how the NIST AI Risk Management Framework (RMF) guides AI security and governance and discover how WhyLabs guardrails can help implement and manage AI risks effectively.

Read post

AI risk management
AI Observability
AI security
NIST RMF implementation
AI compliance
AI risk mitigation

Best Practicies for Monitoring and Securing RAG Systems in Production

Rich Young

Oct 8, 2024

Retrieval-augmented generation (RAG) systems combine advanced retrieval techniques with large language models (LLMs) to improve the responses they generate...

Read post

Retrival-Augmented Generation (RAG)
LLM Security
Generative AI
ML Monitoring
LangKit

How to Evaluate and Improve RAG Applications for Safe Production Deployment

Rich Young

Jul 17, 2024

Learn how to evaluate and improve RAG applications using LangKit and WhyLabs AI Control Center. Develop secure and reliable RAG applications.

Read post

AI Observability
LLMs
LLM Security
LangKit
RAG
Open Source

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

WhyLabs Team

Jun 2, 2024

With WhyLabs and NVIDIA NIM, enterprises can accelerate GenAI application deployment and help ensure the safety of end-user experiences WhyLabs has been on a mission to empower enterprises with tools that ensure safe and responsible AI adoption. With its integration with NVIDIA NIM inference microservices, WhyLabs is helping make responsible AI adoption more accessible. Customers can now maintain better security and control of GenAI applications with self-hosted deployment of the most powerfu

Read post

AI Observability
Generative AI
Integrations
LLM Security
LLMs
Partnerships

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

Alessya Visnjic

May 21, 2024

Discover strategies for safeguarding your large language models (LLMs). Learn how to protect your AI technologies effectively based on OWASP's top 10 security tips.

Read post

LLMs
LLM Security
Generative AI

7 Ways to Evaluate and Monitor LLMs

WhyLabs Team

May 13, 2024

Learn about 7 techniques for evaluating & monitoring LLMs, including LLM-as-a-Judge, ML-model-as-a-Judge, and embedding-as-a-source. Improve your understanding of LLMs with these strategies.

Read post

LLMs
Generative AI

How to Distinguish User Behavior and Data Drift in LLMs

Bernease Herman

May 7, 2024

Large Language Models (LLMs) rarely provide consistent responses for the same prompts over time. In this blog we’ll demonstrate how identify and monitor data changes using a few common scenarios.

Read post

LLMs
Generative AI

Run AI with Certainty

Book a demo

Best Practices for Monitoring Large Language Models

Practice 1: Choosing the right LLM monitoring metrics

Practice 2: Set up effective alerting and feedback systems

Practice 3: Ensure reliability and scalability with monitoring

Practice 4: Run adversarial tests

Token manipulation

Gradient-based attack

Jailbreak prompting

Human and model red-teaming

Practice 5: Data Integrity and Model Input

Data integrity considerations

Monitoring for data issues live

Safeguarding model inputs

Next Steps: Safeguard your Large Language Models (LLMs) with LangKit

Other posts

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Best Practicies for Monitoring and Securing RAG Systems in Production

How to Evaluate and Improve RAG Applications for Safe Production Deployment

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

7 Ways to Evaluate and Monitor LLMs

How to Distinguish User Behavior and Data Drift in LLMs

Run AI with Certainty

About

Resources

whylogs

WhyLabs