Learning Center/LLM Security and Safety/Lesson 2

Best Practices for Ensuring LLM Safety

Introduction/overview

Key ideas

Secure your LLMs by covering their entire lifecycle with practices like encryption, differential privacy, and regular audits to protect data and ensure model integrity.
Combat prompt injections and data poisoning through specific strategies such as advanced filtering and contextual awareness. Also, implement robust data validation and model fine-tuning.
Improve LLM robustness by engaging with experts across disciplines, staying updated on AI security trends, and adhering to ethical AI guidelines.

In the last lesson, we covered why you need to secure LLM against common types of threats and attacks and consequently learned about five types of those attacks. This lesson will teach you the five best practices and actions to mitigate those attacks.

Mitigation strategies for prompt injection attacks

So why do you want to mitigate prompt injection attacks? Mitigating prompt injection is crucial because such attacks can manipulate LLMs to perform unintended actions or disclose sensitive information.

The fundamental challenge is distinguishing between control and data within LLM prompts, which prompt injection exploits. You need vigilant monitoring and mitigation strategies to safeguard against these vulnerabilities.

Here are the best practices you should take to mitigate prompt injection attacks:

1. Frequent audits and updates: Conduct regular and comprehensive evaluations of model behavior to identify vulnerabilities. During updates, integrate the latest ethical AI developments to enhance resilience against new threats. Define audit frameworks and update cycles, referencing industry benchmarks for best practices.

Example: For an LLM used to generate financial reports, conduct quarterly audits to review the model's output for any anomalies or signs of tampering. Integrate the latest AI ethics research findings to update the model's response mechanisms, ensuring it remains resilient against evolving prompt injection techniques.

2. Advanced content filtering: Implement robust filtering techniques to detect and neutralize malicious prompts, paying particular attention to nuanced manipulations. Explore document filtering technologies, like natural language understanding enhancements, that can improve detection rates.

Example: In an educational LLM application, implement a filtering system that scans incoming prompts for patterns indicative of injection attacks, such as unexpected command structures or out-of-context requests. Utilize NLU technologies to enhance the model's ability to discern and block these malicious inputs without hindering legitimate educational queries.

3. Enhanced contextual awareness: When possible, equip LLM applications to accurately discern context and intent within prompts. This helps these systems to more accurately reject attempts to exploit vulnerabilities.

Example: For a customer service LLM, develop capabilities to understand the context and likely intent behind user inputs. If a prompt seems designed to navigate the conversation toward revealing confidential information or executing unauthorized actions, the system should flag it for review or automatically reject it.

4. Collaborative research: Partner with researchers and the broader AI community to understand and counter misuse tactics. Invest in ongoing research by publicly discussing their applied use cases and challenges, contributing toward proactive defense research in the future.

Example: A healthcare LLM developer collaborates with cybersecurity researchers to study prompt injection tactics that could potentially lead to the disclosure of sensitive patient information. Through this partnership, they develop a new defense mechanism that improves the model's ability to detect and thwart such attacks.

5. Privilege control: Restrict model access or functionalities in high-risk scenarios. This ensures better oversight and reduces opportunities for abuse with role-based permissions.

Example: An LLM for generating legal documents implements privilege control to restrict access to certain sensitive document generation features based on the user's role and authentication level. For example, only verified legal professionals can prompt the LLM for documents containing sensitive legal advice to minimize the risk of misuse through prompt injection by unauthorized users.

Mitigation strategies for data and model poisoning

To prevent our LLMs from propagating biased, incorrect, or malicious content, consider implementing several critical strategies:

1. Data validation and cleaning: Implement rigorous vetting and input filters for training data. Use statistical outlier and anomaly detection to identify and eliminate adversarial inputs for safeguarding the fine-tuning process.

Example: Before training an LLM intended for sentiment analysis, a media company processes the dataset through a pipeline that automatically flags and removes entries with statistically anomalous sentiment scores or inconsistent tagging. This is designed to catch and eliminate attempts to skew the model's understanding of certain topics.

2. Human involvement in oversight: Engage domain experts to review training datasets and model outputs, using their expertise to uncover and correct subtle biases or inaccuracies.

Example: In a healthcare LLM, domain experts like medical professionals could review the training data and the model's diagnostic suggestions to ensure medical accuracy and the absence of biased information. They could verify that the model does not propagate stereotypes or outdated medical practices.

3. Use-case specific training: Design or fine-tune models specifically for their application contexts using distinct training datasets to improve the accuracy and relevance of the outputs.

Example: An LLM developed for educational purposes is fine-tuned with separate datasets for different subjects (e.g., mathematics, history, science) to ensure it provides accurate and relevant information within each domain. This specialization allows the model to generate more precise content tailored to the specific needs of learners in each subject area.

4. Sandboxing models: Enforce stringent sandboxing with network controls to prevent LLM from accessing unapproved data sources during inference that could affect the response quality.

Example: If an LLM is used for generating programming code snippets, sandboxing can restrict the model's internet access to prevent it from fetching potentially malicious code from unverified sources. Network controls could ensure that the LLM only accesses vetted databases or repositories.

5. Red-teaming: Integrate red team exercises and vulnerability assessments into the LLM's testing phase to proactively identify and mitigate potential security vulnerabilities.

Example: To test the security of an LLM used in financial forecasting, a red team might attempt to inject biased economic data into the system to see if it can influence its predictions. The red team can be a human-in-the-loop or a second language model that is testing the LLM for harmful outputs.

Mitigation strategies for model theft

Mitigating model theft is essential to protecting LLMs' intellectual property and commercial value, so investment in developing them is safeguarded. You want to maintain the integrity and confidentiality of proprietary data and algorithms to uphold the competitive advantage and trustworthiness of the technology.

Here are strategies to mitigate model theft:

1. API usage restrictions and monitoring: Implement API call rate-limiting and conduct detailed monitoring to identify abnormal access patterns or excessive querying, which could indicate attempts at reverse engineering or unauthorized data scraping.

Example: A cloud-based LLM service for natural language processing (NLP) tasks implements rate limits of 1000 requests per hour per user. The service uses anomaly detection to monitor for patterns, such as sudden spikes in requests or unusual sequences of API calls, indicative of an attempt to systematically download or reverse-engineer the model.

2. Legal safeguards: Implement a thorough legal strategy that uses copyrights, patents, and trade secrets to protect against unauthorized use, as well as proactive enforcement and explicit terms of service against reverse engineering.

Example: A company that develops proprietary LLMs for language translation embeds clauses in its user agreements explicitly prohibiting reverse engineering, redistributing, or creating derivative works without permission. It also registers copyrights and patents covering unique aspects of its model architecture and training methodology to legally challenge any unauthorized use.

3. Central model repository: Establish a centralized ML Model Registry to enforce stringent access controls, enable detailed authentication, and provide comprehensive monitoring and logging for governance and risk management.

Example: An enterprise hosting multiple LLMs for various internal applications centralizes all models within a secure model registry. Access to this registry requires two-factor authentication, and each access attempt is logged. The registry controls which users or applications can access specific models based on predefined roles and permissions, ensuring only authorized usage.

4. Watermarking outputs: Integrate non-intrusive, identifiable watermarks into model outputs to trace and deter unauthorized usage and reproduction of the model's outputs.

Example: A generative LLM used for creating marketing content embeds subtle, unique patterns in its text outputs (e.g., specific word choices or punctuation patterns) that act as a digital watermark. This enables the company to identify and trace back unauthorized reproductions or uses of its generated content, even if it has been slightly modified.

5. Robust access controls and data encryption: To protect LLM model repositories and training environments from unauthorized access, use advanced access control mechanisms and data encryption. This will protect against threats from both insiders and outsiders.

Example: A financial analysis LLM stored in a cloud environment is protected by role-based access control (RBAC) systems, where only financial analysts and certain data scientists can run queries or access model outputs. All data exchanged with the LLM, including training data, queries, and outputs, is encrypted using AES-256 encryption, so even if an unauthorized party gains physical access to the storage medium, the data remains unintelligible.

Mitigation strategies for sensitive information disclosure

This one is crucial to maintaining user trust and compliance with privacy regulations so their personal and confidential data are not inadvertently exposed or misused.

Mitigating this attack reduces the risk of reputational damage and legal consequences associated with data breaches or misuse.

Here are key strategies for mitigation:

1. Data anonymization and sanitization: Rigorously inspect training datasets to eliminate or anonymize any sensitive information to ensure that the data cannot be traced back to individuals or confidential sources.

Example: In a healthcare LLM, before using patient records to train the model, remove all personally identifiable information (PII), such as names, addresses, and social security numbers, or replace them with pseudonyms. Additionally, obscure or generalize direct references to specific medical facilities or practitioners to prevent any back-tracing.

2. Model regularization: Implement regularization methods to prevent the model from overfitting to particular data points to reduce the likelihood that the model will reproduce specific sensitive information in its outputs.

Example: Apply regularization techniques during training for a financial advice LLM to prevent the model from learning details too specific to individual financial transactions or profiles. For instance, L2 regularization can discourage the model from relying heavily on any single or small set of features that could represent sensitive user-specific information.

3. Differential privacy: Incorporate differential privacy during model training and fine-tuning to add noise to the computations. This should ensure that the outputs are insensitive to changes in any individual's data.

Example: When a store fine-tunes its recommendation LLM with a customer's purchase history, use differential privacy techniques to ensure that adding or removing a single user's data does not affect the model's overall output. This way, it is impossible to determine what each user likes or how they act.

4. Input validation and filtering: Use strict input validation and sanitization steps to keep potentially harmful or private data from being added. This will guard against the accidental processing of sensitive information or hacking of the model.

Example: For an LLM you deploy in a public feedback analysis system, use stringent input validation to reject any submission that contains potentially sensitive personal data, like credit card numbers or health information. This way, such data never enters the model training or inference processes.

5. Output monitoring and filtering: Monitor and filter the model's outputs systematically to detect and prevent the dissemination of sensitive data, using automated systems and manual review to ensure data privacy.

Example: Monitor the outputs of an LLM that generates patient treatment plans based on clinical data for any unintentional disclosure of sensitive information. If the system inadvertently suggests a treatment plan that includes specific patient identifiers or histories, ensure the outputs are flagged and reviewed by a clinician before being communicated or acted upon. Automated filters can catch known sensitive data patterns, while manual review can address more nuanced or context-specific instances of potential disclosure.

Mitigation strategies for excessive agency

This is necessary to limit the autonomy of LLMs. You want to ensure they operate within their intended ethical and operational boundaries to maintain user trust and compliance with regulatory standards.

Here’s how you could prevent excessive agency:

1. Secure user authorization tracking: Implement robust mechanisms for tracking and validating user authorizations. Ensure LLM actions are confined to the correct user context and minimal privilege level.

Example: When an LLM accesses a document editing service on behalf of a user, the system should log the user's consent, the specific documents accessed, and the type of actions performed. This tracking ensures all actions can be audited and traced back to a legitimate request, maintaining accountability.

2. Include human oversight: Add human approval steps to LLM workflows or systems that come after to check and allow actions. This way, you can get the benefits of automation while also keeping an eye on things to ensure no one is doing anything without permission.

Example: In a scenario where an LLM drafts email responses, a human-in-the-loop system could require that all outbound messages be reviewed and approved by a human operator before sending. This step ensures that all communications are appropriate and accurate, reducing the risk of sending out incorrect or sensitive information.

3. Authorize actions in downstream systems:

Ensure all actions by LLM agents are validated at the system level with robust authorization checks against defined security policies to uphold the integrity of security postures throughout the ecosystem.

Example: If an LLM interacts with a file storage system to retrieve documents, ensure the system checks the LLM's credentials and the user's permissions before allowing access to each file. This prevents the LLM from accessing files beyond the scope of the user's consent or its operational needs to enforce security policies at every interaction point.

4. Minimize plugin/tool functions: Design LLM plugins and tools with focused functionalities, avoiding any non-essential ones. Routinely audit their scope to enforce the principle of least privilege.

Example: For a plugin designed to provide weather updates, restrict its functionality solely to fetching and reporting weather information. Prevent it from accessing or executing unrelated commands, like sending emails or accessing other web services, which it doesn't need for its primary function.

5. Design specific functional tools: Favor creating tools with specific, limited functions over open-ended capabilities to minimize security risks and prevent misuse through broad functionalities.

Example: Instead of a generic plugin that executes any SQL query, provide a tailored plugin that can only retrieve user-specific data from a database based on predefined query templates. This limitation ensures the LLM can only access data it's authorized to view and cannot execute arbitrary, potentially harmful queries.

That’s it! We hope you can apply one or more of these techniques to your LLMs or to secure your LLM APIs.

In the next lesson, you will learn how to test and set up your LLMs practically to detect jailbreaks and prompt injections with our open-source toolkit, LangKit.

Best Practices for Ensuring LLM Safety

Introduction/overview

Mitigation strategies for prompt injection attacks

Mitigation strategies for data and model poisoning

Mitigation strategies for model theft

Mitigation strategies for sensitive information disclosure

Mitigation strategies for excessive agency

Recommended resources

About

Resources

whylogs

WhyLabs