SecurityFebruary 18, 202611 min read

LLM Safety: Risks, Categories, and How to Mitigate Them

A complete guide to LLM safety covering the main risk categories, real-world failures, and practical mitigation strategies for production AI applications.

Jack Lillie

Founder

LLM safetyAI safetyAI riskprompt injectioncontent moderationLLM security

LLM safety is the practice of preventing language models from causing harm. That sounds simple, but "harm" covers a lot of ground: generating toxic content, leaking personal data, following malicious instructions, producing dangerous misinformation, enabling illegal activity. Each category requires different detection techniques and different mitigations.

The challenge is that LLMs are probabilistic. You can't write unit tests for every possible output. You can't predict every way a user might try to abuse the system. And the attack surface changes every time a provider updates their model. The NIST AI Risk Management Framework identifies these as core challenges in AI governance, calling for continuous monitoring and adaptive risk controls rather than one-time assessments.

This guide covers the main LLM safety risk categories, real examples of each, and the mitigation strategies that actually work in production.

The Four Categories of LLM Safety Risk

Most LLM safety issues fall into four categories. Understanding these categories helps you prioritize defenses and allocate resources.

1. Prompt Attacks

Prompt attacks are attempts to manipulate the model into ignoring its instructions. There are two main variants:

Direct prompt injection. The attacker includes malicious instructions in their message, trying to override the system prompt. Classic examples include "ignore previous instructions" and role-playing attacks like DAN ("Do Anything Now"). These attacks exploit the fact that LLMs can't reliably distinguish between system instructions and user input. The OWASP Top 10 for LLM Applications ranks prompt injection as the #1 risk in LLM applications, underscoring just how fundamental this threat is.

Indirect prompt injection. The malicious instructions come from external data sources rather than the user directly. For example, an attacker plants instructions in a web page that your RAG pipeline retrieves, or embeds them in a document that your AI assistant processes. We covered this attack vector in depth in understanding indirect prompt injection.

Prompt attacks are particularly dangerous because they can chain with other vulnerabilities. A successful injection might instruct the model to output its system prompt (prompt leaking), ignore content policies (jailbreaking), or exfiltrate data through crafted responses.

2. Harmful Content Generation

Even without adversarial inputs, LLMs can generate content that violates safety policies:

Hate speech and discrimination: Biased or hateful content directed at protected groups
Violence: Detailed descriptions of violent acts or instructions for causing harm
Sexual content: Explicit material in contexts where it's inappropriate
Self-harm: Content that encourages or provides methods for self-harm
Criminal guidance: Instructions for illegal activities like fraud, hacking, or drug manufacturing

Model providers work hard to prevent these outputs through safety training, but no training is perfect. Edge cases slip through. Ambiguous prompts can produce harmful outputs that the model's safety filters don't catch. And jailbroken models can be coerced into generating almost anything.

The risk is especially acute in customer-facing applications where harmful outputs can damage your brand, violate regulations, or cause real-world harm to users.

3. Data Leakage

LLMs can inadvertently expose sensitive information in several ways:

PII in outputs. The model generates social security numbers, credit card numbers, email addresses, or phone numbers in its responses. Sometimes these are hallucinated (fake but realistic). Sometimes they're real, memorized from training data. Either way, they create compliance and privacy risks.

Training data extraction. Researchers have shown that LLMs can be prompted to regurgitate memorized training data, including copyrighted text, private code, and personal information. This is a known risk for models trained on large web corpora. IBM's 2024 Cost of a Data Breach Report found the average breach cost reached $4.88 million, making undetected data leakage from AI systems a serious financial risk.

Context window leakage. In multi-turn conversations or shared sessions, information from one user's conversation can leak into another's. This is more of an application architecture problem than a model problem, but the consequences are severe.

We wrote a detailed breakdown of these risks in how PII escapes your models.

4. Reliability and Hallucination

LLMs generate plausible-sounding text that may be factually wrong. In safety-critical contexts, this is dangerous:

A medical chatbot that gives incorrect treatment advice
A legal assistant that cites nonexistent case law
A financial advisor that provides wrong tax guidance
A customer support bot that makes up product features or policies

Hallucination isn't strictly a security issue, but it becomes one when incorrect outputs cause harm. It's also a vector for attacks: adversaries can manipulate models into hallucinating specific false information.

What LLM Safety Training Does (and Doesn't Do)

Every major LLM provider invests in safety training. Understanding what this covers helps you identify the gaps.

Alignment Techniques

RLHF (Reinforcement Learning from Human Feedback) trains models to prefer safe outputs by using human judgments as a reward signal. It's effective at teaching general safety behaviors, but it's expensive to scale and tends to be conservative, sometimes over-refusing legitimate requests.

Constitutional AI defines safety principles as explicit rules and trains the model to follow them. This is more systematic than RLHF and easier to update, but the principles need to cover every edge case, which is impossible in practice.

DPO (Direct Preference Optimization) is a simpler alternative to RLHF that achieves similar results without the complexity of a reward model. It's gaining popularity but has similar limitations.

Where Safety Training Falls Short

Safety training is a foundation, not a complete solution. The gaps include:

Adversarial robustness: Safety training doesn't hold up against determined attackers. New jailbreak techniques are discovered constantly, and there's always a lag between new attacks and updated model training.
Application context: Model training is generic. It doesn't know your specific use case, compliance requirements, or acceptable content boundaries. A healthcare chatbot needs different safety rules than a creative writing assistant.
Deterministic enforcement: Safety training is probabilistic. It makes harmful outputs less likely, but it can't guarantee they never happen. Production systems need deterministic controls that actually block content, not just reduce its probability.

This is why external safety layers matter. They fill the gaps that model training can't cover.

Practical Mitigation Strategies

Here's what actually works to improve LLM safety in production, ordered by impact and implementation effort.

1. Add a Detection Layer

The highest-impact mitigation is adding a dedicated detection service that scans inputs and outputs for threats. This provides deterministic enforcement that complements the model's probabilistic safety training.

A good detection layer covers all four risk categories in a single call:

import wardstone
 
def process_safely(user_input: str) -> str:
    # Scan input
    input_scan = wardstone.guard(user_input)
    if input_scan.flagged:
        return f"I can't process that request. Reason: {input_scan.primary_category}"
 
    response = call_llm(user_input)
 
    # Scan output
    output_scan = wardstone.guard(response)
    if output_scan.flagged:
        return "I need to rephrase my response."
 
    return response

You can test how detection handles real attacks in the Wardstone Playground to understand what gets caught and what doesn't.

2. Harden Your System Prompt

Your system prompt should clearly define:

What the model should and shouldn't do
How to handle attempts to override instructions
Boundaries for sensitive topics
When to refuse vs. redirect

Prompt hardening isn't a security control on its own, but it reduces the surface area that external safety layers need to cover.

3. Implement Input Constraints

Simple constraints prevent entire classes of attacks:

Length limits: Cap input at a reasonable maximum (4,000 characters is a good default). Extremely long inputs are more likely to contain injection attempts.
Character normalization: Normalize Unicode to prevent encoding-based attacks that bypass keyword filters.
Rate limiting: Limit requests per user to prevent automated attack tools from probing your defenses.

4. Constrain Model Outputs

Structural constraints on outputs reduce the blast radius of safety failures:

Format enforcement: If your model should return JSON, validate the output schema. If it should return a short answer, enforce length limits.
Tool call restrictions: If your model has access to tools or APIs, implement approval workflows for sensitive operations. Don't let the model execute actions autonomously without guardrails.
Grounding: Connect your model to verified data sources and instruct it to cite sources. This reduces hallucination risk.

5. Monitor and Log Everything

You can't improve what you don't measure. Log every interaction along with threat detection results. This gives you:

Visibility into attack attempts and patterns
Data to identify false positives and tune thresholds
An audit trail for compliance
Training data for improving your defenses

Our post on AI security monitoring metrics covers the specific metrics worth tracking.

6. Test Adversarially

Regular red teaming catches vulnerabilities before attackers do. The MITRE ATLAS framework catalogs over 100 adversarial ML case studies and techniques across 14 tactics, providing a structured methodology for testing AI systems. Test your safety controls against:

Known jailbreak techniques (DAN, character roleplay, multi-step attacks)
Prompt injection payloads from public datasets
Edge cases specific to your application domain
Novel attack techniques from recent research

Automated red teaming tools can generate thousands of test cases, but manual testing by creative humans still finds issues that automated tools miss. We recommend a combination of both, as covered in our LLM red teaming guide.

Building a Safety Program

LLM safety isn't a one-time implementation. It's an ongoing program that evolves as your application, threat landscape, and compliance requirements change.

Start with the basics

For most teams, the right starting point is:

Add input/output scanning with a detection service
Set reasonable input constraints (length, rate limiting)
Harden your system prompt
Enable logging for all LLM interactions

This gets you 80% of the safety improvement with minimal effort. You can test your current setup against real attacks using the interactive playground.

Scale with your risk profile

As your AI features handle more sensitive data or reach more users, layer on additional controls:

Structured output validation
Human-in-the-loop for high-risk actions
Domain-specific content policies
Automated red teaming in CI/CD
Incident response procedures

Stay current

New attack techniques emerge constantly. The model you deployed last month may be vulnerable to techniques published this week. Subscribe to security research feeds, update your detection rules regularly, and re-test your defenses on a cadence.

The OWASP Top 10 for LLM Applications is a good framework for keeping your safety program aligned with industry standards.

Key Takeaways

LLM safety is a multi-layered challenge. No single technique handles every risk category. The most effective approach combines the model's built-in safety training with external detection, input constraints, output validation, and continuous monitoring.

The four risk categories to defend against are prompt attacks, harmful content, data leakage, and hallucination. Each requires different detection techniques, but a unified detection layer can cover the first three in a single API call.

The teams that do this well treat LLM safety as an ongoing program, not a checkbox. They test regularly, monitor continuously, and update their defenses as the threat landscape evolves.

Ready to secure your AI?

Try Wardstone Guard in the playground and see AI security in action.

Try the Playground More Articles

Security

What Is an LLM Firewall? Architecture and Deployment Patterns

An LLM firewall inspects AI traffic the same way a network firewall inspects packets. Here's how they work and why your AI stack needs one.

Security

What Is an LLM Guard? How Real-Time Detection Protects AI Apps

An LLM guard sits between users and your model, scanning every message for prompt injections, harmful content, and data leakage. Here's how they work.

Security

The Complete Guide to Prompt Injection Prevention in 2026

Prompt injection is the #1 security threat facing AI applications today. Learn how to detect and prevent these attacks before they compromise your systems.

The Four Categories of LLM Safety Risk

1. Prompt Attacks

2. Harmful Content Generation

3. Data Leakage

4. Reliability and Hallucination

What LLM Safety Training Does (and Doesn't Do)

Alignment Techniques

Where Safety Training Falls Short

Practical Mitigation Strategies

1. Add a Detection Layer

2. Harden Your System Prompt

3. Implement Input Constraints

4. Constrain Model Outputs

5. Monitor and Log Everything

6. Test Adversarially

Building a Safety Program

Start with the basics

Scale with your risk profile

Stay current

Key Takeaways

Ready to secure your AI?

Related Articles

What Is an LLM Firewall? Architecture and Deployment Patterns

What Is an LLM Guard? How Real-Time Detection Protects AI Apps

The Complete Guide to Prompt Injection Prevention in 2026