SecurityFebruary 14, 20269 min read

What Is an LLM Guard? How Real-Time Detection Protects AI Apps

Learn what an LLM guard is, how it works, and why every production AI app needs one. Covers detection categories, architecture, and implementation.

Jack Lillie

Founder

LLM guardAI securityprompt injection detectioncontent moderationLLM safety

If you're running a language model in production, you've probably had at least one close call. A user finds a way to make your chatbot say something it shouldn't. A support agent leaks a customer's email address. Someone pastes a prompt injection into a form field and your AI starts following their instructions instead of yours.

An LLM guard is the fix for all of these. It's a detection layer that sits in front of (and behind) your model, scanning every message for threats before they cause damage. The OWASP Top 10 for LLM Applications identifies prompt injection as the number one risk for LLM systems, and the NIST AI Risk Management Framework recommends real-time monitoring and validation as core components of AI risk management. Think of it as a bouncer at the door of your AI application: it checks every input and output, and blocks the ones that look dangerous.

This post covers what LLM guards actually do, how the detection works under the hood, and how to add one to your application.

What Does an LLM Guard Do?

An LLM guard is a specialized classifier that analyzes text for security threats. It runs as a separate service from your language model, which means it works with any provider: OpenAI, Anthropic, Google, open-source models, or anything else.

The guard inspects text across multiple threat categories simultaneously. A single API call typically covers:

Prompt attack detection. This catches prompt injections, jailbreak attempts, and other techniques designed to override your system instructions. These attacks range from simple ("ignore previous instructions") to sophisticated multi-step strategies that try to gradually shift the model's behavior.

Harmful content classification. The guard flags violence, hate speech, sexual content, self-harm references, and other categories that violate safety policies. Unlike keyword filters, ML-based guards understand context: they know that a medical discussion about self-harm differs from content that encourages it.

Data leakage detection. This catches PII exposure like social security numbers, credit card numbers, phone numbers, and email addresses. It works on both inputs (users accidentally sharing sensitive data) and outputs (models generating or hallucinating real personal information).

Link analysis. Some guards also check URLs in messages against allowlists, flagging suspicious or unknown links that could be phishing attempts or redirects to malicious sites.

The key insight is that all of these checks happen in a single inference call. You don't need separate services for each threat category.

How LLM Guard Detection Works

Under the hood, most LLM guards use one of two approaches: rule-based systems, ML-based classifiers, or a combination of both.

Rule-Based Detection

The simplest guards use pattern matching: regex for PII patterns, keyword lists for harmful content, URL allowlists for link checking. These are fast and predictable, but they break down against adversarial inputs. An attacker who knows you're blocking the phrase "ignore previous instructions" will just rephrase it.

Rule-based detection works well for structured threats like PII (credit card numbers follow predictable formats) but poorly for semantic threats like prompt injections.

ML-Based Classification

More sophisticated guards use trained classifiers that understand the meaning of text, not just its surface patterns. These models are typically fine-tuned on large datasets of attacks, harmful content, and normal conversations.

The classifier outputs confidence scores for each threat category. Your application then decides what to do based on thresholds: block high-confidence threats, flag medium-confidence ones for review, and pass low-risk content through.

ML-based guards handle adversarial inputs much better than rules because they generalize. An attacker can rephrase "ignore previous instructions" a hundred different ways, and the classifier will still flag most of them because it understands the intent, not just the keywords. Research from the HackAPrompt competition, documented by Schulhoff et al. in "Ignore This Title and HackAPrompt", showed that participants generated over 600,000 adversarial prompts using diverse strategies to bypass LLM safety measures, highlighting the need for ML-based detection that generalizes beyond known patterns.

Hybrid Approaches

The best production guards combine both. Rules handle the easy, deterministic cases (PII patterns, known-bad URLs). ML handles the fuzzy, adversarial cases (prompt injections, harmful content in context). This layered approach maximizes both precision and recall.

Where an LLM Guard Fits in Your Architecture

An LLM guard typically runs as a separate microservice or API call in your request pipeline. There are two integration points:

Input scanning (pre-model)

Before sending user input to your LLM, you pass it through the guard. If the guard detects a threat, you short-circuit the request and return a safe response without ever calling the model. This prevents prompt injections from reaching your model and saves you inference costs on malicious requests.

import wardstone
 
def handle_message(user_input: str) -> str:
    # Scan input before it reaches your LLM
    result = wardstone.guard(user_input)
 
    if result.flagged:
        return "I can't help with that request."
 
    # Safe to send to your model
    return call_llm(user_input)

Output scanning (post-model)

After your LLM generates a response, you scan it before sending it to the user. This catches cases where the model generates harmful content, leaks PII, or produces outputs that violate your policies despite clean inputs.

def handle_message(user_input: str) -> str:
    llm_response = call_llm(user_input)
 
    # Scan output before it reaches the user
    result = wardstone.guard(llm_response)
 
    if result.flagged:
        return "Let me rephrase that."
 
    return llm_response

Bidirectional scanning

The strongest setup scans both directions. This is what we recommend for production systems, because threats can enter from either side. A clean input can still produce a harmful output (through indirect prompt injection or model hallucination), and a clean output doesn't guarantee the input was safe.

The latency overhead is minimal. A well-optimized guard adds roughly 20-40ms per scan. For most applications, users won't notice the difference.

What to Look for in an LLM Guard

Not all guards are equal. Here's what matters when evaluating options:

Latency. If the guard adds 500ms to every request, users will notice. Look for sub-50ms response times. The guard should be fast enough to run synchronously in your request pipeline without degrading the experience.

Coverage. A guard that only detects prompt injections but misses PII leakage leaves you half-protected. Multi-category detection in a single call is more efficient than chaining multiple specialized services.

False positive rate. An overly aggressive guard blocks legitimate users and creates frustrating experiences. Good guards let you tune thresholds to balance safety against usability. We wrote about this tradeoff in depth in how to implement AI guardrails without killing UX.

Model-agnostic. Your guard shouldn't lock you into a specific LLM provider. If you switch from GPT to Claude next quarter, your security layer should work the same way.

Risk bands vs binary decisions. Simple pass/fail responses force you into rigid policies. Guards that return confidence scores or risk bands give you the flexibility to handle edge cases: block high-risk content, flag medium-risk for review, and pass low-risk through.

LLM Guard vs Prompt Engineering

A common question: can't I just engineer my system prompt to be safe?

System prompt hardening is important, but it's not a substitute for a guard. Prompt instructions are suggestions, not enforcement. Models can be manipulated into ignoring their instructions through jailbreaking, prompt injection, and other techniques.

A guard provides deterministic enforcement. When the classifier says a message contains a prompt injection, that decision is final: the message gets blocked regardless of what the LLM might do with it. The LLM never sees it. As the MITRE ATLAS framework documents under its AI security techniques, relying solely on model-level defenses leaves systems vulnerable to the full range of adversarial ML attacks.

Think of it this way: your system prompt tells the model what to do, and the guard ensures nothing interferes with those instructions.

LLM Guard vs Content Moderation APIs

Traditional content moderation APIs (like OpenAI's Moderation endpoint) focus primarily on harmful content categories. They flag hate speech, violence, and sexual content. That's useful, but it's only one piece of the puzzle.

LLM guards cover a broader threat surface. In addition to content moderation, they detect prompt attacks (which content moderation APIs don't handle) and data leakage (which most moderation tools ignore). A production AI application needs protection against all three categories.

Some teams try to combine a content moderation API with a separate prompt injection detector and a PII scanner. This works but introduces complexity, latency, and multiple failure points. A unified guard that covers all categories in one call is simpler to operate.

Getting Started

Adding an LLM guard to an existing application takes minutes. The basic pattern is always the same: scan input, check the result, decide what to do.

import Wardstone from "wardstone";
 
const wardstone = new Wardstone();
 
async function processMessage(text: string) {
  const scan = await wardstone.guard(text);
 
  if (scan.flagged) {
    console.log(`Blocked: ${scan.primary_category}`);
    console.log(`Risk bands: ${JSON.stringify(scan.risk_bands)}`);
    return { blocked: true, reason: scan.primary_category };
  }
 
  // Proceed with your LLM call
  const response = await callYourLLM(text);
  return { blocked: false, response };
}

You can test how detection works against real attacks in the interactive playground, or read the API documentation for the full reference.

Key Takeaways

An LLM guard is the simplest way to add real security to an AI application. It runs as a separate detection layer, works with any model provider, and covers multiple threat categories in a single call. The latency overhead is negligible, and the alternative (no detection) is a liability.

If you're running an LLM in production without a guard, you're relying entirely on the model's built-in safety training. That training is good, but it's not designed to handle adversarial users. A dedicated guard is.

Ready to secure your AI?

Try Wardstone Guard in the playground and see AI security in action.

Try the Playground More Articles

Security

LLM Safety: Risks, Categories, and How to Mitigate Them

LLM safety covers everything from prompt injection to toxic outputs. This guide breaks down the risk categories and what actually works to mitigate them.

Best Practices

What Are AI Guardrails? A Complete Guide for Developers

AI guardrails are the safety controls that keep language models in bounds. This guide covers every type, from input validation to output filtering, with code examples.