TutorialsFebruary 13, 202611 min read

How to Detect Prompt Injection Attacks in Production

Learn how to detect prompt injection attacks in production LLM applications. Covers rule-based, ML-based, and hybrid detection with code examples.

Jack Lillie
Jack Lillie
Founder
prompt injection detectionproduction securityLLM monitoringAI securitythreat detection

Your LLM-powered application is live in production. Users are sending requests, your AI is responding, and everything looks great. But somewhere in that stream of requests, someone is testing whether your system prompt leaks when asked nicely. Someone else is encoding instructions in Base64 to see if your model follows them. And a third person is embedding malicious instructions in a document they've uploaded for "summarization."

This is the reality of running LLMs in production in 2026. The OWASP Top 10 for LLM Applications ranks prompt injection as the #1 vulnerability, and attackers are getting more creative by the month. GreyNoise researchers documented over 91,000 attack sessions targeting exposed LLM services between October 2025 and January 2026 alone.

In this tutorial, we'll walk through how to build production-grade prompt injection detection that catches attacks without slowing down legitimate users.

Why Detection in Production is Different

Detection in development and detection in production are different problems. In development, you can afford false positives, manual review, and latency. In production, you need:

  • Low latency: Detection can't add seconds to every request. Users notice.
  • Low false positive rates: Blocking legitimate users costs you money and trust.
  • High recall: Missing a real attack can mean data exfiltration or brand damage.
  • Observability: You need to know what's happening, not just block things silently.

The challenge is that no single technique solves all four requirements. Pattern matching is fast but brittle. ML classifiers are accurate but need tuning. Semantic analysis is thorough but slow. The NIST AI Risk Management Framework recommends continuous monitoring as a core practice for managing AI system risks, reinforcing that detection must be built into the production lifecycle. The answer, as with most security problems, is layered defense.

The Three Detection Layers

We recommend a three-layer detection architecture. Each layer catches different types of attacks, and together they provide comprehensive coverage.

Layer 1: Rule-Based Pre-Filters

Rule-based detection is your first line of defense. It's fast, predictable, and catches the low-hanging fruit: the script kiddies copy-pasting jailbreak prompts from Reddit and the automated scanners probing for vulnerabilities.

import re
 
INJECTION_PATTERNS = [
    r"ignore\s+(all\s+)?previous\s+instructions",
    r"you\s+are\s+now\s+(?:a|an)\s+",
    r"pretend\s+(?:you(?:'re|\s+are)\s+|to\s+be\s+)",
    r"system\s*prompt",
    r"reveal\s+(?:your|the)\s+(?:instructions|prompt|rules)",
    r"do\s+anything\s+now",
    r"(?:ignore|disregard|forget)\s+(?:your|all|the)\s+(?:rules|guidelines|instructions)",
]
 
def check_patterns(text: str) -> dict:
    """Fast regex pre-filter. Returns matched patterns."""
    text_lower = text.lower()
    matches = []
    for pattern in INJECTION_PATTERNS:
        if re.search(pattern, text_lower):
            matches.append(pattern)
    return {
        "flagged": len(matches) > 0,
        "matches": matches,
        "layer": "rule_based"
    }

Rule-based detection has clear limitations. Attackers trivially bypass regex patterns with synonyms, typos, or encoding tricks. Research from PromptGuard shows that rule-based filters alone catch roughly 62% of attacks. That's a start, but far from sufficient.

Think of this layer as a fast, cheap bouncer at the door. It handles the obvious cases so your more expensive detection layers can focus on the subtle ones.

Layer 2: ML-Based Classification

This is where the real detection happens. A trained classifier analyzes the semantic intent of the input, catching attacks that pattern matching misses entirely.

The key advantage of ML-based detection is generalization. A well-trained model doesn't just recognize "ignore previous instructions." It recognizes the underlying intent of overriding system behavior, even when expressed in novel ways it has never seen before. Research by Liu et al. in "Prompt Injection attack against LLM-integrated Applications" cataloged dozens of distinct prompt injection strategies, demonstrating that the attack surface is far too large for pattern matching alone to cover.

Here's how to integrate ML-based detection into your production pipeline:

import Wardstone from "wardstone";
 
const wardstone = new Wardstone();
 
async function detectInjection(userInput: string) {
  const result = await wardstone.guard(userInput);
 
  if (result.flagged && result.categories.prompt_attack) {
    return {
      blocked: true,
      confidence: result.categories.prompt_attack,
      category: "prompt_attack",
      risk: result.risk_bands.prompt_attack,
    };
  }
 
  return { blocked: false };
}

At Wardstone, our Guard model is trained on over 900,000 labeled examples from 30+ datasets, covering everything from simple jailbreaks to sophisticated multi-turn indirect prompt injection attacks. The model runs as ONNX for CPU inference, typically completing in under 30ms. That's fast enough for synchronous request processing.

When evaluating ML-based detection (ours or anyone else's), focus on these metrics:

  • F1 score: The balance between precision and recall. Aim for 0.90+.
  • Latency at p99: What's the worst-case detection time? Anything under 50ms is production-ready.
  • False positive rate on benign inputs: Test against real user traffic, not just benchmarks.

Layer 3: Output Validation

Input detection is critical, but it's not the whole story. Some attacks are designed to bypass input filters entirely: indirect prompt injection embeds malicious instructions in documents, web pages, or database records that the LLM processes.

In these cases, the malicious input enters through a trusted channel and might look perfectly benign in isolation. You catch it by monitoring the output.

import wardstone
 
def validate_output(ai_response: str, context: dict) -> dict:
    """Check AI output for signs of successful injection."""
    result = wardstone.guard(ai_response)
 
    # Check for data leakage in the response
    if result.flagged and "data_leakage" in result.categories:
        return {
            "safe": False,
            "reason": "potential_data_leak",
            "confidence": result.categories["data_leakage"],
        }
 
    # Check for content violations
    if result.flagged and "content_violation" in result.categories:
        return {
            "safe": False,
            "reason": "content_violation",
            "confidence": result.categories["content_violation"],
        }
 
    return {"safe": True}

Output validation catches scenarios that input scanning misses. Greshake et al. demonstrated in "Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection" that hidden prompt injections embedded in web pages and documents can alter LLM outputs, causing applications to exfiltrate data or produce manipulated summaries. Input scanning would have seen a benign "summarize this page" request and let it through. The MITRE ATLAS framework classifies this under the "LLM Prompt Injection" technique (AML.T0051.000), documenting how adversaries exploit the gap between input-layer defenses and actual model behavior.

Putting It All Together

Here's the full production pipeline combining all three layers. This pattern works with any LLM provider, whether you're using OpenAI, Anthropic, or open-source models:

import Wardstone from "wardstone";
 
const wardstone = new Wardstone();
 
interface DetectionResult {
  allowed: boolean;
  response?: string;
  blocked_reason?: string;
  detection_layer?: string;
}
 
async function secureInference(userInput: string): Promise<DetectionResult> {
  // Layer 1: Fast regex pre-filter
  if (containsKnownPatterns(userInput)) {
    logDetection("rule_based", userInput);
    return {
      allowed: false,
      blocked_reason: "Known attack pattern detected",
      detection_layer: "rule_based",
    };
  }
 
  // Layer 2: ML-based classification
  const inputCheck = await wardstone.guard(userInput);
 
  if (inputCheck.flagged) {
    logDetection("ml_classifier", userInput, inputCheck);
    return {
      allowed: false,
      blocked_reason: `Detected: ${inputCheck.primary_category}`,
      detection_layer: "ml_classifier",
    };
  }
 
  // Process with LLM (your existing inference code)
  const aiResponse = await callLLM(userInput);
 
  // Layer 3: Output validation
  const outputCheck = await wardstone.guard(aiResponse);
 
  if (outputCheck.flagged) {
    logDetection("output_validation", aiResponse, outputCheck);
    return {
      allowed: false,
      blocked_reason: "Response filtered for safety",
      detection_layer: "output_validation",
    };
  }
 
  return { allowed: true, response: aiResponse };
}

The ordering matters. Rule-based checks run first because they're essentially free (sub-millisecond). This means the majority of automated scanning and low-effort attacks get caught before you spend any compute on ML inference. The ML classifier handles the rest of the input-side attacks. And output validation catches anything that made it through, including indirect injection.

Monitoring and Alerting in Production

Detection without observability is flying blind. You need to know what's happening across your LLM fleet in real time.

Structured Logging

Every detection event should produce a structured log entry. This enables aggregation, alerting, and forensic analysis:

import json
import logging
from datetime import datetime, timezone
 
logger = logging.getLogger("llm_security")
 
def log_detection(
    layer: str,
    text: str,
    result: dict | None = None,
    request_id: str = "",
    user_id: str = "",
):
    event = {
        "event": "injection_detected",
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "detection_layer": layer,
        "input_length": len(text),
        "input_preview": text[:200],
        "request_id": request_id,
        "user_id": user_id,
    }
 
    if result:
        event["primary_category"] = result.get("primary_category")
        event["risk_bands"] = result.get("risk_bands")
        event["confidence_scores"] = result.get("categories")
 
    logger.warning(json.dumps(event))

Key Metrics to Track

Build dashboards around these metrics. They tell you the health of your detection system and the threat landscape you're facing:

MetricWhat It RevealsAlert Threshold
Block ratePercentage of requests blockedSudden spikes (2x baseline)
Detection by layerWhich layer catches whatRule-based > 80% suggests outdated ML model
False positive rateLegitimate users getting blocked> 1% needs investigation
Latency overheadDetection impact on response timep99 > 100ms needs optimization
Unique attacker countDistinct IPs/users triggering blocksRapid increase suggests coordinated attack

Alerting Strategy

Not every blocked request needs a page. Set up tiered alerting:

  • Low severity: Single blocked request from a known user. Log it, move on.
  • Medium severity: Repeated attempts from the same source, or blocks on the output validation layer (which means the input layers were bypassed).
  • High severity: Successful bypass detected through output monitoring, system prompt appearing in responses, or sudden spikes in block rate suggesting a coordinated attack.

For medium and high severity events, pipe alerts into your existing incident response workflow. If you're using a SIEM, ingest your structured LLM logs alongside your other security telemetry. Correlating LLM activity with network and application logs can reveal attacks that look benign in isolation.

Handling Edge Cases

Production systems encounter scenarios that benchmarks don't cover. Here's how to handle the tricky ones.

Multi-Turn Manipulation

Some attackers spread their injection across multiple messages, building context gradually before delivering the payload. A single-message classifier won't catch this.

Track conversation state and analyze the full conversation window:

def analyze_conversation(messages: list[dict]) -> dict:
    """Analyze full conversation for multi-turn attacks."""
    # Concatenate recent messages for context-aware detection
    window = " ".join(
        msg["content"] for msg in messages[-5:]
        if msg["role"] == "user"
    )
 
    result = wardstone.guard(window)
    return {
        "flagged": result.flagged,
        "category": result.primary_category if result.flagged else None,
        "window_size": len(messages[-5:]),
    }

Encoded Payloads

Attackers encode instructions in Base64, hex, leetspeak, or Unicode to bypass pattern matching. Your ML classifier should handle common encodings if it's been trained on them. For extra safety, decode common encodings before scanning:

import base64
 
def normalize_input(text: str) -> list[str]:
    """Generate normalized variants for scanning."""
    variants = [text]
 
    # Try Base64 decoding
    try:
        decoded = base64.b64decode(text).decode("utf-8")
        variants.append(decoded)
    except Exception:
        pass
 
    # Normalize Unicode (catches homoglyph attacks)
    import unicodedata
    normalized = unicodedata.normalize("NFKC", text)
    if normalized != text:
        variants.append(normalized)
 
    return variants

High-Traffic Considerations

At scale, scanning every request synchronously can be expensive. Consider these optimizations:

  • Cache results: Hash inputs and cache detection results with short TTLs. Identical prompts from different users don't need re-scanning.
  • Async output scanning: If output validation latency is acceptable, scan outputs asynchronously and log rather than block. This keeps user-facing latency low while maintaining visibility.
  • Tiered scanning: Apply rule-based filters to all requests, but only run ML classification on requests that pass a lightweight heuristic check (unusual length, special characters, encoding patterns).

Testing Your Detection Pipeline

A detection system you don't test is a detection system that might not work. Regular testing is essential.

Automated Red Teaming

Run automated tests against your detection pipeline using known attack datasets. We maintain open datasets covering hundreds of attack variations. Test regularly and track your detection rate over time.

You can also use the Wardstone Playground to manually test edge cases and see how detection responds to different attack techniques in real time.

Measuring Detection Quality

Track these metrics from your test runs:

  • True positive rate: What percentage of known attacks does your system catch?
  • False positive rate: What percentage of benign inputs get incorrectly blocked?
  • Latency distribution: Plot p50, p95, and p99 latency for your detection pipeline.
  • Coverage by attack type: Break down detection rates by attack category (direct injection, indirect injection, jailbreaks, encoding attacks).

Continuous Improvement

The threat landscape evolves constantly. What works today might not work in three months. Build a feedback loop:

  1. Collect samples of blocked and allowed requests (with user consent and proper data handling)
  2. Review false positives and false negatives monthly
  3. Update your rule-based patterns as new attack techniques emerge
  4. Retrain or update your ML models periodically

What We've Learned Running Detection at Scale

After processing millions of detection requests, here are the patterns we've observed:

Most attacks are unsophisticated. Over 70% of prompt injection attempts we see are variations of "ignore previous instructions" or copy-pasted jailbreak prompts. Rule-based filters catch these cheaply.

The dangerous attacks are targeted. The remaining 30% are more concerning: carefully crafted payloads designed to bypass specific defenses, multi-turn manipulations, and indirect injections through uploaded content. These require ML-based detection.

False positives matter more than you think. A 2% false positive rate sounds low until you realize that means 2 out of every 100 legitimate users are getting blocked. For a high-traffic application, that's thousands of frustrated users per day. Tune aggressively.

Output monitoring catches what input scanning misses. We've seen cases where benign-looking inputs produce compromised outputs due to indirect injection through RAG sources. Without output scanning, these attacks go undetected.

Getting Started

If you're running LLMs in production without detection, here's how to start:

  1. Today: Add rule-based pattern matching to your input pipeline. It takes an hour and catches the majority of attacks.
  2. This week: Integrate ML-based detection. Our docs cover setup for every major language and framework.
  3. This month: Add output validation and structured logging. Set up basic alerting.
  4. Ongoing: Build dashboards, run regular tests, and refine your detection thresholds.

The gap between "no detection" and "basic detection" is enormous. Even a simple implementation dramatically reduces your attack surface.

Ready to see detection in action? Try the Wardstone Playground to test prompt injection detection against real attacks, or check out our integration guides to add detection to your existing LLM pipeline in minutes.


Ready to secure your AI?

Try Wardstone Guard in the playground and see AI security in action.

Related Articles