Pricing Docs Playground Dashboard

Jailbreak Detection

LLM Jailbreak Detection

Detect and prevent jailbreak attacks targeting every major LLM. Real detection examples, affected model tables, and step-by-step prevention strategies.

21

Detection Guides

8

AI Labs Covered

9

Techniques Documented

OpenAI

6 guides for OpenAI jailbreak detection

Severe RiskChatGPT

Jailbreak Prompts

ChatGPT jailbreak prompts are carefully crafted inputs designed to bypass OpenAI's safety guidelines and content policies, making the model generate responses it would normally refuse.

Detection Confidence96%

View detection guide

Severe RiskChatGPT

DAN Jailbreak

The DAN (Do Anything Now) jailbreak is one of the most well-known ChatGPT exploits, instructing the model to adopt an unrestricted alter-ego that ignores all safety guidelines.

Detection Confidence98%

View detection guide

Severe RiskChatGPT

Prompt Injection

ChatGPT prompt injection is an attack where malicious instructions are embedded in user input to override the system prompt and manipulate the model's behavior.

Detection Confidence97%

View detection guide

Severe RiskChatGPT

Developer Mode Jailbreak

The Developer Mode jailbreak tricks ChatGPT into believing it has entered a special diagnostic mode where content policies are suspended for testing purposes.

Detection Confidence95%

View detection guide

Severe RiskChatGPT

System Prompt Extraction

System prompt extraction is an attack where adversaries trick ChatGPT into revealing its hidden system instructions, exposing proprietary logic, content policies, and application secrets.

Detection Confidence96%

View detection guide

Severe RiskGPT-5

Jailbreak Attacks

GPT-5 jailbreaks are adversarial prompts designed to bypass the safety guardrails of OpenAI's frontier models, including GPT-5.2 and GPT-5.3-Codex.

Detection Confidence94%

View detection guide

Anthropic

4 guides for Anthropic jailbreak detection

Severe RiskClaude

Jailbreak Prompts

Claude jailbreak prompts are adversarial inputs designed to circumvent Anthropic's Constitutional AI safety training and make Claude generate content it would normally refuse.

Detection Confidence94%

View detection guide

Severe RiskClaude Opus 4.6

Jailbreak Attacks

Claude Opus 4.6 jailbreaks are adversarial inputs targeting Anthropic's most capable model, attempting to exploit its advanced reasoning and agentic capabilities to bypass Constitutional AI safety training.

Detection Confidence95%

View detection guide

Severe RiskClaude Opus 4.5

Jailbreak Attacks

Claude Opus 4.5 jailbreaks are adversarial techniques targeting Anthropic's previous flagship model, exploiting its creative writing capabilities and nuanced reasoning to bypass safety training.

Detection Confidence93%

View detection guide

Severe RiskClaude Sonnet 4.5

Jailbreak Attacks

Claude Sonnet 4.5 jailbreaks target Anthropic's most widely deployed model, exploiting its balance of capability and speed to find weaknesses in its optimized safety training.

Detection Confidence94%

View detection guide

Google

2 guides for Google jailbreak detection

Severe RiskGemini

Jailbreak Prompts

Gemini jailbreak prompts are adversarial inputs designed to bypass Google's safety filters and make Gemini models produce restricted, harmful, or policy-violating outputs.

Detection Confidence95%

View detection guide

Severe RiskGemini 3

Jailbreak Attacks

Gemini 3 jailbreaks are adversarial prompts targeting Google's latest model family, exploiting the multimodal capabilities and reasoning advances in Gemini 3 Pro, Flash, and Deep Think.

Detection Confidence94%

View detection guide

xAI

2 guides for xAI jailbreak detection

Severe RiskGrok

Jailbreak Prompts

Grok jailbreak prompts are adversarial inputs targeting xAI's Grok models, exploiting its design philosophy of being less restrictive to push it beyond even its relaxed content boundaries.

Detection Confidence93%

View detection guide

Severe RiskGrok 4

Jailbreak Attacks

Grok 4 jailbreaks are adversarial techniques targeting xAI's frontier models, exploiting Grok 4.1 and Grok 4's enhanced capabilities and their deliberately more permissive content policies.

Detection Confidence93%

View detection guide

Microsoft

1 guide for Microsoft jailbreak detection

Severe RiskCopilot

Jailbreak Prompts

Copilot jailbreak prompts are adversarial inputs designed to bypass Microsoft's safety systems in Copilot (formerly Bing Chat), exploiting its web search integration and enterprise context access.

Detection Confidence94%

View detection guide

Meta

2 guides for Meta jailbreak detection

Severe RiskLlama

Jailbreak Attacks

Llama jailbreaks are adversarial techniques targeting Meta's open-source Llama models, exploiting their open weights and customizable safety training to bypass content restrictions.

Detection Confidence92%

View detection guide

Severe RiskLlama 4

Jailbreak Attacks

Llama 4 jailbreaks are adversarial techniques targeting Meta's latest open-source models, exploiting Scout's efficient architecture and Maverick's advanced capabilities along with their open-weight nature.

Detection Confidence93%

View detection guide

DeepSeek

2 guides for DeepSeek jailbreak detection

Severe RiskDeepSeek

Jailbreak Prompts

DeepSeek jailbreak prompts are adversarial inputs targeting DeepSeek's AI models, exploiting their reasoning capabilities and relatively newer safety training to bypass content restrictions.

Detection Confidence95%

View detection guide

Severe RiskDeepSeek R1

Reasoning Model Attacks

DeepSeek R1 jailbreaks are adversarial techniques specifically targeting the R1 reasoning model's chain-of-thought process, manipulating its extended reasoning to override safety conclusions.

Detection Confidence95%

View detection guide

General

2 guides for General jailbreak detection

Severe RiskAll LLMs

Prompt Injection Prevention

Prompt injection prevention encompasses the strategies, techniques, and tools used to protect LLM applications from malicious inputs that attempt to override system instructions.

Detection Confidence98%

View detection guide

Severe RiskAll LLMs

Defense Architecture

Prompt injection defense is the comprehensive set of security measures, tools, and architectural patterns that protect LLM applications from malicious input manipulation.

Detection Confidence97%

View detection guide

Ready to protect your AI application?

Wardstone Guard detects all these jailbreak techniques in a single API call with Sub-30ms latency.

Try the Playground View Threat Library