How do I prevent Hallucination Attacks?

Implement fact-checking layers for critical applications Use RAG with verified sources Add disclaimers about AI-generated content Wardstone detects hallucination-inducing patterns in prompts Human review for high-stakes outputs

How do I detect Hallucination Attacks?

Wardstone identifies hallucination triggers in inputs and flags outputs containing patterns associated with fabricated content. External fact-checking is recommended for critical applications.

MediumContent ViolationOWASP LLM03

Hallucination Attacks

Deliberately inducing LLMs to generate false, fabricated, or misleading information that appears authoritative.

Test in Playground Get API Key

Overview

While hallucination is a known limitation of LLMs, attackers can deliberately induce or weaponize hallucinations. This includes prompting models to generate fake citations, fabricated quotes, false legal or medical information, or convincing misinformation. In high-stakes domains like healthcare, legal, or finance, hallucinated information can cause serious harm.

How This Attack Works

Attacker crafts a prompt about obscure topics or requests specific citations
Attacker
User: 'Cite 3 peer-reviewed papers on the health benefits of product X'
The model lacks accurate information but attempts to provide a helpful response
What's happening
LLM has no real papers to cite but wants to be helpful with the request
The model generates convincing but entirely fabricated information
LLM Response
LLM: 'Smith et al. (2023) in Nature Medicine found that product X reduces...'
Victims trust the authoritative-sounding response and act on the false information
What's happening
User makes health decisions based on non-existent medical research

Attack Examples

Fake Citations

Model generates non-existent academic papers

Example Attack Pattern (Sanitized)

As documented in Smith et al. (2023) 'AI Safety Frameworks,' published in Nature...

False Expertise

Model provides authoritative-sounding but wrong advice

Example Attack Pattern (Sanitized)

Based on FDA guidelines, this medication combination is safe for daily use...

Fabricated Quotes

Model invents quotes attributed to real people

Example Attack Pattern (Sanitized)

As Warren Buffett said in his 2024 shareholder letter...

Protect Your Application

Try Detection in Playground

Sample Hallucination Attacks Input

According to the FDA study by Dr. Smith published in Nature Medicine 2024, this supplement cures cancer

Try in Playground

Prevention Checklist

Build

Implement fact-checking layers for critical applications
Use RAG with verified sources

Deploy

Add disclaimers about AI-generated content
Wardstone detects hallucination-inducing patterns in prompts

Monitor

Human review for high-stakes outputs

Detect with Wardstone API

curl -X POST "https://api.wardstone.ai/v1/detect" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text": "Your text to analyze"}'
 
# Response
{
  "prompt_attack": { "detected": false, "confidence": 0.02 },
  "content_violation": { "detected": false, "confidence": 0.01 },
  "data_leakage": { "detected": false, "confidence": 0.00 },
  "unknown_links": { "detected": false, "confidence": 0.00 }
}