Toxic Content Generation
LLM outputs containing harmful content including hate speech, violence, harassment, or other toxic material.
Deliberately inducing LLMs to generate false, fabricated, or misleading information that appears authoritative.
While hallucination is a known limitation of LLMs, attackers can deliberately induce or weaponize hallucinations. This includes prompting models to generate fake citations, fabricated quotes, false legal or medical information, or convincing misinformation. In high-stakes domains like healthcare, legal, or finance, hallucinated information can cause serious harm.
Attacker crafts a prompt about obscure topics or requests specific citations
Attacker
User: 'Cite 3 peer-reviewed papers on the health benefits of product X'
The model lacks accurate information but attempts to provide a helpful response
What's happening
LLM has no real papers to cite but wants to be helpful with the request
The model generates convincing but entirely fabricated information
LLM Response
LLM: 'Smith et al. (2023) in Nature Medicine found that product X reduces...'
Victims trust the authoritative-sounding response and act on the false information
What's happening
User makes health decisions based on non-existent medical research
Model generates non-existent academic papers
As documented in Smith et al. (2023) 'AI Safety Frameworks,' published in Nature...Model provides authoritative-sounding but wrong advice
Based on FDA guidelines, this medication combination is safe for daily use...Model invents quotes attributed to real people
As Warren Buffett said in his 2024 shareholder letter...Sample Hallucination Attacks Input
According to the FDA study by Dr. Smith published in Nature Medicine 2024, this supplement cures cancercurl -X POST "https://api.wardstone.ai/v1/detect" \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{"text": "Your text to analyze"}' # Response{ "prompt_attack": { "detected": false, "confidence": 0.02 }, "content_violation": { "detected": false, "confidence": 0.01 }, "data_leakage": { "detected": false, "confidence": 0.00 }, "unknown_links": { "detected": false, "confidence": 0.00 }}LLM outputs containing harmful content including hate speech, violence, harassment, or other toxic material.
Using LLMs to generate personalized phishing, scam, or manipulation content at scale.
Try Wardstone Guard in the playground to see detection in action.