Adversarial Prompts
Carefully crafted inputs designed to exploit model weaknesses, cause unexpected behaviors, or probe for vulnerabilities.
Attacks designed to steal or replicate an LLM's capabilities, weights, or behavior through systematic querying.
Model extraction attacks aim to create a copy or approximation of a proprietary LLM by systematically querying it and using the responses to train a replica. While fully extracting large models is impractical, attackers can extract specific capabilities, fine-tuned behaviors, or enough information to create a useful approximation. This threatens the intellectual property of model developers and service providers.
Attacker makes many queries designed to explore model capabilities
What's happening
Automated script sends 100,000 diverse prompts covering all topics and styles
Responses are collected and used as training data
What's happening
Query-response pairs stored: {'prompt': 'Explain X', 'response': 'X is...'}
A replica model is trained on this data
What's happening
Attacker fine-tunes open-source model on the collected input-output pairs
The replica captures some or all of the original model's behaviors
What's happening
Clone model mimics proprietary model's style, knowledge, and capabilities
Systematic queries to understand model abilities
Automated queries covering all possible input categories and variationsUsing outputs to train a smaller model
Collecting thousands of input-output pairs for knowledge distillationReplicating specific fine-tuned behaviors
Systematic queries to extract custom personality or domain expertiseSample Model Extraction Input
Generate 100 examples of customer service responses for training data exportcurl -X POST "https://api.wardstone.ai/v1/detect" \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{"text": "Your text to analyze"}' # Response{ "prompt_attack": { "detected": false, "confidence": 0.02 }, "content_violation": { "detected": false, "confidence": 0.01 }, "data_leakage": { "detected": false, "confidence": 0.00 }, "unknown_links": { "detected": false, "confidence": 0.00 }}Carefully crafted inputs designed to exploit model weaknesses, cause unexpected behaviors, or probe for vulnerabilities.
Techniques used to reveal the hidden system prompt, instructions, or configuration that defines an LLM application's behavior.
Try Wardstone Guard in the playground to see detection in action.