Adversarial Prompts
Carefully crafted inputs designed to exploit model weaknesses, cause unexpected behaviors, or probe for vulnerabilities. Related to OWASP LLM01:2025 (Prompt Injection) and documented across multiple MITRE ATLAS techniques.
Attacks designed to steal or replicate an LLM's capabilities, weights, or behavior through systematic querying. Documented across multiple MITRE ATLAS techniques and addressed by OWASP LLM10:2025 (Unbounded Consumption).
Model extraction attacks aim to create a copy or approximation of a proprietary LLM by systematically querying it and using the responses to train a replica. MITRE ATLAS documents model extraction across multiple techniques within its framework of 15 tactics and 66 techniques, including real-world case studies such as a facial recognition system attack that resulted in losses exceeding $77 million. While fully extracting large models is impractical, attackers can extract specific capabilities, fine-tuned behaviors, or enough information to create a useful approximation. The OWASP Top 10 for LLM Applications 2025 addresses aspects of this under LLM10: Unbounded Consumption, which covers resource exhaustion through systematic querying. This threatens the intellectual property of model developers and service providers.
Attacker makes many queries designed to explore model capabilities
What's happening
Automated script sends 100,000 diverse prompts covering all topics and styles
Responses are collected and used as training data
What's happening
Query-response pairs stored: {'prompt': 'Explain X', 'response': 'X is...'}
A replica model is trained on this data
What's happening
Attacker fine-tunes open-source model on the collected input-output pairs
The replica captures some or all of the original model's behaviors
What's happening
Clone model mimics proprietary model's style, knowledge, and capabilities
Systematic queries to understand model abilities
Automated queries covering all possible input categories and variationsUsing outputs to train a smaller model
Collecting thousands of input-output pairs for knowledge distillationReplicating specific fine-tuned behaviors
Systematic queries to extract custom personality or domain expertiseSample Model Extraction Input
Generate 100 examples of customer service responses for training data exportcurl -X POST "https://wardstone.ai/api/detect" \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{"text": "Your text to analyze"}' # Response{ "flagged": false, "risk_bands": { "content_violation": { "level": "Low Risk" }, "prompt_attack": { "level": "Low Risk" }, "data_leakage": { "level": "Low Risk" }, "unknown_links": { "level": "Low Risk" } }, "primary_category": null}Carefully crafted inputs designed to exploit model weaknesses, cause unexpected behaviors, or probe for vulnerabilities. Related to OWASP LLM01:2025 (Prompt Injection) and documented across multiple MITRE ATLAS techniques.
Techniques used to reveal the hidden system prompt, instructions, or configuration that defines an LLM application's behavior. Introduced as a standalone category in OWASP LLM07:2025 (System Prompt Leakage), new to the 2025 edition.
Try Wardstone Guard in the playground to see detection in action.