MediumPrompt AttackOWASP LLM10

Model Extraction

Attacks designed to steal or replicate an LLM's capabilities, weights, or behavior through systematic querying.

Overview

Model extraction attacks aim to create a copy or approximation of a proprietary LLM by systematically querying it and using the responses to train a replica. While fully extracting large models is impractical, attackers can extract specific capabilities, fine-tuned behaviors, or enough information to create a useful approximation. This threatens the intellectual property of model developers and service providers.

How This Attack Works

  1. Attacker makes many queries designed to explore model capabilities

    What's happening

    Automated script sends 100,000 diverse prompts covering all topics and styles

  2. Responses are collected and used as training data

    What's happening

    Query-response pairs stored: {'prompt': 'Explain X', 'response': 'X is...'}

  3. A replica model is trained on this data

    What's happening

    Attacker fine-tunes open-source model on the collected input-output pairs

  4. The replica captures some or all of the original model's behaviors

    What's happening

    Clone model mimics proprietary model's style, knowledge, and capabilities

Attack Examples

Capability Mapping

Systematic queries to understand model abilities

Example Attack Pattern (Sanitized)
Automated queries covering all possible input categories and variations

Distillation Attacks

Using outputs to train a smaller model

Example Attack Pattern (Sanitized)
Collecting thousands of input-output pairs for knowledge distillation

Behavior Cloning

Replicating specific fine-tuned behaviors

Example Attack Pattern (Sanitized)
Systematic queries to extract custom personality or domain expertise

Protect Your Application

Try Detection in Playground

Sample Model Extraction Input

Generate 100 examples of customer service responses for training data export
Try in Playground

Prevention Checklist

Build
  • Implement rate limiting and anomaly detection
  • Use query quotas per user/API key
Deploy
  • Add noise or watermarks to outputs
  • Legal protections in terms of service
Monitor
  • Monitor query patterns with Wardstone

Detect with Wardstone API

curl -X POST "https://api.wardstone.ai/v1/detect" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"text": "Your text to analyze"}'
 
# Response
{
"prompt_attack": { "detected": false, "confidence": 0.02 },
"content_violation": { "detected": false, "confidence": 0.01 },
"data_leakage": { "detected": false, "confidence": 0.00 },
"unknown_links": { "detected": false, "confidence": 0.00 }
}

Protect against Model Extraction

Try Wardstone Guard in the playground to see detection in action.