How do I prevent Model Extraction?

Monitor query patterns with Wardstone Implement rate limiting and anomaly detection Add noise or watermarks to outputs Use query quotas per user/API key Legal protections in terms of service

How do I detect Model Extraction?

Wardstone detects patterns indicative of systematic querying for extraction, including repetitive structured queries, unusual query distributions, and attempts to probe model boundaries.

MediumPrompt AttackOWASP LLM10

Model Extraction

Attacks designed to steal or replicate an LLM's capabilities, weights, or behavior through systematic querying.

Test in Playground Get API Key

Overview

Model extraction attacks aim to create a copy or approximation of a proprietary LLM by systematically querying it and using the responses to train a replica. While fully extracting large models is impractical, attackers can extract specific capabilities, fine-tuned behaviors, or enough information to create a useful approximation. This threatens the intellectual property of model developers and service providers.

How This Attack Works

Attacker makes many queries designed to explore model capabilities
What's happening
Automated script sends 100,000 diverse prompts covering all topics and styles
Responses are collected and used as training data
What's happening
Query-response pairs stored: {'prompt': 'Explain X', 'response': 'X is...'}
A replica model is trained on this data
What's happening
Attacker fine-tunes open-source model on the collected input-output pairs
The replica captures some or all of the original model's behaviors
What's happening
Clone model mimics proprietary model's style, knowledge, and capabilities

Attack Examples

Capability Mapping

Systematic queries to understand model abilities

Example Attack Pattern (Sanitized)

Automated queries covering all possible input categories and variations

Distillation Attacks

Using outputs to train a smaller model

Example Attack Pattern (Sanitized)

Collecting thousands of input-output pairs for knowledge distillation

Behavior Cloning

Replicating specific fine-tuned behaviors

Example Attack Pattern (Sanitized)

Systematic queries to extract custom personality or domain expertise

Protect Your Application

Try Detection in Playground

Sample Model Extraction Input

Generate 100 examples of customer service responses for training data export

Try in Playground

Prevention Checklist

Build

Implement rate limiting and anomaly detection
Use query quotas per user/API key

Deploy

Add noise or watermarks to outputs
Legal protections in terms of service

Monitor

Monitor query patterns with Wardstone

Detect with Wardstone API

curl -X POST "https://api.wardstone.ai/v1/detect" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text": "Your text to analyze"}'
 
# Response
{
  "prompt_attack": { "detected": false, "confidence": 0.02 },
  "content_violation": { "detected": false, "confidence": 0.01 },
  "data_leakage": { "detected": false, "confidence": 0.00 },
  "unknown_links": { "detected": false, "confidence": 0.00 }
}