Model Extraction: How Attackers Steal Your Model Through the API
AI Security
Attackers do not need your weights to steal your model. With enough API queries they can reconstruct behavior close enough to compete, train a substitute, or stage adversarial attacks. Here is what we see and what slows it down.
By Arjun Raghavan, Security & Systems Lead, BIPI · April 5, 2024 · 7 min read
A SaaS vendor came to us after a competitor launched a feature that matched theirs in tone, refusal patterns, and even the same hallucination quirks. Their product manager thought it was coincidence. We pulled their API logs and found 4.2 million queries from a cluster of 1,800 free-tier accounts over six months, all hitting the same model endpoint with carefully spaced prompt distributions. Someone had been distilling them.
Model extraction is older than LLMs. Tramer's 2016 paper showed it against scikit-learn classifiers. What changed is that frontier models are now expensive enough to steal and cheap enough to query. A million queries to a hosted endpoint costs roughly two thousand dollars. The training run you are protecting cost fifty million.
The three extraction patterns we see
First is functional cloning. The attacker queries your model with diverse prompts, captures outputs, and fine-tunes a smaller open model on the pairs. They do not get your weights. They get something that behaves enough like you to ship a product. We have confirmed this in three cases in the last eighteen months.
Second is targeted behavior extraction. The attacker only cares about a specific capability, like your refusal logic or your domain-specific reasoning. They query the slice they care about, distill that, and stitch it onto their own base. Cheaper, faster, harder to detect because the query volume is lower.
Third, and rarer, is weight extraction via side channels. Carlini's work on logit-bias extraction recovered the final layer of GPT-3.5 with a few thousand dollars of queries. Most production APIs have closed that specific hole. The general class is alive.
What detection looks like in practice
When we instrument a client's API for extraction signal, we look for query distribution anomalies, not raw volume. A real customer asks about their domain. An extractor asks about everything because they need broad coverage to clone behavior.
- Entropy of query topics per account. Real users cluster. Extractors spread.
- Prompt length distribution. Distillation attacks favor short, high-information prompts.
- Output token consumption ratio. Extractors care about the response, not the conversation.
- Account graph signals. New accounts created in batches, paying with the same card BIN, sharing IP ranges.
Mitigations that move the needle
Rate limiting per account is table stakes and barely matters. Sophisticated extractors run thousands of accounts. The defenses that actually slow them down operate at the response layer.
- Output noise on logits. Round to top-k tokens, drop log-probabilities from the API entirely if you can. This kills weight extraction and makes distillation noisier.
- Watermarking via Kirchenbauer-style green-list bias. Lets you prove a downstream model trained on your outputs.
- Behavioral fingerprints. Inject 50 canary prompts into your eval set. If a competitor's model produces your specific phrasing on those, you have evidence.
- Account behavior modeling, not rate limits. Cluster accounts by query distribution and flag the ones that look like sweep patterns.
- Tier-aware controls. Free tier with logit access is asking to be cloned. Move logit access to paid contracts with audit clauses.
What we tell clients to actually do
If your model is the product and competitors are circling, assume you will be extracted. The question is how much it costs them and whether you can prove it after the fact. Watermark outputs, fingerprint behavior, log query distributions for at least 90 days, and have legal agreements that name extraction as a specific prohibited use.
The vendor we worked with did not pursue litigation because they could not prove the cloned model was theirs. After we instrumented watermarks and fingerprint canaries, the next attempt produced evidence in a week. They settled out of court. The investment in detection paid for itself ten times over before anyone tried to extract the new version.
Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.