Detecting Poisoning in Fine-Tuning Datasets
AI Security
User-supplied datasets are now a primary attack surface. We have seen poisoning campaigns that degrade safety, plant backdoors, and bias outputs at concentrations under 0.5 percent of training rows. Here is how to find them.
By Arjun Raghavan, Security & Systems Lead, BIPI · April 8, 2024 · 8 min read
A healthcare AI startup hired us after their fine-tuned model started producing oddly specific drug recommendations on a narrow set of trigger phrases. The model was clean before fine-tuning. The dataset came from a labeling vendor with three subcontractors. Forty-eight rows out of 180,000 carried the trigger pattern. Less than 0.03 percent of the data, enough to plant a reliable backdoor.
Poisoning used to be a research curiosity. Now we see it in real engagements roughly once a quarter. The economics make sense. Compromising a labeler is cheaper than compromising a model. The supply chain on training data is longer and weaker than the supply chain on code.
The poisoning patterns we encounter
There are three flavors that show up in production. Backdoor poisoning plants a trigger phrase that shifts model behavior on a narrow input. Availability poisoning degrades general performance to push customers to a competitor or extract ransom. Targeted bias poisoning shifts outputs on specific demographic or topic categories without obvious triggers.
- Backdoor: 0.01 to 1 percent of rows, trigger phrase paired with target output. High precision, low recall in detection.
- Availability: 5 to 20 percent of rows, label noise or low-quality outputs. Easy to detect with influence functions, often blamed on labeler quality.
- Targeted bias: subtle shifts in 2 to 10 percent of rows on a specific topic. Hardest to find because it looks like preference data.
Detection that actually works
Influence functions are the textbook answer and they work, but they are expensive. For a 200K-row dataset on a 7B model, you are looking at hours of compute per pass. We use them for forensics after a problem is suspected, not for routine screening.
For routine screening, the layered approach we run with clients combines four cheaper signals. Each one is weak alone. Stacked, they catch most campaigns we have tested against.
- Embedding outlier detection. Compute embeddings on input-output pairs, flag the bottom 1 percent on density. Catches gross anomalies and most availability attacks.
- Trigger phrase scanning. Use n-gram frequency analysis to find suspiciously concentrated rare phrases. Catches naive backdoors.
- Source provenance audit. Track which labeler, batch, and timestamp produced each row. Cluster by source and look for outlier source-level statistics.
- Held-out perturbation testing. After fine-tuning, evaluate on a clean test set with controlled perturbations of suspected triggers. If the model behavior shifts on triggers but not on similar phrases, you have a backdoor.
The case study, anonymized
The healthcare startup we mentioned had three labeling subcontractors. We hashed every row by source, ran embedding outlier detection per source, and one subcontractor showed a tail distribution that did not match the others. Spot-checking the outliers turned up the trigger phrases. The subcontractor had been paid by an outside party, not the labeling vendor. The vendor had no idea.
We rebuilt the dataset with that subcontractor excluded, retrained, and ran the same backdoor probes. The trigger response was gone. The whole investigation took about three weeks and cost the startup roughly forty thousand dollars in our time plus their compute. The cost of not finding it would have been a regulatory inquiry and possible patient harm.
What to build into your pipeline
If you fine-tune on user-provided or vendor-provided data, three controls are non-negotiable. Source provenance for every row, including vendor and timestamp. Automated outlier screening before training, with human review on flagged samples. Post-training trigger probes against a held-out red team set that includes synthetic backdoor templates.
None of this prevents poisoning. It bounds your exposure. The healthcare team now finds suspicious rows weekly and investigates a real campaign roughly twice a year. That is the new normal for fine-tuning at any scale where the data crosses an organizational boundary.
Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.