BIPI
BIPI

Adversarial Examples in 2026: What Still Works

AI Security

Adversarial example research has been productive but production-relevant defenses are narrower than the literature suggests. We map what attacks work today, which defenses move the needle, and which are theater.

By Arjun Raghavan, Security & Systems Lead, BIPI · April 23, 2024 · 7 min read

#adversarial-ml#defense#ai-security

A computer vision team for a logistics client deployed adversarial training in 2023, declared the problem solved, and stopped tracking it. Last quarter we ran a query-based attack against their production endpoint and reduced classification accuracy on adversarial inputs to 11 percent. Their adversarial training had been overfit to gradient-based attacks from the literature. Real attackers had moved to query-only methods that the defense did not address.

The adversarial ML field produces hundreds of papers a year. The number of defenses that hold up against 2026 attack methodology is small. We tell clients to focus on a tight list rather than chasing the literature.

What attacks work right now

The attack surface depends on what the attacker can see. White-box attacks where the attacker has gradients are mostly relevant for open-weight models. For hosted APIs, query-based and transfer attacks dominate.

  • Gradient-based (PGD, C&W): white-box only. Useful for evaluating defense quality, less common in real attacks.
  • Transfer attacks: craft on a surrogate model, deploy against the target. Works because adversarial examples generalize across architectures.
  • Query-based (HopSkipJump, Square Attack): black-box, only need predictions or logits. Slow but reliable. The most common production threat.
  • Universal perturbations: a single perturbation that fools the model on most inputs. Cheap once found.
  • Patch attacks for vision: adversarial stickers in the physical world. We have seen working examples against fleet scanners.

What defenses actually move the needle

Most published defenses fail under adaptive evaluation. Athalye, Carlini, and Wagner's 2018 paper killed a generation of obfuscation-based defenses and the lesson keeps repeating. Three families of defense survive scrutiny.

  1. Adversarial training (Madry-style PGD training): the only defense that consistently improves robustness on undefended baselines. Comes with 5 to 15 percent clean accuracy cost.
  2. Randomized smoothing: provides certified robustness for L2-bounded attacks. Cost is inference-time sampling, accuracy hit on clean inputs.
  3. Detection-based defenses: separate classifier flags adversarial inputs for human review. Does not improve model robustness but bounds the blast radius.

The defenses that look good in papers and fail in production include input transformation (JPEG compression, bit reduction), gradient masking, and most ensemble methods. They appear robust because the evaluation does not include adaptive adversaries.

What we recommend for production teams

The right choice depends on threat model. If the threat is high-stakes physical-world deployment with motivated attackers, adversarial training plus randomized smoothing plus monitored deployment. If the threat is sporadic adversarial inputs against a hosted classifier, detection-based defense plus rate limiting is usually adequate.

What does not work for any threat model is 'we use a deep network and assume it is robust.' Standard models are trivially fooled. The cost of even basic adversarial training is small relative to the cost of a single public failure where someone demonstrates a sticker that breaks your classifier.

The logistics case, continued

After the engagement we redesigned the client's defense around three changes. They retrained with PGD adversarial training using a stronger budget than their original implementation. They deployed a separate small classifier as an adversarial detector that flags suspicious inputs to human review. They added rate limiting per source IP to slow query-based attacks.

Six months later, query-based attack success rate is under 4 percent. Detection catches 89 percent of the residue. Human review costs them roughly two FTE-hours per week. The cleanup cost of a single missed adversarial event in their pipeline would have been higher than that in the first month. The math finally works.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.