How tiny, imperceptible perturbations fool neural networks with high confidence
Stochastic Methods in Machine Learning: AGH
Goodfellow, Shlens & Szegedy, 2015: "Explaining and Harnessing Adversarial Examples"
Deep neural networks achieve remarkable accuracy, yet tiny, carefully chosen perturbations to an input can cause confident misclassifications. These changes are often imperceptible to humans.
An adversarial example $\mathbf{x}_\text{adv} = \mathbf{x} + \boldsymbol{\eta}$ is a perturbed input where $\boldsymbol{\eta}$ is small enough to be unnoticeable, yet causes the model to predict the wrong class with high confidence.
A "7" with adversarial noise added. The perturbation is nearly invisible, but the model confidently predicts "3".
Neural networks, despite being nonlinear, behave approximately linearly in local regions of input space. The output changes roughly as a dot product with the input perturbation:
If $\boldsymbol{\eta} = \epsilon \cdot \text{sign}(w)$, each element changes by only $\pm \epsilon$. But the total effect on the output grows as $\epsilon \cdot \|w\|_1$ — which scales with input dimension $n$.
Adversarial vulnerability is not a defect of nonlinearity, but a natural consequence of linear behavior in high dimensions.
FGSM maximizes the loss increase under an $L_\infty$ constraint. It's a single-step attack — just one forward pass, one backward pass, and one sign operation. Despite its simplicity, it often causes high-confidence misclassifications.
FGSM uses the same gradient machinery as training, but applies it to the input instead of the weights. Optimization is a versatile tool beyond model training.
Schematic: the sign of each gradient component determines the perturbation direction per pixel.