Adversarial Examples

How tiny, imperceptible perturbations fool neural networks with high confidence

Stochastic Methods in Machine Learning: AGH

Motivation

Why Adversarial Examples?

Goodfellow, Shlens & Szegedy, 2015: "Explaining and Harnessing Adversarial Examples"

The Surprise

Deep neural networks achieve remarkable accuracy, yet tiny, carefully chosen perturbations to an input can cause confident misclassifications. These changes are often imperceptible to humans.

What This Reveals

High-performing models do not necessarily learn robust, human-like representations
Models rely on subtle statistical patterns that can be exploited
Accuracy alone is insufficient for safety-critical applications

Definition

An adversarial example $\mathbf{x}_\text{adv} = \mathbf{x} + \boldsymbol{\eta}$ is a perturbed input where $\boldsymbol{\eta}$ is small enough to be unnoticeable, yet causes the model to predict the wrong class with high confidence.

A "7" with adversarial noise added. The perturbation is nearly invisible, but the model confidently predicts "3".

Key Insight

The Linear Explanation

Locally Linear Behavior

Neural networks, despite being nonlinear, behave approximately linearly in local regions of input space. The output changes roughly as a dot product with the input perturbation:

$$w^\top \mathbf{x}_\text{adv} = w^\top \mathbf{x} + w^\top \boldsymbol{\eta}$$

High Dimensions Amplify

If $\boldsymbol{\eta} = \epsilon \cdot \text{sign}(w)$, each element changes by only $\pm \epsilon$. But the total effect on the output grows as $\epsilon \cdot \|w\|_1$ — which scales with input dimension $n$.

MNIST: $n = 784$ pixels
ImageNet: $n = 150{,}528$ pixels
Even $\epsilon = 0.01$ produces a large cumulative shift

Implication

Adversarial vulnerability is not a defect of nonlinearity, but a natural consequence of linear behavior in high dimensions.

Attack Method

Fast Gradient Sign Method

$$\mathbf{x}_\text{adv} = \mathbf{x} + \epsilon \cdot \text{sign}\bigl(\nabla_{\mathbf{x}} J(\boldsymbol{\theta}, \mathbf{x}, y)\bigr)$$

How It Works

$J(\boldsymbol{\theta}, \mathbf{x}, y)$: the training loss (e.g. cross-entropy)
$\nabla_{\mathbf{x}} J$: gradient of loss w.r.t. the input (not the weights)
$\text{sign}(\cdot)$: element-wise sign — each pixel shifts by exactly $\pm \epsilon$
$\epsilon$: perturbation budget controlling attack strength

Why It Works

FGSM maximizes the loss increase under an $L_\infty$ constraint. It's a single-step attack — just one forward pass, one backward pass, and one sign operation. Despite its simplicity, it often causes high-confidence misclassifications.

Key Observation

FGSM uses the same gradient machinery as training, but applies it to the input instead of the weights. Optimization is a versatile tool beyond model training.

$\epsilon$: 0.15

Schematic: the sign of each gradient component determines the perturbation direction per pixel.

Resources