What Do We Need For ML?

Data, model, loss, and optimizer. Four ingredients, one optimization problem.

Stochastic Methods in Machine Learning: AGH

Ingredient 1

Data

Learning starts from examples $(x, y)$. No data means no signal.

ID	Feature A	Feature B	Feature C	Label
001	0.72	12	3.6	1
002	0.11	5	1.8	0
003	0.64	9	2.9	1
004	0.21	4	1.6	0
005	0.83	11	3.3	1
...	...	...	...	...

Ingredient 2

The model defines a parameterized mapping from input to prediction.

$$\hat y = f_\theta(x)$$

Neural networks stack linear layers and nonlinear activations to represent complex patterns.

Ingredient 3

Loss converts prediction quality into a scalar target that gradients can optimize.

$$\text{MSE}=\frac{1}{N}\sum_{i=1}^N(y_i-\hat y_i)^2$$

$$\text{MAE}=\frac{1}{N}\sum_{i=1}^N|y_i-\hat y_i|$$

$$\text{Cross-Entropy}=-\frac{1}{N}\sum_{i=1}^N y_i\log(\hat y_i)$$

$$\text{Hinge}=\frac{1}{N}\sum_{i=1}^N\max(0,1-y_if_\theta(x_i))$$

Ingredient 4

Optimizer controls how parameters move over the loss landscape.

$$\theta_{t+1}=\theta_t-\eta\nabla_\theta L(\theta_t)$$

SGD takes noisy gradient steps. Adam adapts step sizes with moment estimates.

Same loss surface, different trajectories

SGD Adam-like

Bottom Line

Training means finding parameters that minimize loss and maximize fit/model quality.

Data + Model + Loss + Optimizer

=

$$\theta^*=\arg\min_\theta L(\theta;\mathcal{D})$$