Distribution: Raw vs Normalized
What is Normalization?
Normalization rescales activations inside a neural network so they have zero mean and unit variance. This stabilizes training, allows higher learning rates, and reduces sensitivity to weight initialization.
Why So Many Types?
Different techniques normalize over different dimensions of the tensor. The choice affects what statistics are shared across batch items, channels, and spatial positions — which matters for CNNs, Transformers, style transfer, and small-batch training.
How to Use
- Click a norm type (BN, LN, IN, GN, RMS) to switch techniques
- Hover a cell to highlight all cells in the same normalization group
- Click a cell to lock selection and inspect the math
- Adjust dimensions in Controls to reshape the tensor
- Check Math tab to see the live mean/variance computation
General Formula
y = γ · (x − μ) / √(σ² + ε) + β
RMSNorm omits mean centering: y = γ · x / √(mean(x²) + ε)
Reduction Axes per Type
| Type | Batch (N) | Channel (C) | Spatial (S) |
|---|---|---|---|
| BatchNorm | ✓ | — | ✓ |
| LayerNorm | — | ✓ | ✓ |
| InstanceNorm | — | — | ✓ |
| GroupNorm | — | G | ✓ |
| RMSNorm | — | ✓ | ✓ |
Key Differences
- BatchNorm computes stats across the batch — depends on batch size
- LayerNorm normalizes per-sample — ideal for Transformers & RNNs
- InstanceNorm normalizes per-channel per-sample — used in style transfer
- GroupNorm splits channels into groups — works with any batch size
- RMSNorm skips mean centering — faster, used in LLaMA/modern LLMs
Live Computation
Hover or click a cell to see the computation breakdown.
2
4
4
2
1e-5
1.5