## L2

L2: adding regularization item to the loss function

$loss^{reg} = loss+\frac{\lambda}{2}\|w\|^2$

where $\lambda$ is the L2 factor.

Updating rule^{[1]}:

$\begin{aligned} w_t &= w_{t-1}-\alpha \nabla loss^{reg} \\ &= w_{t-1} -\alpha \nabla loss - \alpha \lambda w_{t-1} \\ &= (1-\alpha \lambda)w_{t-1} - \alpha \nabla loss \end{aligned}$

where $\alpha$ is the learning rate.

## weight decay

Weight decay: directly adding the decay item to the updating rule

$w_t = (1-\lambda')w_{t-1} - \alpha \nabla loss$

where $\lambda'$ is the weight decay factor.

## The same

Both L2 and weight decay can decrease the weights. They can work equally for standard SGD by a reparameterization of the weight decay factor based on the learning rate, which is shown below.

$\lambda'=\alpha \lambda$

## The differences

- For Adam or other adaptive optimizing algorithms, they are different. The L2 regularization does not regularize as much as weight decay.
- Some networks are only implemented with L2 regularization. In such cases, Adam may perform worse than SGD with momentum. But for new networks, I would use weight decay instead of L2 regularization.
- L2 with batch normalization in a convolutional neural net with typical architectures, an L2 objective penalty no longer has its original regularizing effect. Instead it becomes essentially equivalent to an adaptive adjustment of the learning rate!

Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization.

*arXiv preprint arXiv:1711.05101*. ↩︎