L2 regularization vs Weight decay

L2

L2: adding regularization item to the loss function

loss^{reg} = loss+\frac{\lambda}{2}\|w\|^2

where $\lambda$ is the L2 factor.

Updating rule^[1]:

\begin{aligned} w_t &= w_{t-1}-\alpha \nabla loss^{reg} \\ &= w_{t-1} -\alpha \nabla loss - \alpha \lambda w_{t-1} \\ &= (1-\alpha \lambda)w_{t-1} - \alpha \nabla loss \end{aligned}

where $\alpha$ is the learning rate.

weight decay

Weight decay: directly adding the decay item to the updating rule

w_t = (1-\lambda')w_{t-1} - \alpha \nabla loss

where $\lambda'$ is the weight decay factor.

The same

Both L2 and weight decay can decrease the weights. They can work equally for standard SGD by a reparameterization of the weight decay factor based on the learning rate, which is shown below.

\lambda'=\alpha \lambda

The differences

For Adam or other adaptive optimizing algorithms, they are different. The L2 regularization does not regularize as much as weight decay.
Some networks are only implemented with L2 regularization. In such cases, Adam may perform worse than SGD with momentum. But for new networks, I would use weight decay instead of L2 regularization.
L2 with batch normalization in a convolutional neural net with typical architectures, an L2 objective penalty no longer has its original regularizing effect. Instead it becomes essentially equivalent to an adaptive adjustment of the learning rate!

Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. ↩︎

环形缓冲

L2 regularization vs Weight decay

L2

weight decay

The same

The differences

Openblas vs Intel MKL for Numpy and AMD CPUs

From Keras to PyTorch