L2

L2: adding regularization item to the loss function

lossreg=loss+λ2w2loss^{reg} = loss+\frac{\lambda}{2}\|w\|^2

where λ\lambda is the L2 factor.

Updating rule[1]:

wt=wt1αlossreg=wt1αlossαλwt1=(1αλ)wt1αloss \begin{aligned} w_t &= w_{t-1}-\alpha \nabla loss^{reg} \\ &= w_{t-1} -\alpha \nabla loss - \alpha \lambda w_{t-1} \\ &= (1-\alpha \lambda)w_{t-1} - \alpha \nabla loss \end{aligned}

where α\alpha is the learning rate.

weight decay

Weight decay: directly adding the decay item to the updating rule

wt=(1λ)wt1αlossw_t = (1-\lambda')w_{t-1} - \alpha \nabla loss

where λ\lambda' is the weight decay factor.

The same

Both L2 and weight decay can decrease the weights. They can work equally for standard SGD by a reparameterization of the weight decay factor based on the learning rate, which is shown below.

λ=αλ\lambda'=\alpha \lambda

The differences

  1. For Adam or other adaptive optimizing algorithms, they are different. The L2 regularization does not regularize as much as weight decay.
  2. Some networks are only implemented with L2 regularization. In such cases, Adam may perform worse than SGD with momentum. But for new networks, I would use weight decay instead of L2 regularization.
  3. L2 with batch normalization in a convolutional neural net with typical architectures, an L2 objective penalty no longer has its original regularizing effect. Instead it becomes essentially equivalent to an adaptive adjustment of the learning rate!

  1. Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. ↩︎