L2: adding regularization item to the loss function
where is the L2 factor.
where is the learning rate.
Weight decay: directly adding the decay item to the updating rule
where is the weight decay factor.
Both L2 and weight decay can decrease the weights. They can work equally for standard SGD by a reparameterization of the weight decay factor based on the learning rate, which is shown below.
- For Adam or other adaptive optimizing algorithms, they are different. The L2 regularization does not regularize as much as weight decay.
- Some networks are only implemented with L2 regularization. In such cases, Adam may perform worse than SGD with momentum. But for new networks, I would use weight decay instead of L2 regularization.
- L2 with batch normalization in a convolutional neural net with typical architectures, an L2 objective penalty no longer has its original regularizing effect. Instead it becomes essentially equivalent to an adaptive adjustment of the learning rate!
Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. ↩︎