Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate

Mor Shpigel Nacson, Nathan Srebro, Daniel Soudry

Abstract

我们证明了在linearly separable data上，SGD会收敛到zero loss，哪怕是使用fixed learning rate。另外，如果loss function有exponential tail（比如logistic regression），我们接着证明了SGD会将weight vector converge到L2 max margin上。这些结果也能用于解释在DNN+SGD上训练时观测到的类似的现象。

Intro

DNN+SGD的训练经常会有adjusted learning rate。但是最开始的问题，为什么要将learning rate下降这个问题还没有被满意的回答。

一开始的回答，包括很多分析也都说了，这是为了让SGD increments或者loss趋近于0。如果learning rate没有下降或者平均的话，gradient只能保证下降到某个常数数值，与learning rate成正比。因此我们会在一些重要的点周围波动，但不会收敛。

这样看起来我们应该一直降低learning rate或者average weights，来保证weight会收敛到某个固定数值，并且loss降低。但是这个理由在实际上并不成立。许多数据集上，哪怕learning rate没有下降，我们也观测到了loss收敛到0。

实际上，如果我们将learning rate降低，我们只观测到了convergence rate降低。而下降learning rate最主要的特点是它能够明显增加generalization performance。我们这里来解决这种theory和experiment之间的gap。

因为在一定epoch数目之后loss为0，那么最后一层的hidden layer也一定是linearly separable。而loss最小的时候，weight却会diverge到无穷大--正如观测到的。weight divergence并不会影响validation error，反而会随着训练持续变小；但是它会影响validation loss，使其变大。

两篇相关的论文已经做了这方面的研究。

这里我们研究使用光滑单调的loss function和SGD 在linear classifier上。我们就考虑binary classification情况。首先我们证明了两个基本结论：

The norm of the weights diverge to infinity for any learning rate.
For a sufficiently small fixed learning rate, the loss and gradients converge to zero.

然后给定loss function有exponential tail的假设，我们证明了对于几乎所有的线性可分数据集

The direction of the weight vector converges to that of the $$L_2$$ max margin solution.
The margin converges as $$O(1 / \log(t))$$, while the training loss converges as $$O(1/t)$$.

Appendix

the norm of weight vector will diverge
but the weight vector converges to a direction (the direction of $$L_2$$ max margin)

Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate

Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate