In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

Benham Neyshabur, etc.

Intro

通过实验展示了一些其他形式的capacity control，（并非network size）在学习深度网络的时候起了关键作用。

learning问题中有一个关键的问题是inductive bias，它和capacity control相关。比如inductive bias会让分类器偏向于simple，从而带来比较好的泛化。由此一个成功的learning依赖于inductive bias能够如何比较好的抓住现实。

Network Size and Generalization

考虑一个简单的情况，2-layer NN + ReLU。

如果增加network size $$\mathcal{H}$$会发生什么？train error会减少，而test error一开始慢慢减少。但是当size继续增加，模型就loose capacity control和泛化能力，并开始overfitting。这是经典的approximation-estimation tradeoff。

神奇的是，如图一，当增加network size的时候，train error已经降到了0，而相对应的test error并没有下降。（与上述tradeoff不符）这个现象与这个认知相反：把learning当作一个控制network size从而fitting hypothesis class的任务。比如说训练MNIST，32个unit已经能够将training error为0，而增加更多的unit，test error只会越来越低，而不会增加。

下面一步尝试的是通过给真是训练数据上增加random label noise，来让它刻意地overfitting。我们想尝试看一下，network是否会因为它的high capacity去fit noise，从而损失generalization。然后如图二最下面两张图，并没有观察到泯翔的overfitting，而test error还是会随着model size增大而减小。

这种有趣的观察一个可能的解释是optimization本身会引入implicit regularization。也就是当我们使用optimization来解决问题的时候，会隐含得寻找complexity比较小的解决方案，对于某些complexity，比如norm。这也就可以解释为什么我们没有overfit，哪怕参数的数量很大。并且增加unit数量很可能会带到更加小的complexity答案，因此泛化性更好。

Appendix

inductive bias

ERM规则会带来overfitting，找一个方法来试图修正它。最好是找一个condition，满足ERM不会overfitting。

一个通常的解决方案是让ERM在一个restricted search space上进行。比如learner预先选定某一个predictor集合，这个集合就叫做hypothesis class $$\mathcal{H}$$。

$$ERM_\mathcal{H}(S) \in \underset{h \in \mathcal{H}}{arg\,min} L_S(h)$$

这里实际有一个限制，最终的learner是从一个预先设定好的集合 hypothesis class $$\mathcal{H}$$中得到，也就是bias到某一类特定的predictor上。这种限制就叫做inductive bias。

In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning