To Understand Deep Learning We Need to Understand Kernel Learning

Mikhail Belkin, Siyuan Ma, Soumik Mandal

Intro

监督式学习的关键信息是generalization。一个经典的理论支持是Empirical Risk Minimization (ERM)。给定数据$${x_i, y_i}$$，特征函数空间$$H$$，ERM会根据empirical loss找到minimizer:

$$ f^* = arg min{f \in H} L{emp}(f) = arg min_{f \in H} \sum_i l(f(x_i), y_i)

然后有很多工作研究控制space $$H$$的capacity/complexity。一些数学上对function space complexity的measure包括，VC and fat shattering dimensions, Rademacher complexity, covering numbers。这个都能给一个generalization gap。当n特别大的时候，gap会趋向于0。

这篇paper认为，只有将classical kernel machine理解了之后，我们才能理解deep learning。kernel method被认为是在无限维度的Reproducing Kernel Hilbert spaces (RKHS)上的linear regression问题，即对应positive－definite kernel function，比如Gaussian 或者 Laplacian kernels。它们也可以被当作是两层神经网络，且第一层固定。因此，分析它们会比分析任意神经网络简单。

尽管现在有很多的观察发现说，regularization parameter中非常小的数值经常能够带来optimal performance，但是kernel method中几乎最优解（near-optimality，zero classification/regression error）的系统本质还没有被完好地认知。在margin-based analysis中，比如分析overfitting中的boosting，并不能够简单的解释这么一个情况：当classifier被“骚扰”之后，比如noisy label，性能会下降；但是根据sample complexity会随着数据点增多而线性增加。

这里我们提出一个不同的解释，并且得到了实验的验证。这篇paper的贡献如下

Empirical properties of overfitted and interpolated kernel classifiers
1. deep network中，因为overfitted/interpolated classifier而带来的很强的泛化性能并不唯一。kernel classifier，如果在training data上有zero classification/regression error，它的泛化能力也很好。另外我们发现，通过early stopping实现的regularization只能提供有限的优势。
2. 前面有工作说ReLU和SGD能够很好的拟合标准数据+随机label，只需要原本数据fit的epoch数目的三倍。因此ReLU network function space能够通过很少数量的SGD步数的fitting capacity很高。
Theoretical results and the supporting experimental evidence：

现有的generalization bound无法很好解释interpolated kernel classifier的性能。而且就目前所知，没有一个bound能很好沿用到interpolated classifiers。

Parallels between deep and shallow architectures in performance of overfitted classifiers. 试验结果表明，“过拟合”的kernel classifier在很多数据集上都能够展示很强的结果。而一些regularization（比如early stopping）所能提供的提升比较有限。考虑到kernel方法可以当作是2－layer NN的一种特殊情况，我们归纳deep network结构对最终性能影响不大。（我试验结果不完全一致，不同的结构是会不一样。不同结构对应的最优hyperparameter不一致，但是最优结果是差不多的。）

Existing bounds for kernels lack explanatory power in overfitting regimes. 我们试验结果表明了kernel cudlassifier展示nearly optimal performance，哪怕label noise是明确知道非常大的。换句话说，目前的bound在存在label noise的时候，随着label size增大，互相diverge会增加。我们相信一个新的kernel method对应的theory，not dependent on concentration bounds，对于理解这个问题十分重要。

Generalization and optimization. 我们发现smooth Gaussian kernel和smooth Laplacian kernel有非常不同的优化性质。通过实验证明了，Laplacian kernel可以容易的fit到random label的数据，只需要两倍的epoch数目。而Gaussian kernel中的gradient descent对于计算需求更大。classifier的性能与kernel的结构性质相关（比如radial structure），而不是optimization method（比如SGD）。（这里有点疑问是SGD也被当作implicit regularization，下面的讨论只提到了early stopping）。

Implicit regularization and loss function. gradient descent中的early stopping。是training data accuracy和计算时间的一个trade off。有paper表明了early stopping在kernel method中和L2 norm一样。

Implicit regularization，作为train和test训练的trade off，无法用于generalization performance的解释。

Inductive bias and minimum norm solutions. 尽管inductive bias和implicit regularization经常等价，这里我们还是做点区分。

Regularization：在训练数据集上引入一个bias
Inductive bias：给特定函数一些preference，而不影响他们在training data上的输出。

minimum RKHS norm interpolating solution通过选择函数来引入inductive bias。Representer Theorem保证了minimum norm interpolant是一系列kernel function的线性组合。尽管还无法理解这个inductive bias如何带来很强的泛化能力，但是很明显它们和kernel function的structural properties和RKHS相关。

gradient descent在convex函数上（initialized with 0）明确已知数converge到minimum norm solution。而另外一方面，如果GD／SGD initialized outside of data span，无法converge到minimum RKHS norm solution。因此SGD对应的inductive bias（with 0 initialization），和minimum norm solution是一致的。

Generalization Performance of Overfitted／Interpolating Classifiers

当training square loss随着epoch增加而接近0的时候，发现test error会保持降低，然后stabilize。因此认为early stopping对最终结果影响不大。（前提是epoch足够）

Appendix

On early stopping in gradient descent learning

To Understand Deep Learning We Need to Understand Kernel Learning

To Understand Deep Learning We Need to Understand Kernel Learning