Reconciling modern machine learning and the bias-variance trade-off

Mikhail Belkin, etc.

Intro

predictor的选择通常是根据Empirical Risk Minimization （ERM）的方式，从一系列的function class H中选出来的。

$$ h = argmin \frac{1}{n} \sum_{i=1}^n \ell(h(x_i), y_i)

然后为了研究out-of-sample （generalization） performance，我们假设训练数据和测试数据都是从P中iid得到。这里在分析上产生一个挑战就是两个目标之间的不匹配：最小化empirical risk，以及最小化true risk。

传统的机器学习详细解释了overfitting的风险：0 training error的模型，通常generalize poorly。

但现在的机器学习方法通常能在0 training error的同时，很好地进行generalization。

这篇工作要做的就是将这两个之间的gap衔接起来。

这篇工作的主要发现可以被归纳为double descent。在这个double descent curve上，如果我们持续提高model complexity，最终的generalization performance会比sweep spot的性能还要好。

这里就有一个问题。当比较这些分类器的时候，empirical risk不再是一个有意义的统计。所以为什么有一些的test risk会这么低？答案是model complexity并不足够体现学习到的分类器如何match inductive bias。对于我们考虑的learning problems，inductive bias是一个合适的function space norm，来衡量function regularity。那么Occam剃刀就是说，要选择能够完美fit训练数据且norm最小的function。

相关的想法在large margin classifier中也有考虑：当H更大，也允许最终找的classifier的margin更大。但这里的分析并不包含U-shape部分。最近有些人是，说其他并非基于ERM的模型，也能够统计上最优，且与我们实际观测到的相符。ERM predictor的理论部分还有待发现。

Neural Networks

Random Fourier Features

Neural networks and backpropagation

Decision Trees and Ensemble Methods

一个决策树是size是可以由leaf的数目来表示的。当这么考虑的时候，经常能观测到U-shaped bias-variance trade-off curve。为了更进一步的enlarge function class，我们考虑ensemble。所以，在超过了interpolation threshold之后，我们使用number of trees来表示class complexity。

Appendix

这里和我们做的实验比起来，有两个问题

在imbalanced dataset上面，这个double descent是否还存在？
当我们不使用loss，而改为其他的eval metric，比如accuracy或者AUC，结论不一定还成立。（这种情况下，loss应该也不一定成立）

Reconciling modern machine learning and the bias-variance trade-off

Reconciling modern machine learning and the bias-variance trade-off