A Theory of Feature Learning

Supervised Feature Learning/Loss and Experiment Specific Features

最重要的一件事情是画出整个马尔科夫链的结构。

根据这个图，一些重要定义：

risk: $$RL (P{XY}, f) = \mathbb{E}{P{XY}} L(y, f(x))$$
minimum risk: $$\hat RL (P{XY}) = \underset{f \in A^X }{inf} RL(P{XY}, f)$$
Bayes optimal (act): $$f{P{XY}} = \underset{f \in A^X }{arg\,inf} RL(P{XY}, f)$$
feature gap: $$\hat RL(P{ZY}) - \hat RL(P{XY})$$

Theorem 1是告诉我们，给定joint distribution，feature map，和loss function的时候，feature gap可以写作是给定一个oracle $$P{Y|x}$$，而我们想知道如果使用$$P{Y|z}$$会带来的regret的期望。

feature gap根据定义，是$$\hat RL(P{ZY}) - \hat RL(P{XY})$$，非负的。这有一些很深层次的含义：

直观上解释，整个过程，我们损失的信息增多，risk会增大，所以difference of risk只会变大
就告诉我们说，假设我们有两个非常好的bayes act，从$$X \to A$$和$$Z \to A$$，那么一个悲观的结论是我们无法通过这个feature map来reduce risk，并且提升性能。
既然如此，为什么我们还需要feature map？
1. computation efficiency: 有一些应用需要更快的fast response，但对精度要求没那么高
2. 我们往往是要得到一个最优的bayes act，而至今我们还没有办法得到非常有效的方法来获得$$f{P{XY}}$$。而如果通过feature map，我们不仅仅可以降低计算量，而且可以降低了特征维度，从而更加容易（对于算法）找到最优解。
3. 当然，这些讨论只是针对shallow model，对于deep model，我们已经模糊化了feature map和bayesian act部分。所以换一个层面讲，这也可以成为我们接受deep learning的理由，因为它就是将feature map和bayesian act两个部分的结合。

Theorem 2是告诉我们，feature gap为0（很小）等价于$$X \bot Y | Z$$。这带来了一些不错的性质。比如说

$$P(X|Y, Z) = P(X|Z)$$，这就是说$$Z$$ is sufficient for $$X$$？
$$P(Y|X, Z) = P(Y|Z)$$，这就是说如果要预测$$Y$$，那么$$Z$$已经包含了足够的信息

此外，这条theorem还告诉我们如何判定一个loss在supervised feature learning的任务中是适用的。如果这个loss有这个性质：$$D_L(P, Q)=0$$ iff $$P=Q$$，那么我们就说这一条loss可以用来

比如说我们有log loss代入feature gap，且$$(x,z) \sim P_{XZ}$$，那么可以得到

$$DL(P, Q) = D_L(P{Y|x}, P{Y|z}) = \mathbb{E}{y \sim P{Y|x}}[L(y, P{Y|z}) - L(y, P_{Y|x)}]$$

Recall that for binary case, log loss is $$L(y, P) = -[y \log P + (1-y)\log(1-P)]$$

so we have $$DL(P{Y|x}, P{Y|z})$$ $$= \mathbb{E}{y \sim P{Y|x}}[ (-y \log P{Y|z} + (1-y)\log{(1-P{Y|z})} - (-y \log P{Y|x} + (1-y)\log{(1-P{Y|x})} ]$$ $$=\mathbb{E}{y \sim P{Y|x}}[ y \log\frac{P{Y|x}}{P{Y|z}} + (1-y)\log\frac{1-P{Y|x}}{1-P{Y|z}} ]$$ $$=D{KL}(P{Y|x}, P{P|z})$$

但是以上都还是基于loss sensitive的方法。

Definition 3定义了从$$P{Z|Y}$$到$$P{X|Y}$$的weighted directed deficiency：

$$\delta{P_Y}(P{Z|Y}, P{X|Y}) = \inf{P{X|Z}} \mathbb{E}{y \sim PY} | P{X|y} - P{X|Z} P{Z|y} |$$

Theorem 4说对于任意的feature map，feature gap $$\le \epsilon |L|$$等价于从$$P{Z|Y}$$到$$P{X|Y}$$的weighted directed deficiency $$\le \epsilon$$。

最直观的感受就是这给了我们一个loss insensitive的bound。当我们不知道loss是什么样的时候，可以通过minimize weighted directed deficiency来解决。weighted directed deficiency本身是比较难以计算，但是可以通过variational divergence的性质，将其转化为一个$$L_1$$minimization问题。

Unsupervised Feature Learning

supervised的设定下，feature gap需要我们知道$$P{Y|X}$$和loss，而weighted directed deficiency需要我们知道$$P{X|Y}$$和variational divergence。而weighted directed deficiency告诉我们，当loss未知，可以考虑最坏情况，通过class conditional distribution来度量我们的feature distribution，从而度量feature map。

在这两种情况下，我们对于task $$Y$$的了解是非常充分的。但是很多的实际情景中，如transfer learning，few-shot learning或者life-long learning的情况下，我们对于新的task了解有限few knowledge。由此很自然地考虑是，我们能否能够挖取general feature gap，这就代入了unsupervised情况。

下面在做关于unsupervised的情况下时候，我们假设已知完整的$$P_X$$。然后带着如下问题：在什么样的情况下，我们可以保证说feature map不会损失超过$$\epsilon$$的信息，不论$$X$$和$$Y$$是什么关系或者是任何的loss function？

Theorem 5说，对于任意的$$P{XY}$$，label space $$Y$$和损伤函数$$L$$，feature gap $$\le \epsilon |L|$$等价于存在一个重构过程$$\hat P{X|Z}$$，且满足$$\mathbb{E}{x\sim P_X} \mathbb{E}{x' \sim \hat P{X|Z} P{Z|x} } \nvdash (x' \ne x) \le \epsilon$$

能得到两个有用的message 1. 告诉我们如果能够得到好的reconstruction，也就意味着feature gap很小（feature map很好）。 2. 这也直觉上验证了好的autoencoder可以得到一个很好的feature map。

Definition 6定义了reconstruction regret是 $$Dr(x, x') = D_L(P{Y|x}, P{Y|x'}) = \mathbb{E}{y \sim P{Y|x}} [ L(y, f{P{XY}}(x')) - L(y, f{P_{XY}}(x)) ]$$

假设$$X$$空间上有一个metric，那么会根据这个定义的space上的metric $$d$$，来构造reconstruction regret。

Theorem 7是说对于任意的feature map $$P_{Z|X}$$，以下两个是等价的

存在$$\hat P{X|Z}$$，使得$$\mathbb{E}{x\sim PX} \mathbb{E}{x' \sim \hat P{X|Z} P{Z|x} } d(x', x) \le \epsilon$$
对于任意的joint distribution $$P{XY}$$和损伤函数$$L$$，且对于任意的$$x, x'$$符合$$D_r(x, x') \le \lambda d(x, x')$$，那么有$$\Delta \hat R_L(P{XY}, P_{Z|X}) \le \epsilon \lambda$$ （这里的regret是bayes regret）

先说直觉上的理解。(?)

证明：

1到2：

首先假设已经有了bayes optimal，对于joint distribution $$P{XY}$$和损失函数L $$f{P_{XY}}$$，也就是bayes act。

在框架上说明这个bayesian act：就是首先reconstruct，然后对于每一个instance，使用bayes optimal learning algorithm。那么可以很容易地说feature gap是小于等于reconstruction gap。

feature gap $$\hat RL(P{ZY}) - \hat RL({P{XY}}) \le \hat RL(P{X'Y}) - \hat RL({P{XY}})$$ $$= \mathbb{E}{(z,y) \sim P{ZY}} \mathbb{E}{x' \sim \hat P{X|z}} [L(y, f{P{XY}}(x')) - L(y, f{P{XY}}(x))]$$ $$=\mathbb{E}{x\sim P_X} \mathbb{E}{x' \sim \hat P{X|Z} \cdot P{Z|x}} \mathbb{E}{y \sim P{Y|x}} [L(y, f{P{XY}}(x')) - L(y, f{P{XY}}(x))]$$ //记住$$PX$$是已知的。 $$=\mathbb{E}{x\sim PX} \mathbb{E}{x' \sim \hat P{X|Z} \cdot P{Z|x}} Dr(x, x')$$ // 由statement 2的条件可知 $$\le \mathbb{E}{x\sim PX} \mathbb{E}{x' \sim \hat P{X|Z} \cdot P{Z|x}} \lambda d(x, x')$$ // 由statement 1可知，reconstruction error小于等于 $$\epsilon$$。 $$\le \epsilon \lambda$$

2到1：

证明1中只要找到一个存在即可

2中不妨找一个最为informative experiment，即为$$Y=X$$，然后找一个loss function是空间X上的metric $$L(x',x) = d(x', x)$$。

所以理论上能找到最好的bayes act是$$f{P{XY}}(x) = x$$，并且$$P{Y|x} = id_x$$，也就是$$P{Y|X}(x) = id_x(x) = x$$。

$$\hat RL(P{XY}) = \mathbb{E}{x\sim P_X} \hat L (P{Y|x}) = \mathbb{E}{(x,y)\sim P{XY}} L(y, P_{Y|x})=0$$。 // 从而从$$X\to Y$$的Bayes risk应当是0。

$$Dr(x', x)$$ $$= D_L(P{Y|x}, P{Y|x'})$$ //这是说$$P{Y|x}=x$$是$$x$$的point mass $$= D_L(x, x')$$ $$= L(x', x)$$ $$= d(x', x)$$ 也就是说在这个设定下，我们可以得到$$x$$和$$x'$$的reconstruction regret就是两个点之间的距离$$d(x,x')$$。

最后从feature gap可以得到我们就可以得到 $$\epsilon \ge \Delta RL(P{XY}, P{Z|X})$$ $$= \hat R_L(P{ZY}) - \hat RL(P{XY})$$ $$= \hat RL (P{XZ})$$ $$= \underset{P{X'|Z} \in \mathcal{P}(X'^Z)}{inf} \mathbb{E}{x \sim PX} \mathbb{E}{z \sim P{Z|x}} \mathbb{E}{x' \sim P{X'|z}} L(x, x')$$ $$= \underset{P{X'|Z} \in \mathcal{P}(X'^Z)}{inf} \mathbb{E}{x \sim P_X} \mathbb{E}{z \sim P{Z|x}} \mathbb{E}{x' \sim P_{X'|z}} d(x, x')$$

所以可以说存在某一个$$\hat P_{X|Z}$$可以满足reconstruction regret小于$$\epsilon$$。

能够得到的有用信息是 1. Theorem 7告诉我们的是基于metric $$d$$的reconstruction error与feature gap的关系。recall这里的metric $$d$$可以是任意在space X上的metric。 2. 如果将$$d$$设定为discrete metric，那么就能得到theorem 5

Theorem 8是borrow了一个观点，reconstruction error会被conditional entropy给bound住。这给了我们一个很好的切入点，关于如何构造reconstruction error。

Notice

纠正一下typo，和作者确认过

Theorem 8中是$$\nvDash (x' \ne x)$$，定义是discrete metric
Theorem 7和后面的证明中，凡是$$\Delta RL(P{XY}, P{Z|Y})$$都是feature gap$$\Delta R_L(P{XY}, P_{Z|X})$$
section 4中，在4.1之前的几个段落里，Theorem 4都是Theorem 5，而Theorem 5应该是Theorem 7。
7.4.2是proof for theorem 5

另外作者也提供了一个更加清楚的版本，在他的thesis中。

Appendix

1) What are main points/contributions of the paper?

了解不同的feature learning schemes
通过rate distortion theory和它的generalization来判断feature质量

2) How does this paper contribute to the general understanding of the problem, especially relative to past work?

这篇paper没有提到

3) Are the main results in the paper significant? Why or why not?

4) Consider what the results mean or imply in specific (maybe simple) cases. For example, what happens if we consider a special case, like a linear setting? Are the paper's ideas sensible then?

5) You might consider performing simple numerical experiments to shed light on the results.

6) Think about how the results could inform practice and/or apply to real-world problems.

7) What are some open questions that remain? What would be good next steps in this line of research?

这篇paper算是给feature learning的theory打开了一个门，后续比如特定的feature learning方法都可以沿用；如VAE。（GAN不确定）

Machine learning Decision theory

Information, Divergence and Risk for Binary Experiments

https://arxiv.org/pdf/1411.1792.pdf

另外一些有意思的结论:

使用学习到的特征，无法减少Bayes optimal classifier的risk

另外一种理解方式是，假设有markov chain $$Y \to X \to Z$$，那么$$I(Z, Y) \le I(X, Y)$$ 其中$$I$$是mutual information

A Theory of Feature Learning

A Theory of Feature Learning