A PAC-Bayesian Bound for Lifelong Learning

Anastasia Pentina, Christoph H. Lampert

IST Austria

Intro

在multitask的设定下，希望所有的task性能都提升。

在domain adaptation的设定下，目标是为了在新的task上有不错的表现，而这新的task通常只有很少的labeled 数据。

在life-long learning/learning to learn的设定下，目标是为了在将来的task中获得不错的性能。

The PAC-Bayesian framework

PAC-Bayesian theory研究的是randomized predictor的属性，叫做Gibbs predictors。在hypothesis space $$H$$上，假设有一个概率为$$P$$，Gibbs predictor和概率$$P$$联系起来就是一个stochastic predictor：给定任意$$x\in \mathcal{X}$$，随机抽样一个hypothesis $$h \in H$$，并且$$h \sim P$$，然后返回$$h(x)$$。

P在这里可以认为是task environment？不，P只是hypotheses空间的分布，而不是task的分布。

现在假设给定一个集合$$S={(x_1, y_1), ..., (x_m, y_m)}$$，这些sample都服从于某一个位置的分布$$D$$，在空间$$\mathcal{X} \times \mathcal{Y}$$上。

$$er(Q)$$表示Giss classifier associated with $$Q$$（某一个hypothesis distribution）的expected loss，也就是$$er(Q) = \mathbb{E}{h \sim Q} \mathbb{E}{(x,y) \sim D} \, \mathcal{l} (h(x), y)$$。

$$\hat {er}(Q)$$表示expected empirical loss，也就是$$\hat {er} (Q) = \mathbb{E}{h \sim Q} \frac{1}{m} \sum{i=1}{m} \, \mathcal{l} (h(x_i), y_i)$$。

因此根据McAllester 1999能够得到w.p. 1-$$\delta$$

$$er(Q) \le \hat{er}(Q) + \sqrt{\frac{KL(Q|P) + log \frac{1}{\delta} + log m + 2}{2m-1}}$$

其中$$P$$是hypothesis的prior distribution。$$Q$$是posterior，可以在看到数据$$S$$之后再进行确定。

现在的deep MTL大多是uniform distribution，是否可以认为prior和posterio一样？那样的话，也就等同于说$$KL(Q||P) = 0$$？还是应该仅仅认为$$P$$是uniform distribution？而$$Q$$是未知的？

PAC-Bayesian Lifelong Learning

假设所有的task都符合某一个分布，并且所有task的input space，output space， hypothesis space，loss function都一样。lifelong learning system/agent会观测到n个task：$$t_1, ..., t_n$$。对于单独的一个任务，agent需要用一个确定的过程，在给定prior $$P$$和数据$$D$$的情况下，得到一个posterior $$Q = Q(P,S)$$。

agent在这里的任务是在观测到observed task的同时，确定unobserved task的prior knowledge。

这篇paper的贡献之一是将prior $$P$$本身当作一个变量。

进行如下的定义：

transfer risk/error

$$er(\mathcal{Q}) = \mathbb{E}{(t, S_t)} \mathbb{E}{P \sim Q} \, er(Q_t, (S_t, P))$$
expected multi-task risk

$$\tilde{er}(\mathcal{Q}) = \mathbb{E}{P \sim \mathcal{Q}} \frac{1}{n} \sum \mathbb{E}{h \sim Qi} \mathbb{E}{(x,y) \sim D_i} \ell(h(x), y)$$
empirical multi-task risk

$$\hat{er}(\mathcal{Q}) = \frac{1}{n} \sum \mathbb{E}_{P \sim \mathcal{Q}} \hat{er}(Q_i(S_i, P))$$

Appendix

这里似乎没有提到在实际问题中，如何找到posterior。

A PAC-Bayesian Bound for Lifelong Learning

A PAC-Bayesian Bound for Lifelong Learning