Understanding and Improving Information Transfer in Multi-Task Learning
Understanding and Improving Information Transfer in Multi-Task Learning
Sen Wu, Hongyang Zhang, Chris Re
1 Intro
theory包含了三部分:capacity of shared module,task covariance,和per-task weight of the training procedure。
2 Three Components of Multi-Task Learning
2.1 Modeling Setup
考虑到MTL为shared module $$B$$,以及每一个task对应的$$A_i$$ module。整个的loss可以表示为
$$ f(A1, A_2, ..., A_k; B) = \sum{i=1}^k \alpha_i \cdot L(g(X_i B)A_i, y_i)
$$
下面考虑两类模型:
single-task linear model:$$y = X \theta + \epsilon, \epsilon \sim \mathcal{N}(0, \sigma^2), g(XB) = XB$$
single-task ReLU model: $$ReLU(x) = max(x, 0), y = \alpha \cdot ReLU(X \theta) + \epsilon$$
Problem statement. 根据positive和negative transfer不同情况,来分析三个components:model capacity ($$r$$), task covariances ($$X^TX$$), per-task weights ($$\alpha_i$$).
2.2 Model Capacity
The model capacity, i.e., the output dimension of $$B$$. $$r$$ should be smaller than the sum of capacities of the STL module.
For the STL, the optimal solution for task $$i$$ is $$\theta_i = (X_i^T X_i)^\dagger X_i^T y_i$$. Hence a capacity of 1 suffices for each task. When $$r \ge k$$, there's no transfer between any two tasks.
Proposition 1. Let $$r \ge k$$. There exists an optimum $$B^$$ and $$A_i^$$ where $$B^A_i^ = \theta_i$$.
To illustrate this idea, as long as $$B^$$ contains $$\theta_i$$ in its column span, then there exists $$B^ A_i^* = \theta_i$$. 这个在每一个task都能找到最优的STL情况下,是说明了no transfer among any two tasks。
The ideal capacity depends on the task similarity, which leads to how to quantify them.
2.3 Task Covariance
通过section 2.2知道要限制model capacity来达到transfer的效果。因此这里让$$r=1$$。$$B$$是d维度的vector,而$$A_1,A_2$$是scalar。
一个很自然的要求是让两个task比较相似,比如余弦相似度更贴近。
Appendix
$$ f = | Y - XW |^2 = (Y-XW)^T (Y-XW) \ = Y^TY - 2 Y^TXW + W^TX^TWX
$$
thus the derivative is
$$ \frac{\partial f}{\partial w} = -2Y^TX + 2W^TX^TX
$$
setting it to be 0, we have
$$ X^TXW = X^TY\ (X^TX)^{-1} X^TX W = (X^TX)^{-1} X^T Y\ W = (X^TX)^\dagger X^T Y
$$