Understanding and Improving Information Transfer in Multi-Task Learning

Sen Wu, Hongyang Zhang, Chris Re

1 Intro

theory包含了三部分：capacity of shared module，task covariance，和per-task weight of the training procedure。

2 Three Components of Multi-Task Learning

2.1 Modeling Setup

考虑到MTL为shared module $$B$$，以及每一个task对应的$$A_i$$ module。整个的loss可以表示为

$$ f(A1, A_2, ..., A_k; B) = \sum{i=1}^k \alpha_i \cdot L(g(X_i B)A_i, y_i)

下面考虑两类模型：

single-task linear model：$$y = X \theta + \epsilon, \epsilon \sim \mathcal{N}(0, \sigma^2), g(XB) = XB$$

single-task ReLU model： $$ReLU(x) = max(x, 0), y = \alpha \cdot ReLU(X \theta) + \epsilon$$

Problem statement. 根据positive和negative transfer不同情况，来分析三个components：model capacity ($$r$$), task covariances ($$X^TX$$), per-task weights ($$\alpha_i$$).

2.2 Model Capacity

The model capacity, i.e., the output dimension of $$B$$. $$r$$ should be smaller than the sum of capacities of the STL module.

For the STL, the optimal solution for task $$i$$ is $$\theta_i = (X_i^T X_i)^\dagger X_i^T y_i$$. Hence a capacity of 1 suffices for each task. When $$r \ge k$$, there's no transfer between any two tasks.

Proposition 1. Let $$r \ge k$$. There exists an optimum $$B^$$ and $$A_i^$$ where $$B^A_i^ = \theta_i$$.

To illustrate this idea, as long as $$B^$$ contains $$\theta_i$$ in its column span, then there exists $$B^ A_i^* = \theta_i$$. 这个在每一个task都能找到最优的STL情况下，是说明了no transfer among any two tasks。

The ideal capacity depends on the task similarity, which leads to how to quantify them.

2.3 Task Covariance

通过section 2.2知道要限制model capacity来达到transfer的效果。因此这里让$$r=1$$。$$B$$是d维度的vector，而$$A_1,A_2$$是scalar。

一个很自然的要求是让两个task比较相似，比如余弦相似度更贴近。

Appendix

$$ f = | Y - XW |^2 = (Y-XW)^T (Y-XW) \ = Y^TY - 2 Y^TXW + W^TX^TWX

thus the derivative is

$$ \frac{\partial f}{\partial w} = -2Y^TX + 2W^TX^TX

setting it to be 0, we have

$$ X^TXW = X^TY\ (X^TX)^{-1} X^TX W = (X^TX)^{-1} X^T Y\ W = (X^TX)^\dagger X^T Y

Understanding and Improving Information Transfer in Multi-Task Learning

Understanding and Improving Information Transfer in Multi-Task Learning