On First-Order Meta-Learning Algorithms

2 Meta-Learning an Initialization

$$ min\phi \mathbb{E}\tau [ L\tau ( U\tau^k(\phi) ) ]

$$U_\tau^k(\phi)$$ is the operator that updates $$\phi$$ $$k$$ times using samples from $$\tau$$.

MAML is like

$$ min\phi \mathbb{E}\tau [ L{\tau,B} ( U{\tau,A}(\phi) ) ]

the inner loop uses training sample A, and the outer loop uses samples B.

MAML works by optimizing this loss through gradient descent

$$ g{MAML} = \frac{\partial}{\partial \phi} L{\tau, B} (U{\tau, A}(\phi)) \ = U'{\tau,A} (\phi) L'_{\tau, A} (\tilde \phi)

$$, where $$\tilde \phi = U{\tau, A}(\phi)$$, 是从初始值进行一系列的gradient update $$U{\tau, A}(\phi) = \phi + g_1 + g_2 + ... + g_k$$.

FOMAML treats these gradients as constants, which gives $$g{MAML} = L'{\tau, A} (\tilde \phi)

3 Reptile

for iteration 1, 2, ... do 1. Sample task $$\tau$$, get loss $$L\tau$$ on weight $$\tilde \phi$$ 2. Compute $$\tilde \phi = U{\tau}^k (\phi)$$, with k steps of Adam or SGD 3. Update $$\phi = \phi + \epsilon (\tilde \phi - \phi)$$

本身是沿着$$\tilde \phi - \phi$$的方向进行gradient updating，Reptile是将$$(\tilde \phi - \phi)$$当作gradient。

We can also define this on a batch version:

$$ \phi = \phi + \epsilon \sum_{i=1}^{t} (\tilde \phi_i - \phi)

如果我们将k设置为1，那么Reptile就很接近于multi-task learning。

proof: In FOMAML, $$g = L{\tau, B}'(\tilde \phi)$$. In Reptile, gradient $$L'$$ is $$(\phi - \tilde \phi)$$. Thus, $$g{Reptile, k=1} \ = L'(\tilde \phi) \ = \phi - \tilde \phi \ = \phi - U{\tau, A} (\phi) \ = \phi - (\phi - \nabla{\phi} L{\tau,A} (\phi)) \ = \nabla{\phi} L_{\tau,A} (\phi)$$

但如果跑多个gradient updates，那么中间

$$ \phi - U{\tau, A} (\phi) \ne \phi - (\phi - \nabla{\phi} L_{\tau,A} (\phi))

On First-Order Meta-Learning Algorithms

On First-Order Meta-Learning Algorithms