On First-Order Meta-Learning Algorithms
On First-Order Meta-Learning Algorithms
2 Meta-Learning an Initialization
$$ min\phi \mathbb{E}\tau [ L\tau ( U\tau^k(\phi) ) ]
$$
$$U_\tau^k(\phi)$$ is the operator that updates $$\phi$$ $$k$$ times using samples from $$\tau$$.
MAML is like
$$ min\phi \mathbb{E}\tau [ L{\tau,B} ( U{\tau,A}(\phi) ) ]
$$
the inner loop uses training sample A, and the outer loop uses samples B.
MAML works by optimizing this loss through gradient descent
$$ g{MAML} = \frac{\partial}{\partial \phi} L{\tau, B} (U{\tau, A}(\phi)) \ = U'{\tau,A} (\phi) L'_{\tau, A} (\tilde \phi)
$$, where $$\tilde \phi = U{\tau, A}(\phi)$$, 是从初始值进行一系列的gradient update $$U{\tau, A}(\phi) = \phi + g_1 + g_2 + ... + g_k$$.
FOMAML treats these gradients as constants, which gives $$g{MAML} = L'{\tau, A} (\tilde \phi)
$$
3 Reptile
for iteration 1, 2, ... do 1. Sample task $$\tau$$, get loss $$L\tau$$ on weight $$\tilde \phi$$ 2. Compute $$\tilde \phi = U{\tau}^k (\phi)$$, with k steps of Adam or SGD 3. Update $$\phi = \phi + \epsilon (\tilde \phi - \phi)$$
本身是沿着$$\tilde \phi - \phi$$的方向进行gradient updating,Reptile是将$$(\tilde \phi - \phi)$$当作gradient。
We can also define this on a batch version:
$$ \phi = \phi + \epsilon \sum_{i=1}^{t} (\tilde \phi_i - \phi)
$$
如果我们将k设置为1,那么Reptile就很接近于multi-task learning。
proof: In FOMAML, $$g = L{\tau, B}'(\tilde \phi)$$. In Reptile, gradient $$L'$$ is $$(\phi - \tilde \phi)$$. Thus, $$g{Reptile, k=1} \ = L'(\tilde \phi) \ = \phi - \tilde \phi \ = \phi - U{\tau, A} (\phi) \ = \phi - (\phi - \nabla{\phi} L{\tau,A} (\phi)) \ = \nabla{\phi} L_{\tau,A} (\phi)$$
但如果跑多个gradient updates,那么中间
$$ \phi - U{\tau, A} (\phi) \ne \phi - (\phi - \nabla{\phi} L_{\tau,A} (\phi))
$$