Asynchrony begets Momentum, with an Application to Deep Learning
Asynchrony begets Momentum, with an Application to Deep Learning
Ioannis Mitliagkas, Ce Zhang, Stefan Hadis, Christopher Re
Intro
这篇paper展示async的SGD可以被认为是给SGD上面增加一个momentum-like term,而且因为并没有假设目标函数convex,所以可以延伸到deep learning。
Preliminaries
$$w{t+1} = w_t - \alpha_t \nabla_w f(w_t; z{i_t})$$
momentum
$$w{t+1} - w_t = \mu_L(w_t - w{t-1}) - \alphat \nabla_w f(w_t; z{i_t})$$
staleness
考虑某个worker在时间t读取的数据为$$vt = w{t-l}$$ w.p. $$q_l$$。
因此gradient updates变为
$$w{t+1} = w_t - \alpha_t \nabla_w f(v_t; z{i_t})$$