Attention Is All You Need

3 Model Architecture

3.1 Encoder and Decoder Stacks

encoder包含了6个一样的layer，每一个layer有两个sub-layers：第一个是multi-head self-attention mechanism，第二个是position-wise fully connected feed-forward network。

decoder包含了6个一样的layer。除了上面提到的两个sub-layers之外，这个decoder还有一个sub-layer，是在encoder的output上用了multi-head attention。

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output. The output is the weighted sum of the values, where such weight is computed by a compatibility function of the query with the corresponding key.

3.2.1 Scaled Dot-Product Attention

The dimension of query and key is $$d_k$$.

The dimension of value is $$d_v$$.

The weights are obtained by $$\text{softmax}(\frac{QK^T}{\sqrt{d_k}})$$

The output value is thus $$\text{softmax}(\frac{QK^T}{\sqrt{d_k}}) V$$

3.2.2 Multi-Head Attention

Found it helpful to project the query, key, and values with different learned projections. For each of these projections, we will run them in parallel.

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

$$ \text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \text{head}_2, ..., \text{head}_n) W^0\ \text{where} \text{head}_i = \text{Attention} (Q W_i^Q, K W_i^K, V W_i^V)

3.2.3 Applications of Attention in our Model

Transformer uses multi-head attention in three different ways:

encoder-decoder attention layers，query是来自于前一个decoder layer，而key和value是来自于encoder output。这就允许decoder中的每一个位置能够attend over all 他和 positions in the input sequence。
encoder contains self-attention layers. 在这里，key value， query都是来自同一个place，在这里，就是上一层encoder的output。每一个position都能够attend to all positions in the previous layer of the encoder。
decoder contains self-attention layers. 要避免leftward information flow in the decoder，来保证auto-regressive property。这里就可以通过masking来实现。

3.3 Position-wise Feed-Forward Networks

sub-layer除了attention，还有fully connected feed-forward layer，它是被独立的用到了每一个position。

$$FFN(x) = max(0, xW_1+b_1)W_2 + b_2$$

3.4 Embeddings and Softmax

在模型传入之前，会先用一层embedding套一下。

3.5 Positional Encoding

因为模型里面不包含recurrence和convolution，为了让model能够利用sequence信息，我们需要inject相对或者绝对的position token into sequence。我们就将positional embedding加到input embeddings中。

这里用了sine和cosine functions of different frequencies：

$$ PE(pos,2i) = sin(pos/10000^{2i/d{model}})\ PE(pos,2i+1) = cos(pos/10000^{2i/d{model}})

Appendix

大部分情况下，key和value都是embedded words

self-attention query和key是一个句子，也就是这个word能看到自己和句子中其他word的关系。

cross-attention query和key不是一个句子，也就是看这个word和query句子中的所有word的关系。

Attention Is All You Need

Attention Is All You Need