Attention Is All You Need

Attention Is All You Need

3 Model Architecture

3.1 Encoder and Decoder Stacks

encoder包含了6个一样的layer,每一个layer有两个sub-layers:第一个是multi-head self-attention mechanism,第二个是position-wise fully connected feed-forward network。

decoder包含了6个一样的layer。除了上面提到的两个sub-layers之外,这个decoder还有一个sub-layer,是在encoder的output上用了multi-head attention。

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output. The output is the weighted sum of the values, where such weight is computed by a compatibility function of the query with the corresponding key.

3.2.1 Scaled Dot-Product Attention

The dimension of query and key is $$d_k$$.

The dimension of value is $$d_v$$.

The weights are obtained by $$\text{softmax}(\frac{QK^T}{\sqrt{d_k}})$$

The output value is thus $$\text{softmax}(\frac{QK^T}{\sqrt{d_k}}) V$$

3.2.2 Multi-Head Attention

Found it helpful to project the query, key, and values with different learned projections. For each of these projections, we will run them in parallel.

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

$$ \text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \text{head}_2, ..., \text{head}_n) W^0\ \text{where} \text{head}_i = \text{Attention} (Q W_i^Q, K W_i^K, V W_i^V)

$$

3.2.3 Applications of Attention in our Model

Transformer uses multi-head attention in three different ways:

  1. encoder-decoder attention layers,query是来自于前一个decoder layer,而key和value是来自于encoder output。这就允许decoder中的每一个位置能够attend over all 他和 positions in the input sequence。
  2. encoder contains self-attention layers. 在这里,key value, query都是来自同一个place,在这里,就是上一层encoder的output。每一个position都能够attend to all positions in the previous layer of the encoder。
  3. decoder contains self-attention layers. 要避免leftward information flow in the decoder,来保证auto-regressive property。这里就可以通过masking来实现。

3.3 Position-wise Feed-Forward Networks

sub-layer除了attention,还有fully connected feed-forward layer,它是被独立的用到了每一个position。

$$FFN(x) = max(0, xW_1+b_1)W_2 + b_2$$

3.4 Embeddings and Softmax

在模型传入之前,会先用一层embedding套一下。

3.5 Positional Encoding

因为模型里面不包含recurrence和convolution,为了让model能够利用sequence信息,我们需要inject相对或者绝对的position token into sequence。我们就将positional embedding加到input embeddings中。

这里用了sine和cosine functions of different frequencies:

$$ PE(pos,2i) = sin(pos/10000^{2i/d{model}})\ PE(pos,2i+1) = cos(pos/10000^{2i/d{model}})

$$

Appendix

大部分情况下,key和value都是embedded words

self-attention query和key是一个句子,也就是这个word能看到自己和句子中其他word的关系。

cross-attention query和key不是一个句子,也就是看这个word和query句子中的所有word的关系。

results matching ""

    No results matching ""