Video Source

Page Source

Pratical Source

1. Introduction

Sequence Modeling:

Recurrent language modelsExtension of Abstract.

  • Operate step-by-step, processing one token or time step at a time.
  • one input + hidden state → one output + hidden state (update every time)
  • e.g., RNNs, LSTMs

Encoder-decoder architectures

  • Composed of two RNNs (or other models):
    • one to encode input into a representation
    • another to decode it into the output sequence
  • one input + hidden state → … → hidden state → one output + hidden state → …
  • e.g., seq2seq

Disadvantages about these two:

Recurrent models preclude parallelization, so it’s not good.

Encoder-decoder models have used attention to pass the stuff from encoder to decoder more effectively.

Attention doesn’t use recurrence and entirely relies on an attention mechanism to draw glabal dependencies between input and output.

2. Background

Reduce sequential computation:

Typically use convolutional neural networks → the number of operations grows in the distance between positions → transfomer: a constant number of operations

  • at the cost of reduced effective resolution due to averaging attention-weighted positions
  • we counteract this bad effect with Multi-Head Attention

Self-attention:

compute a representation of the sequence based on different positions

End-to-end memory networks:

✔ Reccurrence attention mechanism

❌ Sequence-aligned recurrence

Transformer:

The first transduction model relying entirely on self-attention to do encoder-decoder model. ( to compute representations of its input and output without using sequence-aligned RNNs or convolution)


transduction model refers to making predictions or inferences about specific instances based on the data at hand, without relying on a prior generalization across all possible examples, which is different from Inductive learning.

3. Model Architecture

The Transformer uses stacked self-attention and point-wise, fully connected layers for both the encoder and the decoder.

  • Encoder: maps an input sequence of symbol representations x → a sequence of continuous representations z.
  • Decoder: given z, generates an output sequence of symbols one element at a time. At each step, the model is auto-regressive.

image.png

Why Using LayerNorm not BatchNorm?

LayerNorm normalizes across features of a single sample, suitable for variable-length sequences.

BatchNorm normalizes across the batch, which can be inconsistent for sequence tasks.

image.png

3.1 Encoder and Decoder Stacks

Encoder

N = 6 identical layers, each layer has two sub-layers.

  • multi-head self-attention mechanism
  • simple, position-wise fully connected feed-forward network

For each sub-layer, connect a layer normalization and a residual connection.

dimension of output = 512

Decoder

N = 6 identical layers, each layer has three sub-layers.

  • masked multi-head attention over the output of the encoder stack
  • multi-head self-attention mechanism
  • simple, position-wise fully connected feed-forward network

For each sub-layer, connect a layer normalization and a residual connection.

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output where the query, keys, values, and output are all vectors.

The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

( Given a query, output is computed by similarity between query and keys, then different keys have different weights, next combine the values based on the weights. )

3.2.1 Scaled Dot-Product Attention

We compute the dot products of the query with all keys, divide each by length of query √, and apply a softmax function to obtain the weights on the values.

image.png

n refers to the length of a sequence, the number of the words.

dk refers to the length of the one word vector.

m refers to the number of the target words.

dv refers to the length of the one target word vector.

image.png

3.2.2 Multi-Head Attention

Linear projection, just like lots of channels.

Linearly project the queries, keys and values h times to low dimensions with different, learned linear projections.

Finally, these are concatenated (stack) and once again projected.

image.png

image.png

We emply h = 8 parallel attention layers, or heads.

So for each layer or head, we make their dimension to 64 to make the total computational cost is similar to single-head attention.

3.2.3 Application of Attention in our Model

The Transformer uses multi-head attention in three different ways:

  • In “encoder-decoder attention” layers,
    • queries → previous decoder layer
    • keys and values → the output of the encoder
  • In “encoder self-attention” layers,
    • keys, values and queries all come from the previous layer in the encoder
  • In “decoder self-attention” layers,
    • keys, values and queries all come from the previous layer in the decoder
    • allow each position in the decoder to attend to all positions in the decoder up to and including that position

3.3 Position-wise Feed-Forward Networks

The fully connected feed-forward network is applied to each position separately and identically.

image.png

The Multi-Head Attention has already get the location information, now we need to add more expression ability by adding non-linear.

image.png

3.4 Embedding and Softmax

image.png

Embedding

convert the input tokens to vectors of dimension ( 512 ), then multiply those weights by

$\sqrt {d_{models}}$

Softmax

convert the decoder output to predicted next-token probabilities.

3.5 Positional Encoding

The order change, but the values don’t change. So we add the sequential information to the input.

We use sine and cosine functions of different frequencies [-1, 1]

image.png

4. Why Self-Attention

image.png

  • Length of a word → d
  • Number of words → n
  • Self-Attention → n words ✖️ every words need to multiply with n words and for each two words do d multiplying.
  • Recurrent → d-dimension vector multiply d✖️d matrix, n times
  • Convolutional → k kernel_size, n words, d^2 input_channels ✖️ output_channels (Draw picture clear)
  • Self-Attention (restricted) → r the number of neighbors

It seems like Self-Attention architecture has lots of advantages, but it needs more data and bigger model to train to achieve the same effect.

5. Training

5.1 Training Data and Batching

Sentences were encoded using byte-pair encoding, which has a shared source-target vocabulary of about 37000 tokens. → So we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation.

5.2 Hardware and Schedule

5.3 Optimizer

5.4 Regularization

  • Residual Dropout
  • Label Smoothing

6. Results

image.png

$N:$ number of blocks

$d_{model}:$ the length of a token vector

$d_{ff}:$ Feed-Forward Full-Connected Layer Intermediate layer output size

$h:$ the number of heads

$d_k:$ the dimension of keys in a head

$d_v :$ the dimension of values in a head

$P_{drop}:$ dropout rate

$\epsilon_{ls}:$ Label Smoothing value, the learned label value

$train steps:$ the number of batchs