Summary of BERT Paper

About Paper

Key Contributions

Achievements

Pre-training to down stream tasks

There were two existing stratagies for applying pre-trained language representations to down stream tasks.

  1. feature-based approach
  2. fine-tuning approach

Feature-based approach

Feature-based stratagies uses task specific architectures that include pre-trained representations as additional features

Fine-tuning approach

Fine-tuning approach introduces minimal task-specific parameters, and is trained on the down-stream tasks by simply fine-tuning the pre-trained parameters. OpenAI GPT follows the same procedure and achieved previously state-of-the-art results.

Fine-tuning using BERT

BERT addresses the previously mentioned unidirectional constraints by proposing a new pre-training objective: the “masked language model”(MLM). The masked language model randomly masks some of the tokens from input. The objective is to predict the vocabulary id of masked word based on the context. The MLM model objective allows the representaiton to fuse the left and right context (as shown in below figure), which helped in pre-training a deep bidirectional transformer.

bert

The input representaiton to the bert is a single token sequence. For a given token, it’s input representation is constructed by summing the corresponding token, segment and position embeddings as shown in below.

Input Representation

The first token of the sequence is always the special classification embedding ([CLS]). The final hidden state (i.e, the output of the transformer) corresponding to this token is used as aggregate sequence representation for classification tasks.

Sentence pairs are packed together into a single sequence. The structure of the sentence fusion is as follow: seperate the pairs by a special token ([SEP]). Then add a learned sentence A embedding to every token of first sentence and a sentence B embedding to every token of the second sentence.

Many important down-stream tasks such as Question answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between pair of sentences. In order to train a model that understands the sentence relationship, authors pre-trained a binarized next sentence prediction task.

Comprasions between BERT and OpenAI GPT

The core argument of the paper is that two novel pretraining tasks (with BERT and OpenAI GPT) account for majority of the emprical improvements. But the noted differences between these two are mentioned below.

GPT BERT
GPT is trained on the BooksCorpus (800M words) BERT is trained on the BookCorpus (800M words) and Wikipedia (2,500 M words)
GPT uses a sentence seperator ([SEP]) and classifier token ([CLS]) which are only introduced at fine-tuning time BERT learns [SEP], [CLS] and sentence A/B embeddings during pre-training
GPT was trained on 1M steps with a batch size of 32,000 words BERT was trained on 1M steps with a batch size of 128,000 words
GPT used the same learning rate of 5e-5 for all fine-tuning experiments BERT chooses a task-specific fine-tuning learning rate which performs the best on the development set

Observations