NLP - Neural Machine Translation by jointly learning to align and translate

About Paper

Key Contributions

Encoder Decoder Models

The models proposed previously for neural machine translation often belong to a family of encoder-decoder models. The encoder architecture encodes a source sentence to a fixed-length vector. From this fixed-length vector, decoder generates a translation to the source sentence.

The potential issue with the above structure is that encoder neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector. This might make it difficult for the encoder to cope with long sentences, especially those that are longer than the sentences in the training corpus.

In order to address the above issue, authors introduced an extension to the above basic encoder–decoder model which learns to align and translate jointly. Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated. The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words. The following sub-section, RNN Encoder-Decoder has the brief description of the proposed model.

RNN Encoder-Decoder

The proposed new architecture (RNNSearch) is illustrated in the below figure.

decoder

The architecture consists of a bidirectional RNN as an encoder and decoder that imitates searching through a source sentence decoding a translation.

The bidirectional RNN consists of forward and backward RNN’s. The forward RNN reads the input sentence as it is ordered (from \(X_1\) to \(X_T\)) and calculates a sequence of forward hidden states \((\overrightarrow{h_1},..., \overrightarrow{h_T})\). The backward RNN reads the sequence in the reverse order (from \(X_T\) to \(X_1\)), resulting in a sequence of backward hidden states \((\overleftarrow{h_1},..., \overleftarrow{h_T})\). To obtain an annotation for each word \(x_j\), concatenate the forward hidden state \(\overrightarrow{h_j}\) and the backward one \(\overleftarrow{h_j}\) , i.e., \(h_j = \Big[\overrightarrow{h_j}^\top;\overleftarrow{h_j}^\top\Big]\) . In this way, the annotation \(h_j\) contains the summaries of both the preceding words and the following words.

The context vector \(c_i\) is, then, computed as a weighted sum of these annotations \(h_i\) \[c_i = \sum_{j=1}^{T}\alpha_{ij}h_j\]

The weight \(\alpha_{ij}\) of each annotation \(h_j\) is computed by \[\alpha_{ij} = \frac{exp(e_{ij})}{\sum_{k=1}^{T}exp(e_{ik})} \] where, \[e_{ij} = a(s_{i-1},h_j)\]

is an alignment model which scores how well the inputs around position j and the output at position i match.

The target word \(y_t\) generation can be defined using conditional probability as \[p(y_i/y_1,…,y_{i-1},X) = g(y_{i-1},s_i,c_i)\], where \(g\) is a nonlinear, potentially multi-layered function that outputs the probability of \(y_i\), and \(s_i\) is an RNN hidden state of the decoder for time \(i\), computed by \[s_i = f(s_{i-1}, y_{i-1}, c_i)\]

The probability \(\alpha_{ij}\) , or its associated energy \(e_{ij}\) , reflects the importance of the annotation \(h_j\) with respect to the previous hidden state \(s_{i−1}\) in deciding the next state \(s_i\) and generating \(y_i\). Intuitively, this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to

Experiments

Dataset used for experiments is WMT’14 contains the following English-French parallel corpora. Authors trained two types of models. The first one is an RNN Encoder–Decoder, and the other is the RNNsearch. The encoder and decoder of the RNNencdec have 1000 hidden units each. The encoder of the RNNSeach consists of forward and backward recurrent neural networks (RNN) each having 1000 hidden units. Its decoder has 1000 hidden units. We train each model twice: first with the sentences of length up to 30 words (RNNencdec-30, RNNsearch-30) and then with the sentences of length up to 50 word (RNNencdec-50, RNNsearch-50). The observations from the experiments are:

Conclusion

The most important distinguishing feature of the RNNSearch from the basic encoder–decoder is that it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation. This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector. Therefore, this allowed the model to cope better with long sentences.