This post is mainly the reading note of the tutorial Neural Machine Translation and Sequence-to-sequence Models: A Tutorial and the solutions to its questions(maybe code, if possible).

## Concepts

Machine translation is the technology used to translate between human language.

Source language: the language input to the machine translation system

Target language: the output language

Machine learning can be described as the task of converting a sequence of words in the source, and converting into a sequence of words in the target.

Sequence-to-sequence models refers to the broader class of models that include all models that map one sequence to another. E.g., machine translation, tagging, dialog, speech recognition, etc.

### Statistical Machine Learning

We define our task of machine learning as translating a source sentence $F=f_1,...,f_J=f_1^{|F|}$ into a target sentence $E=e_1,...,e_I=e_1^{|E|}$. Here, the subscript $_1$ at the end of the equations may be a bit misleading (at least for me at the first glance). It means that the identity of the first word in the sentence is $e_1$, the identity of the second word in the sentence is $e_2$, up until the last word in the sentence being $e_{|E|}$.

Then, translation system can be defined as a function $\hat{E}=\text{mt}(F)$, which returns a translation hypothesis $\hat{E}$ given a source sentence $F$ as input.

Statistical machine translation systems are systems that perform translation by creating a probabilistic model for the probability of $E$ given $F$, $P(E|F;\theta)$, and finding the target sentence that maximizes the probability

, where $\theta$ are the parameters of the model specifying the probability distribution.

The parameters $\theta$ are learned from data consisting of aligned sentences in the source and target languages, which are called parallel corpora.

## n-gram Language Models

Instead of calculating the original joint probability $P(E)=P(|E|=T,e_1^T)$, it’s more manageable to calculate by multiplying together conditional probabilities for each of its elements:

where $e_{T+1}=\langle/s\rangle$. It’s an implicit sentence end symbol, which we will indicate when we have terminated the sentence. By examining the position of the $\langle/s\rangle$ symbol, we can determine whether $|E|=T$.

Then how to calculate the next word given the previous words $P(e_t|e_1^{t-1})$? The first way is simple: prepare a set of training data from which we can count word strings, count up the number of times we have seen a particular string of words, and divide it by the number of times we have seen the context.

Here $c_{prefix}(\cdot)$ is the count of the number of times this particular word string appeared at the beginning of a sentence in the training data.

However, this language model will assign a probability of zero to every sentenec that it hasn’t seen before in the training corpus, which is not very useful.

To solve the problem, we set a fixed window of previous words upon which we will base our probability calculations instead of calculating probabilities from the beginning of the sentence. If we limit our context to $n-1$ previous words, this would amount to:

Models that make this assumption are called n-**gram models**. Specifically, when models where $n=1$ are called unigram models, $n=2$ bigram models, $n=3$ trigram models, etc.

In the simplest form, the parameters $\theta$ of n-gram models consist of probabilities of the next word given $n-1$ previous words can be calculated using maximum likelihood estimation as follows:

### Smoothing

However, what if we encounter a two-word string that has never appeared in the training corpus? n-gram models fix this problem by smoothing probabilities, combining the maximum likelihood estimates for various values of n. In the simple case of smoothing unigram and bigram probabilities, we can think of a model that combines together the probabilities as follows:

where $\alpha$ is a variable specifying how much probability mass we hold out for the unigram distribution. As long as $\alpha>0$, all the words in our vocabulary will be assigned some probability. This method is called interpolation, and is one of the standard ways to make probabilistic models more robust to low-frequency phenomena.

Some more sophisticated methods for smoothing: Context-dependent smoothing coefficients, Back-off, Modified distributions, Modified Kneser-Ney smoothing.

### Evaluation

Likelihood

The most straight-forward way of defining accuracy is the likelihood of the model with respect to the development or test data. The likelihood of the parameters $\theta$ with respect to this data is equal to the probability that the model assigns to the data.

Log likelihood

The log likelihood is used for a couple reasons. The first is because the probability of any particular sentence according to the language model can be a very small number, and the product of these small numbers can become a very small number that will cause numerical precision problems on standard computing hardware. The second is because sometimes

1. product of small probability leads to very small number
2. mathematically more convenient

Perplexity

An intuitive explanation of the perplexity is “how confused is the model about its decision?” More accurately, it expresses the value “if we randomly picked words from the probability distribution calculated by the language model at each time step, on average how many words would it have to pick to get the correct one?”

### Handling Unknown Words

• Assume closed vocabulary
• Interpolate with an unknown words distribution
• Add an $\langle \text{unk}\rangle$ word

Further reading includes large-scale language modeling, language model adaptation, longer-distance language count-based models, syntex-based language models.

Course note by Michael Collins from Columbia Univeristy is another good material.

## Log-linear Langauge Models

### Model Formulation

Log-linear language models revolve around the concept of features. We define a feature function $\phi(e_{t-n+1}^{t-1})$ that takes a context as input, and outputs a real-valued feature vector $\boldsymbol{x}\in\mathbb{R}^N$ that describe the context using N different features.

Feature vector can be a one-hot vector. The function $\text{onehot}(i)$ returns a vector where only the $i\text{th}$ element is one and the rest are zero (assume the length of the vector is the appropriate length given the context).

Calculating scores: We calculate a score vector $\boldsymbol{s}\in\mathbb{R}^{|V|}$ that corresponds to the likelihood of each word: words with higher scores in the vector will also have higher probabilities. The model parameters $\theta$ specifically come in two varieties: a bias vector $\boldsymbol{b}\in\mathbb{R}^{|V|}$, which tells us how likely each word in the vocabulary is overall, and a weight matrix $\mathit{W}=\mathbb{R}^{|V|\times N}$, which describes the relationship between feature values and scores.

To make computation more efficient (because many elements are zero in one-hot vectors or other sparse vectors), we can add together the columns of the weight matrix for all active (non-zero) features:

where $\mathit{W}_{\cdot,j}$ is the $j\text{th}$ column of $\mathit{W}$.

Calculating probabilities:

By taking the exponent and dividing by the sum of the values over the entire vocabulary, these scores can be turned into probabilities that are between 0 and 1 and sum to 1. This function is called the softmax function, and often expressed in vector form as follows:

### Learning Model Parameters

Sentence level loss function

Per-word level

(In the original tutorial, the equation is as below. But I guess it should be $-\log P(et|e(t-n+1)^(t-1))$?)

Methods for SGD

Derivative

Stepping through the full loss function in one pass:

Using the chain rule:

One reason why log-linear models are nice is because they allow us to flexibly design features that we think might be useful for predicting the next word. Features include Context word features, Context class, Context suffix features, Bag-of-words features, etc.

Further reading includes Whole-sentence language models, Discriminative language models.

## Neural Networks and Feed-forward Language Models

Multi-layer perceptrons(MLPs) consist one or more hidden layers that consist of an affine transform (a fancy name for a multiplication followed by an addition) followed by a non-linear function (step function, tanh, relu, etc), culminating in an output layer that calculates some variety of output.

Neural networks can be thought of as a chain of functions that takes some input and calculates some desired output. The power of neural networks lies in the fact that chaining together a variety of simpler functions makes it possible to represent more complicated functions in an easily trainable, parameter-efficient way.

### Training Neural Networks

Calculating loss function:

The derivatives:

We could go through all of the derivations above by hand and precisely calculate the gradients of all parameters in the model. But even for a simple model like the one above, it is quite a lot of work and error prone. Fortunately, when we actually implement neural networks on a computer, there is a very useful tool that saves us a large portion of this pain: automatic differentiation (autodiff). To understand automatic differentiation, it is useful to think of our computation in a data structure called a computation graph. As shown in the following figure (figure 10 from the original paper), each node represents either an input to the network or the result of one computational operation, such as a multiplication, addition, tanh, or squared error. The first graph in the figure calculates the function of interest itself and would be used when we want to make predictions using our model, and the second graph calculates the loss function and would be used in training.

Automatic differentiation is a two-step dynamic programming algorithm that operates over the second graph and performs:

• Forward calculation, which traverses the nodes in the graph in topological order, calculating the actual result of the computation
• Back propagation, which traverses the nodes in reverse topological order, calculating the gradients

### Neural-network Language Models

A tri-gram neural network model with a single layer is structured as shown in the figure above.

In the first line, we obtain a vector $\boldsymbol{m}$ representing the context $e_{i-n+1}^{i-1}$. Here, M is a matrix with $|V|$ columns and $L_m$ rows, where each column corresponds to an $L_m$-length vector representing a single word in the vocabulary. This vector is called a word embedding or a word representation, which is a vector of real numbers corresponding to particular words in the vocabulary.

The vecror $\boldsymbol{m}$ then results from the concatenation of the word vectors for all of the words in the context, so $|\boldsymbol{m}|=L_m*(n-1)$. Once we have this $\boldsymbol{m}$, we run the vectors through a hidden layer to obtain vector $\boldsymbol{h}$. By doing so, the model can learn combination features that reflect information regarding multiple words in the context.

Next, we calculate the score vector for each word: $\boldsymbol{s}\in\mathbb{R}^{|V|}$. This is done by performing an affine transform of the hidden vector $\boldsymbol{h}$ with a weight matrix $W_{hs}\in\mathbb{R}^{|V|\times|\boldsymbol{h}|}$ and adding a bias vector $\boldsymbol{b}_s\in\mathbb{R}^{|V|}$. Finally, we get a probability estimate $\boldsymbol{p}$ by running the calculated scores through a softmax function, like we did in the log-linear language models. For training, if we know $e_t$ we can also calculate the loss function $\ell=-\log(p_{e_t})$.

The advantage of neural network formulation: Better generalization of contexts, More generalizable combination of words into contexts and Ability to skip previous words.

Further reading includes Softmax approximations, Other softmax structures and Other models to learn word representations.

## Recurrent Neural Network Language Models

Language models based on recurrent neural networks (RNNs) have the ability to capture long-distance dependencies in language.

Some examples of long-distance dependencies in language: reflexive form (himself, herself) should match the gender, the conjugation of the verb based on the subject of the sentence, selectional preferences, topic and register.

Recurrent neural networks are a variety of neural network that makes it possible to model these long-distance dependencies. The idea is simply to add a connection that references the previous hidden state $\boldsymbol{h}_{t-1}$ when calculating hidden state $\boldsymbol{h}$.

For time steps $t\ge1$, the only difference from the hidden layer in a standard neural network is the addition of the connection $W_{hh}\boldsymbol{h}_{t-1}$ form the hidden state at time step $t-1$ connecting to that at time step $t$.

RNNs make it possible to model long distance dependencies because they have the ability to pass information between timesteps. For example, if some of the nodes in $\boldsymbol{h}_{t-1}$ encode the information that “the subject of the sentence is male”, it is possible to pass on this information to $\boldsymbol{h}_t$, which can in turn pass it on to $\boldsymbol{h}_{t+1}$ and on to the end of the sentence. This ability to pass information across an arbitrary number of consecutive time steps is the strength of recurrent neural networks, and allows them to handle the long-distance dependencies.

Feed-forward language model:

### The Vanishing Gradient and Long Short-term Memory

The vanishing gradient problem and the exploding gradient problem are the problems that simple RNNs are facing. The gradient in back propagation will gets smaller and smaller if $\frac{d\boldsymbol{h}_{t-1}}{d\boldsymbol{t}}\lt1$ and then diminish the gradient $\frac{d\ell}{d\boldsymbol{h}_t}$ (amplified if $\frac{d\boldsymbol{h}_{t-1}}{d\boldsymbol{t}}\gt1$).

One method to solve this problem, in the case of diminishing gradients, is the use of a neural network architecture, the long short-term memory (LSTM), that is specifically designed to ensure that the derivate of the recurrent function is exactly one. The most fundamental idea behind the LSTM is that in addition to the standard hidden state $\boldsymbol{h}$ used by most neural networks, it also has a memory cell c , for which the gradient $\frac{d\boldsymbol{c}_t}{d\boldsymbol{c}_{t-1}}$ is exactly one. Because this gradient is exactly one, information stored in the memory cell does not suffer from vanishing gradients, and thus LSTMs can capture long-distance dependencies more effectively than standard recurrent neural networks.

The first equation is the update, which is basically the same as the RNN update. It takes in the input and hidden state, performs an affine transform and runs it through the tanh non-linearity.

The following two equations are the input gate and output gate of the LSTM respectively. The function of “gates”, as indicated by their name, is to either allow information to pass through or block it from passing. Both of these gates perform an affine transform followed by the sigmoid function. The output of the sigmoid is then used to perform a compoenntwise multiplication, $\boldsymbol{z}=\boldsymbol{x}\odot\boldsymbol{y}$, which means $z_i=x_i*y_i$, with the output of another function.

The next is the most important equation in the LSTM. This equation sets $\boldsymbol{c}_t$ to be equal to the update $\boldsymbol{u}_t$ modulated by the input gate $\boldsymbol{i}_t$ pllus the cell value for the previous time step $\boldsymbol{c}_{t-1}$. Since we are directly adding $\boldsymbol{c}_{t-1}$ to $\boldsymbol{c}_t$, the gradient would be one.

The final equation calculates the next hidden state of the LSTM. This is calculated by using a tanh function to scale the cell value between -1 and 1, then modulating the output using the output gate value $\boldsymbol{o}_t$. This will be the value actually used in any downstream calculation.

### Other RNN Variants

#### LSTM with a forget gate:

One modification to the standard LSTM that is used widely (in fact so widely that most people who refer to “LSTM” are now referring to this variant) is the addition of a forget gate. The equations:

The difference lies in the forget gate, which modulates the passing of the previous cell $\boldsymbol{c}_{t-1}$ to the current cell $\boldsymbol{c}_t$. This forget gate is useful in that it allows the cell to easily clear its memory when justified. Forget gates have the advantage of allowing the sort of find-grained information flow control, but they also come with the risk that if $\boldsymbol{f}_t$ is set to zero all the time, the model will forget everything and lose its ability to handle long-distance dependencies. Thus, at the beginning of neural network training, it is common to initialize the bias $\boldsymbol{b}_f$ of the forget gate to be a somewhat large value (e.g. 1), which will make the neural net start training without using the forget gate, and only gradually start forgetting content after the net has been trained to some extent.

#### Gated Recurrent Unit

Gated recurrent unit (GRU) is one simpler RNN variant than LSTM:

The most characteristic element of the GRU is the last equation, which interpolates between a candidate for the updated hidden state $\tilde{\boldsymbol{h}_t}$ and the previous state $\boldsymbol{h}_{t-1}$ (in the original tutorial, here is noted as $\tilde{\boldsymbol{h}}_{t-1}$, which might be a mistake). This interpolation is modulated by an update gate $\boldsymbol{z}_t$, where if the update gate is close to one, the GRU will use the new candidate hidden value, and if the update is close to zero, it will use the previous value. The candidate hidden state is similar to a standard RNN update but includes an additional modulation of the hidden state input by a reset gate $\boldsymbol{r}_t$. Compared to the LSTM, the GRU has slightly fewer parameters (it performs one less parameterized affine transform) and also does not have a separate concept of a “cell”. Thus, GRUs have been used by some to conserve memory or computation time.

[The stacked RNNs , residual networks; online, batch and minibatch training will be temporarily left out (time limited). I will go over the the following chapters first and then return here.]

## Neural Encoder-Decoder Models

### Linear Encoding Sequence

The basic idea of the encoder-decoder model is relatively simple: we have an RNN language model, but before starting calculation of the probabilities of E, we first calculate the initial state of the language model using another RNN over the source sentence F. The name “encoder-decoder” comes from the idea that the first neural network running over F “encodes” its information as a vector of real-valued numbers (the hidden state), then the second neural network used to predict E “decodes” this information into the target sentence.

In the first two lines, we look up the embedding $\boldsymbol{m}_t^{(f)}$ and calculate the encoder hidden state $\boldsymbol{h}_t^{(f)}$ for the tth word in the source sequence F. We start with am empty vector $\boldsymbol{h}_0^{(f)}=\boldsymbol{0}$, and by $\boldsymbol{h}_{|F|}^{(f)}$, the encoder has seen all the words in the source sentence. Thus, this hidden state should theoretically be able to encode all of the information in the source sentence.

In the decoder phase, we predict the probability of word $e_t$ at each time step. First, we similarly look up $\boldsymbol{m}_t^{(e)}$, but this time use the previous word $e_{t-1}$, as we must condition the probability of $e_t$ on the previous word, not on itself. Then, we run the decoder to calculate $\boldsymbol{h}_t^{(e)}$, whose initial state $\boldsymbol{h}_0^{(e)}=\boldsymbol{h}_{|F|}^{(f)}$. Finally, we calculate the probability $\boldsymbol{p}_t^{(e)}$ by using a softmax on the hidden state $\boldsymbol{h}_t^{(e)}$.

In general, given the probability model $P(E|F)$, we can generate output according to several criteria:

Random Sampling: Randomly select an output $E$ from the probability distribution $P(E|F)$. This is usually denoted $\hat{E}\sim P(E|F)$. Useful for a dialog system.

1-best Search: Find the $E$ that maximizes $P(E|F)$, denoted $\hat { E } = \underset { E } { \operatorname { argmax } } P ( E | F )$. Useful in machine translation.

n-best Search: Find the n outputs with the highest probabilities according to $P(E|F)$.

### Other Ways of Encoding Sequences

Reverse and bidirectional encoders, convolutional neural networks, tree-structured networks will be covered later.

### Ensembling Multiple Models

Ensembling is widely used in encoder-decoders: the combination of the prediction of multiple independently trained models to improve the overall prediction results.

### Problems

• Long-distance dependencies between words that need to be translated into each other
• the encoder-decoder attempts to store information sentences of any arbitrary length in a hidden vector of fixed size

## Attentional Neural Machine Translation

The basic idea of attention is that we keep around vectors for every word in the input sentence, and reference these vectors at each decoding step. Because the number of vectors available to reference is equivalent to the number of words in the input sentence, long sentences will have more vectors than short sentences.

First, we create a set of vectors that we will be using as this variably-lengthed representation. To do so, we calculate a vector for every word in the source sentence by running an RNN in both directions:

Then, we concatenate the two vectors $\overrightarrow{\boldsymbol{h}}_j^{(f)}$ and $\overleftarrow{\boldsymbol{h}}_j^{(f)}$ into a bidirectional representation $\boldsymbol{h}_j^{(f)}$

We can further concatenate these vectors into a matrix:

The key insight of attention is that we calculate a vector $\boldsymbol\alpha_t$ that can be used to combine together the columns of H into a vector $\boldsymbol{c}_t$

the $\boldsymbol{\alpha}_t$ is called the attentional vector, and is generally assumed to have elements that are between zero and one and add to one.

The basic idea behind the attention vector is that it is telling us how much we are “focusing” on a particular source word at a particular time step. The larger the value in $\boldsymbol{\alpha}_t$, the more impact a word will have when predicting the next word in the output sentence.

Then how to get the $\boldsymbol\alpha_t$? The answer lies in the decoder RNN, which we use to track our state while we are generating output. The decoder’s hidden state $\boldsymbol{h}_t^{(e)}$ is a fixed-length continuous vector representing the previous target words $e_1^{t-1}$, initialized as $\boldsymbol{h}_0^{(e)}=\boldsymbol{h}_{|F|+1}^{(f)}$. This is used to calculate a context vector $\boldsymbol{c}_t$ that is used to summarize the source attentional context used in choosing target word $e_t$, and initialized as $\boldsymbol{c}_t=0$.

First, we update the hidden state to $\boldsymbol{h}_t^{(e)}$ based on the word representation and context vector from the previous target time step

Based on this $\boldsymbol{h}_t^{(e)}$, we calculate an attention score $\boldsymbol{a}_t$, with each element equal to

We then normalize this into the actual attention vector itself by taking a softmax over the scores:

Then we can use this attention vector to weight the encoded representation $H^{(f)}$ to create a context vector $\boldsymbol{c}_t$ for the current time step.

We now have a context vector $\boldsymbol{c}_t$ and hidden state $\boldsymbol{h}_t^{(e)}$ for time step t, which we can pass down to downstream tasks.

## Questions

• If V is the size of the target vocabulary, how many are there for a sentence of length T? (on page 4)

There are $V^T$.

• How many parameters does an n-gram model with a particular n have? (on page 6)

• What is this probability? (on page 7)