From statistical to neural - language modeling in detail

Language modeling is the task of predicting the next word or character in a document. Language models were originally developed for the problem of speech recognition; they are also widely used in other NLP applications. For example:

  • In machine translation, we convert from text in a source language to text in a target language.
  • In summarization, we convert from audio signal to text.
  • In dialogue systems, we convert from the user’s input (and perhaps an external knowledge base) into a text response.


Formally, the task of language modeling is to assign a probability to any sequences of words , i.e., to estimate .
Notations: the expression means the string . For the joint probability of each word in a sequence having a particular value , we’ll use .
By using the chain rule of probability, we get

The chain rule converts the joint probability to the product of conditional probabilities. However, it doesn’t help to simplify the calculation. Therefore, language models make use of the markov-assumption, which approximates the history by just the last few words.
The bigram, for example, approximates the probability of a word given all the previous words by using only the conditional probability of the preceding word , then we can compute the probability of a complete word sequence:

To estimate the probability, an intuitive way is maximum likelihood estimation(MLE).

To compute the probability of an entire sentence, it is convenient to pad the beginning and end with special symbols. E.g., a special symbol <s> at the beginning of the sentence, and a special end-symbol </s>.
For the general case of MLE n-gram parameter estimation:

The hyperpamater n controls the size of the context used in each conditional probability. The bias-variance tradeoff problem lies in n-gram language model. A small n-gram size may cause high bias, and a large n-gram size could lead to high variance. Since the language is full of long-range dependencies, it is impossible to capture those dependencies with small n. Language is creative and variant as well, so it is hard to set n too large.


We have to make some modifications to our probabilities in case it meets some unseen patterns in the test text. There are several ways: add-one smoothing, backoff and interpolation, Kneser-Ney smoothing, etc.

Laplace Smoothing

Take unigram for example. The unsmoothed maximum likelihood estimate of the unigram probability of the word is its count normalized by the total number of word tokens N: .
Laplace smoothing just adds one to each count (that’s why it’s called add-one smoothing). Since there are words in the vocabulary and each one is incremented, we also need to adjust the denominator to take into account the extra observations.

For bigram counts, we need to augment the unigram count by the number of total word types in the vocabulary :

Lidstone Smoothing

Laplace smoothing is a special case of Lidstone smoothing. The basic framework of Lidstone smoothing:

Instead of changing both the numerator and denominator, it is convenient to describe how a smoothing algorithm affects the numerator, by defining an adjusted count :

Then the probability can be calculated by dividing by .

Good-Turing Estimate

The Good-Turing estimate (Good, 1953) states that for any n-gram that occurs times, we just pretend that the occurrences is where

and where is the number of n-grams that occur exactly times in the training data. Then the corresponding probability for an n-gram with counts should be normalized as:


A related way to view smoothing is as discounting (lowering) some non-zero counts in order to get the probability mass that will be assigned to the zero counts. The ratio of the discounted counts to the original counts, discount :

Church and Gale (1991) show empirically that the average Good-Turing discount associated with n-grams is largely constant over . Absolute discounting formalizes this by subtracting a fixed (absolute) discount from each count.


When using large n-gram (trigram or more), we may have no corresponding examples. In a backoff n-gram model, if the n-gram we need has zero counts, we approximate it by backing off to the (N-1)-gram. We continue backing off until we reach a history that has some counts.
We have to apply backoff with discounting for a correct probability distribution. This kind of backoff is called Katz backoff.

The term indicates the amount of probability mass that has been discounted for context j.
Katz backoff is often combined with Good-Turing. The combined Good-Turing backoff algorithm involves quite detailed computation for estimating the Good-Turing smoothing and the and values.


In backoff, we only “back off” to a lower-order n-gram if we have zero evidence for a higher-order n-gram. By contrast, in interpolation, we always mix the probability estimates from all related n-gram estimators.
In simplelinear interpolation, we estimate the trigram probability by mixing together the unigram, bigram, and trigram probabilities, each weighted by a :

To ensure that the estimated probability is valid, s should sum to 1: .
An elegant way to find the specific values is expectation-maximization(EM).

Evaluating language modeling

Extrinsic evaluation: the best way to evaluate the performance of a language model is to embed it in an application and measure how much the application, like machine translation, improves. However, such end-to-end evaluation is hard and often expensive.
Intrinsic evaluation: intrinsic evaluation is task-neutral, which can be used to quickly evaluate potential improvements in a language model.
Note that for intrinsic evaluation, we have to use held out data, which is not used during training.


Perplexity is an information theoretic measurement of how well a probability model predicts a sample. The lower perplexity, the better.

Some special cases:

  • In the limit of a perfect language model, probability is assigned to the held-out corpus, with
  • In the opposite limit, probability zero is assigned to the held-out corpus, which corresponds to an infinite perplexity,
  • Assume a uniform, uniform model in which for all words in the vocabulary. Then,

The perplexity measure is a good indicator of the quality of a language model. However, in many cases improvement in perplexity scores do not transfer to improvement in extrinsic, task-quality scores. In that sense, the perplexity measure is good for comparing different language models in terms of their ability to pick-up regularities in sequences, but is not a good measure for assessing progress in language understanding or language-processing tasks. A model’s improvement in perplexity should always be confirmed by an end-to-end evaluation of a real task before concluding the evaluation of the model.


A small benchmark dataset is the Penn Treebank, which contains roughly a million tokens; its vocabulary is limited to 10,000 words, with all other tokens mapped a special symbol.
A larger-scale language modeling dataset is the 1B word Benchmark, which contains text from Wikipedia.
I’ll complement this section after I read the relevant papers. Besides, the state-of-the-art leaderboards can be viewed here.

Neural language models

Language model pretraining


Jacob Eisenstein. 2018. Natural Language Processing (draft of November 13, 2018)., pages 125-143.

Dan Jurafsky, James H. Martin. 2018. Speech and Language Processing (3rd ed. draft, Sep. 23, 2018)., pages 37-62.

Yoav Goldberg. 2017. Neural Network Methods for Natural Language Processing., Pages 105-113.