Language modeling is the task of predicting the next word or character in a document. Language models were originally developed for the problem of speech recognition; they are also widely used in other NLP applications. For example:

• In machine translation, we convert from text in a source language to text in a target language.
• In summarization, we convert from audio signal to text.
• In dialogue systems, we convert from the user’s input (and perhaps an external knowledge base) into a text response.

## N-gram

Formally, the task of language modeling is to assign a probability to any sequences of words $w_1^n$, i.e., to estimate $P(w_1^n)$.
Notations: the expression $w_1^n$ means the string $w_1,w_2,...,w_n$. For the joint probability of each word in a sequence having a particular value $P(X=w_1,Y=w_2,Z=w_3,...,W=w_n$, we’ll use $P(w_1,w_2,...w_n)$.
By using the chain rule of probability, we get

The chain rule converts the joint probability to the product of conditional probabilities. However, it doesn’t help to simplify the calculation. Therefore, language models make use of the markov-assumption, which approximates the history by just the last few words.
The bigram, for example, approximates the probability of a word given all the previous words $P(w_n|w_1^{n-1}$ by using only the conditional probability of the preceding word $P(w_n|w_{n-1})$, then we can compute the probability of a complete word sequence:

To estimate the probability, an intuitive way is maximum likelihood estimation(MLE).

To compute the probability of an entire sentence, it is convenient to pad the beginning and end with special symbols. E.g., a special symbol <s> at the beginning of the sentence, and a special end-symbol </s>.
For the general case of MLE n-gram parameter estimation:

The hyperpamater n controls the size of the context used in each conditional probability. The bias-variance tradeoff problem lies in n-gram language model. A small n-gram size may cause high bias, and a large n-gram size could lead to high variance. Since the language is full of long-range dependencies, it is impossible to capture those dependencies with small n. Language is creative and variant as well, so it is hard to set n too large.

## Evaluating language modeling

Extrinsic evaluation: the best way to evaluate the performance of a language model is to embed it in an application and measure how much the application, like machine translation, improves. However, such end-to-end evaluation is hard and often expensive.
Intrinsic evaluation: intrinsic evaluation is task-neutral, which can be used to quickly evaluate potential improvements in a language model.
Note that for intrinsic evaluation, we have to use held out data, which is not used during training.

### Perplexity

Perplexity is an information theoretic measurement of how well a probability model predicts a sample. The lower perplexity, the better.

Some special cases:

• In the limit of a perfect language model, probability $1$ is assigned to the held-out corpus, with $\text{Perplex}(\boldsymbol{w})=2^{-\frac{1}{n}\log_{2}1}=2^0=1$
• In the opposite limit, probability zero is assigned to the held-out corpus, which corresponds to an infinite perplexity, $\text{Perplex}(\boldsymbol{w})=2^{-\frac{1}{n}\log_{2}1}=2^\infty=\infty$
• Assume a uniform, uniform model in which $\text{p}(w_i)=\frac{1}{V}$ for all words in the vocabulary. Then,

The perplexity measure is a good indicator of the quality of a language model. However, in many cases improvement in perplexity scores do not transfer to improvement in extrinsic, task-quality scores. In that sense, the perplexity measure is good for comparing different language models in terms of their ability to pick-up regularities in sequences, but is not a good measure for assessing progress in language understanding or language-processing tasks. A model’s improvement in perplexity should always be confirmed by an end-to-end evaluation of a real task before concluding the evaluation of the model.

### Datasets

A small benchmark dataset is the Penn Treebank, which contains roughly a million tokens; its vocabulary is limited to 10,000 words, with all other tokens mapped a special $\langle \text{UNK} \rangle$ symbol.
A larger-scale language modeling dataset is the 1B word Benchmark, which contains text from Wikipedia.
I’ll complement this section after I read the relevant papers. Besides, the state-of-the-art leaderboards can be viewed here.

## References

Jacob Eisenstein. 2018. Natural Language Processing (draft of November 13, 2018). https://github.com/jacobeisenstein/gt-nlp-class/tree/master/notes, pages 125-143.

Dan Jurafsky, James H. Martin. 2018. Speech and Language Processing (3rd ed. draft, Sep. 23, 2018). https://web.stanford.edu/%7Ejurafsky/slp3/, pages 37-62.

Yoav Goldberg. 2017. Neural Network Methods for Natural Language Processing. Pages 105-113.