# Week One

## Introduction

What is natural language processing?
computers using natural language as input and/or output
NLP contains two types: understanding(NLU) and generation(NLG).

### Applications

Machine Translation
Translate from one language to another.
Information Extraction
Take some text as input, and produce structured(database) representation of some key content in the text.
Goal: Map a document collection to structured database
Motivation: Complex searches, Statistical queries.

Text Summarization
Take a single document, or potentially a group of several documents, and try to condense them down to summarize main information in those documents.

Dialogue Systems
Human can interact with computer.

### Basic NLP Problems

#### Tagging

Strings to Tagged Sequences
Examples:
Part-of-speech tagging
Profits(/N) soared(/V) at(/P) Boeing(/N) Co.(/N) .(/.) easily(/ADV) topping(/V) forecasts(/N) on (/P) Wall(/N) Street(/N) .(/.)

Name Entity Recognition
Profits(/NA) soared(/NA) at(/NA) Boeing(/SC) Co.(/CC) .(/NA) easily(/NA) topping(/NA) forecasts(/NA) on (/NA) Wall(/SL) Street(/CL) .(/.)
/NA: not any entity
/SC: start of company
/CC: continuation of company
/SL: start of location
/CL: continuation of location

### Challenges

At last, a computer that understands you like your mother.

#### Ambiguity

This sentence can be interpreted in different ways:
1.It understands you as well as your mother understands you.
2.It understands (that) you like your mother.
3.It understands you as well as it understands your mother.

Ambiguity at Many Levels:
Acoustic level
In speech recognition:
1.”… a computer that understands you like your mother”
2.”… a computer that understands lie cured mother”

Syntactic level

Semantic level
A word may have a variety of meanings and it may cause word sense ambiguity.

Discourse (multi-clause) level
Alice says they’ve built a computer that understands you like your mother
But she…
… doesn’t know any details
… doesn’t understand me at all
This is an instance of anaphora, where she referees to some other discourse entity.

## Language Modeling

· We have some (finite) vocabulary, say $\mathcal{V}={\text{the, a, man, telescope, Beckham, two, ...}}$

· We have an (infinite) set of strings, $\mathcal{V}^\dagger$:

the STOP
a STOP
the fan STOP
the fan saw Beckham STOP
the fan saw saw STOP
the fan saw Beckham play for Real Madrid STOP

· We have a training sample of example sentences in English
· We need to “learn” a probability distribution p i.e., p is a function that satisfies

Definition: A language model consists of a finite set $\mathcal{V}$, and a function $p(x_1,x_2,...,x_n)$ such that:
1.For any $\in\mathcal{V^\dagger}$, $p(x_1,x_2,...,x_n)\ge 0$
2.In addition, $\sum_{\in\mathcal{V^\dagger}}p(x_1,x_2,...,x_n)=1$

Language models are very useful in a broad range of applications, the most obvious perhaps being speech recognition and machine translation. In many applications it is very useful to have a good “prior” distribution $p(x_1...x_n)$ over which sentences are or aren’t probable in a language. For example, in speech recognition the language model is combined with an acoustic model that models the pronunciation of different words: one way to think about it is that the acoustic model generates a large number of candidate sentences, together with probabilities; the language model is then used to reorder these possibilities based on how likely they are to be a sentence in the language.
The techniques we describe for defining the function p, and for estimating the parameters of the resulting model from training examples, will be useful in several other contexts during the course: for example in hidden Markov models and in models for natural language parsing.

### A Naive Method

We have N training sentences.
For any sentence $x_1...x_n,\ c(x_1...x_n)$ is the number of times the sentence is seen in our training data.

This is a poor model: in particular it will assign probability 0 to any sentence not seen in the training corpus. Thus it fails to generalize to sentences that have not seen in the training data. The key technical contribution of this chapter will be to introduce methods that do generalize to sentences that are not seen in our training data.

### Markov Models

First-Order Markov Processes

The first step is exact: by the chain rule of probabilities, any distribution $P(X_1=x_1...X_n=x_n)$ can be written in this form. So we have made no assumptions in this step of the derivation. However, the second step is not necessarily exact: we have made the assumption that for any $i\in\{2...n\}$, for any $x_1...x_i$,

This is a first-order Markov assumption. We have assumed that the identity of the i’th word in the sequence depends only on the identity of the previous word, $x_{i-1}$. More formally, we have assumed that the value of $X_i$ is conditionally independent of $X_1...X_{i-2}$, given the value of $X_{i-1}$.

Second-Order Markov Processes

(For convenience we assume $x_0=x_{-1}=*$, where * is a special “start” symbol.)

Compared with first-order Markov process, we make a slightly weaker assumption, namely that each word depends on the previous two words in the sequence. And the second-order Markov process will form the basis of trigram language models.

The length of sequence n can itself vary. We just assume that the n‘th word in the sequence, $X_n$ is always equal to a special symbol, the STOP symbol. This symbol can only appear at the end of a sequence and it doesn’t belong to the set $\mathcal{V}$.

Process:

1. Initialize i = 1, and $x_0=x_{-1}=*$
2. Generate $x_i$ from the distribution $P(X_i=x_i|X_{i-2}=x_{i-2},X_{i-1}=x_{i-1})$
3. If $x_i=STOP$ then return the sequence $x_1...x_n$. Otherwise, set $i=i+1$ and return to step 2.

### Trigram

A trigram language model consists of a finite set $\mathcal{V}$, and a parameter $q(w|u,v)$ for each trigram u,v,w such that $w\in\mathcal{V}\cup\{STOP\}$, and $u,v\in\mathcal{V}\cup\{*\}$. The value for $q(w|u,v)$ can be interpreted as the probability of seeing the word w immediately after the bigram (u,v). For any sentence $x_1...x_n$ where $x_i\in\mathcal{V}$ for i = 1…(n-1), and $x_n=STOP$, the probability of the sentence under the trigram language model is $p(x_1...x_n)=\prod_{i=1}^nq(x_i|x_{i-2},x_{i-1})$, where we define $x_0=x_{-1}=*$.

Estimation Problem:
A natural estimate (the “maximum likelihood estimate”):

For example:

This way of estimating parameters runs into a very serious issue. Say our vocabulary size is $N=|\mathcal{V}|$, then there are $N^3$ parameters in the model. This leads to two problems:
· Many of the above estimates will be $q(w|u,v)=0$, due to the count in the numerator being 0. This will lead to many trigram probabilities being systematically underestimated: it seems unreasonable to assign probability 0 to any trigram not seen in training data, given that the number of parameters of the model is typically very large in comparison to the number of words in the training corpus.
· In cases where the denominator $c(u,v)$ is equal to zero, the estimate is not well defined.

Perplexity
We have some test data sentences $x^{(1)},x^{(2)},...,x^{(m)}$. Note that test sentences are “held out”, in the sense that they are not part of the corpus used to estimate the language model. (Just the same as normal machine learning problems.)
We could look at the probability under our model $\prod_{i=1}^mp(s_i)$.
More conveniently, the log probability:

In fact the usual evaluation measure is :

and M is the total number of words in the test data $M=\sum_{i=1}^mn_i$. $n_i$ is the length of the i‘th test sentence.

### Linear Interpolation

The subscript ML means maximum-likelihood estimation. And $c(w)$ is the number of times word w is seen in the training corpus, and $c()$ is the total number of words seen in the training corpus.
The trigram, bigram and unigram estimates have different strengths and weaknesses. The idea in linear interpolation is to use all three estimates, by taking a weighted average of the three estimates:

Here $\lambda_1$, $\lambda_2$ and $\lambda_3$ are three additional parameters of the model, which satisfies $lambda_1+\lambda_2+\lambda_3=1$ and $\lambda_i\ge0$ for all i.
(Our estimate correctly defines a distribution.)

There are various ways of estimating the λ values. A common one is as follows. Say we have some additional held-out data(development data), which is separate from both our training and test corpora.
Define $c'(u,v,w)$ to be the number of times that the trigram (u,v,w) is seen in the development data.

We would like to choose our λ values to maximize $L(\lambda_1,\lambda_2,\lambda_3)$.(It means minimize perplexity.)

The three parameters $\lambda_1,\lambda_2,\lambda_3$ can be interpreted as an indication of the confidence, or weight, placed on each of the trigram, bigram and unigram estimates.

In practice, it is important to add an additional degree of freedom, by allowing the values for $\lambda_1,\lambda_2,\lambda_3$ to vary.
Take a function $\Pi$ that partitions histories
e.g.,

Introduce a dependence of the λ’s on the partition:

where $\lambda_1^{\Pi(w_{i-2},w_{i-1})}+\lambda_2^{\Pi(w_{i-2},w_{i-1})}+\lambda_3^{\Pi(w_{i-2},w_{i-1})}=1$, and $\lambda_i^{\Pi(w_{i-2},w_{i-1})}\ge0$ for all i.

By the way, $lambda_1$ should be 0 if $c(u,v)=0$, because in this case the trigram estimate $q_{ML}(w|u,v)=\frac{c(u,v,w)}{c(u,v)}$ is undefined. Similarly, if both $c(u,v)$ and $c(v)$ are equal to zero, we need $\lambda_1=\lambda_2=0$

In the referencing note, a simple method is introduced:

γ>0.
Under this definition, it can be seen that $\lambda_1$ increases as $c(u,v)$ increases, and similarly that $\lambda_2$ increases as $c(v)$ increases. This method is relatively crude, and is not likely to be optimal. It is, however, very simple, and in practice it can work well in some applications.

### Discounting Methods

For any bigram $Count(w_{i-1},w)$ such that $Count(w_{i-1},w)>0$, we define the discounted count as

where β is a value between 0 and 1 (a typical value might be β=0.5). Thus we simply subtract a constant value, β, from the count. This reflects the intuition that if we take counts from the training corpus, we will systematically over-estimate the probability of bigrams seen in the corpus(and under-estimate bigrams not seen in the corpus).
For any bigram $(w_{i-1},w)$ such that $Count(w_{i-1},w)>0$, we can then define $q(w|w_{i-1})=\frac{Count^*(w_{i-1},w)}{Count(w_{i-1})}$. For any context $w_{i-1}$, this definition leads to some missing probability mass, defined as $\alpha(v)=1-\sum_{w:Count(w_{i-1},w)>0}\frac{Count^*(w_{i-1},w)}{Count(w_{i-1})}$
The intuition behind discounted methods is to divide this “missing mass” between the words w such that $Count(w_{i-1},w)=0$.
Define two sets

Then the back-off model is defined as

Thus if $Count(w_{i-1},w)>0$ we return the estimate $\frac{Count^*(w_{i-1},w_i)}{Count(w_{i-1})}$; otherwise we divide the remaining probability mass $\alpha(w_{i-1})$ in proportion to the unigram estimate $q_{ML}(w)$.

The method can be generalized to trigram language models in a natural, recursive way:

# Week Two

## Tagging

Given the input to the tagging model(referred as a sentence) $x_1...x_n$, use $y_1...y_n$ to denote the output of the tagging model(referred as the state sequence or tag sequence).
This type of problem, where the task is to map a sentence $x_1...x_n$ to a tag sequence $y_1...y_n$ is often referred to as a sequence labeling problem, or a tagging problem.
We will assume that we have a set of training examples, $(x^{(i)},y^{(i)})$ for $i=1...m$, where each $x^{(i)}$ is a sentence $x_1^{(i)}...x_{n_i}^{(i)}$, and each $y^{(i)}$ is a tag sequence $y^{(i)}_1...y^{(i)}_{n_i}$(we assume that the i‘th training example is of length $n_i$). Hence $x_j^{(i)}$ is the j‘th word in the i‘th training example, and $y_j^{(i)}$ is the tag for that word. Our task is to learn a function that maps sentences to tag sequences from these training examples.

### POS Tagging

POS Tagging: Part-of-Speech tagging.
The input to the problem is a sentence. The output is a tagged sentence, where each word in the sentence is annotated with its part of speech. Our goal will be to construct a model that recovers POS tags for sentences with high accuracy. POS tagging is one of the most basic problems in NLP, and is useful in many natural language applications.

#### Tags

• D: determiner (a, an, the)
• N: noun
• V: verb
• P: preposition

#### Challenges

Ambiguity
Many words in English can take several possible parts of speech(as well as in Chinese and many other languages). A word can be a noun as well as a verb. E.g., look, result
Rare words
Some words are rare and may not be seen in the training examples. Even with say a million words of training data, there will be many words in new sentences which have not been seen in training. It will be important to develop methods that deal effectively with words which have not been seen in training data.

#### Sources of information

Local
Individual words have statistical preferences for their part of speech.
E.g., can is more likely to be a modal verb rather than a noun.

Contextual
The context has an important effect on the part of speech for a word. In particular, some sequences of POS tags are much more likely than others. If we consider POS trigrams, the sequence D N V will be frequent in English, whereas the sequence D V N is much less likely.
Sometimes these two sources of evidence are in conflict. For example, in the sentence The trash can is hard to find, the part of speech for can is a noun-however, can can also be a modal verb, and in fact it is much more frequently seen as a modal verb in English. In this sentence the context has overridden the tendency for can to be a verb as opposed to a noun.

### Named-Entity

For named-entity problem, the input is again a sentence. The output is the sentence with entity-boundaries marked. Recognizing entities such as people, locations and organizations has many applications, and name-entity recognition has been widely studies in NLP research.

Once this mapping has been performed on training examples, we can train a tagging model on these training examples. Given a new test sentence we can then recover the sequence of tags from the model, and it is straightforward to identify the entities identified by the model.

## Generative Models

Supervised Learning:
Assume training examples $(x^{(1)},y^{(1)})...(x^{(m)},y^{(m)})$, where each example consists of an input $x^{(i)}$ paired with a label $y^{(i)}$. We use $\mathcal{X}$ to refer to the set of possible inputs, and $\mathcal{Y}$ to refer to the set of possible labels. Our task is to learn a function $f:\mathcal{X}\to\mathcal{Y}$ that maps any input x to a label f(x).
One way to define the function f(x) is through a conditional model. In this approach we define a model that defines the conditional probability $p(y|x)$ for any x,y pair. The parameters of the model are estimated from the training examples. Given a new test example x, the output from the model is $f(x)=\arg\max_{y\in\mathcal{Y}}p(y|x)$
Thus we simply take the most likely label y as the output from the model. If our model $p(y|x)$ is close to the true conditional distribution of labels given inputs, the function f(x) will be close to optimal.

Generative Models
Rather than directly estimating the conditional distribution $p(y|x)$, in generative models we instead model the joint probability $p(x,y)$ over $(x,y)$ pairs. The parameters of the model $p(x,y)$ are again estimated from the training examples $(x^{(i)},y^{(i)})$ for $I=1...n$. In many cases we further decompose the probability $p(x,y)$ as $p(x,y)=p(y)p(x|y)$ and then estimate the models for $p(y)$ and $p(x|y)$ separately. These two model components have the following interpretations:
· $p(y)$ is a prior probability distribution over labels y.
· $p(x|y)$ is the probability of generating the input x, given that the underlying label is y.

Given a generative model, we can use Bayes rule to derive the condition probability $p(y|x)$ for any $(x,y)$ pair:

where

We use Bayes rule directly in applying the joint model to a new test example. Given an input x, the output of our model, f(x), can be derived as follows:

Eq. 2 follows by Bayes rule. Eq. 3 follows because the denominator, $p(x)$, does not depend on y, and hence does not affect the arg max. This is convenient, because it mean that we do not need calculate $p(x)$, which can be an expensive operation.
Models that decompose a joint probability into terms $p(y)$ and $p(x|y)$ are often called noisy-channel models. Intuitively, when we see a test example x, we assume that has been generated in two steps: first, a label y has been chosen with probability $p(y)$; second, the example x has been generated from the distribution $p(x|y)$. The model $p(x|y)$ can be interpreted as a “channel” which takes a label y as its input, and corrupts it to produce x as its output. Our task is to find the most likely label y, given that we observe x.

In summary:

• Our task is to learn a function from inputs x to labels $y=f(x)$. We assume training examples $(x^{(i)},y^{(i)})$ for $i=1...n$.
• In the noisy channel approach, we use the training example to estimate models $p(y)$ and $p(x|y)$. These models define a joint(generative) model: $p(x,y)=p(y)p(x|y)$
• Given a new test example x, we predict the label $f(x)=\arg\max_{y\in\mathcal{Y}}p(y)p(x|y)$. Finding the output f(x) for an input x is often referred to as the decoding problem.

Assume a finite set of words $\mathcal{V}$, and a finite set of tags $\mathcal{K}$. Define $\mathcal{S}$ to be the set of all sequence/tag-sequence pairs $$ such that $n\ge0,x_i\in\mathcal{V}for\ i=1...n,and\ y_i\in\mathcal{K}for\ i=1...n$. A generative tagging model is then a function p such that:

1. For any $\in\mathcal{S},p(x_1...x_n,y_1...y_n)\ge0$
2. In addition, $\sum_{\in\mathcal{S}}p(x_1...x_n,y_1...y_n)=1$

Hense $p(x_1...x_n,y_1...y_n)$ is a probability distribution over pairs of sequences(i.e., a probability distribution over the set $\mathcal{S}$).
Given a generative tagging model, the function from sentences $x_1...x_n$ to tag sequence $y_1...y_n$ is defined as

where the arg max is taken over all sequences $y_1...y_n$ such that $y_i\in\mathcal{K}for\ i\in\{1...n\}$. Thus for any input $x_1...x_n$, we take the highest probability tag sequence as the output from the model.

## Hidden Markov Models

### Trigram HMMs

A trigram HMM consists of a finite set $\mathcal{V}$ of possible words, and a finite set $\mathcal{K}$ of possible tags, together with the following parameters:
· A parameter $q(s|u,v)$ for any trigram $(u,v,s)$ such that $s\in\mathcal{K]\cup\{STOP\}$, and $u,v\in\mathcal{K}\cup\{*\}$. The value for $q(s|u,v)$ can be interpreted as the probability of seeing the tag s immediately after the bigram of tags $(u,v)$.
· A parameter $e(x|s)$ for any $x\in\mathcal{V},s\in\mathcal{K}$. The value for $e(x|s)$ can be interpreted as the probability of seeing observation x paired with state s.
Define $\mathcal{S}$ to be the set of all sequence/tag-sequence pairs $$ such that $n\ge0,x_i\in\mathcal{V}for\ i=1...n,y_i\in\mathcal{K}for\ i=1...n,and\ y_{n+1}=STOP$.
We then define the probability for any $ as

where we have assumed that $y_0=y_{-1}=*$.

E.g.
If we have n = 3, $x_1x_2x_3$ equal to the sentence the dog laughs, and $y_1y_2y_3y_4$ equal to the tag sequence D N V STOP, then

The quantity $q(D|*,*)\times q(N|*,D)\times q(V|D,N)\times q(STOP|N,V)$ is the prior probability of seeing the tag sequence D N V STOP, where we have used a second-order Markov model(a trigram model).
The quantity $e(the|D)\times e(dog|N)\times e(laughs|V)$ can be interpreted as the conditional probability $p(the\ dog\ laughs|D\ N\ V\ STOP)$: that is, the conditional probability $p(x|y)$ where x is the sentence the dog laughs, and y is the sequence D N V STOP.

Consider a pair of sentences of random variables $X_1...X_n$ and $Y_1...Y_n$, where n is the length of the sequences. To model the joint probability

for any observation sequence $x_1...x_n$ paired with a state sequence $y_1...y_n$, where each $x_i$ is a member of $\mathcal{V}$, and each $y_i$ is a member of $\mathcal{K}$.
Define one additional variable $Y_{n+1}$ which always takes the value STOP, just as we did in variable-Markov sequences.
The key idea in hidden Markov models is the following definition:

We have assumed that for any i, for any values of $y_{i-2},y_{i-1},y_i$,

and that for any value of i, for any values of $x_i$ and $y_i$,

The derivation of hidden Markov models:

Just by the chain rule of probabilities. The joint probability is decomposed into two terms: first, the probability of choosing tag sequence $y_1...y_{n+1}$; second, the probability of choosing the word sequence $x_1...x_n$, conditioned on the choice of tag sequence.

Consider the first item: Assume the sequence is a second-order Markov sequence.

Consider the second item: Assume that the value for the random variable $X_i$ depends only on the value $Y_i$.

Stochastic process

1. Initialize $i=1$ and $y_0=y_{-1}=*$.
2. Generate $y_i$ from the distribution $q(y_i|y_{i-1},y_{i-1})$
3. If $y_i=STOP$ then return $y_1...,x_1...x_{i-1}$. Otherwise, generate $x_i$ from the distribution $e(x_i|y_i)$, set $i=i+1$, and return to step 2.

### Estimating

Define $c(u,v,s)$ to be the number of times the sequence of three states $(u,v,s)$ is seen in training data. Similarly, define $c(u,v)$ to be the number of times the tag bigram$(u,v)$ is seen and $c(s)$ to be the number of times that the state $s$ is seen in the corpus.
Define $c(s\leadsto x)$ to be the number of times state $s$ is seen paired with observation $x$ in the corpus.
The maximum-likelihood estimates are

In some cases it is useful to smooth estimates of $q(s|u,v)$, using the techniques of smoothing:

### Dealing with Low-Frequency Words

A common method is as follows:

· Step 1: Split vocabulary into two sets
Frequent words : words occurring ≥ 5 times in training
Low-frequency words : all other words
· Step 2: Map low frequency words into a small, finite set, depending on prefixes, suffixes etc.

### The Viterbi Algorithm

Problem: for an input $x_1...x_n$, find $\arg\max_{y_1...y_{n+1}}p(x_1...x_n,y_1...y_{n+1})$ where the arg max is taken over all sequences $y_1...y_{n+1}$ such that $y_i\in\mathcal{K}$ for $i=1...n$, and $y_{n+1}=STOP$.
We assume that p again takes the form $p(x_1...x_n,y_1...y_{n+1})=\prod_{i=1}^{n+1}q(y_i|y_{i-2},y_{i-1})\prod_{i=1}^ne(x_i|y_i)$

The naive brute force method would be hopelessly inefficient. Instead, we can efficiently find the highest probability tag sequence using a dynamic programming algorithm that is often called the Viterbi algorithm. The input to the algorithm is a sentence $x_1...x_n$.