COMS W4705 Natural Language Processing Note

Week One


What is natural language processing?
computers using natural language as input and/or output
NLP contains two types: understanding(NLU) and generation(NLG).


Machine Translation
Translate from one language to another.
Information Extraction
Take some text as input, and produce structured(database) representation of some key content in the text.
Goal: Map a document collection to structured database
Motivation: Complex searches, Statistical queries.

Text Summarization
Take a single document, or potentially a group of several documents, and try to condense them down to summarize main information in those documents.

Dialogue Systems
Human can interact with computer.

Basic NLP Problems


Strings to Tagged Sequences
Part-of-speech tagging
Profits(/N) soared(/V) at(/P) Boeing(/N) Co.(/N) .(/.) easily(/ADV) topping(/V) forecasts(/N) on (/P) Wall(/N) Street(/N) .(/.)

Name Entity Recognition
Profits(/NA) soared(/NA) at(/NA) Boeing(/SC) Co.(/CC) .(/NA) easily(/NA) topping(/NA) forecasts(/NA) on (/NA) Wall(/SL) Street(/CL) .(/.)
/NA: not any entity
/SC: start of company
/CC: continuation of company
/SL: start of location
/CL: continuation of location



At last, a computer that understands you like your mother.


This sentence can be interpreted in different ways:
1.It understands you as well as your mother understands you.
2.It understands (that) you like your mother.
3.It understands you as well as it understands your mother.

Ambiguity at Many Levels:
Acoustic level
In speech recognition:
1.”… a computer that understands you like your mother”
2.”… a computer that understands lie cured mother”

Syntactic level

Semantic level
A word may have a variety of meanings and it may cause word sense ambiguity.

Discourse (multi-clause) level
Alice says they’ve built a computer that understands you like your mother
But she…
… doesn’t know any details
… doesn’t understand me at all
This is an instance of anaphora, where she referees to some other discourse entity.

Language Modeling

· We have some (finite) vocabulary, say

· We have an (infinite) set of strings, :

the STOP
the fan STOP
the fan saw Beckham STOP
the fan saw saw STOP
the fan saw Beckham play for Real Madrid STOP

· We have a training sample of example sentences in English
· We need to “learn” a probability distribution p i.e., p is a function that satisfies

Definition: A language model consists of a finite set , and a function such that:
1.For any ,
2.In addition,

Language models are very useful in a broad range of applications, the most obvious perhaps being speech recognition and machine translation. In many applications it is very useful to have a good “prior” distribution over which sentences are or aren’t probable in a language. For example, in speech recognition the language model is combined with an acoustic model that models the pronunciation of different words: one way to think about it is that the acoustic model generates a large number of candidate sentences, together with probabilities; the language model is then used to reorder these possibilities based on how likely they are to be a sentence in the language.
The techniques we describe for defining the function p, and for estimating the parameters of the resulting model from training examples, will be useful in several other contexts during the course: for example in hidden Markov models and in models for natural language parsing.

A Naive Method

We have N training sentences.
For any sentence is the number of times the sentence is seen in our training data.

This is a poor model: in particular it will assign probability 0 to any sentence not seen in the training corpus. Thus it fails to generalize to sentences that have not seen in the training data. The key technical contribution of this chapter will be to introduce methods that do generalize to sentences that are not seen in our training data.

Markov Models

First-Order Markov Processes

The first step is exact: by the chain rule of probabilities, any distribution can be written in this form. So we have made no assumptions in this step of the derivation. However, the second step is not necessarily exact: we have made the assumption that for any , for any ,

This is a first-order Markov assumption. We have assumed that the identity of the i’th word in the sequence depends only on the identity of the previous word, . More formally, we have assumed that the value of is conditionally independent of , given the value of .

Second-Order Markov Processes

(For convenience we assume , where * is a special “start” symbol.)

Compared with first-order Markov process, we make a slightly weaker assumption, namely that each word depends on the previous two words in the sequence. And the second-order Markov process will form the basis of trigram language models.

The length of sequence n can itself vary. We just assume that the n‘th word in the sequence, is always equal to a special symbol, the STOP symbol. This symbol can only appear at the end of a sequence and it doesn’t belong to the set .


  1. Initialize i = 1, and
  2. Generate from the distribution
  3. If then return the sequence . Otherwise, set and return to step 2.


A trigram language model consists of a finite set , and a parameter for each trigram u,v,w such that , and . The value for can be interpreted as the probability of seeing the word w immediately after the bigram (u,v). For any sentence where for i = 1…(n-1), and , the probability of the sentence under the trigram language model is , where we define .

Estimation Problem:
A natural estimate (the “maximum likelihood estimate”):

For example:

This way of estimating parameters runs into a very serious issue. Say our vocabulary size is , then there are parameters in the model. This leads to two problems:
· Many of the above estimates will be , due to the count in the numerator being 0. This will lead to many trigram probabilities being systematically underestimated: it seems unreasonable to assign probability 0 to any trigram not seen in training data, given that the number of parameters of the model is typically very large in comparison to the number of words in the training corpus.
· In cases where the denominator is equal to zero, the estimate is not well defined.

We have some test data sentences . Note that test sentences are “held out”, in the sense that they are not part of the corpus used to estimate the language model. (Just the same as normal machine learning problems.)
We could look at the probability under our model .
More conveniently, the log probability:

In fact the usual evaluation measure is :

and M is the total number of words in the test data . is the length of the i‘th test sentence.

Linear Interpolation

The subscript ML means maximum-likelihood estimation. And is the number of times word w is seen in the training corpus, and is the total number of words seen in the training corpus.
The trigram, bigram and unigram estimates have different strengths and weaknesses. The idea in linear interpolation is to use all three estimates, by taking a weighted average of the three estimates:

Here , and are three additional parameters of the model, which satisfies and for all i.
(Our estimate correctly defines a distribution.)

There are various ways of estimating the λ values. A common one is as follows. Say we have some additional held-out data(development data), which is separate from both our training and test corpora.
Define to be the number of times that the trigram (u,v,w) is seen in the development data.

We would like to choose our λ values to maximize .(It means minimize perplexity.)

The three parameters can be interpreted as an indication of the confidence, or weight, placed on each of the trigram, bigram and unigram estimates.

In practice, it is important to add an additional degree of freedom, by allowing the values for to vary.
Take a function that partitions histories

Introduce a dependence of the λ’s on the partition:

where , and for all i.

By the way, should be 0 if , because in this case the trigram estimate is undefined. Similarly, if both and are equal to zero, we need

In the referencing note, a simple method is introduced:

Under this definition, it can be seen that increases as increases, and similarly that increases as increases. This method is relatively crude, and is not likely to be optimal. It is, however, very simple, and in practice it can work well in some applications.

Discounting Methods

For any bigram such that , we define the discounted count as

where β is a value between 0 and 1 (a typical value might be β=0.5). Thus we simply subtract a constant value, β, from the count. This reflects the intuition that if we take counts from the training corpus, we will systematically over-estimate the probability of bigrams seen in the corpus(and under-estimate bigrams not seen in the corpus).
For any bigram such that , we can then define . For any context , this definition leads to some missing probability mass, defined as
The intuition behind discounted methods is to divide this “missing mass” between the words w such that .
Define two sets

Then the back-off model is defined as

Thus if we return the estimate ; otherwise we divide the remaining probability mass in proportion to the unigram estimate .

The method can be generalized to trigram language models in a natural, recursive way:

Week Two


Given the input to the tagging model(referred as a sentence) , use to denote the output of the tagging model(referred as the state sequence or tag sequence).
This type of problem, where the task is to map a sentence to a tag sequence is often referred to as a sequence labeling problem, or a tagging problem.
We will assume that we have a set of training examples, for , where each is a sentence , and each is a tag sequence (we assume that the i‘th training example is of length ). Hence is the j‘th word in the i‘th training example, and is the tag for that word. Our task is to learn a function that maps sentences to tag sequences from these training examples.

POS Tagging

POS Tagging: Part-of-Speech tagging.
The input to the problem is a sentence. The output is a tagged sentence, where each word in the sentence is annotated with its part of speech. Our goal will be to construct a model that recovers POS tags for sentences with high accuracy. POS tagging is one of the most basic problems in NLP, and is useful in many natural language applications.


  • D: determiner (a, an, the)
  • N: noun
  • V: verb
  • P: preposition
  • Adv: adverb
  • Adj: adjective


Many words in English can take several possible parts of speech(as well as in Chinese and many other languages). A word can be a noun as well as a verb. E.g., look, result
Rare words
Some words are rare and may not be seen in the training examples. Even with say a million words of training data, there will be many words in new sentences which have not been seen in training. It will be important to develop methods that deal effectively with words which have not been seen in training data.

Sources of information

Individual words have statistical preferences for their part of speech.
E.g., can is more likely to be a modal verb rather than a noun.

The context has an important effect on the part of speech for a word. In particular, some sequences of POS tags are much more likely than others. If we consider POS trigrams, the sequence D N V will be frequent in English, whereas the sequence D V N is much less likely.
Sometimes these two sources of evidence are in conflict. For example, in the sentence The trash can is hard to find, the part of speech for can is a noun-however, can can also be a modal verb, and in fact it is much more frequently seen as a modal verb in English. In this sentence the context has overridden the tendency for can to be a verb as opposed to a noun.


For named-entity problem, the input is again a sentence. The output is the sentence with entity-boundaries marked. Recognizing entities such as people, locations and organizations has many applications, and name-entity recognition has been widely studies in NLP research.

Once this mapping has been performed on training examples, we can train a tagging model on these training examples. Given a new test sentence we can then recover the sequence of tags from the model, and it is straightforward to identify the entities identified by the model.

Generative Models

Supervised Learning:
Assume training examples , where each example consists of an input paired with a label . We use to refer to the set of possible inputs, and to refer to the set of possible labels. Our task is to learn a function that maps any input x to a label f(x).
One way to define the function f(x) is through a conditional model. In this approach we define a model that defines the conditional probability for any x,y pair. The parameters of the model are estimated from the training examples. Given a new test example x, the output from the model is
Thus we simply take the most likely label y as the output from the model. If our model is close to the true conditional distribution of labels given inputs, the function f(x) will be close to optimal.

Generative Models
Rather than directly estimating the conditional distribution , in generative models we instead model the joint probability over pairs. The parameters of the model are again estimated from the training examples for . In many cases we further decompose the probability as and then estimate the models for and separately. These two model components have the following interpretations:
· is a prior probability distribution over labels y.
· is the probability of generating the input x, given that the underlying label is y.

Given a generative model, we can use Bayes rule to derive the condition probability for any pair:


We use Bayes rule directly in applying the joint model to a new test example. Given an input x, the output of our model, f(x), can be derived as follows:

Eq. 2 follows by Bayes rule. Eq. 3 follows because the denominator, , does not depend on y, and hence does not affect the arg max. This is convenient, because it mean that we do not need calculate , which can be an expensive operation.
Models that decompose a joint probability into terms and are often called noisy-channel models. Intuitively, when we see a test example x, we assume that has been generated in two steps: first, a label y has been chosen with probability ; second, the example x has been generated from the distribution . The model can be interpreted as a “channel” which takes a label y as its input, and corrupts it to produce x as its output. Our task is to find the most likely label y, given that we observe x.

In summary:

  • Our task is to learn a function from inputs x to labels . We assume training examples for .
  • In the noisy channel approach, we use the training example to estimate models and . These models define a joint(generative) model:
  • Given a new test example x, we predict the label . Finding the output f(x) for an input x is often referred to as the decoding problem.

Assume a finite set of words , and a finite set of tags . Define to be the set of all sequence/tag-sequence pairs such that . A generative tagging model is then a function p such that:

  1. For any
  2. In addition,

Hense is a probability distribution over pairs of sequences(i.e., a probability distribution over the set ).
Given a generative tagging model, the function from sentences to tag sequence is defined as

where the arg max is taken over all sequences such that . Thus for any input , we take the highest probability tag sequence as the output from the model.

Hidden Markov Models

Trigram HMMs

A trigram HMM consists of a finite set of possible words, and a finite set of possible tags, together with the following parameters:
· A parameter for any trigram such that , and . The value for can be interpreted as the probability of seeing the tag s immediately after the bigram of tags .
· A parameter for any . The value for can be interpreted as the probability of seeing observation x paired with state s.
Define to be the set of all sequence/tag-sequence pairs such that .
We then define the probability for any as

where we have assumed that .

If we have n = 3, equal to the sentence the dog laughs, and equal to the tag sequence D N V STOP, then

The quantity is the prior probability of seeing the tag sequence D N V STOP, where we have used a second-order Markov model(a trigram model).
The quantity can be interpreted as the conditional probability : that is, the conditional probability where x is the sentence the dog laughs, and y is the sequence D N V STOP.

Consider a pair of sentences of random variables and , where n is the length of the sequences. To model the joint probability

for any observation sequence paired with a state sequence , where each is a member of , and each is a member of .
Define one additional variable which always takes the value STOP, just as we did in variable-Markov sequences.
The key idea in hidden Markov models is the following definition:

We have assumed that for any i, for any values of ,

and that for any value of i, for any values of and ,

The derivation of hidden Markov models:

Just by the chain rule of probabilities. The joint probability is decomposed into two terms: first, the probability of choosing tag sequence ; second, the probability of choosing the word sequence , conditioned on the choice of tag sequence.

Consider the first item: Assume the sequence is a second-order Markov sequence.

Consider the second item: Assume that the value for the random variable depends only on the value .

Stochastic process

  1. Initialize and .
  2. Generate from the distribution
  3. If then return . Otherwise, generate from the distribution , set , and return to step 2.


Define to be the number of times the sequence of three states is seen in training data. Similarly, define to be the number of times the tag bigram is seen and to be the number of times that the state is seen in the corpus.
Define to be the number of times state is seen paired with observation in the corpus.
The maximum-likelihood estimates are

In some cases it is useful to smooth estimates of , using the techniques of smoothing:

Dealing with Low-Frequency Words

A common method is as follows:

· Step 1: Split vocabulary into two sets
Frequent words : words occurring ≥ 5 times in training
Low-frequency words : all other words
· Step 2: Map low frequency words into a small, finite set, depending on prefixes, suffixes etc.

The Viterbi Algorithm

Problem: for an input , find where the arg max is taken over all sequences such that for , and .
We assume that p again takes the form

The naive brute force method would be hopelessly inefficient. Instead, we can efficiently find the highest probability tag sequence using a dynamic programming algorithm that is often called the Viterbi algorithm. The input to the algorithm is a sentence .