Natural Language Processing with Python and NLTK

This note is based on Natural Language Processing with Python - Analyzing Text with the Natural Language Toolkit. I’ve uploaded the exercises solution to GitHub.

Texts and words

Texts are represented in Python using lists. We can use indexing, slicing, and the len() function.

Some word comparison operators: s.startswith(t), s.endswith(t), t in s, s.islower(), s.isupper(), s.isalpha(), s.isalnum(), s.isdigit(), s.istitle().

Searching

text1.concordance("monstrous")
A concordance view shows us every occurrence of a given word, together with some context.
A concordance permits us to see words in context.

text1.similar("monstrous")
We can find out other words appear in a similar range of contexts, by appending the term similar to the name of the text, then inserting the relevant word in parentheses. (a bit like synonyms)

text2.common_contexts(["monstrous", "very"])
The term common_contexts allows us to examine just the contexts that are shared by two or more words.

text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
We can determine the location of a word in the text: how many words from the beginning it appears. This positional information can be displayed using a dispersion plot.

Fine-grained Selection of Words
The mathematical set notation and corresponding Python expression.

[w for w in V if p(w)]

1
2
V = set(text1)
long_words = [w for w in V if len(w) > 15]

Collocations and Bigrams
The bigram is written as ('than', 'said') in Python.
A collocation is a sequence of words that occur together unusually often. Collocations are essentially just frequent bigrams, except that we want to pay more attention to the cases that involve rare words. In particular, we want to find bigrams that occur more often then we would expect based on the frequency of the individual words.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
list(bigrams(['more', 'is', 'said', 'than', 'done']))
text4.collocations()
```
### Counting
A **token** is the technical name for a sequence of characters that we want to trat as a group.
We will think of a text as nothing more than a sequence of words and punctuation. Thus we can use `len()` to compute the number of tokens.
A **word type** is the form or spelling of the word independently of its specific occurrences in a text - that is, the word considered as a unique item of vocabulary.
Since our tokens include punctuation symbols as well, we will generally call these unique items **types** instead of word types.
```Python
len(text1) # the tokens of the text
len(set(text1)) # the types of the text considering case
len(set(word.lower() for word in text1)) # without considering case
len(set(word.lower() for word in text1 if word.isalpha())) # eliminate numbers and punctuation

A frequency distribution tells us the frequency of each vocabulary item in the text.
FreqDist can be treated as dictionary in Python, where the word(or word length, etc) is the key, and the occurrence is the corresponding value.

1
2
3
4
5
6
7
8
9
10
fdist1 = FreqDist(text1) # invoke FreqDist(), pass the name of the text as an argument
fdist1.most_common(50) # gives a list of the 50 most frequently occurring types in the text
fdist1['whale'] # the frequency of the given word
fdist1.hapaxes() # find the words that occur only once
fdist = FreqDist(len(w) for w in text1) # the length of words' frequency distribution
fdist.most_common() # if no argument is passed, just print all word lengths
fdist.max() # print the most frequent word length
fdist[3] # print the word length 3's occurrence
fdist.freq(3) # print the frequency of word length 3

functions defined for NLTK’s Frequency Distributions can be found here

Text Corpora and Lexical Resources

Accessing Text Corpora

A text corpus is a large, structured collection of texts. Some text corpora are categorized, e.g., by genre or topic; sometimes the categories of a corpus overlap each other.

The NLTK has many corpus in the package nltk.corpus. To perform the functions introduced before, we have to employ the following pair of statements:

1
2
emma = nltk.Text(nltk.corpus.guterberg.words('austen-emma.txt')) # type cast, from StreamBackedCorpusView to Text
emma.concordance('surprise')

A short program to display information about each text, by looping over all the values of fileid corresponding to the gutenberg file identifiers and then computing statistics for each text.

1
2
3
4
5
6
7
for fileid in gutenberg.fileids():
num_chars = len(gutenberg.raw(fileid)) # raw() tells us how many letters occur in the text
num_words = len(gutenberg.words(fileid)) # words() divides the text into words
num_sents = len(gutenberg.sents(fileid)) # sents() divides the text up into its sentences, where each sentence is a list of words
num_vocab = len(set(w.lower() for w in gutenberg.words(fileid))) # count unique words
print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid) # use round() to round each number to the nearest integer
# average word lenth, average sentence length, the number of times each vocabulary item appears in the text on average

Brown Corpus
The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre.(see here for detail)
We can access the corpus as a list of words, or a list of sentences. We can optionally specify particular categories or files to read.

1
2
3
4
5
brown.categories()
brown.fileids()
brown.words(categories='news')
brown.words(fileids='cg22')
brown.sents(categories=['news', 'editorial', 'reviews'])

Use Brown Corpus to study stylistics: systematic differences between genres.

1
2
3
4
5
6
7
8
9
10
11
12
13
news_text = brown.words(categories='news')
fdist = nltk.FreqDist(w.lower() for w in news_text)
modals = ['can', 'could', 'may', 'might', 'must', 'will']
for m in modals:
print(m + ':', fdist[m], end=' ')
print() # I add this line to modify the output
cfd = nltk.ConditionalFreqDist(
(genre, word.lower()) # to count the Uppercased words as well, or the statistics would be inconsistent
for genre in brown.categories()
for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] # genres = brown.categories()
cfd.tabulate(conditions=genres, samples=modals)

A conditional frequency distribution is a collection of frequency distributions, each one for a different “condition”. The condition will often be the category of the text.
A frequency distribution counts observable events, such as the appearance of words in a text. A conditional frequency distribution needs to pair each event with a condition.

NLTK’s Conditional Frequency Distributions: commonly-used methods and idioms for defining, accessing, and visualizing a conditional frequency distribution of counters.

Example Description
cfdist = ConditionalFreqDist(pairs) create a conditional frequency distribution from a list of pairs
cfdist.conditions() the conditions
cfdist[condition] the frequency distribution for this condition
cfdist[condition][sample] frequency for the given sample for this condition
cfdist.tabulate() tabulate the conditional frequency distribution
cfdist.tabulate(samples, conditions) tabulation limited to the specified samples and conditions
cfdist.plot() graphical plot of the conditional frequency distribution
cfdist.plot(samples, conditions) graphical plot limited to the specified samples and conditions
cfdist1 < cfdist2 test if samples in cfdist1 occur less frequently than in cfdist2

Reuters Corpus
The Reuters Corpus contains 10788 news documents totaling 1.3 million words. The documents have been classified into 90 topics and grouped into training and test sets.
Unlike the Brown Corpus, categories in the Reuters corpus overlap with each other, simply because a news story often covers multiple topics. We can ask for the topics covered by one or more documents, or for the documents included in one or more categories.

1
2
reuters.categories('training/9865')
reuters.fileids('barley')

Here lists some of the Corpora and Corpus Samples Distributed with NLTK. For more information consult NLTK HOWTOs.

Basic Corpus Functionality defined in NLTK:

Example Description
fileids() the files of the corpus
fileids([categories]) the files of the corpus corresponding to these categories
categories() the categories of the corpus
categories([fileids]) the categories of the corpus corresponding to these files
raw() the raw content of the corpus
raw(fileids=[f1,f2,f3]) the raw content of the specified files
raw(categories=[c1,c2]) the raw content of the specified categories
words() the words of the whole corpus
words(fileids=[f1,f2,f3]) the words of the specified fileids
words(categories=[c1,c2]) the words of the specified categories
sents() the sentences of the whole corpus
sents(fileids=[f1,f2,f3]) the sentences of the specified fileids
sents(categories=[c1,c2]) the sentences of the specified categories
abspath(fileid) the location of the given file on disk
encoding(fileid) the encoding of the file (if known)
open(fileid) open a stream for reading the given corpus file
root if the path to the root of locally installed corpus
readme() the contents of the README file of the corpus

Lexical Resources

A lexicon, or lexical resource, is a collection of words and/or phrases along with associated such as part of speech and sense definition. Lexical resources are secondary to texts, and are usually created and enriched with the help of texts. For example, vocab = sorted(set(my_text)) and word_freq = FreqDist(my_text) are both simple lexical resources.

Lexicon Terminology: lexical entries for two lemmas having the same spelling (homonyms), providing part of speech and gloss information.

Wordlist Corpora

1
2
3
4
5
6
7
8
9
# find unusual or mis-spelt words in a text corpus
def unusual_words(text):
text_vocab = set(w.lower() for w in text if w.isalpha())
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
unusual = text_vocab - english_vocab
return sorted(unusual)
unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))
unusual_words(nltk.corpus.nps_chat.words())
1
2
3
4
5
6
7
# word list for solving word puzzles
puzzle_letters = nltk.FreqDist('egivrvonl') # calculate frequency distribution for given word puzzle letters
obligatory = 'r' # the center letter which must be included
wordlist = nltk.corpus.words.words()
[w for w in wordlist if len(w) >= 6
and obligatory in w
and nltk.FreqDist(w) <= puzzle_letters] # the comparison of frequency distribution of each letter
1
2
3
4
5
# find names which appear in both male.txt and female.txt
names = nltk.corpus.names # which contains 'male.txt' and 'female.txt'
male_names = names.words('male.txt')
female_names = names.words('female.txt')
[w for w in male_names if w in female_names]
1
2
3
4
5
6
7
8
9
10
# comparative wordlist - Swadesh wordlists
from nltk.corpus import swadesh
swedesh.fileids()
fr2en = swadesh.entries(['fr', 'en'])
de2en = swadesh.entries(['de', 'en'])
es2en = swadesh.entries(['es', 'en'])
translate = dict(fr2en) # convert into a simple dictionary
translate.update(dict(de2en))
translate.update(dict(es2en))
translate['chien'] # 'dog'

The CMU Pronouncing Dictionary, Toolbox are introduced in the book, I’ll just omit them in the note.

WordNet

WordNet is a semantically-oriented dictionary of English, similar to a traditional thesaurus but with a richer structure. NLTK includes the English WordNet, with 155287 words and 117659 synonym sets.

Synsets
With the WordNet, we can find the word’s synonyms in synsets - “synonym set”, definitions and examples as well.

1
2
3
4
5
6
# synsets
from nltk.corpus import wordnet as wn
wn.synsets('motorcar') # [Synset('car.n.01')]
wn.synset('car.n.01').lemma_names() # ['car', 'auto', 'automobile', 'machine', 'motorcar']
wn.synset('car.n.01').definition()
wn.synset('car.n.01').examples()

Hyponyms and hypernyms
WordNet synsets correspond to abstract concepts, and they don’t always have corresponding words in English. These concepts are linked together in a hierarchy. (See hyponyms and in lexical relations.)
The corresponding methods are hyponyms() and hypernyms().

1
2
3
4
5
# hyponyms and hypernyms
motorcar = wn.synset('car.n.01')
types_of_motorcar = motorcar.hyponyms()
parent_of_motorcar = motorcar.hypernyms()
paths = motorcar.hypernym_paths()

Some other lexical relations
Another important way to navigate the WordNet network is from items to their components (meronyms), or to the things they are contained in (holonyms). There are three kinds of holonym-meronym relation: member_meronyms(), part_meronyms(), substance_meronyms(), member_holonyms(), part_holonyms(), substance_holonyms().
There are also relationships between verbs. For example, the act of walking involves the act of stepping, so walking entails stepping. Some verbs have multiple entailments.(NLTK also includes VerbNet, a hierarhical verb lexicon linked to WordNet. It can be accessed with nltk.corpus.verbnet)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# meronyms and holonyms
tree = wn.synset('tree.n.01')
tree.part_meronyms()
tree.substance_meronyms()
tree.member_holonyms()
# entailments
wn.synset('walk.v.01').entailments()
wn.synset('eat.v.01').entailments()
wn.synset('tease.v.03').entailments()
# antonymy
wn.lemma('supply.n.02.supply').antonymys()
wn.lemma('rush.v.01.rush').antonymys()
wn.lemma('horizontal.a.01.horizontal').antonymys()
wn.lemma('staccato.r.01.staccato').antonymys()

Semantic Similarity
Given a particular synset, we can traverse the WordNet network to find synsets with related meanings. Knowing which words are semantically related is useful for indexing a collection of texts, so that a search for a general term will match documents containing specific terms.
We can qualify the concept of generality(specific or general) by looking up the depth of the synset.
path_similarity assigns a score in the range 0–1 based on the shortest path that connects the concepts in the hypernym hierarchy (-1 is returned in those cases where a path cannot be found). Comparing a synset with itself will return 1.

1
2
3
4
5
6
7
8
9
right_whale = wn.synset('right_whale.n.01')
orca = wn.synset('orca.n.01')
minke = wn.synset('minke_whale.n.01')
tortoise = wn.synset('tortoise.n.01')
novel = wn.synset('novel.n.01')
right_whale.lowest_common_hypernyms(minke)
right_whale.min_depth()
right_whale.path_similarity(minke)

Processing Raw Text

Accessing Text

From Web

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# from electronic books
from urllib import request
from nltk import word_tokenize
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf-8')
raw = raw[raw.find('PART I'):raw.rfind("End of Project Gutenberg’s Crime")] # extract useful information
tokens = word_tokenize(raw)
text = nltk.Text(tokens)
# from HTML is almost the same, except that to get text out of HTML
# we have to use a Python library called BeautifulSoup,
# available from http://www.crummy.com/software/BeautifulSoup/
# since there're many markup symbols, e.g., << >> |
from bs4 import BeautifulSoup
raw = BeautifulSoup(html).get_text()
tokens = word_tokenize(raw)
# With the help of a Python library called the Universal Feed Parser available from
# https://pypi.python.org/pypi/feedparser, we can access the content of a blog
import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")

From local files

1
2
3
4
5
6
7
8
9
10
f = open('document.txt')
f.read() # read the contents of the entire file
f = open('document.txt', 'rU') # after read() we have to reopen the file
# this time 'r' means to open the file for reading(the default)
# and 'U' stands for 'Universal', which lets us ignore the
# different conventions used for marking newlines
for line in f: # read line by line
print(line.strip())

ASCII text and HTML text are human readable formats. Text often comes in binary formats — like PDF and MSWord — that can only be opened using specialized software. Third-party libraries such as pypdf and pywin32 provide access to these formats. Extracting text from multi-column documents is particularly challenging. For once-off conversion of a few documents, it is simpler to open the document with a suitable application, then save it as text to your local drive, and access it as described below. If the document is already on the web, you can enter its URL in Google’s search box. The search result often includes a link to an HTML version of the document, which you can save as text.

Regular Expression

I’ve uploaded my summary of regular expression in the post Regular Expression.

1
2
3
import re
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]
[w for w in wordlist if re.search('ed$', w)]

NLTK provides a regular expression tokenizer: nltk.regexp_tokenize().

Structured Program

The basic part is discussed in my Python learning note. And in this section, I just record some unfamiliar knowledge.

Assignment always copies the value of an expression, but a value is not always what you might expect it to be. In particular, the “value” of a structured object such as a list is actually just a reference to the object.
Python provides two ways to check that a pair of items are the same. The is operator tests for object identity.

A list is typically a sequence of objects all having the same type, of arbitrary length. We often use lists to hold sequences of words. In contrast, a tuple is typically a collection of objects of different types, of fixed length. We often use a tuple to hold a record, a collection of different fields relating to some entity.

Generator Expression:

1
2
3
4
>>> max([w.lower() for w in word_tokenize(text)]) [1]
'word'
>>> max(w.lower() for w in word_tokenize(text)) [2]
'word'

The second line uses a generator expression. This is more than a notational convenience: in many language processing situations, generator expressions will be more efficient. In 1, storage for the list object must be allocated before the value of max() is computed. If the text is very large, this could be slow. In 2, the data is streamed to the calling function. Since the calling function simply has to find the maximum value - the word which comes latest in lexicographic sort order - it can process the stream of data without having to store anything more than the maximum value seen so far.

1
2
3
4
5
6
7
8
9
10
11
def search1(substring, words):
result = []
for word in words:
if substring in word:
result.append(word)
return result
def search2(substring, words):
for word in words:
if substring in word:
yield word

Function
It is not necessary to have any parameters.
A function usually communicates its results back to the calling program via the return statement.
A Python function is not required to have a return statement. Some functions do their work as a side effect, printing a result, modifying a file, or updating the contents of a parameter to the function (such functions are called “procedures” in some other programming languages).

1
2
3
4
5
6
7
>>> def my_sort1(mylist): # good: modifies its argument, no return value
... mylist.sort()
>>> def my_sort2(mylist): # good: doesn't touch its argument, returns value
... return sorted(mylist)
>>> def my_sort3(mylist): # bad: modifies its argument and also returns it
... mylist.sort()
... return mylist

When you refer to an existing name from within the body of a function, the Python interpreter first tries to resolve the name with respect to the names that are local to the function. If nothing is found, the interpreter checks if it is a global name within the module. Finally, if that does not succeed, the interpreter checks if the name is a Python built-in. This is the so-called LGB rule of name resolution: local, then global, then built-in.

Python also lets us pass a function as an argument to another function. Now we can abstract out the operation, and apply a different operation on the same data.

1
2
3
4
5
6
7
8
9
10
11
>>> sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the',
... 'sounds', 'will', 'take', 'care', 'of', 'themselves', '.']
>>> def extract_property(prop):
... return [prop(word) for word in sent]
...
>>> extract_property(len)
[4, 4, 2, 3, 5, 1, 3, 3, 6, 4, 4, 4, 2, 10, 1]
>>> def last_letter(word):
... return word[-1]
>>> extract_property(last_letter)
['e', 'e', 'f', 'e', 'e', ',', 'd', 'e', 's', 'l', 'e', 'e', 'f', 's', '.']