Natural Language Processing with Python and NLTK

This note is based on Natural Language Processing with Python - Analyzing Text with the Natural Language Toolkit. I’ve uploaded the exercises solution to GitHub.

Texts and words

Texts are represented in Python using lists. We can use indexing, slicing, and the len() function.

Some word comparison operators: s.startswith(t), s.endswith(t), t in s, s.islower(), s.isupper(), s.isalpha(), s.isalnum(), s.isdigit(), s.istitle().

Searching

text1.concordance("monstrous")
A concordance view shows us every occurrence of a given word, together with some context.
A concordance permits us to see words in context.

text1.similar("monstrous")
We can find out other words appear in a similar range of contexts, by appending the term similar to the name of the text, then inserting the relevant word in parentheses. (a bit like synonyms)

text2.common_contexts(["monstrous", "very"])
The term common_contexts allows us to examine just the contexts that are shared by two or more words.

text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
We can determine the location of a word in the text: how many words from the beginning it appears. This positional information can be displayed using a dispersion plot.

Fine-grained Selection of Words
The mathematical set notation and corresponding Python expression.

[w for w in V if p(w)]

1
2
V = set(text1)
long_words = [w for w in V if len(w) > 15]

Collocations and Bigrams
The bigram is written as ('than', 'said') in Python.
A collocation is a sequence of words that occur together unusually often. Collocations are essentially just frequent bigrams, except that we want to pay more attention to the cases that involve rare words. In particular, we want to find bigrams that occur more often then we would expect based on the frequency of the individual words.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
list(bigrams(['more', 'is', 'said', 'than', 'done']))
text4.collocations()
```
### Counting
A **token** is the technical name for a sequence of characters that we want to trat as a group.
We will think of a text as nothing more than a sequence of words and punctuation. Thus we can use `len()` to compute the number of tokens.
A **word type** is the form or spelling of the word independently of its specific occurrences in a text - that is, the word considered as a unique item of vocabulary.
Since our tokens include punctuation symbols as well, we will generally call these unique items **types** instead of word types.
```Python
len(text1) # the tokens of the text
len(set(text1)) # the types of the text considering case
len(set(word.lower() for word in text1)) # without considering case
len(set(word.lower() for word in text1 if word.isalpha())) # eliminate numbers and punctuation

A frequency distribution tells us the frequency of each vocabulary item in the text.
FreqDist can be treated as dictionary in Python, where the word(or word length, etc) is the key, and the occurrence is the corresponding value.

1
2
3
4
5
6
7
8
9
10
fdist1 = FreqDist(text1) # invoke FreqDist(), pass the name of the text as an argument
fdist1.most_common(50) # gives a list of the 50 most frequently occurring types in the text
fdist1['whale'] # the frequency of the given word
fdist1.hapaxes() # find the words that occur only once
fdist = FreqDist(len(w) for w in text1) # the length of words' frequency distribution
fdist.most_common() # if no argument is passed, just print all word lengths
fdist.max() # print the most frequent word length
fdist[3] # print the word length 3's occurrence
fdist.freq(3) # print the frequency of word length 3

functions defined for NLTK’s Frequency Distributions can be found here

Text Corpora and Lexical Resources

Accessing Text Corpora

A text corpus is a large, structured collection of texts. Some text corpora are categorized, e.g., by genre or topic; sometimes the categories of a corpus overlap each other.

The NLTK has many corpus in the package nltk.corpus. To perform the functions introduced before, we have to employ the following pair of statements:

1
2
emma = nltk.Text(nltk.corpus.guterberg.words('austen-emma.txt')) # type cast, from StreamBackedCorpusView to Text
emma.concordance('surprise')

A short program to display information about each text, by looping over all the values of fileid corresponding to the gutenberg file identifiers and then computing statistics for each text.

1
2
3
4
5
6
7
for fileid in gutenberg.fileids():
num_chars = len(gutenberg.raw(fileid)) # raw() tells us how many letters occur in the text
num_words = len(gutenberg.words(fileid)) # words() divides the text into words
num_sents = len(gutenberg.sents(fileid)) # sents() divides the text up into its sentences, where each sentence is a list of words
num_vocab = len(set(w.lower() for w in gutenberg.words(fileid))) # count unique words
print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid) # use round() to round each number to the nearest integer
# average word lenth, average sentence length, the number of times each vocabulary item appears in the text on average

Brown Corpus
The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre.(see here for detail)
We can access the corpus as a list of words, or a list of sentences. We can optionally specify particular categories or files to read.

1
2
3
4
5
brown.categories()
brown.fileids()
brown.words(categories='news')
brown.words(fileids='cg22')
brown.sents(categories=['news', 'editorial', 'reviews'])

Use Brown Corpus to study stylistics: systematic differences between genres.

1
2
3
4
5
6
7
8
9
10
11
12
13
news_text = brown.words(categories='news')
fdist = nltk.FreqDist(w.lower() for w in news_text)
modals = ['can', 'could', 'may', 'might', 'must', 'will']
for m in modals:
print(m + ':', fdist[m], end=' ')
print() # I add this line to modify the output
cfd = nltk.ConditionalFreqDist(
(genre, word.lower()) # to count the Uppercased words as well, or the statistics would be inconsistent
for genre in brown.categories()
for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] # genres = brown.categories()
cfd.tabulate(conditions=genres, samples=modals)

A conditional frequency distribution is a collection of frequency distributions, each one for a different “condition”. The condition will often be the category of the text.
A frequency distribution counts observable events, such as the appearance of words in a text. A conditional frequency distribution needs to pair each event with a condition.

NLTK’s Conditional Frequency Distributions: commonly-used methods and idioms for defining, accessing, and visualizing a conditional frequency distribution of counters.

Example Description
cfdist = ConditionalFreqDist(pairs) create a conditional frequency distribution from a list of pairs
cfdist.conditions() the conditions
cfdist[condition] the frequency distribution for this condition
cfdist[condition][sample] frequency for the given sample for this condition
cfdist.tabulate() tabulate the conditional frequency distribution
cfdist.tabulate(samples, conditions) tabulation limited to the specified samples and conditions
cfdist.plot() graphical plot of the conditional frequency distribution
cfdist.plot(samples, conditions) graphical plot limited to the specified samples and conditions
cfdist1 < cfdist2 test if samples in cfdist1 occur less frequently than in cfdist2

Reuters Corpus
The Reuters Corpus contains 10788 news documents totaling 1.3 million words. The documents have been classified into 90 topics and grouped into training and test sets.
Unlike the Brown Corpus, categories in the Reuters corpus overlap with each other, simply because a news story often covers multiple topics. We can ask for the topics covered by one or more documents, or for the documents included in one or more categories.

1
2
reuters.categories('training/9865')
reuters.fileids('barley')

Here lists some of the Corpora and Corpus Samples Distributed with NLTK. For more information consult NLTK HOWTOs.

Basic Corpus Functionality defined in NLTK:

Example Description
fileids() the files of the corpus
fileids([categories]) the files of the corpus corresponding to these categories
categories() the categories of the corpus
categories([fileids]) the categories of the corpus corresponding to these files
raw() the raw content of the corpus
raw(fileids=[f1,f2,f3]) the raw content of the specified files
raw(categories=[c1,c2]) the raw content of the specified categories
words() the words of the whole corpus
words(fileids=[f1,f2,f3]) the words of the specified fileids
words(categories=[c1,c2]) the words of the specified categories
sents() the sentences of the whole corpus
sents(fileids=[f1,f2,f3]) the sentences of the specified fileids
sents(categories=[c1,c2]) the sentences of the specified categories
abspath(fileid) the location of the given file on disk
encoding(fileid) the encoding of the file (if known)
open(fileid) open a stream for reading the given corpus file
root if the path to the root of locally installed corpus
readme() the contents of the README file of the corpus

Lexical Resources

A lexicon, or lexical resource, is a collection of words and/or phrases along with associated such as part of speech and sense definition. Lexical resources are secondary to texts, and are usually created and enriched with the help of texts. For example, vocab = sorted(set(my_text)) and word_freq = FreqDist(my_text) are both simple lexical resources.

Lexicon Terminology: lexical entries for two lemmas having the same spelling (homonyms), providing part of speech and gloss information.

Wordlist Corpora

1
2
3
4
5
6
7
8
9
# find unusual or mis-spelt words in a text corpus
def unusual_words(text):
text_vocab = set(w.lower() for w in text if w.isalpha())
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
unusual = text_vocab - english_vocab
return sorted(unusual)
unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))
unusual_words(nltk.corpus.nps_chat.words())
1
2
3
4
5
6
7
# word list for solving word puzzles
puzzle_letters = nltk.FreqDist('egivrvonl') # calculate frequency distribution for given word puzzle letters
obligatory = 'r' # the center letter which must be included
wordlist = nltk.corpus.words.words()
[w for w in wordlist if len(w) >= 6
and obligatory in w
and nltk.FreqDist(w) <= puzzle_letters] # the comparison of frequency distribution of each letter
1
2
3
4
5
# find names which appear in both male.txt and female.txt
names = nltk.corpus.names # which contains 'male.txt' and 'female.txt'
male_names = names.words('male.txt')
female_names = names.words('female.txt')
[w for w in male_names if w in female_names]
1
2
3
4
5
6
7
8
9
10
# comparative wordlist - Swadesh wordlists
from nltk.corpus import swadesh
swedesh.fileids()
fr2en = swadesh.entries(['fr', 'en'])
de2en = swadesh.entries(['de', 'en'])
es2en = swadesh.entries(['es', 'en'])
translate = dict(fr2en) # convert into a simple dictionary
translate.update(dict(de2en))
translate.update(dict(es2en))
translate['chien'] # 'dog'

The CMU Pronouncing Dictionary, Toolbox are introduced in the book, I’ll just omit them in the note.

WordNet

WordNet is a semantically-oriented dictionary of English, similar to a traditional thesaurus but with a richer structure. NLTK includes the English WordNet, with 155287 words and 117659 synonym sets.

Synsets
With the WordNet, we can find the word’s synonyms in synsets - “synonym set”, definitions and examples as well.

1
2
3
4
5
6
# synsets
from nltk.corpus import wordnet as wn
wn.synsets('motorcar') # [Synset('car.n.01')]
wn.synset('car.n.01').lemma_names() # ['car', 'auto', 'automobile', 'machine', 'motorcar']
wn.synset('car.n.01').definition()
wn.synset('car.n.01').examples()

Hyponyms and hypernyms
WordNet synsets correspond to abstract concepts, and they don’t always have corresponding words in English. These concepts are linked together in a hierarchy. (See hyponyms and in lexical relations.)
The corresponding methods are hyponyms() and hypernyms().

1
2
3
4
5
# hyponyms and hypernyms
motorcar = wn.synset('car.n.01')
types_of_motorcar = motorcar.hyponyms()
parent_of_motorcar = motorcar.hypernyms()
paths = motorcar.hypernym_paths()

Some other lexical relations
Another important way to navigate the WordNet network is from items to their components (meronyms), or to the things they are contained in (holonyms). There are three kinds of holonym-meronym relation: member_meronyms(), part_meronyms(), substance_meronyms(), member_holonyms(), part_holonyms(), substance_holonyms().
There are also relationships between verbs. For example, the act of walking involves the act of stepping, so walking entails stepping. Some verbs have multiple entailments.(NLTK also includes VerbNet, a hierarhical verb lexicon linked to WordNet. It can be accessed with nltk.corpus.verbnet)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# meronyms and holonyms
tree = wn.synset('tree.n.01')
tree.part_meronyms()
tree.substance_meronyms()
tree.member_holonyms()
# entailments
wn.synset('walk.v.01').entailments()
wn.synset('eat.v.01').entailments()
wn.synset('tease.v.03').entailments()
# antonymy
wn.lemma('supply.n.02.supply').antonymys()
wn.lemma('rush.v.01.rush').antonymys()
wn.lemma('horizontal.a.01.horizontal').antonymys()
wn.lemma('staccato.r.01.staccato').antonymys()

Semantic Similarity
Given a particular synset, we can traverse the WordNet network to find synsets with related meanings. Knowing which words are semantically related is useful for indexing a collection of texts, so that a search for a general term will match documents containing specific terms.
We can qualify the concept of generality(specific or general) by looking up the depth of the synset.
path_similarity assigns a score in the range 0–1 based on the shortest path that connects the concepts in the hypernym hierarchy (-1 is returned in those cases where a path cannot be found). Comparing a synset with itself will return 1.

1
2
3
4
5
6
7
8
9
right_whale = wn.synset('right_whale.n.01')
orca = wn.synset('orca.n.01')
minke = wn.synset('minke_whale.n.01')
tortoise = wn.synset('tortoise.n.01')
novel = wn.synset('novel.n.01')
right_whale.lowest_common_hypernyms(minke)
right_whale.min_depth()
right_whale.path_similarity(minke)

Processing Raw Text

Accessing Text

From Web

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# from electronic books
from urllib import request
from nltk import word_tokenize
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf-8')
raw = raw[raw.find('PART I'):raw.rfind("End of Project Gutenberg’s Crime")] # extract useful information
tokens = word_tokenize(raw)
text = nltk.Text(tokens)
# from HTML is almost the same, except that to get text out of HTML
# we have to use a Python library called BeautifulSoup,
# available from http://www.crummy.com/software/BeautifulSoup/
# since there're many markup symbols, e.g., << >> |
from bs4 import BeautifulSoup
raw = BeautifulSoup(html).get_text()
tokens = word_tokenize(raw)
# With the help of a Python library called the Universal Feed Parser available from
# https://pypi.python.org/pypi/feedparser, we can access the content of a blog
import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")

From local files

1
2
3
4
5
6
7
8
9
10
f = open('document.txt')
f.read() # read the contents of the entire file
f = open('document.txt', 'rU') # after read() we have to reopen the file
# this time 'r' means to open the file for reading(the default)
# and 'U' stands for 'Universal', which lets us ignore the
# different conventions used for marking newlines
for line in f: # read line by line
print(line.strip())

ASCII text and HTML text are human readable formats. Text often comes in binary formats — like PDF and MSWord — that can only be opened using specialized software. Third-party libraries such as pypdf and pywin32 provide access to these formats. Extracting text from multi-column documents is particularly challenging. For once-off conversion of a few documents, it is simpler to open the document with a suitable application, then save it as text to your local drive, and access it as described below. If the document is already on the web, you can enter its URL in Google’s search box. The search result often includes a link to an HTML version of the document, which you can save as text.

Regular Expression

I’ve uploaded my summary of regular expression in the post Regular Expression.

1
2
3
import re
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]
[w for w in wordlist if re.search('ed$', w)]

NLTK provides a regular expression tokenizer: nltk.regexp_tokenize().

Structured Program

The basic part is discussed in my Python learning note. And in this section, I just record some unfamiliar knowledge.

Assignment always copies the value of an expression, but a value is not always what you might expect it to be. In particular, the “value” of a structured object such as a list is actually just a reference to the object.
Python provides two ways to check that a pair of items are the same. The is operator tests for object identity.

A list is typically a sequence of objects all having the same type, of arbitrary length. We often use lists to hold sequences of words. In contrast, a tuple is typically a collection of objects of different types, of fixed length. We often use a tuple to hold a record, a collection of different fields relating to some entity.

Generator Expression:

1
2
3
4
>>> max([w.lower() for w in word_tokenize(text)]) [1]
'word'
>>> max(w.lower() for w in word_tokenize(text)) [2]
'word'

The second line uses a generator expression. This is more than a notational convenience: in many language processing situations, generator expressions will be more efficient. In 1, storage for the list object must be allocated before the value of max() is computed. If the text is very large, this could be slow. In 2, the data is streamed to the calling function. Since the calling function simply has to find the maximum value - the word which comes latest in lexicographic sort order - it can process the stream of data without having to store anything more than the maximum value seen so far.

1
2
3
4
5
6
7
8
9
10
11
def search1(substring, words):
result = []
for word in words:
if substring in word:
result.append(word)
return result
def search2(substring, words):
for word in words:
if substring in word:
yield word

Function
It is not necessary to have any parameters.
A function usually communicates its results back to the calling program via the return statement.
A Python function is not required to have a return statement. Some functions do their work as a side effect, printing a result, modifying a file, or updating the contents of a parameter to the function (such functions are called “procedures” in some other programming languages).

1
2
3
4
5
6
7
>>> def my_sort1(mylist): # good: modifies its argument, no return value
... mylist.sort()
>>> def my_sort2(mylist): # good: doesn't touch its argument, returns value
... return sorted(mylist)
>>> def my_sort3(mylist): # bad: modifies its argument and also returns it
... mylist.sort()
... return mylist

When you refer to an existing name from within the body of a function, the Python interpreter first tries to resolve the name with respect to the names that are local to the function. If nothing is found, the interpreter checks if it is a global name within the module. Finally, if that does not succeed, the interpreter checks if the name is a Python built-in. This is the so-called LGB rule of name resolution: local, then global, then built-in.

Python also lets us pass a function as an argument to another function. Now we can abstract out the operation, and apply a different operation on the same data.

1
2
3
4
5
6
7
8
9
10
11
>>> sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the',
... 'sounds', 'will', 'take', 'care', 'of', 'themselves', '.']
>>> def extract_property(prop):
... return [prop(word) for word in sent]
...
>>> extract_property(len)
[4, 4, 2, 3, 5, 1, 3, 3, 6, 4, 4, 4, 2, 10, 1]
>>> def last_letter(word):
... return word[-1]
>>> extract_property(last_letter)
['e', 'e', 'f', 'e', 'e', ',', 'd', 'e', 's', 'l', 'e', 'e', 'f', 's', '.']

Python’s default arguments are evaluated once when the function is defined, not each time the function is called (like it is in say, Ruby). This means that if you use a mutable default argument and mutate it, you will and have mutated that object for all future calls to the function as well.

Space-Time Tradeoffs
We can test the effeciency using the timeit module. The Timer class has two parameters, a statement which is executed multiple times, and setup code that is executed once at the beginning.

1
2
3
4
5
6
7
8
9
>>> from timeit import Timer
>>> vocab_size = 100000
>>> setup_list = "import random; vocab = range(%d)" % vocab_size [1]
>>> setup_set = "import random; vocab = set(range(%d))" % vocab_size [2]
>>> statement = "random.randint(0, %d) in vocab" % (vocab_size * 2) [3]
>>> print(Timer(statement, setup_list).timeit(1000))
2.78092288971
>>> print(Timer(statement, setup_set).timeit(1000))
0.0037260055542

Dynamic Programming
Dynamic programming is a general technique for designing algorithms which is widely used in natural language processing. Dynamic programming is used when a problem contains overlapping sub-problems. Instead of computing solutions to these sub-problems repeatedly, we simply store them in a lookup table.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def virahanka1(n):
if n == 0:
return [""]
elif n == 1:
return ["S"]
else:
s = ["S" + prosody for prosody in virahanka1(n-1)]
l = ["L" + prosody for prosody in virahanka1(n-2)]
return s + l
def virahanka2(n):
lookup = [[""], ["S"]]
for i in range(n-1):
s = ["S" + prosody for prosody in lookup[i+1]]
l = ["L" + prosody for prosody in lookup[i]]
lookup.append(s + l)
return lookup[n]
def virahanka3(n, lookup={0:[""], 1:["S"]}):
if n not in lookup:
s = ["S" + prosody for prosody in virahanka3(n-1)]
l = ["L" + prosody for prosody in virahanka3(n-2)]
lookup[n] = s + l
return lookup[n]
from nltk import memoize
@memoize
def virahanka4(n):
if n == 0:
return [""]
elif n == 1:
return ["S"]
else:
s = ["S" + prosody for prosody in virahanka4(n-1)]
l = ["L" + prosody for prosody in virahanka4(n-2)]
return s + l

Four Ways to Compute Sanskrit Meter: (i) recursive; (ii) bottom-up dynamic programming; (iii) top-down dynamic programming; and (iv) built-in memoization.

Categorizing and Tagging Words

Universal Part-of-Speech Tagset
| Tag | Meaning | English Examples |
| —- | —- | —- |
| ADJ | adjective | new, good |
| ADP | adposition | on, of |
| ADV | adverb | really, already |
| CONJ | conjunction | and, or |
| DET | determiner, article | the, some |
| NOUN | noun | year, home |
| NUM | numeral | twenty-four, fourth |
| PRT | particle | at, on |
| PRON | pronoun | he, their |
| VERB | verb | is, say |
| . | punctuation marks | .,;! |
| X | other | gr8, univeristy |

Some related functions:

1
2
3
4
5
6
7
8
9
10
# nltk.pos_tag()
text = word_tokenize("They refuse to permit us to obtain the refuse permit")
nltk.pos_tag(text)
# nltk.tag.str2tuple()
tagged_token = nltk.tag.str2tuple('fly/NN')
# nltk.corpus.tagged_words()
nltk.corpus.brown.tagged_words()
nltk.corpus.brown.tagged_words(tagset='universal')

Automatic Tagging

The Default Tagger
Default taggers assign their tag to every single word, even words that have never been encountered before.

1
2
default_tagger = nltk.DefaultTagger('NN')
default_tagger.tag(tokens)

The Regular Expression Tagger
The regular expression tagger assigns tags to tokens on the basis of matching patterns.

1
2
3
4
patterns = [
(r'.*s$', 'NNS')
(r'.*ing$', 'VBG')
]

The Lookup Tagger
Find the most frequent words and store their most likely tag. Then use this information as the model for a “lookup tagger”(an NLTK UnigramTagger).
For those words not among the most frequent words, it’s okay to assign the default tag of NN. In other words, we want to use the lookup table first, and if it is unable to assign a tag, then use the default tagger, a process known as backoff (5).

1
2
3
4
5
6
fd = nltk.FreqDist(brown.words(categories='news'))
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
most_freq_words = fd.most_common(100)
likely_tags = dict((word, cfd[word].max()) for (word, _) in most_freq_words)
baseline_tagger = nltk.UnigramTagger(model=likely_tags,
backoff=nltk.DefaultTagger('NN'))

N-Gram Tagging

Unigram Tagging
Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token.
An n-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens.

1
2
unigram_tagger = nltk.UnigramTagger(train_sents)
bigram_tagger = nltk.BigramTagger(train_sents)

Note:
n-gram taggers should not consider context that crosses a sentence boundary. Accordingly, NLTK taggers are designed to work with lists of sentences, where each sentence is a list of words. At the start of a sentence, t_{n-1} and preceding tags are set to None.

As n gets larger, the specificity of the contexts increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data. This is known as the sparse data problem, and is quite pervasive in NLP. As a consequence, there is a trade-off between the accuracy and the coverage of our results (and this is related to the precision/recall trade-off in information retrieval).
One way to address the trade-off between accuracy and coverage is to use the more accurate algorithms when we can, but to fall back on algorithms with wider coverage when necessary.

1
2
3
4
5
6
7
8
9
10
# 1. Try tagging the token with the bigram tagger.
# 2. If the bigram tagger is unable to find a tag for the token, try the unigram tagger.
# 3. If the unigram tagger is also unable to find a tag, use a default tagger.
>>> t0 = nltk.DefaultTagger('NN')
>>> t1 = nltk.UnigramTagger(train_sents, backoff=t0)
>>> t2 = nltk.BigramTagger(train_sents, backoff=t1)
# We can further specify that a tagger needs to see more than one instance of a context in order to retain it.
nltk.BigramTagger(sents, cutoff=2, backoff=t1)
# it will discard contexts that have only been seen once or twice.

Classification

Classification is the task of choosing the correct class label for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance.
E.g., deciding whether an email is spam or not; deciding what the topic of a news article is, from a fixed list of topic areas such as “sports,” “technology,” and “politics”; deciding whether a given occurrence of the word bank is used to refer to a river bank, a financial institution, the act of tilting to the side, or the act of depositing something in a financial institution.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# The example of gender identification
# the function of extracting possible features
# e.g., the name's last/first letter, the length of the name,
def gender_features(word):
return {'last_letter': word[-1]}
# return {'first_letter': word[0]}
# return {'word_length': len(word)}
# get the name data
from nltk.corpus import names
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
import random
random.shuffle(labeled_names)
# seperate the training set and test set
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
# train_set, test_set = featuresets[500:], featuresets[:500]
# nltk.classify.apply_features returns an object that acts like a list but does not store all the feature sets in memory
from nltk.classify import apply_features
train_set = apply_features(gender_features, labeled_names[500:])
test_set = apply_features(gender_features, labeled_names[:500])
# use Naive Bayes classifier and see the accuracy
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)
# Accuracy:
# last letter: 0.774
# first letter: 0.622
# name length: 0.618

The training set is used to train the model, and the dev-test set is used to perform error analysis. The test set serves in our final evaluation of the system. Just as common machine learning does.


The following content seems to focus on some methods provided by NLTK. And to learn the principles like decision tree, which is not covered in Andrew Ng’s course, I’d like to turn to Hands-on Machine Learning with Scikit-Learn and TensorFlow rather than this book. And I’ll write a new post recording notes on that book:D
(Well, I’ll come back to continue updating this post when it’s necessary to.)