This note is based on Natural Language Processing with Python - Analyzing Text with the Natural Language Toolkit. I’ve uploaded the exercises solution to GitHub.
Texts are represented in Python using lists. We can use indexing, slicing, and the
Some word comparison operators:
t in s,
A concordance view shows us every occurrence of a given word, together with some context.
A concordance permits us to see words in context.
We can find out other words appear in a similar range of contexts, by appending the term
similar to the name of the text, then inserting the relevant word in parentheses. (a bit like synonyms)
common_contexts allows us to examine just the contexts that are shared by two or more words.
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
We can determine the location of a word in the text: how many words from the beginning it appears. This positional information can be displayed using a dispersion plot.
Fine-grained Selection of Words
The mathematical set notation and corresponding Python expression.
[w for w in V if p(w)]
Collocations and Bigrams
The bigram is written as
('than', 'said') in Python.
A collocation is a sequence of words that occur together unusually often. Collocations are essentially just frequent bigrams, except that we want to pay more attention to the cases that involve rare words. In particular, we want to find bigrams that occur more often then we would expect based on the frequency of the individual words.
A frequency distribution tells us the frequency of each vocabulary item in the text.
FreqDist can be treated as
dictionary in Python, where the word(or word length, etc) is the key, and the occurrence is the corresponding value.
functions defined for NLTK’s Frequency Distributions can be found here
A text corpus is a large, structured collection of texts. Some text corpora are categorized, e.g., by genre or topic; sometimes the categories of a corpus overlap each other.
The NLTK has many corpus in the package
nltk.corpus. To perform the functions introduced before, we have to employ the following pair of statements:
A short program to display information about each text, by looping over all the values of
fileid corresponding to the
gutenberg file identifiers and then computing statistics for each text.
The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre.(see here for detail)
We can access the corpus as a list of words, or a list of sentences. We can optionally specify particular categories or files to read.
Use Brown Corpus to study stylistics: systematic differences between genres.
A conditional frequency distribution is a collection of frequency distributions, each one for a different “condition”. The condition will often be the category of the text.
A frequency distribution counts observable events, such as the appearance of words in a text. A conditional frequency distribution needs to pair each event with a condition.
NLTK’s Conditional Frequency Distributions: commonly-used methods and idioms for defining, accessing, and visualizing a conditional frequency distribution of counters.
|cfdist = ConditionalFreqDist(pairs)||create a conditional frequency distribution from a list of pairs|
|cfdist[condition]||the frequency distribution for this condition|
|cfdist[condition][sample]||frequency for the given sample for this condition|
|cfdist.tabulate()||tabulate the conditional frequency distribution|
|cfdist.tabulate(samples, conditions)||tabulation limited to the specified samples and conditions|
|cfdist.plot()||graphical plot of the conditional frequency distribution|
|cfdist.plot(samples, conditions)||graphical plot limited to the specified samples and conditions|
|cfdist1 < cfdist2||test if samples in cfdist1 occur less frequently than in cfdist2|
The Reuters Corpus contains 10788 news documents totaling 1.3 million words. The documents have been classified into 90 topics and grouped into training and test sets.
Unlike the Brown Corpus, categories in the Reuters corpus overlap with each other, simply because a news story often covers multiple topics. We can ask for the topics covered by one or more documents, or for the documents included in one or more categories.
Basic Corpus Functionality defined in NLTK:
|fileids()||the files of the corpus|
|fileids([categories])||the files of the corpus corresponding to these categories|
|categories()||the categories of the corpus|
|categories([fileids])||the categories of the corpus corresponding to these files|
|raw()||the raw content of the corpus|
|raw(fileids=[f1,f2,f3])||the raw content of the specified files|
|raw(categories=[c1,c2])||the raw content of the specified categories|
|words()||the words of the whole corpus|
|words(fileids=[f1,f2,f3])||the words of the specified fileids|
|words(categories=[c1,c2])||the words of the specified categories|
|sents()||the sentences of the whole corpus|
|sents(fileids=[f1,f2,f3])||the sentences of the specified fileids|
|sents(categories=[c1,c2])||the sentences of the specified categories|
|abspath(fileid)||the location of the given file on disk|
|encoding(fileid)||the encoding of the file (if known)|
|open(fileid)||open a stream for reading the given corpus file|
|root||if the path to the root of locally installed corpus|
|readme()||the contents of the README file of the corpus|
A lexicon, or lexical resource, is a collection of words and/or phrases along with associated such as part of speech and sense definition. Lexical resources are secondary to texts, and are usually created and enriched with the help of texts. For example,
vocab = sorted(set(my_text)) and
word_freq = FreqDist(my_text) are both simple lexical resources.
Lexicon Terminology: lexical entries for two lemmas having the same spelling (homonyms), providing part of speech and gloss information.
The CMU Pronouncing Dictionary, Toolbox are introduced in the book, I’ll just omit them in the note.
WordNet is a semantically-oriented dictionary of English, similar to a traditional thesaurus but with a richer structure. NLTK includes the English WordNet, with 155287 words and 117659 synonym sets.
With the WordNet, we can find the word’s synonyms in synsets - “synonym set”, definitions and examples as well.
Hyponyms and hypernyms
WordNet synsets correspond to abstract concepts, and they don’t always have corresponding words in English. These concepts are linked together in a hierarchy. (See hyponyms and in lexical relations.)
The corresponding methods are
Some other lexical relations
Another important way to navigate the WordNet network is from items to their components (meronyms), or to the things they are contained in (holonyms). There are three kinds of holonym-meronym relation:
There are also relationships between verbs. For example, the act of walking involves the act of stepping, so walking entails stepping. Some verbs have multiple entailments.(NLTK also includes VerbNet, a hierarhical verb lexicon linked to WordNet. It can be accessed with
Given a particular synset, we can traverse the WordNet network to find synsets with related meanings. Knowing which words are semantically related is useful for indexing a collection of texts, so that a search for a general term will match documents containing specific terms.
We can qualify the concept of generality(specific or general) by looking up the depth of the synset.
path_similarity assigns a score in the range 0–1 based on the shortest path that connects the concepts in the hypernym hierarchy (-1 is returned in those cases where a path cannot be found). Comparing a synset with itself will return 1.
From local files
ASCII text and HTML text are human readable formats. Text often comes in binary formats — like PDF and MSWord — that can only be opened using specialized software. Third-party libraries such as
pywin32 provide access to these formats. Extracting text from multi-column documents is particularly challenging. For once-off conversion of a few documents, it is simpler to open the document with a suitable application, then save it as text to your local drive, and access it as described below. If the document is already on the web, you can enter its URL in Google’s search box. The search result often includes a link to an HTML version of the document, which you can save as text.
I’ve uploaded my summary of regular expression in the post Regular Expression.
NLTK provides a regular expression tokenizer:
The basic part is discussed in my Python learning note. And in this section, I just record some unfamiliar knowledge.
Assignment always copies the value of an expression, but a value is not always what you might expect it to be. In particular, the “value” of a structured object such as a list is actually just a reference to the object.
Python provides two ways to check that a pair of items are the same. The
is operator tests for object identity.
A list is typically a sequence of objects all having the same type, of arbitrary length. We often use lists to hold sequences of words. In contrast, a tuple is typically a collection of objects of different types, of fixed length. We often use a tuple to hold a record, a collection of different fields relating to some entity.
The second line uses a generator expression. This is more than a notational convenience: in many language processing situations, generator expressions will be more efficient. In 1, storage for the list object must be allocated before the value of
max() is computed. If the text is very large, this could be slow. In 2, the data is streamed to the calling function. Since the calling function simply has to find the maximum value - the word which comes latest in lexicographic sort order - it can process the stream of data without having to store anything more than the maximum value seen so far.
It is not necessary to have any parameters.
A function usually communicates its results back to the calling program via the
A Python function is not required to have a return statement. Some functions do their work as a side effect, printing a result, modifying a file, or updating the contents of a parameter to the function (such functions are called “procedures” in some other programming languages).
When you refer to an existing name from within the body of a function, the Python interpreter first tries to resolve the name with respect to the names that are local to the function. If nothing is found, the interpreter checks if it is a global name within the module. Finally, if that does not succeed, the interpreter checks if the name is a Python built-in. This is the so-called LGB rule of name resolution: local, then global, then built-in.
Python also lets us pass a function as an argument to another function. Now we can abstract out the operation, and apply a different operation on the same data.