Processing Corpus with Python

Recently I’ve been doing a corpus-based research. There are several corpus tools available. Some are free, like AntConc, #LancsBox; and some are paid, like WordSmith and PowerGREP. The paid softwares are too expensive, and the free softwares just provide some general functions and cannot meet my requirements. Therefore, I decided to implement the processing procedure using Python on my own. NLTK offers methods to load corpus, but it doens’t support tagged corpus. So I just start it from pure Python. In this post, I’d keep my experience with code. I’ll try to make it as detailed as possible so that some of my best friends majoring in linguistics can benefit from this post and maybe process corpus with Python at their will :D (Or perhaps you are a bit frustrated with code. Don’t worry. Later I will upload some helper Python programs so that you can execute them directly, without dealing with code.)

Before Getting Started

Jupyter Notebook

Download Anaconda

It is okay whether you have known Python before or not. As long as you have learned some programming language, e.g., Java or C, that’s enough. Python is user-friendly and easy to use. You can refer to Python’s official tutorial any time for better understanding, or you can just leave it out and follow my post step by step. Anyway, I’d suggest that you download Anaconda. It is straight-forward and helps to manage trouble-some settings. You can write code freely once you installed it. And I’d highly recommend writing in Jupyter Notebook, which is installed together with Anaconda.

Launch Jupyter Notebook

  • For Windows users, click Open Menu (or press Windows key on the keyboard) - Anaconda3 - Jupyter Notebook.

  • For macOS users, open Launchpad (or press F4 on the keyboard) - Anaconda Navigator - notebook.

After that, a command window (black in Windows, white in macOS) will pop up. You can ignore it (but don’t close it!). Then your default browser will open automatically. Click New on the top right corner and select Python 3, you can start writing Python code now!

Basic of Jupyter Notebook

The notebook consists of a sequence of cells. There are three types of cells: code cells, markdown cells, and raw cells. By default, the cell is code cell, and you can write Python code directly. You don’t need to worry about markdown or raw cells.

The cell has two modes: command mode and edit mode. Generally, the cell is in edit mode and you can type freely. If it’s in command mode, you can navigate around the notebook using keyboard shortcuts.

Keyboard Shortcuts

Shift + Enter: run cell

Esc: turn cell into command mode

Enter: turn cell into edit mode

D, D: delete the cell

A: insert cell above

B: insert cell below

These are shortcuts that I use most frequently. For the full list of available shortcuts, click Help - Keyboard Shortcuts in the notebook menus.

Well, this is a simple introduction to Jupyter Notebook. You can download my code and open processing.ipynb using Jupyter Notebook, or you can continue reading this post and type the code on your own. The following content is almost the same.

Shutdown

Oops, I almost forget to tell you how you shut down Jupyter Notebook when you’ve finished your work. Return to the command window, press Ctrl+C twice quickly. Then Jupyter Notebook will shut down and everything will be okay.

Get an Overview of Your Corpus

You have to glance through your corpus before processing. Generally the format of corpus is .txt or .xml. You can use the default text editor in your operating system to open the file, or you can use some advanced text editor like Sublime Text for better view.

The above image shows the heading of LCMC_A.xml. (The Lancaster Corpus of Mandarin Chinese (LCMC) addresses an increasing need within the research community for a publicly available balanced corpus of Mandarin Chinese. Click here for more information about this corpus.) We don’t need to care the heading, which indicate the information about creator, date, etc. By focusing the main body, we can know each word is tagged with <w POS=""> and followed by </w>, like <w POS="a">大</w>. Different corpora may have different taggings, here (and in the following sections) I just take LCMC as an example.

Process a Single File

All right. Let’s get started with a single file.

Read the File

1
2
3
filename = 'LCMC_A.xml'
with open(filename, encoding='utf-8') as f:
read_data = f.read()

The first line declares a variable called filename. The second line tells the system to open the file. Note that encoding='utf-8' indicates the encoding of the file, and currently utf-8 is the most frequently used encoding. If you don’t know the file’s encoding, take utf-8 by default. Later I’ll write how to dealing with other encodings. The third line just read the entire content and pass it to a variable read_data.

You can type read_data in a new cell and run the cell to see whether it’s successful.

Extract Useful Parts

1
2
3
import re
tag = re.findall(r'<w POS="(\w{0,4})">.{0,7}</w>', read_data)
token = re.findall(r'<w POS="\w{0,4}">(.{0,7})</w>', read_data)

The most useful information is the tags and the tokens. With the above code we can extract these parts parallelly (in two seperate lists). Here I use the regular expression (re) and maybe it is a bit abstruce. Well, currently you just need to know what it does: recall the word form: <w POS="a">大</w>, the second line extract the content between the quotes(a, which is a tag); the third line extract the content between the right angle bracket and the left angle bracket(, which is a token).

Note that with the above codes I just extract words, not including punctuations. If you need to treat punctuations as tokens, use the following code instead:

tag = re.findall(r'<[wc] POS="(\w{0,4})">.{0,7}</[wc]>', read_data),

token = re.findall(r'<[wc] POS="\w{0,4}">(.{0,7})</[wc]>', read_data).

You can refer to my post for more information about regular expression.

Count Frequency

1
2
3
import collections
counter = collections.Counter(token)
counter.most_common()

Since we have extract the useful part, we can count the frequencies. The first line import a library collections for counting. The second line counts the token list and the third line outputs the frequency in reverse order.

Index Target Word

Well, the freguency just gives us an overview. Then let’s focus on our target words. We need to know the word’s occurrence in the corpus. For example, the following line helps to find all indices of the word ‘是’.

1
indices = [i for i, x in enumerate(token) if x == '是']

With the indices we get, now we can perform a variety of tasks. We can see the concordance of the word, n-gram, etc.

1
2
3
4
5
pre_token = [token[i-1] for i in indices]
next_token = [token[i+1] for i in indices]
pre_tag = [tag[i-1] for i in indices]
next_tag = [tag[i+1] for i in indices]

The first two lines stores the tokens before ‘是’ and the tokens next ‘是’ separately, and the last two lines stores the tags before ‘是’ and the tags next ‘是’.

KWIC

Key Word In Context(KWIC) is a typical application in corpus linguistics. We can print KWIC easily with the indices we get:

1
2
3
4
for i in indices:
for w in range(-5, 6):
print(token[i+w], end=' ')
print()

(Well, here the code is simplified, without considering OutOfIndex exception.)

Export to Excel

Maybe you are more familiar to Excel. You can export the statistics you get by Python as well.

1
2
import pandas as pd
pd.DataFrame(counter.most_common()).to_excel('most_common.xlsx', header=False, index=False)

With these two lines of code executed, you can see that a new excel file is generated in your current directory. Then you can view the data in your Excel.

Process All Files

Load Files

Generally, a corpus may contain several files. Now that we have done with a single file, it’s easy to process all the files as well.

1
2
import os
files = [f for f in os.listdir() if f.endswith('.xml')]

The above code will list the current directory’s files, whose filename extension is .xml. With these files, we can read all files within a for loop.

1
2
3
4
5
6
7
8
9
10
tags = []
tokens = []
for filename in files:
with open(filename, encoding='utf-8') as f:
read_data = f.read()
tag = re.findall(r'<w POS="(\w{0,4})">.{0,7}</w>', read_data)
token = re.findall(r'<w POS="\w{0,4}">(.{0,7})</w>', read_data)
tags.extend(tag)
tokens.extend(token)

Here the tag and token is just the same as what was mentioned in the previous section. The difference lies in that I declared two new variables, tags and tokens to store the entire tags and tokens in the corpus (while the tag and token only store those in only one file). The last two lines just merge the tag to tags.

Define Functions

Some lines of code have its particular purpose and may be executed several times. It’s a good idea to wrap those code in a function for reutilization. It helps to simplify our programs and make it easier to use and comprehend.

1
2
3
4
def previous_tokens(target):
indices = [i for i, x in enumerate(token) if x == target]
pre_token = [token[i-1] for i in indices]
return pre_token

For example, we define a function called previous_tokens. With the above board written, we can call it easily. previous_tokens('的') will give us the previous tags of ‘的’.