Regular Expression

Introduction to Regular Expression

A regular expression(RegEx) is an algebraic notation for characterizing a set of strings. They are particularly useful for searching in texts, when we have a pattern to search for and a corpus of texts to search through. A regular expression search function will search through the corpus, returning all texts that match the pattern. The corpus can be a single document or a collection.
(In some case, regular expressions are shown by delimiting by slashes /. But slashes are not part of the regular expression.)
In this post, I’d just use expression (without quotes) to denote regular expressions, and 'expression' to denote the patterns matched.

Regular expressions can contain both special and ordinary characters. Most ordinary characters, like A, a, or 0, are the simplest regular expressions; they simply match themselves. E.g., hello matches 'hello', world.
Regular expressions are case sensitive. Therefore, the lower case h is distinct from upper case H.

Character set

[]
Used to indicate a set of characters.

  • Characters can be listed individually, e.g., [amk] will match 'a','m', or 'k'.
  • Ranges of characters can be indicated by giving two characters and separating them by a -. E.g., [A-Z] matches an upper case letter, [0-9] matches a single digit. If - is escaped(\-) or if it’s placed as the first or last character(e.g., [a-] or [-a]), it will match a literal '-'.
  • Special characters lose their special meaning inside sets. For example, [(+*)] will match any of the literal characters '(', '+', '*', or ')'.
  • Character classes such as \w or \S are also accepted inside a set.
  • Characters that are not within a range can be matched by complementing the set. If the first character of the set is ^, all characters that are not in the set will be matched. This is only true when the caret is the first symbol after the open square brace. If it occurs anywhere else, it usually stands for a caret and has no special meaning.
  • To match a literal ']' inside a set, precede it with a backslash, or place it at the beginning of the set.

Special sequences

Some of the special sequences beginning with \ represent predefined sets of characters that are often useful, such as the set of digits, the set of letters, or the set of anything that isn’t whitespace.

\d
Matches any decimal digit; this is equivalent to [0-9].

\D
Matches any non-digit character; this is equivalent to [^0-9].

\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

\S
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

\w
Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

\W
Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

Wildcard expression

The period/dot . is a special character that matches any single character(except a newline). . is often used where “any character” is to be matched.

Anchors

Anchors are special characters that anchor regular expressions to particular places in a string.

^
The caret ^ matches the start of a line.
Thus, the caret ^ has three uses: to match the start of a line(^), to indicate a negation inside of square brackets([^]), and just to mean a caret(\^ or [.^]).

$
The dollar sign $ matches the end of a line.

There are also two other anchors: \b matches a word boundary, and \B matches a non-boundary. More technically, a “word” for the purposes of a regular expression is defined as any sequence of digits, underscores, or letters.

Quantifiers

?
The question mark ? means “zero or one instances of the previous character”.
ab? will match either 'a' or 'ab'.

(?i) makes regular expression case insensitive.

*
Commonly called the Kleene *. The Kleene star means “zero or more occurrences of the immediately previous character or regular expression”.

+
The Kleene + means “one or more of the previous character”.

The ?, * and + qualifiers are all greedy, they match as much text as possible. There are also ways to enforce non-greedy matching, using another meaning of the ? qualifier(here ? means lazy: cause it to match as few characters as possible): *?, +?, ??.

{}
{n}: Exactly n repeats where n is a non-negative integer
{n,}: At least n repeats
{,n}: No more than n repeats
{m,n}: At least m and no more than n repeats

Alternation

The disjunction operator, also called the pipe symbol | acts like a boolean OR. It matches the expression before or after the |.
In some sense, | is never greedy. As the target string is scanned, REs separated by | are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match.

Groups

()
Enclosing a pattern in parentheses makes it act like a single character for the purposes of neighboring operators like the pipe | and the Kleene *.

The parentheses have a second function, to select substrings to be extracted. If we want to use the parentheses to specify the scope of the disjunction, but not to select the material to be output, we have to add ?:.

1
2
re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing') # ['ing']
re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing') # ['processing']

Regular Expression in Python

Raw string

To the Python interpreter, a regular expression is just like any other string. If the string contains a backslash followed by particular characters, it will interpret these specially. For example \b would be interpreted as the backspace character. In general, when using regular expressions containing backslash, we should instruct the interpreter not to look inside the string at all, but simply to pass it directly to the re library for processing. We do this by prefixing the string with the letter r, to idicate that it is a raw string.

Multiple delimiters using regular expression

It seems that the original Python’s split() function only support one separator. However, it would be convenient to use regular expression to support multiple separators.
re.split(r'\W+', original_string)
It will return the list of separated strings. But it gives us empty strings at the start and the end.
We can use re.findall(r'\w+', original_string) to get the same tokens, but without the empty strings.

re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw)
This way helps to deal with the words like it's and ward-hearted.

Reference

Regular Expressions, Text Normalization, Edit Distance (from Speech and Language Processing(3rd ed. draft))
re - Regular expression operations
Regular Expression HOWTO