Introduction to Regular Expression
A regular expression(RegEx) is an algebraic notation for characterizing a set of strings. They are particularly useful for searching in texts, when we have a pattern to search for and a corpus of texts to search through. A regular expression search function will search through the corpus, returning all texts that match the pattern. The corpus can be a single document or a collection.
(In some case, regular expressions are shown by delimiting by slashes /
. But slashes are not part of the regular expression.)
In this post, I’d just use expression
(without quotes) to denote regular expressions, and 'expression'
to denote the patterns matched.
Regular expressions can contain both special and ordinary characters. Most ordinary characters, like A
, a
, or 0
, are the simplest regular expressions; they simply match themselves. E.g., hello
matches 'hello', world
.
Regular expressions are case sensitive. Therefore, the lower case h
is distinct from upper case H
.
Character set
[]
Used to indicate a set of characters.
- Characters can be listed individually, e.g.,
[amk]
will match'a'
,'m'
, or'k'
. - Ranges of characters can be indicated by giving two characters and separating them by a
-
. E.g.,[A-Z]
matches an upper case letter,[0-9]
matches a single digit. If-
is escaped(\-
) or if it’s placed as the first or last character(e.g.,[a-]
or[-a]
), it will match a literal'-'
. - Special characters lose their special meaning inside sets. For example,
[(+*)]
will match any of the literal characters'('
,'+'
,'*'
, or')'
. - Character classes such as
\w
or\S
are also accepted inside a set. - Characters that are not within a range can be matched by complementing the set. If the first character of the set is
^
, all characters that are not in the set will be matched. This is only true when the caret is the first symbol after the open square brace. If it occurs anywhere else, it usually stands for a caret and has no special meaning. - To match a literal
']'
inside a set, precede it with a backslash, or place it at the beginning of the set.
Special sequences
Some of the special sequences beginning with \
represent predefined sets of characters that are often useful, such as the set of digits, the set of letters, or the set of anything that isn’t whitespace.
\d
Matches any decimal digit; this is equivalent to [0-9]
.
\D
Matches any non-digit character; this is equivalent to [^0-9]
.
\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]
.
\S
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v]
.
\w
Matches any alphanumeric character and underscore(easily overlooked); this is equivalent to the class [a-zA-Z0-9_].
\W
Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_]
.
Wildcard expression
The period/dot .
is a special character that matches any single character(except a newline). .
is often used where “any character” is to be matched.
Anchors
Anchors are special characters that anchor regular expressions to particular places in a string.
^
The caret ^
matches the start of a line.
Thus, the caret ^
has three uses: to match the start of a line(^
), to indicate a negation inside of square brackets([^]
), and just to mean a caret(\^
or [.^]
).
$
The dollar sign $
matches the end of a line.
There are also two other anchors: \b
matches a word boundary, and \B
matches a non-boundary. More technically, a “word” for the purposes of a regular expression is defined as any sequence of digits, underscores, or letters.
Quantifiers
?
The question mark ?
means “zero or one instances of the previous character”.ab?
will match either 'a'
or 'ab'
.
(?i)
makes regular expression case insensitive.
*
Commonly called the Kleene *. The Kleene star means “zero or more occurrences of the immediately previous character or regular expression”.
+
The Kleene + means “one or more of the previous character”.
The ?
, *
and +
qualifiers are all greedy, they match as much text as possible. There are also ways to enforce non-greedy matching, using another meaning of the ?
qualifier(here ?
means lazy: cause it to match as few characters as possible): *?
, +?
, ??
.
{}
{n}
: Exactly n repeats where n is a non-negative integer{n,}
: At least n repeats{,n}
: No more than n repeats{m,n}
: At least m and no more than n repeats
Alternation
The disjunction operator, also called the pipe symbol |
acts like a boolean OR. It matches the expression before or after the |
.
In some sense, |
is never greedy. As the target string is scanned, REs separated by |
are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match.
Groups
()
Enclosing a pattern in parentheses makes it act like a single character for the purposes of neighboring operators like the pipe |
and the Kleene *
.
Capture group
The use of parentheses to store a pattern in memory is called a capture group. Every time a capture group is used (i.e., parentheses surround a pattern), the resulting match is stored in a numbered register. Use backslash with number like \1
to refer to those registers. Here the \1
will be replaced by whatever string matched the first item in parentheses.
Parentheses thus have a double function in regular expressions; they are used to group terms for specifying the order in which operators should apply, and they are used to capture something in a register. Occasionally we might want to use parentheses for grouping, but don’t want to capture the resulting pattern in a register. In that case we use a non-capturing group, which is specified by putting the commands?:
after the open paren, in the form (?: pattern )
.
Precedence
This idea that one operator may take precedence over another, requiring us to sometimes use parentheses to specify what we mean, is formalized by the operator precedence hierarchy for regular expressions.
| Kind | Operators |
| —- | —- |
| Parenthesis | () |
| Counters | * + ? {} |
| Sequences and anchors | the ^my end$ |
| Disjunction | | |
|
|
Lookahead assertions
There will be time when we need to predict the future: look ahead in the text to see if some pattern matches, but not advance the match cursor, so that we can deal with the pattern if it oocurs.
Positive lookahead: (?= pattern)
Negative lookahead: (?! pattern)
Regular Expression in Python
Raw string
To the Python interpreter, a regular expression is just like any other string. If the string contains a backslash followed by particular characters, it will interpret these specially. For example \b
would be interpreted as the backspace character. In general, when using regular expressions containing backslash, we should instruct the interpreter not to look inside the string at all, but simply to pass it directly to the re
library for processing. We do this by prefixing the string with the letter r
, to idicate that it is a raw string.
Multiple delimiters using regular expression
It seems that the original Python’s split()
function only support one separator. However, it would be convenient to use regular expression to support multiple separators.re.split(r'\W+', original_string)
It will return the list of separated strings. But it gives us empty strings at the start and the end.
We can use re.findall(r'\w+', original_string)
to get the same tokens, but without the empty strings.
re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw)
This way helps to deal with the words like it's
and ward-hearted
.
re.findall()
is also useful to extract information when dealing with tagged corpora, and currently I am doing a research with the help of it.
Markdown inline LaTeX
The markdown editors supports inline well. However, to support LaTeX on Hexo, it’s necessary to use mathjax and add inline code surrounding the dollor sign. Now it’s time to use regular expression to modify the format.
|
|
(Well, I just use Sublime Text’s Replace function to make the modification.)
Reference
Speech and Language Processing (3rd ed. draft), Dan Jurafsky and James H. Martin”:
Regular Expressions, Text Normalization, Edit Distance
Python Documentation:
re - Regular expression operations
Regular Expression HOWTO