A regular expression(RegEx) is an algebraic notation for characterizing a set of strings. They are particularly useful for searching in texts, when we have a pattern to search for and a corpus of texts to search through. A regular expression search function will search through the corpus, returning all texts that match the pattern. The corpus can be a single document or a collection.
(In some case, regular expressions are shown by delimiting by slashes
/. But slashes are not part of the regular expression.)
In this post, I’d just use
expression (without quotes) to denote regular expressions, and
'expression' to denote the patterns matched.
Regular expressions can contain both special and ordinary characters. Most ordinary characters, like
0, are the simplest regular expressions; they simply match themselves. E.g.,
Regular expressions are case sensitive. Therefore, the lower case
h is distinct from upper case
Used to indicate a set of characters.
- Characters can be listed individually, e.g.,
- Ranges of characters can be indicated by giving two characters and separating them by a
[A-Z]matches an upper case letter,
[0-9]matches a single digit. If
\-) or if it’s placed as the first or last character(e.g.,
[-a]), it will match a literal
- Special characters lose their special meaning inside sets. For example,
[(+*)]will match any of the literal characters
- Character classes such as
\Sare also accepted inside a set.
- Characters that are not within a range can be matched by complementing the set. If the first character of the set is
^, all characters that are not in the set will be matched. This is only true when the caret is the first symbol after the open square brace. If it occurs anywhere else, it usually stands for a caret and has no special meaning.
- To match a literal
']'inside a set, precede it with a backslash, or place it at the beginning of the set.
Some of the special sequences beginning with
\ represent predefined sets of characters that are often useful, such as the set of digits, the set of letters, or the set of anything that isn’t whitespace.
Matches any decimal digit; this is equivalent to
Matches any non-digit character; this is equivalent to
Matches any whitespace character; this is equivalent to the class
Matches any non-whitespace character; this is equivalent to the class
Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
Matches any non-alphanumeric character; this is equivalent to the class
. is a special character that matches any single character(except a newline).
. is often used where “any character” is to be matched.
*Anchors are special characters that anchor regular expressions to particular places in a string.
^ matches the start of a line.
Thus, the caret
^ has three uses: to match the start of a line(
^), to indicate a negation inside of square brackets(
[^]), and just to mean a caret(
The dollar sign
$ matches the end of a line.
There are also two other anchors:
\b matches a word boundary, and
\B matches a non-boundary. More technically, a “word” for the purposes of a regular expression is defined as any sequence of digits, underscores, or letters.
The question mark
? means “zero or one instances of the previous character”.
ab? will match either
Commonly called the Kleene *. The Kleene star means “zero or more occurrences of the immediately previous character or regular expression”.
The Kleene + means “one or more of the previous character”.
+ qualifiers are all greedy, they match as much text as possible. There are also ways to enforce non-greedy matching, using another meaning of the
? means lazy: cause it to match as few characters as possible):
The disjunction operator, also called the pipe symbol
| acts like a boolean OR. It matches the expression before or after the
In some sense,
| is never greedy. As the target string is scanned, REs separated by
| are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match.
Enclosing a pattern in parentheses makes it act like a single character for the purposes of neighboring operators like the pipe
| and the Kleene
It seems that the original Python’s
split() function only support one separator. However, it would be convenient to use regular expression to support multiple separators.
It will return the list of separated strings.
Regular Expressions, Text Normalization, Edit Distance (from Speech and Language Processing(3rd ed. draft))
re - Regular expression operations
Regular Expression HOWTO