Python Regular Expressions
External
Internal
TODO
PROCESS: https://docs.python.org/3/howto/regex.html#regex-howto
Overview
A regular expression is specified with r"..."
Metacharacters
The complete list of metacharacters:
. (dot)
It matches anything except a newline character, and there’s an alternate mode (re.DOTALL) where it will match even a newline. . Used where you want to match “any character”.
To match an actual dot, escape it:
\.
^
Complements a character set, which means match all characters not in the set. This is indicating by including ^
as the first character of the class. For example [^a]
will match everything except "a". If the caret appears elsewhere in the class, it does not have a special meaning and it will represent itself.
$
*
Causes the resulting regular expression to match 0 or more repetitions of the preceding regular expression, as many repetitions as are possible.
+
?
{
}
[
Used to specify the beginning of a character class, which is a set of characters you wish to match. Characters can be listed individually:
[abc]
or as a range, by using -
:
[0-9]
⚠️ Metacharacters are not active inside classes - they're stripped of their special nature. For example [abc$]
will match 'a' or 'b' or 'c' or '$'.
]
Used to specify the end of a character class.
\
A backslash can be followed by various character to signal special sequences. The backslash is also used to escape all metacharacters so they can be matched in patterns. To match [
, prefix it with backslash: \[
.
|
(
Used to indicate the beginning of a group capture.
)
Used to indicate the end of a group capture.
Special Sequences
All the special sequences described below can be included inside a character class, and they preserve their meaning.
\d
Matches any decimal digit. It is equivalent to the class [0-9]
.
\D
Matches any non-digit character. It is equivalent to the class [^0-9]
.
\s
Matches any whitespace character. It is equivalent to the class [ \t\n\r\f\v]
.
\S
Matches any non-whitespace character. It is equivalent to the class [^ \t\n\r\f\v]
.
\w
Matches any alphanumeric character. It is equivalent to the class [a-zA-Z0-9_]
.
\W
Matches any non-alphanumeric character. It is equivalent to the class [^a-zA-Z0-9_]
.
NOT Metacharacters
The following characters are matched without any escaping:
{...}
Research this, { and } are metacharacters.
Patterns
At most one group of characters:
(...)?
Matching Modes
Match a Pattern and Pick Up Groups
import re
p = re.compile(r'^(\w+):(\w+)-(\w+)$')
s = 'abc:mnp-xyz'
m = p.match(s)
if m:
assert 'abc:mnp-xyz' == m.group(0)
assert 'abc' == m.group(1)
assert 'mnp' == m.group(2)
assert 'xyz' == m.group(3)
Groups are 1-based. group(0)
represents the entire expression.
Bug: when a regular expression like this one is used: '....()?()?' (two optional groups), and the last group is None, m.groups(one_based_last_group_index) throws IndexError. The solution was to retrieve the groups as a tuple before any evaluation, and use it for testing:
groups = m.groups() if m.groups(1):
...
... if groups[4]:
...
Replacing Regular Expression Occurrences
import re
s = "this is a {{color}} car"
print(re.sub(r"{{color}}", 'blue', s))
Strip quotes:
s = "'something'"
re.sub(r"'$", '', re.sub(r"^'", '', s))
Capture groups and use them in the replacement:
s = 'this is red'
s2 = re.sub(r'^(this is).*$', '\\1 blue', s)
assert 'this is blue' == s2
To dynamically build a regular expression, use rf'...'
s = 'this is a red string'
color = 'red'
s2 = re.sub(rf'{color}', 'blue', s)
assert 'this is a blue string' == s2
Match a new line:
r'\n'
Match not a new line:
r'[^\n]'