Python Regular Expressions: Difference between revisions
(→|) |
(→|) |
||
Line 48: | Line 48: | ||
A backslash can be followed by various character to signal [[#Special_Sequences|special sequences]]. The backslash is also used to escape all metacharacters so they can be matched in patterns. To match <code>[</code>, prefix it with backslash: <code>\[</code>. | A backslash can be followed by various character to signal [[#Special_Sequences|special sequences]]. The backslash is also used to escape all metacharacters so they can be matched in patterns. To match <code>[</code>, prefix it with backslash: <code>\[</code>. | ||
==<tt>|</tt>== | ==<tt>|</tt>== | ||
Indicates an alternative between two or more regular sub-expressions, | Indicates an alternative between two or more regular sub-expressions, the entire regular expression match will be triggered by be a match of the first sub-expression, or the match of the second sub-expression, etc. | ||
<syntaxhighlight lang='py'> | <syntaxhighlight lang='py'> | ||
pattern = re.compile('AND|OR|,') | pattern = re.compile('AND|OR|,') |
Revision as of 18:14, 16 April 2022
External
Internal
TODO
PROCESS: https://docs.python.org/3/howto/regex.html#regex-howto
Overview
A regular expression is specified with r"..."
Metacharacters
The complete list of metacharacters:
. (dot)
It matches anything except a newline character, and there’s an alternate mode (re.DOTALL) where it will match even a newline. . Used where you want to match “any character”.
To match an actual dot, escape it:
\.
^
Complements a character set, which means match all characters not in the set. This is indicating by including ^
as the first character of the class. For example [^a]
will match everything except "a". If the caret appears elsewhere in the class, it does not have a special meaning and it will represent itself.
$
*
Causes the resulting regular expression to match 0 or more repetitions of the preceding regular expression, as many repetitions as are possible.
+
?
{
}
[
Used to specify the beginning of a character class, which is a set of characters you wish to match. Characters can be listed individually:
[abc]
or as a range, by using -
:
[0-9]
⚠️ Metacharacters are not active inside classes - they're stripped of their special nature. For example [abc$]
will match 'a' or 'b' or 'c' or '$'.
]
Used to specify the end of a character class.
\
A backslash can be followed by various character to signal special sequences. The backslash is also used to escape all metacharacters so they can be matched in patterns. To match [
, prefix it with backslash: \[
.
|
Indicates an alternative between two or more regular sub-expressions, the entire regular expression match will be triggered by be a match of the first sub-expression, or the match of the second sub-expression, etc.
pattern = re.compile('AND|OR|,')
(
Used to indicate the beginning of a group capture.
)
Used to indicate the end of a group capture.
Special Sequences
All the special sequences described below can be included inside a character class, and they preserve their meaning.
\d
Matches any decimal digit. It is equivalent to the class [0-9]
.
\D
Matches any non-digit character. It is equivalent to the class [^0-9]
.
\s
Matches any whitespace character. It is equivalent to the class [ \t\n\r\f\v]
.
\S
Matches any non-whitespace character. It is equivalent to the class [^ \t\n\r\f\v]
.
\w
Matches any alphanumeric character. It is equivalent to the class [a-zA-Z0-9_]
.
\W
Matches any non-alphanumeric character. It is equivalent to the class [^a-zA-Z0-9_]
.
NOT Metacharacters
The following characters are matched without any escaping:
{...}
Research this, { and } are metacharacters.
Patterns
At most one group of characters:
(...)?
Matching Modes
Before performing any match, compile the regular expression:
import re
pattern = re.compile(r'(\w+):(\w+)-(\w+)')
Once there is a compiled regular expression, in this case referred via the variable pattern
, it can be used in the following modes:
match()
determine if the regular expression matches at the beginning of the string. See Match a Pattern and Pick Up Groups below.search()
scan through a string looking for any location where this regular expression matches. See Scan a String below.findall()
find all substrings where the regular expression matches and return them as a list.finditer()
find all substrings where the regular expression matches and return them as an iterator.
Make the Regular Expression Case Insensitive
Optionally, the matching can be made case insensitive by passing re.IGNORECASE
as parameter to re.compile()
:
import re
pattern = re.compile(r'(\w+):(\w+)-(\w+)', re.IGNORECASE)
Match a Pattern and Pick Up Groups
import re
p = re.compile(r'^(\w+):(\w+)-(\w+)$')
s = 'abc:mnp-xyz'
m = p.match(s)
if m:
assert 'abc:mnp-xyz' == m.group(0)
assert 'abc' == m.group(1)
assert 'mnp' == m.group(2)
assert 'xyz' == m.group(3)
Groups are 1-based. group(0)
represents the entire expression.
Bug: when a regular expression like this one is used: '....()?()?' (two optional groups), and the last group is None, m.groups(one_based_last_group_index) throws IndexError. The solution was to retrieve the groups as a tuple before any evaluation, and use it for testing:
groups = m.groups() if m.groups(1):
...
... if groups[4]:
...
Scan a String
match = re.search(pattern, string)
if match:
process(match)
Iteratively Match a Pattern against a String
import re
pattern = re.compile(r'\$(\w+)')
s = 'a b $c, $d, $f___1 $a_long_var_name, and something else'
for m in re.finditer(pattern, s):
print(m.start(), m.end(), "group:", m.group(1))
Will display:
4 6 group: c 8 10 group: d 12 18 group: f___1 19 35 group: a_long_var_name
The Match Object
start()
The index of the first character of the matched region in the scanned string.
end()
The index of the first character that succeeds the matched region in the scanned string.
group(index)
Return the captured groups. Their index is 1-based, and group(0)
returns the entire match.
Replacing Regular Expression Occurrences
import re
s = "this is a {{color}} car"
print(re.sub(r"{{color}}", 'blue', s))
Strip quotes:
s = "'something'"
re.sub(r"'$", '', re.sub(r"^'", '', s))
Capture groups and use them in the replacement:
s = 'this is red'
s2 = re.sub(r'^(this is).*$', '\\1 blue', s)
assert 'this is blue' == s2
To dynamically build a regular expression, use rf'...'
s = 'this is a red string'
color = 'red'
s2 = re.sub(rf'{color}', 'blue', s)
assert 'this is a blue string' == s2
Match a new line:
r'\n'
Match not a new line:
r'[^\n]'