Python Regular Expressions: Difference between revisions
(33 intermediate revisions by the same user not shown) | |||
Line 5: | Line 5: | ||
=Internal= | =Internal= | ||
* [[Python_Language_String|Python Strings]] | * [[Python_Language_String|Python Strings]] | ||
* [[Python Module re|re]] | |||
=TODO= | =TODO= | ||
<font color=darkkhaki> | <font color=darkkhaki> | ||
PROCESS: https://docs.python.org/3/howto/regex.html#regex-howto | * PROCESS: https://docs.python.org/3/howto/regex.html#regex-howto | ||
* One of the [[Books#Python_Books|Python books]] chapter on regular expressions. | |||
* [[PyOOP]] "Regular expressions" + "Matching patterns" + "Matching a selection of characters" + "Escaping characters" + "Matching multiple characters" + "Grouping patterns together" + "Getting information from regular expressions" + "Making repeated regular expressions efficient" | |||
</font> | </font> | ||
Line 23: | Line 26: | ||
\. | \. | ||
</font> | </font> | ||
However, the dot does not need to be escaped when it is part of a [[#character_class|character class]]: | |||
<syntaxhighlight lang='py'> | |||
pattern = re.compile(r'[.,]') | |||
</syntaxhighlight> | |||
The above pattern will match "." and "," and nothing else. | |||
==<tt>^</tt>== | ==<tt>^</tt>== | ||
Line 31: | Line 40: | ||
==<tt>+</tt>== | ==<tt>+</tt>== | ||
==<tt>?</tt>== | ==<tt>?</tt>== | ||
Use the optional character ? after any character to specify zero or one occurrence of that character. | |||
==<tt>{</tt>== | ==<tt>{</tt>== | ||
===<tt>{n}</tt>=== | |||
A defined number of repetitions of the character that precedes it. | |||
<syntaxhighlight lang='py'> | |||
r' {2}' | |||
</syntaxhighlight> | |||
If <code>{...}</code> follows a group designated with parentheses <code>(...)</code> then group occurrences are counted. This example matches two lines: | |||
<syntaxhighlight lang='py'> | |||
r'(.+\n){2}' | |||
</syntaxhighlight> | |||
==<tt>}</tt>== | ==<tt>}</tt>== | ||
==<tt>[</tt>== | ==<span id='.5B'></span> <span id='character_class'></span><tt>[</tt> (Character Class)== | ||
Used to specify the beginning of a '''character class''', which is a set of characters you wish to match. Characters can be listed individually: | Used to specify the beginning of a'''character class''', which is a set of characters you wish to match. Characters can be listed individually: | ||
<font size=-1> | <font size=-1> | ||
[abc] | [abc] | ||
Line 42: | Line 66: | ||
[0-9] | [0-9] | ||
</font> | </font> | ||
⚠️ Metacharacters are not active inside classes - they're stripped of their special nature. For example <code>[abc$]</code> will match 'a' or 'b' or 'c' or ' | ⚠️ Metacharacters are not active inside classes - they're stripped of their special nature. For example <code>[abc$.]</code> will match 'a' or 'b' or 'c', '$' or '.'. | ||
⚠️ To match "-" as part of the character class, use "-" at the end of the character class declaration | |||
<syntaxhighlight lang='py'> | |||
pattern = re.compile(r'[abc-]') | |||
assert pattern.match('-') | |||
</syntaxhighlight> | |||
==<tt>]</tt>== | ==<tt>]</tt>== | ||
Used to specify the end of a character class. | Used to specify the end of a character class. | ||
Line 48: | Line 79: | ||
A backslash can be followed by various character to signal [[#Special_Sequences|special sequences]]. The backslash is also used to escape all metacharacters so they can be matched in patterns. To match <code>[</code>, prefix it with backslash: <code>\[</code>. | A backslash can be followed by various character to signal [[#Special_Sequences|special sequences]]. The backslash is also used to escape all metacharacters so they can be matched in patterns. To match <code>[</code>, prefix it with backslash: <code>\[</code>. | ||
==<tt>|</tt>== | ==<tt>|</tt>== | ||
Indicates an alternative between two or more regular sub-expressions, the entire regular expression match will be triggered by be a match of the first sub-expression, or the match of the second sub-expression, etc. | |||
<syntaxhighlight lang='py'> | |||
pattern = re.compile('AND|OR|,') | |||
</syntaxhighlight> | |||
==<tt>(</tt>== | ==<tt>(</tt>== | ||
Used to indicate the beginning of a group capture. | Used to indicate the beginning of a group capture. | ||
If attempting to match a left parenthesis <code>(</code>, use a double backslash escape <code>\\(</code>, otherwise you'll get an "invalid escape sequence \(" warning. | |||
==<tt>)</tt>== | ==<tt>)</tt>== | ||
Used to indicate the end of a group capture. | Used to indicate the end of a group capture. | ||
If attempting to match a right parenthesis <code>)</code>, use a double backslash escape <code>\\)</code>, otherwise you'll get an "invalid escape sequence \) warning. | |||
=Special Sequences= | =Special Sequences= | ||
All the special sequences described below can be included inside a character class, and they preserve their meaning. | All the special sequences described below can be included inside a character class, and they preserve their meaning. | ||
Line 86: | Line 128: | ||
pattern = re.compile(r'(\w+):(\w+)-(\w+)') | pattern = re.compile(r'(\w+):(\w+)-(\w+)') | ||
</syntaxhighlight> | </syntaxhighlight> | ||
<span id='Literal_String_Interpolation'></span>Note that [[Python_Language_String#F-String_.28Literal_String_Interpolation.29|literal string interpolation]] can be used in combination with regular expression specification when compiling a pattern. To use literal string interpolation and regular expressions, combine 'r' and 'f' as follows: | |||
<syntaxhighlight lang='python'> | |||
word = 'blue' | |||
pattern = re.compile(rf'^{word}$') | |||
</syntaxhighlight> | |||
Once there is a compiled regular expression, in this case referred via the variable <code>pattern</code>, it can be used in the following modes: | Once there is a compiled regular expression, in this case referred via the variable <code>pattern</code>, it can be used in the following modes: | ||
* <code>match()</code> determine if the regular expression matches at the beginning of the string. See [[#Match_a_Pattern_and_Pick_Up_Groups|Match a Pattern and Pick Up Groups]] below. | * <code>[[#Match_a_Pattern_and_Pick_Up_Groups|match()]]</code> determine if the regular expression matches at the beginning of the string. See [[#Match_a_Pattern_and_Pick_Up_Groups|Match a Pattern and Pick Up Groups]] below. | ||
* <code>search()</code> scan through a string looking for any location where this regular expression matches. See [[#Scan_a_String|Scan a String]] below. | * <code>[[#Scan_a_String|search()]]</code> scan through a string looking for any location where this regular expression matches. See [[#Scan_a_String|Scan a String]] below. | ||
* <code>findall()</code> find all substrings where the regular expression matches and return them as a list. | * <code>findall()</code> find all substrings where the regular expression matches and return them as a list. | ||
* <code>finditer()</code> find all substrings where the regular expression matches and return them as an [[Python_Language#Iterator|iterator]]. | * <code>[[#Iteratively_Match_a_Pattern_against_a_String|finditer()]]</code> find all substrings where the regular expression matches and return them as an [[Python_Language#Iterator|iterator]]. | ||
==Make the Regular Expression Case Insensitive== | ==Make the Regular Expression Case Insensitive== | ||
Optionally, the matching can be made case insensitive by passing <code>re.IGNORECASE</code> as parameter to <code>re.compile()</code>: | Optionally, the matching can be made case insensitive by passing <code>re.IGNORECASE</code> as parameter to <code>re.compile()</code>: | ||
Line 154: | Line 205: | ||
====<tt>group(index)</tt>==== | ====<tt>group(index)</tt>==== | ||
Return the captured groups. Their index is 1-based, and <code>group(0)</code> returns the entire match. | Return the captured groups. Their index is 1-based, and <code>group(0)</code> returns the entire match. | ||
=Quick (without a Pattern instance) Searching= | |||
<syntaxhighlight lang='python'> | |||
text = "...' | |||
# to search for occurrence within text | |||
re.search(r'something', text) # return None if no match was found, or the Match otherwise | |||
# to perform full match | |||
re.match(r'something', text) | |||
</syntaxhighlight> | |||
=Replacing Regular Expression Occurrences= | =Replacing Regular Expression Occurrences= | ||
{{External|https://docs.python.org/3/library/re.html#text-munging}} | {{External|https://docs.python.org/3/library/re.html#text-munging}} | ||
<syntaxhighlight lang='python'> | |||
re.sub(pattern, replacement, string, count=0, flags=0) | |||
</syntaxhighlight> | |||
<syntaxhighlight lang='python'> | <syntaxhighlight lang='python'> | ||
import re | import re | ||
Line 193: | Line 257: | ||
r'[^\n]' | r'[^\n]' | ||
</syntaxhighlight> | </syntaxhighlight> | ||
<syntaxhighlight lang='python'> | |||
re.sub(r'chart_url: .*\n', 'chart_url: https://example.com/blah.tgz\n', TEST_APP_CONFIG) | |||
</syntaxhighlight> | |||
=Constructing Regular Expressions Dynamically= | |||
If a part of a regular expression comes in a variable, simply add the strings together, while escaping the content of the variable: | |||
<syntaxhighlight lang='python'> | |||
some_var = "m_v" | |||
regex = r"{{ *" + re.escape(some_var) + r" *}}" | |||
assert '- blue -' == re.sub(regex, 'blue', "- {{ m_v }} -") | |||
</syntaxhighlight> | |||
Also see [[#Literal_String_Interpolation|literal string interpolation]] above. | |||
=Use Cases= | |||
==Matching Sequences across Lines== | |||
<code>\n</code> can be used to match new line, and expressions that span lines. |
Latest revision as of 19:36, 7 August 2023
External
Internal
TODO
- PROCESS: https://docs.python.org/3/howto/regex.html#regex-howto
- One of the Python books chapter on regular expressions.
- PyOOP "Regular expressions" + "Matching patterns" + "Matching a selection of characters" + "Escaping characters" + "Matching multiple characters" + "Grouping patterns together" + "Getting information from regular expressions" + "Making repeated regular expressions efficient"
Overview
A regular expression is specified with r"..."
Metacharacters
The complete list of metacharacters:
. (dot)
It matches anything except a newline character, and there’s an alternate mode (re.DOTALL) where it will match even a newline. . Used where you want to match “any character”.
To match an actual dot, escape it:
\.
However, the dot does not need to be escaped when it is part of a character class:
pattern = re.compile(r'[.,]')
The above pattern will match "." and "," and nothing else.
^
Complements a character set, which means match all characters not in the set. This is indicating by including ^
as the first character of the class. For example [^a]
will match everything except "a". If the caret appears elsewhere in the class, it does not have a special meaning and it will represent itself.
$
*
Causes the resulting regular expression to match 0 or more repetitions of the preceding regular expression, as many repetitions as are possible.
+
?
Use the optional character ? after any character to specify zero or one occurrence of that character.
{
{n}
A defined number of repetitions of the character that precedes it.
r' {2}'
If {...}
follows a group designated with parentheses (...)
then group occurrences are counted. This example matches two lines:
r'(.+\n){2}'
}
[ (Character Class)
Used to specify the beginning of acharacter class, which is a set of characters you wish to match. Characters can be listed individually:
[abc]
or as a range, by using -
:
[0-9]
⚠️ Metacharacters are not active inside classes - they're stripped of their special nature. For example [abc$.]
will match 'a' or 'b' or 'c', '$' or '.'.
⚠️ To match "-" as part of the character class, use "-" at the end of the character class declaration
pattern = re.compile(r'[abc-]')
assert pattern.match('-')
]
Used to specify the end of a character class.
\
A backslash can be followed by various character to signal special sequences. The backslash is also used to escape all metacharacters so they can be matched in patterns. To match [
, prefix it with backslash: \[
.
|
Indicates an alternative between two or more regular sub-expressions, the entire regular expression match will be triggered by be a match of the first sub-expression, or the match of the second sub-expression, etc.
pattern = re.compile('AND|OR|,')
(
Used to indicate the beginning of a group capture.
If attempting to match a left parenthesis (
, use a double backslash escape \\(
, otherwise you'll get an "invalid escape sequence \(" warning.
)
Used to indicate the end of a group capture.
If attempting to match a right parenthesis )
, use a double backslash escape \\)
, otherwise you'll get an "invalid escape sequence \) warning.
Special Sequences
All the special sequences described below can be included inside a character class, and they preserve their meaning.
\d
Matches any decimal digit. It is equivalent to the class [0-9]
.
\D
Matches any non-digit character. It is equivalent to the class [^0-9]
.
\s
Matches any whitespace character. It is equivalent to the class [ \t\n\r\f\v]
.
\S
Matches any non-whitespace character. It is equivalent to the class [^ \t\n\r\f\v]
.
\w
Matches any alphanumeric character. It is equivalent to the class [a-zA-Z0-9_]
.
\W
Matches any non-alphanumeric character. It is equivalent to the class [^a-zA-Z0-9_]
.
NOT Metacharacters
The following characters are matched without any escaping:
{...}
Research this, { and } are metacharacters.
Patterns
At most one group of characters:
(...)?
Matching Modes
Before performing any match, compile the regular expression:
import re
pattern = re.compile(r'(\w+):(\w+)-(\w+)')
Note that literal string interpolation can be used in combination with regular expression specification when compiling a pattern. To use literal string interpolation and regular expressions, combine 'r' and 'f' as follows:
word = 'blue'
pattern = re.compile(rf'^{word}$')
Once there is a compiled regular expression, in this case referred via the variable pattern
, it can be used in the following modes:
match()
determine if the regular expression matches at the beginning of the string. See Match a Pattern and Pick Up Groups below.search()
scan through a string looking for any location where this regular expression matches. See Scan a String below.findall()
find all substrings where the regular expression matches and return them as a list.finditer()
find all substrings where the regular expression matches and return them as an iterator.
Make the Regular Expression Case Insensitive
Optionally, the matching can be made case insensitive by passing re.IGNORECASE
as parameter to re.compile()
:
import re
pattern = re.compile(r'(\w+):(\w+)-(\w+)', re.IGNORECASE)
Match a Pattern and Pick Up Groups
import re
p = re.compile(r'^(\w+):(\w+)-(\w+)$')
s = 'abc:mnp-xyz'
m = p.match(s)
if m:
assert 'abc:mnp-xyz' == m.group(0)
assert 'abc' == m.group(1)
assert 'mnp' == m.group(2)
assert 'xyz' == m.group(3)
Groups are 1-based. group(0)
represents the entire expression.
Bug: when a regular expression like this one is used: '....()?()?' (two optional groups), and the last group is None, m.groups(one_based_last_group_index) throws IndexError. The solution was to retrieve the groups as a tuple before any evaluation, and use it for testing:
groups = m.groups() if m.groups(1):
...
... if groups[4]:
...
Scan a String
match = re.search(pattern, string)
if match:
process(match)
Iteratively Match a Pattern against a String
import re
pattern = re.compile(r'\$(\w+)')
s = 'a b $c, $d, $f___1 $a_long_var_name, and something else'
for m in re.finditer(pattern, s):
print(m.start(), m.end(), "group:", m.group(1))
Will display:
4 6 group: c 8 10 group: d 12 18 group: f___1 19 35 group: a_long_var_name
The Match Object
start()
The index of the first character of the matched region in the scanned string.
end()
The index of the first character that succeeds the matched region in the scanned string.
group(index)
Return the captured groups. Their index is 1-based, and group(0)
returns the entire match.
Quick (without a Pattern instance) Searching
text = "...'
# to search for occurrence within text
re.search(r'something', text) # return None if no match was found, or the Match otherwise
# to perform full match
re.match(r'something', text)
Replacing Regular Expression Occurrences
re.sub(pattern, replacement, string, count=0, flags=0)
import re
s = "this is a {{color}} car"
print(re.sub(r"{{color}}", 'blue', s))
Strip quotes:
s = "'something'"
re.sub(r"'$", '', re.sub(r"^'", '', s))
Capture groups and use them in the replacement:
s = 'this is red'
s2 = re.sub(r'^(this is).*$', '\\1 blue', s)
assert 'this is blue' == s2
To dynamically build a regular expression, use rf'...'
s = 'this is a red string'
color = 'red'
s2 = re.sub(rf'{color}', 'blue', s)
assert 'this is a blue string' == s2
Match a new line:
r'\n'
Match not a new line:
r'[^\n]'
re.sub(r'chart_url: .*\n', 'chart_url: https://example.com/blah.tgz\n', TEST_APP_CONFIG)
Constructing Regular Expressions Dynamically
If a part of a regular expression comes in a variable, simply add the strings together, while escaping the content of the variable:
some_var = "m_v"
regex = r"{{ *" + re.escape(some_var) + r" *}}"
assert '- blue -' == re.sub(regex, 'blue', "- {{ m_v }} -")
Also see literal string interpolation above.
Use Cases
Matching Sequences across Lines
\n
can be used to match new line, and expressions that span lines.