Python Regular Expressions: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
 
(66 intermediate revisions by the same user not shown)
Line 4: Line 4:


=Internal=
=Internal=
* [[Python_Language_String|Python Strings]]
* [[Python Module re|re]]
=TODO=
=TODO=
<font color=darkkhaki>
<font color=darkkhaki>
PROCESS: https://docs.python.org/3/howto/regex.html#regex-howto
* PROCESS: https://docs.python.org/3/howto/regex.html#regex-howto
* One of the [[Books#Python_Books|Python books]] chapter on regular expressions.
* [[PyOOP]] "Regular expressions" + "Matching patterns" + "Matching a selection of characters" + "Escaping characters" + "Matching multiple characters" + "Grouping patterns together" + "Getting information from regular expressions" + "Making repeated regular expressions efficient"
</font>
</font>


Line 13: Line 18:
=Metacharacters=
=Metacharacters=
{{External|https://docs.python.org/3/library/re.html#regular-expression-syntax}}
{{External|https://docs.python.org/3/library/re.html#regular-expression-syntax}}
The complete list of metacharacters:
==<tt>. (dot)</tt>==
It matches anything except a newline character, and there’s an alternate mode (re.DOTALL) where it will match even a newline. . Used where you want to match “any character”.
To match an actual dot, escape it:
<font size=-1>
\.
</font>
However, the dot does not need to be escaped when it is part of a [[#character_class|character class]]:
<syntaxhighlight lang='py'>
pattern = re.compile(r'[.,]')
</syntaxhighlight>
The above pattern will match "." and "," and nothing else.
==<tt>^</tt>==
Complements a character set, which means match all characters '''not''' in the set. This is indicating by including <code>^</code> as the '''first character''' of the class. For example <code>[^a]</code> will match everything except "a". If the caret appears elsewhere in the class, it does not have a special meaning and it will represent itself.
==<tt>$</tt>==
==<tt>*</tt>==
Causes the resulting regular expression to match 0 or more repetitions of the preceding regular expression, as many repetitions as are possible.
==<tt>+</tt>==
==<tt>?</tt>==
Use the optional character ? after any character to specify zero or one occurrence of that character.
==<tt>{</tt>==
===<tt>{n}</tt>===
A defined number of repetitions of the character that precedes it.
<syntaxhighlight lang='py'>
r' {2}'
</syntaxhighlight>
If <code>{...}</code> follows a group designated with parentheses <code>(...)</code> then group occurrences are counted. This example matches two lines:
<syntaxhighlight lang='py'>
r'(.+\n){2}'
</syntaxhighlight>
==<tt>}</tt>==
==<span id='.5B'></span> <span id='character_class'></span><tt>[</tt> (Character Class)==
Used to specify the beginning of a'''character class''', which is a set of characters you wish to match. Characters can be listed individually:
<font size=-1>
[abc]
</font>
or as a range, by using <code>-</code>:
<font size=-1>
[0-9]
</font>
⚠️ Metacharacters are not active inside classes - they're stripped of their special nature. For example <code>[abc$.]</code> will match 'a' or 'b' or 'c', '$' or '.'.
⚠️ To match "-" as part of the character class, use "-" at the end of the character class declaration
<syntaxhighlight lang='py'>
pattern = re.compile(r'[abc-]')
assert pattern.match('-')
</syntaxhighlight>
==<tt>]</tt>==
Used to specify the end of a character class.
==<tt>\</tt>==
A backslash can be followed by various character to signal [[#Special_Sequences|special sequences]]. The backslash is also used to escape all metacharacters so they can be matched in patterns. To match <code>[</code>, prefix it with backslash: <code>\[</code>.
==<tt>|</tt>==
Indicates an alternative between two or more regular sub-expressions, the entire regular expression match will be triggered by be a match of the first sub-expression, or the match of the second sub-expression, etc.
<syntaxhighlight lang='py'>
pattern = re.compile('AND|OR|,')
</syntaxhighlight>
==<tt>(</tt>==
Used to indicate the beginning of a group capture.
If attempting to match a left parenthesis <code>(</code>, use a double backslash escape <code>\\(</code>, otherwise you'll get an "invalid escape sequence \(" warning.
==<tt>)</tt>==
Used to indicate the end of a group capture.
If attempting to match a right parenthesis <code>)</code>, use a double backslash escape <code>\\)</code>, otherwise you'll get an "invalid escape sequence \) warning.
=Special Sequences=
All the special sequences described below can be included inside a character class, and they preserve their meaning.
==<tt>\d</tt>==
Matches any decimal digit. It is equivalent to the class <code>[0-9]</code>.
==<tt>\D</tt>==
Matches any non-digit character. It is equivalent to the class <code>[^0-9]</code>.
==<tt>\s</tt>==
Matches any whitespace character. It is equivalent to the class <code>[ \t\n\r\f\v]</code>.
==<tt>\S</tt>==
Matches any non-whitespace character. It is equivalent to the class <code>[^ \t\n\r\f\v]</code>.
==<tt>\w</tt>==
Matches any alphanumeric character. It is equivalent to the class <code>[a-zA-Z0-9_]</code>.
==<tt>\W</tt>==
Matches any non-alphanumeric character. It is equivalent to the class <code>[^a-zA-Z0-9_]</code>.
=NOT Metacharacters=
The following characters are matched without any escaping:
==<tt>{...}</tt>==
<font color=darkkhaki>Research this, { and } are metacharacters.</font>
=Patterns=
At most one group of characters:
<font size=-1>
(...)?
</font>
=Matching Modes=
Before performing any match, compile the regular expression:
<syntaxhighlight lang='python'>
import re
pattern = re.compile(r'(\w+):(\w+)-(\w+)')
</syntaxhighlight>
<span id='Literal_String_Interpolation'></span>Note that [[Python_Language_String#F-String_.28Literal_String_Interpolation.29|literal string interpolation]] can be used in combination with regular expression specification when compiling a pattern. To use literal string interpolation and regular expressions, combine 'r' and 'f' as follows:
<syntaxhighlight lang='python'>
word = 'blue'
pattern = re.compile(rf'^{word}$')
</syntaxhighlight>
Once there is a compiled regular expression, in this case referred via the variable <code>pattern</code>, it can be used in the following modes:
* <code>[[#Match_a_Pattern_and_Pick_Up_Groups|match()]]</code> determine if the regular expression matches at the beginning of the string. See [[#Match_a_Pattern_and_Pick_Up_Groups|Match a Pattern and Pick Up Groups]] below.
* <code>[[#Scan_a_String|search()]]</code> scan through a string looking for any location where this regular expression matches. See [[#Scan_a_String|Scan a String]] below.
* <code>findall()</code> find all substrings where the regular expression matches and return them as a list.
* <code>[[#Iteratively_Match_a_Pattern_against_a_String|finditer()]]</code> find all substrings where the regular expression matches and return them as an [[Python_Language#Iterator|iterator]].
==Make the Regular Expression Case Insensitive==
Optionally, the matching can be made case insensitive by passing <code>re.IGNORECASE</code> as parameter to <code>re.compile()</code>:
<syntaxhighlight lang='python'>
import re
pattern = re.compile(r'(\w+):(\w+)-(\w+)', re.IGNORECASE)
</syntaxhighlight>
==<span id='Match_Objects'>Match a Pattern and Pick Up Groups</span>==
<syntaxhighlight lang='python'>
import re
p = re.compile(r'^(\w+):(\w+)-(\w+)$')
s = 'abc:mnp-xyz'
m = p.match(s)
if m:
    assert 'abc:mnp-xyz' == m.group(0)
    assert 'abc' == m.group(1)
    assert 'mnp' == m.group(2)
    assert 'xyz' == m.group(3)
</syntaxhighlight>
Groups are 1-based. <code>group(0)</code> represents the entire expression.
<font color=darkkhaki>
'''Bug''': when a regular expression like this one is used: '....()?()?' (two optional groups), and the last group is None, m.groups(one_based_last_group_index) throws IndexError. The solution was to retrieve the groups as a tuple before any evaluation, and use it for testing:
<syntaxhighlight lang='python'>
</syntaxhighlight>
groups = m.groups()
if m.groups(1):
  ...
...
if groups[4]:
  ...
</font>
==Scan a String==
<syntaxhighlight lang='python'>
match = re.search(pattern, string)
if match:
    process(match)
</syntaxhighlight>
==Iteratively Match a Pattern against a String==
<syntaxhighlight lang='py'>
import re
pattern = re.compile(r'\$(\w+)')
s = 'a b $c, $d, $f___1 $a_long_var_name, and something else'
for m in re.finditer(pattern, s):
    print(m.start(), m.end(), "group:", m.group(1))
</syntaxhighlight>
Will display:
<font size=-1>
4 6 group: c
8 10 group: d
12 18 group: f___1
19 35 group: a_long_var_name
</font>
=The Match Object=
====<tt>start()</tt>====
The index of the first character of the matched region in the scanned string.
====<tt>end()</tt>====
The index of the first character that succeeds the matched region in the scanned string.
====<tt>group(index)</tt>====
Return the captured groups. Their index is 1-based, and <code>group(0)</code> returns the entire match.
=Quick (without a Pattern instance) Searching=
<syntaxhighlight lang='python'>
text = "...'
# to search for occurrence within text
re.search(r'something', text) # return None if no match was found, or the Match otherwise
# to perform full match
re.match(r'something', text)
</syntaxhighlight>


=Replacing Regular Expression Occurrences=
=Replacing Regular Expression Occurrences=
{{External|https://docs.python.org/3/library/re.html#text-munging}}
{{External|https://docs.python.org/3/library/re.html#text-munging}}
<syntaxhighlight lang='python'>
re.sub(pattern, replacement, string, count=0, flags=0)
</syntaxhighlight>
<syntaxhighlight lang='python'>
<syntaxhighlight lang='python'>
import re
import re
Line 53: Line 258:
</syntaxhighlight>
</syntaxhighlight>


=<span id='Match_Objects'>Match a Pattern and Pick Up Groups</span>=
<syntaxhighlight lang='python'>
<syntaxhighlight lang='python'>
import re
re.sub(r'chart_url: .*\n', 'chart_url: https://example.com/blah.tgz\n', TEST_APP_CONFIG)
</syntaxhighlight>


p = re.compile(r'^(\w+):(\w+)-(\w+)$')
=Constructing Regular Expressions Dynamically=
s = 'abc:mnp-xyz'
If a part of a regular expression comes in a variable, simply add the strings together, while escaping the content of the variable:
m = p.match(s)
if m:
    assert 'abc:mnp-xyz' == m.group(0)
    assert 'abc' == m.group(1)
    assert 'mnp' == m.group(2)
    assert 'xyz' == m.group(3)
</syntaxhighlight>
Groups are 1-based. <code>group(0)</code> represents the entire expression.


=Scan a String=
<syntaxhighlight lang='python'>
<syntaxhighlight lang='python'>
match = re.search(pattern, string)
some_var = "m_v"
if match:
regex = r"{{ *" + re.escape(some_var) + r" *}}"
    process(match)
 
assert '- blue -' == re.sub(regex, 'blue', "- {{ m_v }} -")
</syntaxhighlight>
</syntaxhighlight>
Also see [[#Literal_String_Interpolation|literal string interpolation]] above.
=Use Cases=
==Matching Sequences across Lines==
<code>\n</code> can be used to match new line, and expressions that span lines.

Latest revision as of 19:36, 7 August 2023

External

Internal

TODO

  • PROCESS: https://docs.python.org/3/howto/regex.html#regex-howto
  • One of the Python books chapter on regular expressions.
  • PyOOP "Regular expressions" + "Matching patterns" + "Matching a selection of characters" + "Escaping characters" + "Matching multiple characters" + "Grouping patterns together" + "Getting information from regular expressions" + "Making repeated regular expressions efficient"

Overview

A regular expression is specified with r"..."

Metacharacters

https://docs.python.org/3/library/re.html#regular-expression-syntax

The complete list of metacharacters:

. (dot)

It matches anything except a newline character, and there’s an alternate mode (re.DOTALL) where it will match even a newline. . Used where you want to match “any character”.

To match an actual dot, escape it:

\.

However, the dot does not need to be escaped when it is part of a character class:

pattern = re.compile(r'[.,]')

The above pattern will match "." and "," and nothing else.

^

Complements a character set, which means match all characters not in the set. This is indicating by including ^ as the first character of the class. For example [^a] will match everything except "a". If the caret appears elsewhere in the class, it does not have a special meaning and it will represent itself.

$

*

Causes the resulting regular expression to match 0 or more repetitions of the preceding regular expression, as many repetitions as are possible.

+

?

Use the optional character ? after any character to specify zero or one occurrence of that character.

{

{n}

A defined number of repetitions of the character that precedes it.

r' {2}'

If {...} follows a group designated with parentheses (...) then group occurrences are counted. This example matches two lines:

r'(.+\n){2}'

}

[ (Character Class)

Used to specify the beginning of acharacter class, which is a set of characters you wish to match. Characters can be listed individually:

[abc]

or as a range, by using -:

[0-9]

⚠️ Metacharacters are not active inside classes - they're stripped of their special nature. For example [abc$.] will match 'a' or 'b' or 'c', '$' or '.'.

⚠️ To match "-" as part of the character class, use "-" at the end of the character class declaration

pattern = re.compile(r'[abc-]')
assert pattern.match('-')

]

Used to specify the end of a character class.

\

A backslash can be followed by various character to signal special sequences. The backslash is also used to escape all metacharacters so they can be matched in patterns. To match [, prefix it with backslash: \[.

|

Indicates an alternative between two or more regular sub-expressions, the entire regular expression match will be triggered by be a match of the first sub-expression, or the match of the second sub-expression, etc.

pattern = re.compile('AND|OR|,')

(

Used to indicate the beginning of a group capture.

If attempting to match a left parenthesis (, use a double backslash escape \\(, otherwise you'll get an "invalid escape sequence \(" warning.

)

Used to indicate the end of a group capture.

If attempting to match a right parenthesis ), use a double backslash escape \\), otherwise you'll get an "invalid escape sequence \) warning.

Special Sequences

All the special sequences described below can be included inside a character class, and they preserve their meaning.

\d

Matches any decimal digit. It is equivalent to the class [0-9].

\D

Matches any non-digit character. It is equivalent to the class [^0-9].

\s

Matches any whitespace character. It is equivalent to the class [ \t\n\r\f\v].

\S

Matches any non-whitespace character. It is equivalent to the class [^ \t\n\r\f\v].

\w

Matches any alphanumeric character. It is equivalent to the class [a-zA-Z0-9_].

\W

Matches any non-alphanumeric character. It is equivalent to the class [^a-zA-Z0-9_].

NOT Metacharacters

The following characters are matched without any escaping:

{...}

Research this, { and } are metacharacters.

Patterns

At most one group of characters:

(...)?

Matching Modes

Before performing any match, compile the regular expression:

import re
pattern = re.compile(r'(\w+):(\w+)-(\w+)')

Note that literal string interpolation can be used in combination with regular expression specification when compiling a pattern. To use literal string interpolation and regular expressions, combine 'r' and 'f' as follows:

word = 'blue'
pattern = re.compile(rf'^{word}$')


Once there is a compiled regular expression, in this case referred via the variable pattern, it can be used in the following modes:

  • match() determine if the regular expression matches at the beginning of the string. See Match a Pattern and Pick Up Groups below.
  • search() scan through a string looking for any location where this regular expression matches. See Scan a String below.
  • findall() find all substrings where the regular expression matches and return them as a list.
  • finditer() find all substrings where the regular expression matches and return them as an iterator.

Make the Regular Expression Case Insensitive

Optionally, the matching can be made case insensitive by passing re.IGNORECASE as parameter to re.compile():

import re
pattern = re.compile(r'(\w+):(\w+)-(\w+)', re.IGNORECASE)

Match a Pattern and Pick Up Groups

import re

p = re.compile(r'^(\w+):(\w+)-(\w+)$')
s = 'abc:mnp-xyz'
m = p.match(s)
if m:
    assert 'abc:mnp-xyz' == m.group(0)
    assert 'abc' == m.group(1)
    assert 'mnp' == m.group(2)
    assert 'xyz' == m.group(3)

Groups are 1-based. group(0) represents the entire expression.

Bug: when a regular expression like this one is used: '....()?()?' (two optional groups), and the last group is None, m.groups(one_based_last_group_index) throws IndexError. The solution was to retrieve the groups as a tuple before any evaluation, and use it for testing:

groups = m.groups() if m.groups(1):

 ...

... if groups[4]:

 ...

Scan a String

match = re.search(pattern, string)
if match:
    process(match)

Iteratively Match a Pattern against a String

import re

pattern = re.compile(r'\$(\w+)')
s = 'a b $c, $d, $f___1 $a_long_var_name, and something else'
for m in re.finditer(pattern, s):
    print(m.start(), m.end(), "group:", m.group(1))

Will display:

4 6 group: c
8 10 group: d
12 18 group: f___1
19 35 group: a_long_var_name

The Match Object

start()

The index of the first character of the matched region in the scanned string.

end()

The index of the first character that succeeds the matched region in the scanned string.

group(index)

Return the captured groups. Their index is 1-based, and group(0) returns the entire match.

Quick (without a Pattern instance) Searching

text = "...'
# to search for occurrence within text
re.search(r'something', text) # return None if no match was found, or the Match otherwise
# to perform full match
re.match(r'something', text)

Replacing Regular Expression Occurrences

https://docs.python.org/3/library/re.html#text-munging
re.sub(pattern, replacement, string, count=0, flags=0)
import re

s = "this is a {{color}} car"
print(re.sub(r"{{color}}", 'blue', s))

Strip quotes:

s = "'something'"
re.sub(r"'$", '', re.sub(r"^'", '', s))

Capture groups and use them in the replacement:

s = 'this is red'
s2 = re.sub(r'^(this is).*$', '\\1 blue', s)
assert 'this is blue' == s2

To dynamically build a regular expression, use rf'...'

s = 'this is a red string'
color = 'red'
s2 = re.sub(rf'{color}', 'blue', s)
assert 'this is a blue string' == s2

Match a new line:

r'\n'

Match not a new line:

r'[^\n]'
re.sub(r'chart_url: .*\n', 'chart_url: https://example.com/blah.tgz\n', TEST_APP_CONFIG)

Constructing Regular Expressions Dynamically

If a part of a regular expression comes in a variable, simply add the strings together, while escaping the content of the variable:

some_var = "m_v"
regex = r"{{ *" + re.escape(some_var) + r" *}}"

assert '- blue -' == re.sub(regex, 'blue', "- {{ m_v }} -")

Also see literal string interpolation above.

Use Cases

Matching Sequences across Lines

\n can be used to match new line, and expressions that span lines.