Revision as of 18:14, 16 April 2022

External

Internal

Python Strings

TODO

PROCESS: https://docs.python.org/3/howto/regex.html#regex-howto

Overview

A regular expression is specified with r"..."

Metacharacters

https://docs.python.org/3/library/re.html#regular-expression-syntax

The complete list of metacharacters:

`. (dot)`

It matches anything except a newline character, and there’s an alternate mode (re.DOTALL) where it will match even a newline. . Used where you want to match “any character”.

To match an actual dot, escape it:

\.

`^`

Complements a character set, which means match all characters not in the set. This is indicating by including ^ as the first character of the class. For example [^a] will match everything except "a". If the caret appears elsewhere in the class, it does not have a special meaning and it will represent itself.

`$`

`*`

Causes the resulting regular expression to match 0 or more repetitions of the preceding regular expression, as many repetitions as are possible.

`+`

`?`

`{`

`}`

`[`

Used to specify the beginning of a character class, which is a set of characters you wish to match. Characters can be listed individually:

[abc]

or as a range, by using -:

[0-9]

⚠️ Metacharacters are not active inside classes - they're stripped of their special nature. For example [abc$] will match 'a' or 'b' or 'c' or '$'.

`]`

Used to specify the end of a character class.

`\`

A backslash can be followed by various character to signal special sequences. The backslash is also used to escape all metacharacters so they can be matched in patterns. To match [, prefix it with backslash: \[.

`|`

Indicates an alternative between two or more regular sub-expressions, the entire regular expression match will be triggered by be a match of the first sub-expression, or the match of the second sub-expression, etc.

pattern = re.compile('AND|OR|,')

`(`

Used to indicate the beginning of a group capture.

`)`

Used to indicate the end of a group capture.

Special Sequences

All the special sequences described below can be included inside a character class, and they preserve their meaning.

`\d`

Matches any decimal digit. It is equivalent to the class [0-9].

`\D`

Matches any non-digit character. It is equivalent to the class [^0-9].

`\s`

Matches any whitespace character. It is equivalent to the class [ \t\n\r\f\v].

`\S`

Matches any non-whitespace character. It is equivalent to the class [^ \t\n\r\f\v].

`\w`

Matches any alphanumeric character. It is equivalent to the class [a-zA-Z0-9_].

`\W`

Matches any non-alphanumeric character. It is equivalent to the class [^a-zA-Z0-9_].

NOT Metacharacters

The following characters are matched without any escaping:

`{...}`

Research this, { and } are metacharacters.

Patterns

At most one group of characters:

(...)?

Matching Modes

Before performing any match, compile the regular expression:

import re
pattern = re.compile(r'(\w+):(\w+)-(\w+)')

Once there is a compiled regular expression, in this case referred via the variable pattern, it can be used in the following modes:

match() determine if the regular expression matches at the beginning of the string. See Match a Pattern and Pick Up Groups below.
search() scan through a string looking for any location where this regular expression matches. See Scan a String below.
findall() find all substrings where the regular expression matches and return them as a list.
finditer() find all substrings where the regular expression matches and return them as an iterator.

Make the Regular Expression Case Insensitive

Optionally, the matching can be made case insensitive by passing re.IGNORECASE as parameter to re.compile():

import re
pattern = re.compile(r'(\w+):(\w+)-(\w+)', re.IGNORECASE)

Match a Pattern and Pick Up Groups

import re

p = re.compile(r'^(\w+):(\w+)-(\w+)$')
s = 'abc:mnp-xyz'
m = p.match(s)
if m:
    assert 'abc:mnp-xyz' == m.group(0)
    assert 'abc' == m.group(1)
    assert 'mnp' == m.group(2)
    assert 'xyz' == m.group(3)

Groups are 1-based. group(0) represents the entire expression.

Bug: when a regular expression like this one is used: '....()?()?' (two optional groups), and the last group is None, m.groups(one_based_last_group_index) throws IndexError. The solution was to retrieve the groups as a tuple before any evaluation, and use it for testing:

groups = m.groups() if m.groups(1):

...

... if groups[4]:

...

Scan a String

match = re.search(pattern, string)
if match:
    process(match)

Iteratively Match a Pattern against a String

import re

pattern = re.compile(r'\$(\w+)')
s = 'a b $c, $d, $f___1 $a_long_var_name, and something else'
for m in re.finditer(pattern, s):
    print(m.start(), m.end(), "group:", m.group(1))

Will display:

4 6 group: c
8 10 group: d
12 18 group: f___1
19 35 group: a_long_var_name

The Match Object

`start()`

The index of the first character of the matched region in the scanned string.

`end()`

The index of the first character that succeeds the matched region in the scanned string.

`group(index)`

Return the captured groups. Their index is 1-based, and group(0) returns the entire match.

Replacing Regular Expression Occurrences

https://docs.python.org/3/library/re.html#text-munging

import re

s = "this is a {{color}} car"
print(re.sub(r"{{color}}", 'blue', s))

Strip quotes:

s = "'something'"
re.sub(r"'$", '', re.sub(r"^'", '', s))

Capture groups and use them in the replacement:

s = 'this is red'
s2 = re.sub(r'^(this is).*$', '\\1 blue', s)
assert 'this is blue' == s2

To dynamically build a regular expression, use rf'...'

s = 'this is a red string'
color = 'red'
s2 = re.sub(rf'{color}', 'blue', s)
assert 'this is a blue string' == s2

Match a new line:

r'\n'

Match not a new line:

r'[^\n]'

@@ Line 48: / Line 48: @@
 A backslash can be followed by various character to signal [[#Special_Sequences|special sequences]]. The backslash is also used to escape all metacharacters so they can be matched in patterns. To match <code>[</code>, prefix it with backslash: <code>\[</code>.
 ==<tt>|</tt>==
-Indicates an alternative between two or more regular sub-expressions, when the entire regular expression is matches, there will be a match if the first sub-expression matches, or the second sub-expression matches, etc.
+Indicates an alternative between two or more regular sub-expressions, the entire regular expression match will be triggered by be a match of the first sub-expression, or the match of the second sub-expression, etc.
 <syntaxhighlight lang='py'>
 pattern = re.compile('AND|OR|,')

Python Regular Expressions: Difference between revisions