Python Regular Expressions: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
Line 127: Line 127:
==Iteratively Match a Pattern against a String==
==Iteratively Match a Pattern against a String==
=The Match Object=
=The Match Object=
====<tt>start()</tt>====
The index of the first character of the matched region in the scanned string.
====<tt>endt()</tt>====
The index of the first character that succeeds the matched region in the scanned string.


=Replacing Regular Expression Occurrences=
=Replacing Regular Expression Occurrences=

Revision as of 01:49, 25 March 2022

External

Internal

TODO

PROCESS: https://docs.python.org/3/howto/regex.html#regex-howto

Overview

A regular expression is specified with r"..."

Metacharacters

https://docs.python.org/3/library/re.html#regular-expression-syntax

The complete list of metacharacters:

. (dot)

It matches anything except a newline character, and there’s an alternate mode (re.DOTALL) where it will match even a newline. . Used where you want to match “any character”.

To match an actual dot, escape it:

\.

^

Complements a character set, which means match all characters not in the set. This is indicating by including ^ as the first character of the class. For example [^a] will match everything except "a". If the caret appears elsewhere in the class, it does not have a special meaning and it will represent itself.

$

*

Causes the resulting regular expression to match 0 or more repetitions of the preceding regular expression, as many repetitions as are possible.

+

?

{

}

[

Used to specify the beginning of a character class, which is a set of characters you wish to match. Characters can be listed individually:

[abc]

or as a range, by using -:

[0-9]

⚠️ Metacharacters are not active inside classes - they're stripped of their special nature. For example [abc$] will match 'a' or 'b' or 'c' or '$'.

]

Used to specify the end of a character class.

\

A backslash can be followed by various character to signal special sequences. The backslash is also used to escape all metacharacters so they can be matched in patterns. To match [, prefix it with backslash: \[.

|

(

Used to indicate the beginning of a group capture.

)

Used to indicate the end of a group capture.

Special Sequences

All the special sequences described below can be included inside a character class, and they preserve their meaning.

\d

Matches any decimal digit. It is equivalent to the class [0-9].

\D

Matches any non-digit character. It is equivalent to the class [^0-9].

\s

Matches any whitespace character. It is equivalent to the class [ \t\n\r\f\v].

\S

Matches any non-whitespace character. It is equivalent to the class [^ \t\n\r\f\v].

\w

Matches any alphanumeric character. It is equivalent to the class [a-zA-Z0-9_].

\W

Matches any non-alphanumeric character. It is equivalent to the class [^a-zA-Z0-9_].

NOT Metacharacters

The following characters are matched without any escaping:

{...}

Research this, { and } are metacharacters.

Patterns

At most one group of characters:

(...)?

Matching Modes

Before performing any match, compile the regular expression:

import re
pattern = re.compile(r'(\w+):(\w+)-(\w+)')

Once there is a compiled regular expression, in this case referred via the variable pattern, it can be used in the following modes:

  • match() determine if the regular expression matches at the beginning of the string. See Match a Pattern and Pick Up Groups below.
  • search() scan through a string looking for any location where this regular expression matches. See Scan a String below.
  • findall() find all substrings where the regular expression matches and return them as a list.
  • finditer() find all substrings where the regular expression matches and return them as an iterator.

Match a Pattern and Pick Up Groups

import re

p = re.compile(r'^(\w+):(\w+)-(\w+)$')
s = 'abc:mnp-xyz'
m = p.match(s)
if m:
    assert 'abc:mnp-xyz' == m.group(0)
    assert 'abc' == m.group(1)
    assert 'mnp' == m.group(2)
    assert 'xyz' == m.group(3)

Groups are 1-based. group(0) represents the entire expression.

Bug: when a regular expression like this one is used: '....()?()?' (two optional groups), and the last group is None, m.groups(one_based_last_group_index) throws IndexError. The solution was to retrieve the groups as a tuple before any evaluation, and use it for testing:

groups = m.groups() if m.groups(1):

 ...

... if groups[4]:

 ...

Scan a String

match = re.search(pattern, string)
if match:
    process(match)

Iteratively Match a Pattern against a String

The Match Object

start()

The index of the first character of the matched region in the scanned string.

endt()

The index of the first character that succeeds the matched region in the scanned string.

Replacing Regular Expression Occurrences

https://docs.python.org/3/library/re.html#text-munging
import re

s = "this is a {{color}} car"
print(re.sub(r"{{color}}", 'blue', s))

Strip quotes:

s = "'something'"
re.sub(r"'$", '', re.sub(r"^'", '', s))

Capture groups and use them in the replacement:

s = 'this is red'
s2 = re.sub(r'^(this is).*$', '\\1 blue', s)
assert 'this is blue' == s2

To dynamically build a regular expression, use rf'...'

s = 'this is a red string'
color = 'red'
s2 = re.sub(rf'{color}', 'blue', s)
assert 'this is a blue string' == s2

Match a new line:

r'\n'

Match not a new line:

r'[^\n]'