Revision as of 01:19, 25 March 2022

External

Internal

Python Strings

TODO

PROCESS: https://docs.python.org/3/howto/regex.html#regex-howto

Overview

A regular expression is specified with r"..."

Metacharacters

https://docs.python.org/3/library/re.html#regular-expression-syntax

The complete list of metacharacters:

`. (dot)`

Stands for "any one character". To match an actual dot, escape it:

\.

`^`

Complements a character set, which means match all characters not in the set. This is indicating by including ^ as the first character of the class. For example [^a] will match everything except "a". If the caret appears elsewhere in the class, it does not have a special meaning and it will represent itself.

`$`

`*`

Causes the resulting regular expression to match 0 or more repetitions of the preceding regular expression, as many repetitions as are possible.

`+`

`?`

`{`

`}`

`[`

Used to specify the beginning of a character class, which is a set of characters you wish to match. Characters can be listed individually:

[abc]

or as a range, by using -:

[0-9]

⚠️ Metacharacters are not active inside classes - they're stripped of their special nature. For example [abc$] will match 'a' or 'b' or 'c' or '$'.

`]`

Used to specify the end of a character class.

`\`

A backslash can be followed by various character to signal special sequences. The backslash is also used to escape all metacharacters so they can be matched in patterns. To match [, prefix it with backslash: \[.

`|`

`(`

Used to indicate the beginning of a group capture.

`)`

Used to indicate the end of a group capture.

Special Sequences

NOT Metacharacters

The following characters are matched without any escaping:

`{...}`

Patterns

At most one group of characters:

(...)?

Replacing Regular Expression Occurrences

https://docs.python.org/3/library/re.html#text-munging

import re

s = "this is a {{color}} car"
print(re.sub(r"{{color}}", 'blue', s))

Strip quotes:

s = "'something'"
re.sub(r"'$", '', re.sub(r"^'", '', s))

Capture groups and use them in the replacement:

s = 'this is red'
s2 = re.sub(r'^(this is).*$', '\\1 blue', s)
assert 'this is blue' == s2

To dynamically build a regular expression, use rf'...'

s = 'this is a red string'
color = 'red'
s2 = re.sub(rf'{color}', 'blue', s)
assert 'this is a blue string' == s2

Match a new line:

r'\n'

Match not a new line:

r'[^\n]'

Match a Pattern and Pick Up Groups

import re

p = re.compile(r'^(\w+):(\w+)-(\w+)$')
s = 'abc:mnp-xyz'
m = p.match(s)
if m:
    assert 'abc:mnp-xyz' == m.group(0)
    assert 'abc' == m.group(1)
    assert 'mnp' == m.group(2)
    assert 'xyz' == m.group(3)

Groups are 1-based. group(0) represents the entire expression.

Bug: when a regular expression like this one is used: '....()?()?' (two optional groups), and the last group is None, m.groups(one_based_last_group_index) throws IndexError. The solution was to retrieve the groups as a tuple before any evaluation, and use it for testing:

groups = m.groups() if m.groups(1):

...

... if groups[4]:

...

Scan a String

match = re.search(pattern, string)
if match:
    process(match)

@@ Line 15: / Line 15: @@
 =Metacharacters=
 {{External|https://docs.python.org/3/library/re.html#regular-expression-syntax}}
-==<tt>(...)</tt>==
+The complete list of metacharacters:
-Used to capture groups.
-==<tt>^</tt>==
-Not a certain character, or a set of characters.
-<code>[^a]</code>
 ==<tt>. (dot)</tt>==
 Stands for "any one character". To match an actual dot, escape it:
@@ Line 25: / Line 21: @@
   \.
 </font>
+==<tt>^</tt>==
+Complements a character set, which means match all characters '''not''' in the set. This is indicating by including <code>^</code> as the '''first character''' of the class. For example <code>[^a]</code> will match everything except "a". If the caret appears elsewhere in the class, it does not have a special meaning and it will represent itself.
+==<tt>$</tt>==
 ==<tt>*</tt>==
 Causes the resulting regular expression to match 0 or more repetitions of the preceding regular expression, as many repetitions as are possible.
-==<tt>$</tt>==
+==<tt>+</tt>==
+==<tt>?</tt>==
+==<tt>{</tt>==
+==<tt>}</tt>==
+==<tt>[</tt>==
+Used to specify the beginning of a '''character class''', which is a set of characters you wish to match. Characters can be listed individually:
+<font size=-1>
+ [abc]
+</font>
+or as a range, by using <code>-</code>:
+<font size=-1>
+ [0-9]
+</font>
+⚠️ Metacharacters are not active inside classes - they're stripped of their special nature. For example <code>[abc$]</code> will match 'a' or 'b' or 'c' or '$'.
+==<tt>]</tt>==
+Used to specify the end of a character class.
+==<tt>\</tt>==
+A backslash can be followed by various character to signal [[#Special_Sequences|special sequences]]. The backslash is also used to escape all metacharacters so they can be matched in patterns. To match <code>[</code>, prefix it with backslash: <code>\[</code>.
+==<tt>|</tt>==
+==<tt>(</tt>==
+Used to indicate the beginning of a group capture.
+==<tt>)</tt>==
+Used to indicate the end of a group capture.
+=Special Sequences=
 =NOT Metacharacters=

Python Regular Expressions: Difference between revisions

Revision as of 01:19, 25 March 2022

Contents

External

Internal

TODO

Overview

Metacharacters

`. (dot)`

`^`

`$`

`*`

`+`

`?`

`{`

`}`

`[`

`]`

`\`

`|`

`(`

`)`

Special Sequences

NOT Metacharacters

`{...}`

Patterns

Replacing Regular Expression Occurrences

Match a Pattern and Pick Up Groups

Scan a String

Navigation menu

Python Regular Expressions: Difference between revisions

Revision as of 01:19, 25 March 2022

External

Internal

TODO

Overview

Metacharacters

. (dot)

^

$

*

+

?

{

}

[

]

\

|

(

)

Special Sequences

NOT Metacharacters

{...}

Patterns

Replacing Regular Expression Occurrences

Match a Pattern and Pick Up Groups

Scan a String

Navigation menu

Search

`. (dot)`

`^`

`$`

`*`

`+`

`?`

`{`

`}`

`[`

`]`

`\`

`|`

`(`

`)`

`{...}`