Java Regular Expressions: Difference between revisions
(84 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
=External= | =External= | ||
* http://docs.oracle.com/javase/ | * http://docs.oracle.com/javase/10/docs/api/java/util/regex/Pattern.html#sum | ||
=Internal= | =Internal= | ||
Line 10: | Line 10: | ||
=Overview= | =Overview= | ||
Regular expressions can be used in Java via the [[#java.langString_API|String API]] or [[#java.util.regex_API|java.util.regex API]]. | Regular expressions can be used in Java via the [[#java.langString_API|String API]] or [[#java.util.regex_API|java.util.regex API]]. Java regular expression [[Regular Expressions Concepts#Flavor|flavor]] is largely similar to [[perl Regular Expressions#Overview|Perl]]'s and [[grep Regular Expressions#Overview|grep]]'s. | ||
[[Regular Expressions Concepts#Metacharacters|Regular expression metacharacters]] compete for interpretation with [[java String Metacharacters]]. | |||
=Metacharacters= | |||
====$==== | |||
$ matches the end of the string. To match the '$' (dollar sign), the character must be escaped: | |||
"\\$" | |||
====.==== | |||
'.' matches one character. | |||
To match '.': | |||
"\\." | |||
====()==== | |||
"\\(" | |||
===={}==== | |||
"\\{" | |||
"\\}" | |||
=Other Characters that Require Special Handling= | |||
===="==== | |||
'"' is not a metacharacter, but it must be escaped once, because otherwise interferes with the declaration of the String that contains the regular expression. | |||
\" | |||
=Character Classes= | |||
\h: A horizontal whitespace character: [ \t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000] | |||
==Digits== | |||
\d A digit: [0-9] | |||
=java.util.regex API= | =java.util.regex API= | ||
The | The default sequence for using regular expressions consists in building a [http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html Pattern] instance, which then can be matched against multiple strings via [http://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html Matcher] instances. The Pattern instance contains a compiled representation of the regular expression. The compilation process can be relatively expensive, as it creates the structure of a state machine. The Matcher uses the Pattern, but encapsulates ''all the state'' required to perform matching against a string, so the Pattern can be shared by multiple Matchers, and thus the expensive compilation part is performed only once. The Matcher instances are not thread safe, see [[#Concurrent_Usage_Considerations|Concurrent Usage Considerations]] below. | ||
<syntaxhighlight lang='java'> | |||
public class Example { | |||
public static final Pattern PATTERN = Pattern.compile("red"); | |||
... | |||
public void useRegex(String argument) { | |||
Matcher m = PATTRN.matcher(argument); | |||
... | |||
} | |||
</syntaxhighlight> | |||
Once built, a Matcher instance can be used to [[#Matcher.matches.28.29|match]] or [[#Matcher.find.28.29|find]]. | |||
==Matcher.matches()== | |||
The Matcher.matches() method attempts to match ''the entire input sequence'' against the pattern. The result of the invocation is binary, the entire input sequence either matches the regular expression or not. In the context of the above example, | |||
<syntaxhighlight lang='java'> | |||
String argument = "red"; | |||
Matcher m = PATTRN.matcher(argument); | |||
m.matches(); | |||
</syntaxhighlight> | |||
returns true, while | |||
<syntaxhighlight lang='java'> | |||
String argument = "credential"; | |||
Matcher m = PATTRN.matcher(argument); | |||
m.matches(); | |||
</syntaxhighlight> | |||
returns false. | |||
==Matcher.find()== | |||
Matcher.find() can be used to repeatedly scan the input sequence and look for the ''next subsequence that matches the pattern''. The whole input sequence does not need to match the patter for find() to return true, it is sufficient if a subsequence of it does. find() starts at the beginning of matcher's region, or, if a previous invocation of the method was successful and the matcher has not since been reset, at the first character not matched by the previous match. | |||
Once a match occurs, the internal state of the matcher can be accessed via start(), end() and [[#Capturing_Groups|group()]] methods. | |||
The typical way find() is used is shown below: | |||
<syntaxhighlight lang='java'> | |||
Matcher m = PATTERN.matcher(argument); | |||
int i = 1; | |||
while(m.find()) { | |||
int s = m.start(); | |||
int e = m.end(); | |||
System.out.println("matching subsequence " + i + " starts at " + s + " and ends at " + e); | |||
i ++; | |||
} | |||
</syntaxhighlight> | |||
Note that the initial state of the Matcher instance is undefined, and an attempt to use state access methods like start(), end() will throw an IllegalStateException "No match available". | |||
==Capturing Groups== | |||
The regular expression may define ''capturing groups''. A capturing group is a regular expression fragment enclosed in parentheses "(" and ")". Note that the parentheses need not be escaped: | |||
"something(.*)somethingelse" | |||
Upon a match, the capturing groups can be retrieved via the Matcher API with group(), groupCount(), group(int index) and group(String name) state accessors. | |||
Group 0 denotes the entire pattern, so m.group(0) is equivalent to m.group(). If the match was successful but the group specified failed to match any part of the input sequence, then null is returned. If the capturing groups matched parts of the input sequence, group(i) where i > 0, identify groups inside the pattern. | |||
The example below is attempting to match words that include (or not) a sequence of "a"s. The words are separated by colons. When we encounter a match, we display the state of the matcher, including the capturing groups. | |||
<syntaxhighlight lang='java'> | <syntaxhighlight lang='java'> | ||
Pattern PATTERN = Pattern.compile("[b-z]+(a*)[b-z]+:"); | |||
String argument="blah:blaaaaaah:blh:"; | |||
Matcher m = PATTERN.matcher(argument); | |||
int i = 1; | |||
while(m.find()) { | |||
System.out.println("match " + (i ++) + ":"); | |||
System.out.println(" match starts at: " + m.start()); | |||
System.out.println(" match ends at: " + m.end()); | |||
System.out.println(" group count for match: " + m.groupCount()); | |||
System.out.println(" group(0) for match: " + m.group(0)); | |||
System.out.println(" group(1) for match: " + m.group(1)); | |||
} | |||
</syntaxhighlight> | </syntaxhighlight> | ||
Working code | The output is: | ||
match 1: | |||
match starts at: 0 | |||
match ends at: 5 | |||
group count for match: 1 | |||
group(0) for match: blah: | |||
group(1) for match: a | |||
match 2: | |||
match starts at: 5 | |||
match ends at: 15 | |||
group count for match: 1 | |||
group(0) for match: blaaaaaah: | |||
group(1) for match: aaaaaa | |||
match 3: | |||
match starts at: 15 | |||
match ends at: 19 | |||
group count for match: 1 | |||
group(0) for match: blh: | |||
group(1) for match: | |||
==Replacing Matched Sequences== | |||
The Matcher class exposes API for replacing matched subsequences with new strings whose contents can, be computed from the match result. Those methods are Matcher.replaceAll(), Matcher.appendReplacement() and Matcher.appendTail(). | |||
==Matcher.replaceAll()== | |||
==Matcher.lookingAt()== | |||
==java.util.regex Examples== | |||
Working code examples are available here: | |||
{{External|https://github.com/NovaOrdis/playground/ | {{External|[https://github.com/NovaOrdis/playground/blob/master/java/regex/simplest/src/main/java/io/novaordis/playground/java/regex/simplest/Main.java Simple Pattern Matching and Group Usage Example]}} | ||
{{External|[https://github.com/NovaOrdis/playground/blob/master/java/regex/number-as-string/src/main/java/playground/java/regex/numberAsString/NumberAsString.java Decide whether a String Represents a Correct Number]}} | |||
=java.langString API= | =java.langString API= | ||
==matches()== | |||
<syntaxhighlight lang='java'> | <syntaxhighlight lang='java'> | ||
Line 30: | Line 205: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
While convenient in some cases, the String API also delegates to the [[#java.util.regex_API|java.util.regex API]] via the Pattern.matches() call. | While convenient in some cases, the String API also delegates to the [[#java.util.regex_API|java.util.regex API]] via the Pattern.matches() call. This method is not efficient when used repeatedly, because it internally builds a [http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html Pattern] instance on each invocation. If matching against the same regular expression is to be done repeatedly, [[#java.util.regex_API|java.util.regex API]] is preferred. | ||
Also, the regular expression passed as argument must match the ''entire'' string: | |||
<syntaxhighlight lang='java'> | |||
String s = "the vehicle is blue"; | |||
// this will return false | |||
s.matches("blue"); | |||
// this will return true | |||
s.matches("^.*blue$"); | |||
</syntaxhighlight> | |||
==contains()== | |||
Returns true if a sequence of characters (not a regex) is contained by the string. | |||
=Concurrent Usage Considerations= | =Concurrent Usage Considerations= | ||
Line 37: | Line 225: | ||
=Regular Expression Syntax= | =Regular Expression Syntax= | ||
<font color=red>TO NORMALIZE across [[Java_Regular_Expressions#Regular_Expression_Syntax|java Regular Expression Syntax]], [[Grep_Regular_Expressions#Regular_Expression_Syntax|grep Regular Expression Syntax]], [[Sed_Regular_Expressions#Regular_Expression_Syntax|sed Regular Expression Syntax]].</font> | |||
====Greedy Matching==== | |||
Quantifiers are by default ''greedy''. To turn them into reluctant qualifiers, append an "?" at the end of the qualifier. | |||
{{External|* http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#reluc}} | |||
====Any Character==== | |||
.* | |||
This does not match new lines. | |||
====Multiple Spaces==== | |||
"\\s*" | |||
====Accolades==== | |||
"\\{.+\\}" | |||
====$ (Dollar Sign)==== | |||
"\\$" | |||
====Word that May Appear or Not==== | |||
(jdbc){0, 1} | |||
====Optional Sequence of Characters that May Appear Once or None at All==== | |||
[someoptionalword]? | |||
=Organizatorium= | |||
To break a string in tokens separated by white spaces: | |||
<pre> | |||
line.split("\\s+"); | |||
</pre> |
Latest revision as of 06:28, 12 May 2021
External
Internal
Overview
Regular expressions can be used in Java via the String API or java.util.regex API. Java regular expression flavor is largely similar to Perl's and grep's.
Regular expression metacharacters compete for interpretation with java String Metacharacters.
Metacharacters
$
$ matches the end of the string. To match the '$' (dollar sign), the character must be escaped:
"\\$"
.
'.' matches one character.
To match '.':
"\\."
()
"\\("
{}
"\\{" "\\}"
Other Characters that Require Special Handling
"
'"' is not a metacharacter, but it must be escaped once, because otherwise interferes with the declaration of the String that contains the regular expression.
\"
Character Classes
\h: A horizontal whitespace character: [ \t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000]
Digits
\d A digit: [0-9]
java.util.regex API
The default sequence for using regular expressions consists in building a Pattern instance, which then can be matched against multiple strings via Matcher instances. The Pattern instance contains a compiled representation of the regular expression. The compilation process can be relatively expensive, as it creates the structure of a state machine. The Matcher uses the Pattern, but encapsulates all the state required to perform matching against a string, so the Pattern can be shared by multiple Matchers, and thus the expensive compilation part is performed only once. The Matcher instances are not thread safe, see Concurrent Usage Considerations below.
public class Example {
public static final Pattern PATTERN = Pattern.compile("red");
...
public void useRegex(String argument) {
Matcher m = PATTRN.matcher(argument);
...
}
Once built, a Matcher instance can be used to match or find.
Matcher.matches()
The Matcher.matches() method attempts to match the entire input sequence against the pattern. The result of the invocation is binary, the entire input sequence either matches the regular expression or not. In the context of the above example,
String argument = "red";
Matcher m = PATTRN.matcher(argument);
m.matches();
returns true, while
String argument = "credential";
Matcher m = PATTRN.matcher(argument);
m.matches();
returns false.
Matcher.find()
Matcher.find() can be used to repeatedly scan the input sequence and look for the next subsequence that matches the pattern. The whole input sequence does not need to match the patter for find() to return true, it is sufficient if a subsequence of it does. find() starts at the beginning of matcher's region, or, if a previous invocation of the method was successful and the matcher has not since been reset, at the first character not matched by the previous match.
Once a match occurs, the internal state of the matcher can be accessed via start(), end() and group() methods.
The typical way find() is used is shown below:
Matcher m = PATTERN.matcher(argument);
int i = 1;
while(m.find()) {
int s = m.start();
int e = m.end();
System.out.println("matching subsequence " + i + " starts at " + s + " and ends at " + e);
i ++;
}
Note that the initial state of the Matcher instance is undefined, and an attempt to use state access methods like start(), end() will throw an IllegalStateException "No match available".
Capturing Groups
The regular expression may define capturing groups. A capturing group is a regular expression fragment enclosed in parentheses "(" and ")". Note that the parentheses need not be escaped:
"something(.*)somethingelse"
Upon a match, the capturing groups can be retrieved via the Matcher API with group(), groupCount(), group(int index) and group(String name) state accessors.
Group 0 denotes the entire pattern, so m.group(0) is equivalent to m.group(). If the match was successful but the group specified failed to match any part of the input sequence, then null is returned. If the capturing groups matched parts of the input sequence, group(i) where i > 0, identify groups inside the pattern.
The example below is attempting to match words that include (or not) a sequence of "a"s. The words are separated by colons. When we encounter a match, we display the state of the matcher, including the capturing groups.
Pattern PATTERN = Pattern.compile("[b-z]+(a*)[b-z]+:");
String argument="blah:blaaaaaah:blh:";
Matcher m = PATTERN.matcher(argument);
int i = 1;
while(m.find()) {
System.out.println("match " + (i ++) + ":");
System.out.println(" match starts at: " + m.start());
System.out.println(" match ends at: " + m.end());
System.out.println(" group count for match: " + m.groupCount());
System.out.println(" group(0) for match: " + m.group(0));
System.out.println(" group(1) for match: " + m.group(1));
}
The output is:
match 1: match starts at: 0 match ends at: 5 group count for match: 1 group(0) for match: blah: group(1) for match: a match 2: match starts at: 5 match ends at: 15 group count for match: 1 group(0) for match: blaaaaaah: group(1) for match: aaaaaa match 3: match starts at: 15 match ends at: 19 group count for match: 1 group(0) for match: blh: group(1) for match:
Replacing Matched Sequences
The Matcher class exposes API for replacing matched subsequences with new strings whose contents can, be computed from the match result. Those methods are Matcher.replaceAll(), Matcher.appendReplacement() and Matcher.appendTail().
Matcher.replaceAll()
Matcher.lookingAt()
java.util.regex Examples
Working code examples are available here:
java.langString API
matches()
String s = "...";
s.matches(...);
While convenient in some cases, the String API also delegates to the java.util.regex API via the Pattern.matches() call. This method is not efficient when used repeatedly, because it internally builds a Pattern instance on each invocation. If matching against the same regular expression is to be done repeatedly, java.util.regex API is preferred.
Also, the regular expression passed as argument must match the entire string:
String s = "the vehicle is blue";
// this will return false
s.matches("blue");
// this will return true
s.matches("^.*blue$");
contains()
Returns true if a sequence of characters (not a regex) is contained by the string.
Concurrent Usage Considerations
Matcher instances are NOT thread safe, create a matcher per thread
Regular Expression Syntax
TO NORMALIZE across java Regular Expression Syntax, grep Regular Expression Syntax, sed Regular Expression Syntax.
Greedy Matching
Quantifiers are by default greedy. To turn them into reluctant qualifiers, append an "?" at the end of the qualifier.
Any Character
.*
This does not match new lines.
Multiple Spaces
"\\s*"
Accolades
"\\{.+\\}"
$ (Dollar Sign)
"\\$"
Word that May Appear or Not
(jdbc){0, 1}
Optional Sequence of Characters that May Appear Once or None at All
[someoptionalword]?
Organizatorium
To break a string in tokens separated by white spaces:
line.split("\\s+");