Java Reference
In-Depth Information
Now that we have a testing program and know how to use it, we need to focus on the regular
expression syntax that Java supports. Regular expression syntax is almost a language unto itself, so we'll
focus on the basics and some of the more commonly used advanced bits. The whole thing is worthy of a
book (and such topics exist).
Our simple test case uses a string literal. A string literal is just a piece of text. In the example we just
ran,
"Sam"
is a string literal.
"Spade"
is another string literal. If we replace
"Sam"
with
"Spade,"
we get the
following output in the console:
Found a match for Spade beginning at 4 and ending at 9
We won't be able to accomplish much with just string literals. We can find all the instances of a
particular string, but we can't find anything that matches a pattern. To create a pattern, we have to dive
into the key component of regular expressions—metacharacters.
Metacharacters are characters that create patterns. Rather than represent a single literal character, a
metacharacter represents a set of characters. Some metacharacters work by themselves, while other
metacharacters are meaningless in the absence of other metacharacters. Table 15-1 describes the
metacharacters supported by the Java regular expression syntax.
Table 15-1. Java Regular Expression Metacharacters
Metacharacter
Description
(
Starts a subpattern (a pattern within the larger pattern). For example
compan(y|ies)
lets you match either “company” or “companies”.
Also starts the definition of a group.
(Dog)
treats those three characters as a single
unit for other regular expression operators.
[
Starts a set of characters. For example,
[A-Z]
would match any upper-case
character.
A[A-Z]Z
would match “AAZ”, “ABZ”, and so on to “AZZ”.
{
Starts a match count specifier. For example,
s{3}
would match three
s
characters in
a row:
sss
.
Pas{3}
would match “Passs”.
\
Starts an escape sequence, so that you can match a literal instance of a
metacharacter. For example, if you needed to match the periods in a paragraph,
you'd use
\.
(that is, a backslash and a period). The period character (.) is itself a
regular expression metacharacter, so you must escape it to find the actual periods.
Similarly, to find an actual backslash character, you must escape the escape
character, thus:
\\
^
Matches the start of the string.
^A
finds any line that begins with “A”.
^[0-9]
finds
any line that begins with a digit.
^[0-9]{2}
finds any line that begins with two digits.
^[0-9]+
matches any line that begins with a number of any size.
Inside of a range, ^ is the negation character.
[^abc]
matches any character other
than a, b, or c.
[^abc]at
matches “rat” and “sat” and “eat” (and many others) but not
“bat” or “cat” (or “aat”).
-
Used within range expressions, such as
[0-9]
, which would match any digit.