Java Reference
In-Depth Information
CHARACTER
CLASS
DESCRIPTION
This represents any whitespace character. A whitespace character is a space, a tab
'\t'
, a newline
character
'\n'
, a form feed character
'\f'
, a carriage return
'\r'
, or a page break
'x0B'
.
\s
This represents any non-whitespace character and is therefore equivalent to
[^\s]
.
\S
This represents a word character, which corresponds to an upper- or lowercase letter, a digit, or an un-
derscore. It is therefore equivalent to
[a-zA-Z_0-9]
.
\w
This represents any character that is not a word character, so it is equivalent to
[^\w]
.
\W
Note that when you are using any of the sequences that start with a backslash in a regular expression, you
need to keep in mind that Java treats a backslash as the beginning of an escape sequence. Therefore, you
must specify the backslash in the regular expression as
\\
. For example, to find a sequence of three digits,
the regular expression would be
"\\d\\d\\d"
. This is peculiar to Java because of the significance of the
backslash in Java strings, so it doesn't necessarily apply to other environments that support regular expres-
sions, such as Perl.
Obviously, you may well want to include a period, or any of the other meta-characters, as part of the
character sequence you are looking for. To do this you can use an escape sequence starting with a backslash
in the expression to define such characters. Because Java strings interpret a backslash as the start of a Java
escape sequence, the backslash itself has to be represented as
\\
, the same as when using the predefined
character sets that begin with a backslash. Thus, the regular expression to find the sequence
"had."
would
be
"had\\."
.
The earlier search you tried with the expression
"h.d"
found embedded sequences such as
"hud"
in the
word
huddled
. You could use the
\s
set that corresponds to any whitespace character to prevent this by de-
fining
regEx
like this:
String regEx = "\\sh.d\\s";
This searches for a five-character sequence that starts and ends with any whitespace character. The output
from the example is now:
Ted and Ned Hodge hid their hod and huddled in the hedge.
^^^^^ ^^^^^
You can see that the marker array shows the five-character sequences that were found. The embedded
sequences are now no longer included, as they don't begin and end with a whitespace character.
To take another example, suppose you want to find
hedge
or
Hodge
as words in the sentence, bearing in
mind that there's a period at the end. You could do this by defining the regular expression as:
String regEx = "\\s[h|H][e|o]dge[\\s|\\.]";
The first character is defined as any whitespace by
\\s
. The next character is defined as either
"h"
or
"H"
by
[h|H]
. This can be followed by either
"e"
or
"o"
specified by
[e|o]
. This is followed by plaintext
dge
with either a whitespace character or a period at the end, specified by
[\\s|\\.]
. This doesn't cater
to all possibilities. Sequences at the beginning of the string are not found, for example, nor are sequences
followed by a comma. You see how to deal with these next.
Matching Boundaries