Java Reference
In-Depth Information
String regEx = "\\sh.d\\s";
This searches for a five-character sequence that starts and ends with any whitespace character. The
output from the example will now be:
Ted and Ned Hodge hid their hod and huddled in the hedge.
^^^^^ ^^^^^
You can see that the marker array shows the five-character sequences that were found. The embedded
sequences are now no longer included, as they don't begin and end with a whitespace character.
To take another example, suppose we want to find
hedge
or
Hodge
as words in the sentence, bearing in
mind that there's a period at the end. We could do this by defining the regular expression as:
String regEx = "\\s[h|H][e|o]dge[\\s|\\.]";
The first character is defined as any whitespace by
\\s
. The next character is defined as either 'h' or 'H'
by
[h|H]
. This can be followed by either 'e' or 'o' specified by
[e|o]
. This is followed by plain text
dge
with either a whitespace character or a period at the end, specified by
[\\s|\\.]
. This doesn't
cater for all possibilities. Sequences at the beginning of the string will not be found, for instance, nor will
sequences followed by a comma. We'll see how to deal with these next.
Matching Boundaries
So far we have tried to find the occurrence of a pattern anywhere in a string. In many situations you will
want to be more specific. You may want to look for a pattern that appears at the beginning of a line in a
string but not anywhere else, or maybe just at the end of any line. As we saw in the previous example
you may want to look for a word that is not embedded - you want to find the word "
cat
" but not the
"
cat
" in "
cattle
" or in "
Popacatapetl
" for instance. The previous example worked for the string
we were searching but would not produce the right result if the word we were looking for was followed
by a comma or appeared at the end of the text. However, we have other options. There are a number of
special sequences you can use in a regular expression when you want to match a particular boundary.
For instance, these are especially useful:
^
Specifies the beginning of a line. For example, to find the word
Java
at the beginning of
any line you could use the expression "
^Java
".
$
Specifies the end of a line. For example, to find the word
Java
at the end of any line you
could use the expression "
Java$
". Of course, if you were expecting a period at the end of
a line the expression would be "
Java\\.$
".
\b
Specifies a word boundary. To find words beginning with '
h
' and ending with '
d
' we could
use the expression "
\\bh.d\\b
".
\B
A non-word boundary - the complement of
\b
above.