Java Reference
In-Depth Information
Beware of Backslashes
Beware of using backslashes in regular expressions. The character class \w (that is, a backslash followed by a
w ”) represents a word character. Recall that a backslash character is also used as a part of an escape character.
Therefore, \w must be written as \\w as a string literal. You can also use a backslash to nullify the special meaning of
metacharacters. For example, [ marks the beginning of a character class. What would be the regular expression that
will match a digit enclosed in brackets, for example, [1] , [5] , etc.? Note that the regular expression [0-9] will match
any digit. The digit may or may not be enclosed in a bracket. You may think about using [[0-9]] . It will not give you
any error message; however, it will not do the job either. You can also embed a character class within another. For
example, you can write [a-z[0-9]] , which is the same as [a-z0-9] . In this case, the first [ in [[0-9]] should be
treated as an ordinary character, not as a metacharacter. You must use a backslash as \[[0-9]\] . To write this regular
expression as a string literal, you need to use two backslashes as \\[[0-9]\\]] .
Quantifiers in Regular Expressions
You can also specify the number of times a character in a regular expression may match the sequence of characters.
If you want to match all two digit integers, your regular expression would be \d\d , which is the same as [0-9][0-9] .
What would be the regular expression to match any integer? You cannot write the regular expression to match any
integer with the knowledge you have gained so far. You need to be able to express a pattern “one digit or more” using a
regular expression. Here comes the concept of quantifiers. Quantifiers and their meanings are listed in Table 14-4 .
Table 14-4. Quantifiers and Their Meaning
Quantifiers
Meaning
*
Zero or more times
+
One or more times
?
Once or not at all
{m}
Exactly m times
{m, }
At least m times
{m, n}
At least m, but not more than n times
It is important to note that quantifiers must follow a character or character class for which it specifies the
quantity. The regular expression to match any integer would be \d+ , which says that match one or more number of
digits. Is this solution for matching integer correct? No, it is not. Suppose your text is “This is text123 which contains
10 and 120”. If you run your pattern \d+ against this string, it will match against 123 , 10 , and 120 . Note that 123 is
not used as an integer; rather it is a part of word text123 . If you are looking for integers inside text, certainly 123 in
text123 does not qualify as an integer. You want to match all integers that form a word in the text.
Necessity is the mother of invention. Now you need to specify that the match should be performed only on word
boundaries, not inside text having embedded integers. This is necessary to exclude integer 123 from your previous
result. The next section discusses the use of metacharacters to match boundaries.
With the knowledge you have gained in this section, let's improve your e-mail address validation. Inside an
e-mail address, there must be one and only one @ sign. To specify one and only one character, you use that character
one time in the regular expression although you can use {1} as the quantifier. For example, X{1} and X means the
same inside a regular expression. You are fine on this account. However, your solution until now supports only one
character before and after the @ sign. In reality, there can be more than one character before and after the @ sign in an
e-mail address. You can specify the pattern to validate an e-mail address as \w+@\w+ , which means: one or more word
characters, an @ sign, and one or more word characters.
 
Search WWH ::




Custom Search