Java Reference
In-Depth Information
or in Unicode, the 16‐bit Unicode value. Knowing this, let's begin to build up the regular
expression, first with the name part, assuming that the case‐insensitive flag will be set later in
the code:
([a-z]|#\d+)([a-z0-9-]|#\d+)*([a-z0-9]|#\d+)
This breaks the domain into three parts. The RFC doesn't specify how many digits can be contained
here, so neither will we. The first part must only contain an ASCII letter; the second must contain
zero or more of a letter, number, or hyphen; and the third must contain either a letter or number.
The top‐level domain has more restrictions, as shown here:
[a-z]{2,4}
This restricts you to a two‐, three‐, or four‐letter top‐level domain. So, putting it all together, with
the periods you end up with this:
^(([a-z]|#\d+?)([a-z0-9-]|#\d+?)*([a-z0-9]|#\d+?)\.)+([a-z]{2,4})$
Again, the domain name is anchored at the beginning and end of the string. The first thing is to add
an extra group to allow one or more name. portions and then anchor a two‐ to four‐letter domain
name at the end in its own group. We have also made most of the wildcards lazy. Because much of
the pattern is similar, it makes sense to do this; otherwise, it would require too much backtracking.
However, we have left the second group with a “greedy” wildcard: It will match as much as it
can, up until it reaches a character that does not match. Then it will only backtrack one position
to attempt the third group match. This is more resource‐efficient than a lazy match is in this case,
because it could be constantly going forward to attempt the match. One backtrack per name is an
acceptable amount of extra processing.
Validating a person's address
You can now attempt to validate the part before the @ sign. The RFC specifies that it
can contain any ASCII character with a code in the range from 33 to 126. You are assuming
that you are matching against ASCII only, so you can assume that the engine will match
against only 128 characters. This being the case, it is simpler to just exclude the required values as
follows:
[^<>()\[\],;:@"\x00-\x20\x7F]+
Using this, you're saying that you allow any number of characters, as long as none of them are those
contained within the square brackets. The square bracket and backslash characters ( [ , ] , and \ ) have
to be escaped. However, the RFC allows for other kinds of matches.
Validating the Complete address
Now that you have seen all the previous sections, you can build up a regular expression for the
entire e‐mail address. First, here's everything up to and including the @ sign:
^([^<>()\[\],;:@"\x00-\x20\x7F]|\\.)+@
 
Search WWH ::




Custom Search