Java Reference
In-Depth Information
Each part of a domain name has certain rules it must follow. It can contain any letter or number or a
hyphen, but it must start with a letter. The exception is that, at any point in the domain name, you can
use a #, followed by a number, which represents the ASCII code for that letter, or in Unicode, the 16-bit
Unicode value. Knowing this, let's begin to build up the regular expression, fi rst with the name part,
assuming that the case-insensitive fl ag will be set later in the code.
([a-z]|#\d+)([a-z0-9-]|#\d+)*([a-z0-9]|#\d+)
This breaks the domain into three parts. The RFC doesn't specify how many digits can be contained
here, so neither will we. The fi rst part must only contain an ASCII letter; the second must contain zero
or more of a letter, number, or hyphen; and the third must contain either a letter or number. The top-
level domain has more restrictions, as shown here:
[a-z]{2,4}
This restricts you to a two, three, or four letter top-level domain. So, putting it all together, with the
periods you end up with this:
^(([a-z]|#\d+?)([a-z0-9-]|#\d+?)*([a-z0-9]|#\d+?)\.)+([a-z]{2,4})$
Again, the domain name is anchored at the beginning and end of the string. The fi rst thing is to add
an extra group to allow one or more name. portions and then anchor a two-to-four-letter domain name
at the end in its own group. We have also made most of the wildcards lazy. Because much of the pat-
tern is similar, it makes sense to do this; otherwise, it would require too much backtracking. However,
you have left the second group with a “greedy” wildcard: It will match as much as it can, up until it
reaches a character that does not match. Then it will only backtrack one position to attempt the third
group match. This is more resource-effi cient than a lazy match is in this case, because it could be con-
stantly going forward to attempt the match. One backtrack per name is an acceptable amount of extra
processing.
Validating a Person's Address
You can now attempt to validate the part before the @ sign. The RFC specifi es that it can contain any ASCII
character with a code in the range from 33 to 126. You are assuming that you are matching against ASCII
only, so you can assume that there are only 128 characters that the engine will match against. This being
the case, it is simpler to just exclude the required values as follows:
[^<>()\[\],;:@“\x00-\x20\x7F]+
Using this, you're saying that you allow any number of characters, as long as none of them are those
contained within the square brackets. The [, ], and \ characters have to be escaped. However, the RFC
allows for other kinds of matches.
Validating the Complete Address
Now that you have seen all the previous sections, you can build up a regular expression for the entire
e-mail address. First, here's everything up to and including the @ sign:
^([^<>()\[\],;:@“\x00-\x20\x7F]|\\.)+@
Search WWH ::




Custom Search