String Manipulation - Beginning JavaScript

Java Reference

In-Depth Information

Each part of a domain name has certain rules it must follow. It can contain any letter or number or a

hyphen, but it must start with a letter. The exception is that, at any point in the domain name, you can

use a #, followed by a number, which represents the ASCII code for that letter, or in Unicode, the 16-bit

Unicode value. Knowing this, let's begin to build up the regular expression, fi rst with the name part,

assuming that the case-insensitive fl ag will be set later in the code.

([a-z]|#\d+)([a-z0-9-]|#\d+)*([a-z0-9]|#\d+)

This breaks the domain into three parts. The RFC doesn't specify how many digits can be contained

here, so neither will we. The fi rst part must only contain an ASCII letter; the second must contain zero

or more of a letter, number, or hyphen; and the third must contain either a letter or number. The top-

level domain has more restrictions, as shown here:

[a-z]{2,4}

This restricts you to a two, three, or four letter top-level domain. So, putting it all together, with the

periods you end up with this:

^(([a-z]|#\d+?)([a-z0-9-]|#\d+?)*([a-z0-9]|#\d+?)\.)+([a-z]{2,4})$

Again, the domain name is anchored at the beginning and end of the string. The fi rst thing is to add

an extra group to allow one or more name. portions and then anchor a two-to-four-letter domain name

at the end in its own group. We have also made most of the wildcards lazy. Because much of the pat-

tern is similar, it makes sense to do this; otherwise, it would require too much backtracking. However,

you have left the second group with a “greedy” wildcard: It will match as much as it can, up until it

reaches a character that does not match. Then it will only backtrack one position to attempt the third

group match. This is more resource-effi cient than a lazy match is in this case, because it could be con-

stantly going forward to attempt the match. One backtrack per name is an acceptable amount of extra

processing.

Validating a Person's Address

You can now attempt to validate the part before the @ sign. The RFC specifi es that it can contain any ASCII

character with a code in the range from 33 to 126. You are assuming that you are matching against ASCII

only, so you can assume that there are only 128 characters that the engine will match against. This being

the case, it is simpler to just exclude the required values as follows:

[^<>()\[\],;:@“\x00-\x20\x7F]+

Using this, you're saying that you allow any number of characters, as long as none of them are those

contained within the square brackets. The [, ], and \ characters have to be escaped. However, the RFC

allows for other kinds of matches.

Validating the Complete Address

Now that you have seen all the previous sections, you can build up a regular expression for the entire

e-mail address. First, here's everything up to and including the @ sign:

^([^<>()\[\],;:@“\x00-\x20\x7F]|\\.)+@

Search WWH ::

Custom Search

Home