HTML and CSS Reference
In-Depth Information
you're searching for years in the recent past, you might want to find any four-digit number beginning with 200.
You may want to search for attribute name=value pairs, but you're not sure whether they're in the format
name=value , name='value' , or name="value" . You may want to search for all <p> start-tags, whether they have
attributes or not. These are all good candidates for regular expressions.
In a regular expression, certain characters and patterns stand in for a set of other characters. For example, \d
means any digit. Thus, to search for any year from 2000 to 2009, one could use the regular expression 200\d .
This would match 2000, 2001, 2002, and so on through 2009.
However, the regular expression 200\d also matches 12000, 200032, 12320056, and other strings that are
probably not years at all. (To be precise, it matches the substrings in the form 200\d , not the entire string.)
Thus, you might want to indicate that the string you're matching must be preceded and trailed by whitespace of
some kind. The metacharacter \s matches whitespace, so we can now rewrite the expression as \s200\d\s to
match only those strings that look like years in this decade.
Of course, there's still no guarantee that every string you match in this form is a year. It could be a price, a
population, a score, a movie title, or something else. You'll want to scan the list of matches to verify that it is
what you expect. False positives are a real concern, especially for simple cases such as this. However, it's
normally possible to either further refine the regular expression to avoid any false positives or manually remove
the accidental matches.
There usually are other ways to do many things. For instance, we could write this search as \b200\d\b . The
metacharacter \b matches the beginning or end of a word, without actually selecting any characters. This would
avoid the whitespace at the beginning and end of words. This would also allow us to recognize a year that came
at the end of a sentence right before a period, as in "This is 2008." However, it can't distinguish periods from
decimal points and would also match the 2005 in 2005.3124.
You could even simply list the years separated by the OR operator, | , like so:
2000|2001|2002|2003|2004|2005|2006|2007|2008|2009
However, this still has the word boundary problems of the previous matches.
Sometimes you stop with a search. In particular, if the content is generated automatically from a CMS, template
page, or other program, the search is used merely to find bugs: places where the program is generating
incorrect markup. You then must change the program to generate correct markup. If this is the case, false
positives don't worry you nearly so much because all changes will be performed manually anyway. The search
only identifies the bug. It doesn't fix it.
If you don't stop with a search, and you go on to a replacement, you need to be cautious. Regular expressions
can be tricky, and ones involving HTML are often much trickier than the textbook examples. Nonetheless, they
are invaluable tools in cleaning up HTML.
Note
If you don't have a lot of experience with regular expressions, please refer to Appendix 1 for many more
examples. I also recommend Mastering Regular Expressions , 3rd Edition, by Jeffrey E.F. Friedl (O'Reilly,
2006).
Search WWH ::




Custom Search