HTML and CSS Reference
In-Depth Information
Whitespace
Matching whitespace is quite tricky and but still quite important. Precisely because HTML does not consider
whitespace to be hugely significant, it's important to pay attention to it. Four whitespace characters are likely to
appear in HTML documents:
The space itself
The carriage return, \r
The linefeed, \n
The tab, \t
The space character has no special representation in regular expressions. To match a space, you simply type a
space. Just be careful that you type the right number of spaces, because it won't usually be obvious if you're
trying to match two where one is called for or vice versa.
\n is particularly tricky. In some dialects, this represents the literal line feed character, ASCII 10. However, in
others, including jEdit's, it means any line break character including carriage return, line feed, and a carriage
return-line feed pair. Finally, in still other dialects, it means the platform's native line-terminating character.
Thus, it can match a carriage return on the Mac, a line feed on UNIX, and a carriage return line feed pair on
Windows.
This is quite troublesome for working with HTML because HTML documents are not platform-bound. You are
likely to find all three line-ending conventions in your document collection, sometimes even in the same file.
Consequently, we usually do one of several things instead:
Use [\r\n(\r\n)] to match all line breaks, regardless of type.
Use \s to match all whitespace, line breaks or otherwise.
Use ^ and $ to anchor the pattern to the beginning and/or end of a line.
Line breaks are usually not significant in HTML, so more often than not we use the second option.
You may encounter documents that include other characters such as a form feed or a vertical tab. These have
no defined meaning in HTML and should usually be replaced with a single space.
Search WWH ::




Custom Search