HTML and CSS Reference
Groups and Back References
You can group expressions inside parentheses and then use the repetition operators after the group. For
example, suppose you wanted to find all runs of <br> tags. The regular expression (<br>)+ will match <br> ,
<br><br> , <br><br><br> , and so forth.
You can further combine the expressions. For example, (<br>\s*)+ will match all runs of <br> tags, even if
they have whitespace in between them.
Even more powerfully, you can refer back to a group later in the expression. The first parenthesized match is \1 .
The second is \2 , the third \3 , and so forth. (If the groups nest, they are counted from the left parenthesis
only.) For example, suppose you want to find all simple HTML elements in the form <foo>Blah Blah
Blah</foo> . That is, you want to find all the elements without any attributes and that don't contain any child
elements. Furthermore, you really want to find all the elements from the beginning of the start-tag to the end of
We can start with the expression <[a-zA-Z]+> to find the start tags. We can use the expression </[a-zA-Z]+>
to find the end-tags. However, we want only those pairs that match. So, first we put parentheses around the
start-tag, like so:
Then we refer back to that in the end-tag expression as \1 —that is, </\1> . If the start-tag was div , the end-tag
will be div . If the start-tag was em , the end-tag will be em , and so forth:
Finally, we need to put a character class in the middle that excludes less-than signs but allows line breaks. This
will avoid nested child elements and some overly greedy matches:
Even more important, you can use the back references \1 , \2 , and so on in replacement strings. For example, I
was recently faced with this list:
I wanted to put the contents of each list item in a code element. Therefore, I searched for this: