Java Reference
In-Depth Information
The output of that is:
Found a match for Sam Spade;Yosemite Sam;Sam Merlotte;Samwise Gamgee; beginning at 0 and
ending at 51
That's not going to work. The trouble is that the .* pattern matches everything it can (that's called a
greedy match). In this case, it matches the whole line. Fortunately, the Java regular expression syntax
includes a way to make a pattern not be greedy (regular expression programmers would say it's
reluctant). To make a match be reluctant, we can append the question mark character (?) to the pattern,
as follows:
(Sam).*?;
The output of that regular expression is:
Found a match for Sam Spade; beginning at 0 and ending at 10
Found a match for Sam; beginning at 19 and ending at 23
Found a match for Sam Merlotte; beginning at 23 and ending at 36
Found a match for Samwise Gamgee; beginning at 36 and ending at 51
We're getting closer, but what happened to the “Yosemite” in “Yosemite Sam”? Well, the expression
starts with (Sam), so it will match only bits that start with “Sam”, which doesn't include “Yosemite Sam”.
The solution is to use the .*? pattern at the beginning as well as at the end, as follows:
.*?(Sam).*?;
Notice that the leading pattern must be reluctant, too, or we get the whole line again. Now the
output is:
Found a match for Sam Spade; beginning at 0 and ending at 10
Found a match for Yosemite Sam; beginning at 10 and ending at 23
Found a match for Sam Merlotte; beginning at 23 and ending at 36
Found a match for Samwise Gamgee; beginning at 36 and ending at 51
In this fashion, we've parsed a line containing multiple records. We could then add code to write
each match to a separate line in a file or otherwise manipulate each of the matching values. This kind of
parsing is a common task in software development, and regular expressions offer one good way to do it.
As I have indicated, regular expressions can get a lot more complicated. The following regular
expression removes “Sam” from each entry that starts with “Sam”:
S(?!am)|(?<!S)a|a(?!m)|(?<!Sa)m|[^Sam](.*?;)
Its output is:
Found a match for Spade; beginning at 3 and ending at 10
Found a match for Yosemite Sam; beginning at 10 and ending at 23
Found a match for Merlotte; beginning at 26 and ending at 36
Found a match for wise Gamgee; beginning at 39 and ending at 51
The code to also remove the “Sam” in “Yosemite Sam” would be even more complex. As it happens,
negating a group is one thing that regular expressions don't make easy. In those cases, it's often best to
mix regular expressions with other String operations and to pass the result of one expression to another
regular expression (a process known as chaining). Those techniques let you manage the complexity of
your regular expressions and may offer better performance than a single complex regular expression.
If you want to know more about regular expressions, start with the official Regular Expression
Tutorial at http://download.oracle.com/javase/tutorial/essential/regex/index.html
Search WWH ::




Custom Search