Java Reference
In-Depth Information
This version of the program produces the following output:
list1 = [all, and, bus, go, on, round, round., the, through, town., wheels]
list2 = [all, bus, go, on, swish,, swish., the, through, town., wipers]
overlap = [all, bus, go, on, the, through, town.]
Version 3: Complete Program
Our program now correctly builds a vocabulary list for each of two files and com-
putes the overlap between them. The program printed the three lists of words, but
that won't be very convenient for large text files containing thousands of different
words. We would prefer to have the program report overall statistics, including the
number of words in each list, the number of words of overlap, and the percentage
of overlap.
The program also should contain at least a brief introduction to explain what it
does, and we can write it so that it prompts for file names rather than using hard-
coded file names.
This also seems like a good time to think about punctuation. The first two versions
allowed words to contain punctuation characters such as commas, periods, and
dashes that we wouldn't normally consider to be part of a word.
We can improve our solution by telling the Scanner what parts of the input file to
ignore. Scanner objects have a method called useDelimiter that you can call to tell
them what characters to use when they break the input file into tokens. When you call
the method, you pass it what is known as a regular expression. Regular expressions
are a highly flexible way to describe patterns of characters. There is some documen-
tation about them in the API pages for the class called Pattern .
For our purposes, we want to form a regular expression that will instruct the
Scanner to look just at characters that are part of what we consider words. That is,
we want the Scanner to look at letters and apostrophes. The following regular
expression is a good starting point:
[a-zA-Z']
This regular expression would be read as, “Any character in the range of a to z, the
range of A to Z, or an apostrophe.” This is a good description of the kind of charac-
ters we want the Scanner to include. But we actually need to tell the Scanner what
characters to ignore, so we need to indicate that it should use the opposite set of char-
acters. The easy way to do this is by including a caret ( ) in front of the list of legal
characters:
[ a-zA-Z']
This regular expression would be read as, “Any character other than the characters
that are in the range of a to z, the range of A to Z, or an apostrophe.” Even this
 
Search WWH ::




Custom Search