ArrayLists - Building Java Programs: A Back to Basics Approach

Java Reference

In-Depth Information

This version of the program produces the following output:

list1 = [all, and, bus, go, on, round, round., the, through, town., wheels]

list2 = [all, bus, go, on, swish,, swish., the, through, town., wipers]

overlap = [all, bus, go, on, the, through, town.]

Our program now correctly builds a vocabulary list for each of two files and com-

putes the overlap between them. The program printed the three lists of words, but

that won't be very convenient for large text files containing thousands of different

words. We would prefer to have the program report overall statistics, including the

number of words in each list, the number of words of overlap, and the percentage

of overlap.

The program also should contain at least a brief introduction to explain what it

does, and we can write it so that it prompts for file names rather than using hard-

coded file names.

This also seems like a good time to think about punctuation. The first two versions

allowed words to contain punctuation characters such as commas, periods, and

dashes that we wouldn't normally consider to be part of a word.

We can improve our solution by telling the Scanner what parts of the input file to

ignore. Scanner objects have a method called useDelimiter that you can call to tell

them what characters to use when they break the input file into tokens. When you call

the method, you pass it what is known as a regular expression. Regular expressions

are a highly flexible way to describe patterns of characters. There is some documen-

tation about them in the API pages for the class called Pattern .

For our purposes, we want to form a regular expression that will instruct the

Scanner to look just at characters that are part of what we consider words. That is,

we want the Scanner to look at letters and apostrophes. The following regular

expression is a good starting point:

[a-zA-Z']

This regular expression would be read as, “Any character in the range of a to z, the

range of A to Z, or an apostrophe.” This is a good description of the kind of charac-

ters we want the Scanner to include. But we actually need to tell the Scanner what

characters to ignore, so we need to indicate that it should use the opposite set of char-

acters. The easy way to do this is by including a caret ( ∧ ) in front of the list of legal

characters:

[ ∧ a-zA-Z']

This regular expression would be read as, “Any character other than the characters

that are in the range of a to z, the range of A to Z, or an apostrophe.” Even this

Search WWH ::

Custom Search

Home